Skip to content

Conversation

bigcyy
Copy link
Member

@bigcyy bigcyy commented Aug 17, 2025

What's changed?

  • Integration with HertzBeat based on the OpenTelemetry log protocol.
  • Real-time log stream viewing.
  • Log storage.
  • Real-time log threshold alerting (JEXL matching, window aggregation, watermark mechanism).
  • Periodic log threshold alerting (SQL data query, JEXL matching).
  • Log management and statistics.
  • vector.yml configuration generation.
  • Unit tests.
  • E2E tests.
  • Update documentation.
  • Detail refinement.

Checklist

  • I have read the Contributing Guide
  • I have written the necessary doc or comment.
  • I have added the necessary unit tests and all cases have passed.

Add or update API

  • I have added the necessary e2e tests and all cases have passed.

bigcyy added 30 commits July 2, 2025 17:12
…g, log-based alert triggering, and previewing of SQL query results.
…ealtime, add single alert and group alert to log realtime and log periodic
@tomsun28 tomsun28 requested a review from Copilot August 19, 2025 15:32
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements comprehensive log monitoring capabilities for Apache HertzBeat by integrating OpenTelemetry log protocol (OTLP) support, real-time log streaming, log storage, and both real-time and periodic log threshold alerting mechanisms.

Key changes include:

  • Complete log management system with integration, streaming, and management interfaces
  • Enhanced alert system to support both metric and log data types with real-time and periodic threshold monitoring
  • Multi-language internationalization support for Traditional Chinese, Simplified Chinese, Portuguese, Japanese, and English
  • OpenTelemetry log protocol integration with comprehensive documentation

Reviewed Changes

Copilot reviewed 92 out of 93 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
web-app/src/assets/i18n/*.json Added comprehensive internationalization strings for log monitoring features across all supported languages
web-app/src/assets/doc/log-integration/*.md Created OTLP integration documentation in English and Chinese
web-app/src/assets/app-data.json Added log module navigation menu configuration
web-app/src/app/service/log.service.ts Implemented log service for API communication and data management
web-app/src/app/routes/routes-routing.module.ts Added routing configuration for log module
web-app/src/app/routes/log/ Complete log module implementation including integration, streaming, and management components
web-app/src/app/routes/alert/alert-setting/alert-setting.component.ts Enhanced alert system to support log-based threshold monitoring

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 91 to 92
logWorkerExecutor = new ThreadPoolExecutor(4, 10, 10, TimeUnit.SECONDS,
new LinkedBlockingQueue<>(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, if the blocking queue is LinkedBlockingQueue, the thread size first will increase to 4, and then others tasks will put in the LinkedBlockingQueue, the thread size will never increase until the blocking queue is filled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi tom, thanks for your feedback! I've pushed a fix to address this.

@tomsun28
Copy link
Member

Great!👍 👍 So big feature, we need more time to review.

zhangshenghang and others added 12 commits August 20, 2025 09:13
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Yang Chen <1597081640@qq.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Yang Chen <1597081640@qq.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Yang Chen <1597081640@qq.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Yang Chen <1597081640@qq.com>
Signed-off-by: Yang Chen <1597081640@qq.com>
Comment on lines +103 to +107
@Schema(title = "Query Expression", example = "SELECT * FROM metrics WHERE value > 90")
@Size(max = 2048)
@Column(length = 2048)
private String queryExpr;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 hi, overall it is very good.
We need to discuss whether the queryExpr for log here is reasonable. In the previous design, the query expression and the judgment expre were together, such as the promql threshold rule like, expr: (100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80, you can see the (100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) is query expression and the > 80 is the judgment expression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tom, thank you for your review. I believe that a two-stage design, separating the query and alerting phases, is of paramount importance. If we do not make this separation, our SQL query syntax will be tightly coupled with the alerting threshold syntax. This coupling would lead to insurmountable syntax conflicts when adapting to different SQL dialects, such as those used by Greptime and other databases.

Furthermore, from a user experience perspective, this two-stage model is more intuitive in UI design. It perfectly aligns with the user's natural mental model: first defining "What data do I need?" and then determining "What conditions should trigger an alert?" This approach is much easier to understand and operate than one that mixes all the logic together.

Finally, it is worth noting that this two-stage model is widely adopted in mature industry monitoring platforms, such as Alibaba Cloud's Log Service (SLS). This further validates the stability and maturity of a decoupled solution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tom, thank you for your review. I believe that a two-stage design, separating the query and alerting phases, is of paramount importance. If we do not make this separation, our SQL query syntax will be tightly coupled with the alerting threshold syntax. This coupling would lead to insurmountable syntax conflicts when adapting to different SQL dialects, such as those used by Greptime and other databases.

Furthermore, from a user experience perspective, this two-stage model is more intuitive in UI design. It perfectly aligns with the user's natural mental model: first defining "What data do I need?" and then determining "What conditions should trigger an alert?" This approach is much easier to understand and operate than one that mixes all the logic together.

Finally, it is worth noting that this two-stage model is widely adopted in mature industry monitoring platforms, such as Alibaba Cloud's Log Service (SLS). This further validates the stability and maturity of a decoupled solution.

I also have a question about this. If there is no prompt for the query field, do I still need to check the information displayed in the relevant log to correctly configure the threshold expression?

I use SLS more frequently in my daily work. In most cases, I only need phrase matching. When I need to format information or process other data, I use SQL to enhance my writing. The convenience lies in the fact that when I generate threshold expressions, I can quickly generate the required expressions through AI-powered query generation, historical queries, and prompts. Could we also adopt this approach?

Then I looked at the code you discussed for this position. As I understand it, you should first perform a QueryExpr query, then use expr to perform JEXL threshold matching. Wouldn't this be a bit complicated? I'm also concerned that there might be a performance bottleneck when executing a large number of complex JEXL expressions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I looked at the code you discussed for this position. As I understand it, you should first perform a QueryExpr query, then use expr to perform JEXL threshold matching. Wouldn't this be a bit complicated? I'm also concerned that there might be a performance bottleneck when executing a large number of complex JEXL expressions.

I believe this is unavoidable. We can, however, add expression caching or SQL result set caching to improve performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, when coupling SQL with a threshold expression, the SQL result must be a single column; a multi-column result set cannot be processed. For example, if I need to count the number of logs for each log level of every web service within 30 seconds, and this SQL is coupled with a threshold expression, the result will only contain the log count column. This leads to a loss of critical information, such as which services and which log levels the logs belong to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we are using JEXL here, which leads to the separation of the two expressions now. What if the SQL query itself is an expression judgment? I see that some platforms do this.

For example, if I need to count the number of logs for each log level of every web service within 30 seconds which number is > 20.

For this, the sql expr is

SELECT 
    service_name,
    log_level,
    COUNT(*) AS log_count
FROM logs
WHERE timestamp >= NOW() - INTERVAL '30' SECOND
GROUP BY service_name, log_level
HAVING COUNT(*) > 20;

That we do not need other expr to condition. Also it can render the filter data UI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why recommend using a single expression instead of two

  • Ease of sharing and dissemination, similar to a PromQL threshold expression or an expression like {level="error"} |~ ".*(timeout|exception).*" or a sql, which makes it very convenient to share and propagate.
  • Unified design, keeping it consistent with the design of other existing types of expressions, where the expression itself includes both the data query and the trigger condition. like the PromQL、Loki、ELK.
  • For future unification of expressions, we can consider defining our own unified expression rules for external use, and internally implement them based on different dependencies.
  • AI can better understand and generate expressions throught one expr.

Comment on lines +71 to +86

public void reduceAndSendAlarmGroup(Map<String, String> groupLabels, List<SingleAlert> alerts) {
workerExecutor.execute(() -> {
try {
// Generate alert fingerprint
for (SingleAlert alert : alerts) {
String fingerprint = generateAlertFingerprint(alert.getLabels());
alert.setFingerprint(fingerprint);
}
// Process the group alert
alarmGroupReduce.processGroupAlert(groupLabels, alerts);
} catch (Exception e) {
log.error("Reduce alarm group failed: {}", e.getMessage());
}
});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we are handling the group alerts separately with special processing, but the reduce are design for all alerts which from internal system and extern system. I think we can keep it for now because no new database fields have been added.

@Duansg
Copy link
Contributor

Duansg commented Aug 26, 2025

Hi, @bigcyy I pulled your PR and gave it a quick spin. Overall, aside from some UI interactions that need polishing ,it's pretty well-rounded. I'd be happy to be the first user after this PR merges, hahaha. 👍

Maybe we can even build this feature together :)

Here are some of my questions and suggestions:

  1. Without Greptime enabled, can the Log Manage list page display a user-friendly prompt instead of an error message?
  2. In the Threshold configuration, the Alarm Content now displays in a single column. Was this intentional?
  3. In the Periodic Threshold configuration, do both metrics and logs now support PROMQL and SQL? Because I noticed that when using SQL queries for metrics, they failed to display properly.
  4. Could we consider initializing the Log Stream in a pause state? When resume is triggered, it would then initialize the connection state. Additionally, should we adjust the automatic refresh of input parameters to: take effect only when manually paused/enabled? Currently, when I input filtering parameters, if I haven't finished entering them before resetting the SSE, I lose the reference logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

5 participants