[OSPP] Apache HertzBeat Log Monitoring Capability #3673

bigcyy · 2025-08-17T07:24:40Z

What's changed?

Checklist

I have read the Contributing Guide
I have written the necessary doc or comment.
I have added the necessary unit tests and all cases have passed.

Add or update API

I have added the necessary e2e tests and all cases have passed.

…P#25bef0067

…metrics_periodic and log_periodic

… periodic_xxx

…style

…g, log-based alert triggering, and previewing of SQL query results.

…port statements.

…ert labels

…ealtime, add single alert and group alert to log realtime and log periodic

…P#25bef0067

Copilot

Pull Request Overview

This PR implements comprehensive log monitoring capabilities for Apache HertzBeat by integrating OpenTelemetry log protocol (OTLP) support, real-time log streaming, log storage, and both real-time and periodic log threshold alerting mechanisms.

Key changes include:

Complete log management system with integration, streaming, and management interfaces
Enhanced alert system to support both metric and log data types with real-time and periodic threshold monitoring
Multi-language internationalization support for Traditional Chinese, Simplified Chinese, Portuguese, Japanese, and English
OpenTelemetry log protocol integration with comprehensive documentation

Reviewed Changes

Copilot reviewed 92 out of 93 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
web-app/src/assets/i18n/*.json	Added comprehensive internationalization strings for log monitoring features across all supported languages
web-app/src/assets/doc/log-integration/*.md	Created OTLP integration documentation in English and Chinese
web-app/src/assets/app-data.json	Added log module navigation menu configuration
web-app/src/app/service/log.service.ts	Implemented log service for API communication and data management
web-app/src/app/routes/routes-routing.module.ts	Added routing configuration for log module
web-app/src/app/routes/log/	Complete log module implementation including integration, streaming, and management components
web-app/src/app/routes/alert/alert-setting/alert-setting.component.ts	Enhanced alert system to support log-based threshold monitoring

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

web-app/src/app/routes/log/log-stream/log-stream.component.ts

web-app/src/app/routes/log/log-manage/log-manage.component.ts

web-app/src/app/routes/log/log-integration/log-integration.component.ts

web-app/src/app/routes/alert/alert-setting/alert-setting.component.ts

web-app/src/assets/i18n/zh-TW.json

tomsun28 · 2025-08-19T15:35:26Z

hertzbeat-alerter/src/main/java/org/apache/hertzbeat/alert/AlerterWorkerPool.java

+        logWorkerExecutor = new ThreadPoolExecutor(4, 10, 10, TimeUnit.SECONDS,
+                new LinkedBlockingQueue<>(),


hi, if the blocking queue is LinkedBlockingQueue, the thread size first will increase to 4, and then others tasks will put in the LinkedBlockingQueue, the thread size will never increase until the blocking queue is filled.

Hi tom, thanks for your feedback! I've pushed a fix to address this.

tomsun28 · 2025-08-19T15:42:42Z

Great!👍 👍 So big feature, we need more time to review.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yang Chen <1597081640@qq.com>

Signed-off-by: Yang Chen <1597081640@qq.com>

tomsun28 · 2025-08-25T15:37:19Z

hertzbeat-common/src/main/java/org/apache/hertzbeat/common/entity/alerter/AlertDefine.java

+    @Schema(title = "Query Expression", example = "SELECT * FROM metrics WHERE value > 90")
+    @Size(max = 2048)
+    @Column(length = 2048)
+    private String queryExpr;
+


👍 hi, overall it is very good.
We need to discuss whether the queryExpr for log here is reasonable. In the previous design, the query expression and the judgment expre were together, such as the promql threshold rule like, expr: (100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80, you can see the (100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) is query expression and the > 80 is the judgment expression.

Hi Tom, thank you for your review. I believe that a two-stage design, separating the query and alerting phases, is of paramount importance. If we do not make this separation, our SQL query syntax will be tightly coupled with the alerting threshold syntax. This coupling would lead to insurmountable syntax conflicts when adapting to different SQL dialects, such as those used by Greptime and other databases.

Furthermore, from a user experience perspective, this two-stage model is more intuitive in UI design. It perfectly aligns with the user's natural mental model: first defining "What data do I need?" and then determining "What conditions should trigger an alert?" This approach is much easier to understand and operate than one that mixes all the logic together.

Finally, it is worth noting that this two-stage model is widely adopted in mature industry monitoring platforms, such as Alibaba Cloud's Log Service (SLS). This further validates the stability and maturity of a decoupled solution.

https://help.aliyun.com/zh/sls/configure-an-alert-monitoring-rule-in-log-service

Hi Tom, thank you for your review. I believe that a two-stage design, separating the query and alerting phases, is of paramount importance. If we do not make this separation, our SQL query syntax will be tightly coupled with the alerting threshold syntax. This coupling would lead to insurmountable syntax conflicts when adapting to different SQL dialects, such as those used by Greptime and other databases.

Furthermore, from a user experience perspective, this two-stage model is more intuitive in UI design. It perfectly aligns with the user's natural mental model: first defining "What data do I need?" and then determining "What conditions should trigger an alert?" This approach is much easier to understand and operate than one that mixes all the logic together.

Finally, it is worth noting that this two-stage model is widely adopted in mature industry monitoring platforms, such as Alibaba Cloud's Log Service (SLS). This further validates the stability and maturity of a decoupled solution.

I also have a question about this. If there is no prompt for the query field, do I still need to check the information displayed in the relevant log to correctly configure the threshold expression?

I use SLS more frequently in my daily work. In most cases, I only need phrase matching. When I need to format information or process other data, I use SQL to enhance my writing. The convenience lies in the fact that when I generate threshold expressions, I can quickly generate the required expressions through AI-powered query generation, historical queries, and prompts. Could we also adopt this approach?

Then I looked at the code you discussed for this position. As I understand it, you should first perform a QueryExpr query, then use expr to perform JEXL threshold matching. Wouldn't this be a bit complicated? I'm also concerned that there might be a performance bottleneck when executing a large number of complex JEXL expressions.

Then I looked at the code you discussed for this position. As I understand it, you should first perform a QueryExpr query, then use expr to perform JEXL threshold matching. Wouldn't this be a bit complicated? I'm also concerned that there might be a performance bottleneck when executing a large number of complex JEXL expressions.

I believe this is unavoidable. We can, however, add expression caching or SQL result set caching to improve performance.

Furthermore, when coupling SQL with a threshold expression, the SQL result must be a single column; a multi-column result set cannot be processed. For example, if I need to count the number of logs for each log level of every web service within 30 seconds, and this SQL is coupled with a threshold expression, the result will only contain the log count column. This leads to a loss of critical information, such as which services and which log levels the logs belong to.

I wonder if we are using JEXL here, which leads to the separation of the two expressions now. What if the SQL query itself is an expression judgment? I see that some platforms do this.

For example, if I need to count the number of logs for each log level of every web service within 30 seconds which number is > 20.

For this, the sql expr is

SELECT service_name, log_level, COUNT(*) AS log_count FROM logs WHERE timestamp >= NOW() - INTERVAL '30' SECOND GROUP BY service_name, log_level HAVING COUNT(*) > 20;

That we do not need other expr to condition. Also it can render the filter data UI.

Why recommend using a single expression instead of two

Ease of sharing and dissemination, similar to a PromQL threshold expression or an expression like {level="error"} |~ ".*(timeout|exception).*" or a sql, which makes it very convenient to share and propagate.

Unified design, keeping it consistent with the design of other existing types of expressions, where the expression itself includes both the data query and the trigger condition. like the PromQL、Loki、ELK.

For future unification of expressions, we can consider defining our own unified expression rules for external use, and internally implement them based on different dependencies.

AI can better understand and generate expressions throught one expr.

tomsun28 · 2025-08-25T15:49:58Z

hertzbeat-alerter/src/main/java/org/apache/hertzbeat/alert/reduce/AlarmCommonReduce.java

+
+    public void reduceAndSendAlarmGroup(Map<String, String> groupLabels, List<SingleAlert> alerts) {
+        workerExecutor.execute(() -> {
+            try {
+                // Generate alert fingerprint
+                for (SingleAlert alert : alerts) {
+                    String fingerprint = generateAlertFingerprint(alert.getLabels());
+                    alert.setFingerprint(fingerprint);
+                }
+                // Process the group alert
+                alarmGroupReduce.processGroupAlert(groupLabels, alerts);
+            } catch (Exception e) {
+                log.error("Reduce alarm group failed: {}", e.getMessage());
+            }
+        });
+    }


here we are handling the group alerts separately with special processing, but the reduce are design for all alerts which from internal system and extern system. I think we can keep it for now because no new database fields have been added.

Duansg · 2025-08-26T16:26:38Z

Hi, @bigcyy I pulled your PR and gave it a quick spin. Overall, aside from some UI interactions that need polishing ,it's pretty well-rounded. I'd be happy to be the first user after this PR merges, hahaha. 👍

Maybe we can even build this feature together :)

Here are some of my questions and suggestions:

Without Greptime enabled, can the Log Manage list page display a user-friendly prompt instead of an error message?
In the Threshold configuration, the Alarm Content now displays in a single column. Was this intentional?
In the Periodic Threshold configuration, do both metrics and logs now support PROMQL and SQL? Because I noticed that when using SQL queries for metrics, they failed to display properly.
Could we consider initializing the Log Stream in a pause state? When resume is triggered, it would then initialize the connection state. Additionally, should we adjust the automatic refresh of input parameters to: take effect only when manually paused/enabled? Currently, when I input filtering parameters, if I haven't finished entering them before resetting the SSE, I lose the reference logs.

bigcyy added 30 commits July 2, 2025 17:12

feat: Receive OTLP/HTTP logs and transform to unified entity

4ee3856

improvement: Add license

abce871

Merge branch 'master' into OSPP#25bef0067

ad5baf2

improvement: modify source to protocol

05a410b

Merge branch 'OSPP#25bef0067' of github.com:bigcyy/hertzbeat into OSP…

f5f3221

…P#25bef0067

feat: add otlp log integration ui

420c9f0

bugfix: fix error en doc

5e50d5f

improvement: code style

a14a279

Merge branch 'master' into OSPP#25bef0067

04cf287

refactor: realtime to metrics_realtime and log_realtime, periodic to …

dd405a2

…metrics_periodic and log_periodic

feat: support send logEntry to commonDataQueue

600e505

feat: send log to commonDataQueue

1cd5f85

feat: Add log realtime calculator and some ui

d12e0d6

improvement: Modify xxx_realtime and xxx_periodic to realtime_xxx and…

0f5cb35

… periodic_xxx

improvement: Place the logs section above the alerts section

9a94f25

improvement: move data type selector

d06ea81

feat: Save log to db

98ca234

feat: log stream show

b7811cc

improvement: To make the log display more compact and unify the code …

b432940

…style

merge main

9ecdf66

Merge branch 'master' into OSPP#25bef0067

2e049dd

temp

116b74f

feat: Modify the alert_define table definition to support log queryin…

5979645

…g, log-based alert triggering, and previewing of SQL query results.

improvement: Remove real-time log alert recovery and delete unused im…

ec2513c

…port statements.

improvement: Split PeriodicAlertCalculator to logxx and metricsxxx

611ea49

improvament: keep the firing fingerprint but add the log entity to al…

7c17778

…ert labels

improvement: refactor alert calculate, add window aggregator to log r…

1fa8e0c

…ealtime, add single alert and group alert to log realtime and log periodic

Merge branch 'master' into OSPP#25bef0067

74d98bc

improvement: remove chinese comments

5c0b77c

improvement: fix code style

e00a109

bigcyy added 3 commits August 17, 2025 15:25

Merge branch 'master' into OSPP#25bef0067

2ccadd9

bugfix: add license header

50344ff

Merge branch 'OSPP#25bef0067' of github.com:bigcyy/hertzbeat into OSP…

3acdc4d

…P#25bef0067

tomsun28 added good first pull request Good for newcomers new feature labels Aug 17, 2025

github-project-automation bot added this to Apache HertzBeat (Incubating) Aug 17, 2025

github-project-automation bot moved this to To do in Apache HertzBeat (Incubating) Aug 17, 2025

Calvin979 and others added 2 commits August 19, 2025 22:08

Merge branch 'master' into OSPP#25bef0067

c19b404

Merge branch 'master' into OSPP#25bef0067

4075e11

tomsun28 requested a review from Copilot August 19, 2025 15:32

Copilot AI reviewed Aug 19, 2025

View reviewed changes

tomsun28 reviewed Aug 19, 2025

View reviewed changes

zhangshenghang and others added 12 commits August 20, 2025 09:13

Merge branch 'master' into OSPP#25bef0067

363040b

improvement: Correct thread pool scaling in AlerterWorkerPool

2607f83

Update web-app/src/assets/i18n/zh-TW.json

5d738a4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yang Chen <1597081640@qq.com>

improvement: Remove console.log

72d266f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yang Chen <1597081640@qq.com>

improvement: Use substring() instead of deprecated substr() method

649319c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yang Chen <1597081640@qq.com>

improvement: Replace deprecated document.execCommand('copy')

821fa9f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yang Chen <1597081640@qq.com>

improvement: Use substring() instead of deprecated substr() method

dbda9fa

Merge branch 'master' into OSPP#25bef0067

1c8317d

test: add unit test for log ingestion controller and add json produces

3c203d0

Merge branch 'master' into OSPP#25bef0067

1b485ba

Signed-off-by: Yang Chen <1597081640@qq.com>

feat: add otlp adapter unit test

9c48868

improvement: replace import *

8eb1149

tomsun28 mentioned this pull request Aug 25, 2025

[Feature] Add support for Alibaba Cloud Elasticsearch storage #3589

Open

tomsun28 reviewed Aug 25, 2025

View reviewed changes

Merge branch 'master' into OSPP#25bef0067

4ee3ea9

		logWorkerExecutor = new ThreadPoolExecutor(4, 10, 10, TimeUnit.SECONDS,
		new LinkedBlockingQueue<>(),

[OSPP] Apache HertzBeat Log Monitoring Capability #3673

Are you sure you want to change the base?

[OSPP] Apache HertzBeat Log Monitoring Capability #3673

Conversation

bigcyy commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's changed?

Checklist

Add or update API

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomsun28 commented Aug 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Duansg commented Aug 26, 2025

Uh oh!

Uh oh!

bigcyy commented Aug 17, 2025 •

edited

Loading