Bug 1883719
| Summary: | Alert ElasticsearchBulkRequestsRejectionJumps never gets pending/firing due to there is no `bulk` thread pool. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Qiaoling Tang <qitang> |
| Component: | Logging | Assignee: | Lukas Vlcek <lvlcek> |
| Status: | CLOSED ERRATA | QA Contact: | Qiaoling Tang <qitang> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.6 | CC: | aos-bugs, brejones, jcantril, lvlcek, periklis |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | logging-exploration | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 11:21:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1888393 | ||
Target 4.7 as this is not a release blocker Bulk statistics is part of `index` threadpool now. See https://www.elastic.co/guide/en/elasticsearch/reference/6.8/modules-threadpool.html We need to update both the rules and alerts: - https://github.com/openshift/elasticsearch-operator/blob/master/files/prometheus_rules.yml - https://github.com/openshift/elasticsearch-operator/blob/master/files/prometheus_rules.yml Ooops... We need to update both the rules and alerts: - https://github.com/openshift/elasticsearch-operator/blob/master/files/prometheus_rules.yml - https://github.com/openshift/elasticsearch-operator/blob/master/files/prometheus_alerts.yml Tested with elasticsearch-operator.4.6.0-202010081538.p0, the rules are changed to:
- name: logging_elasticsearch.rules
rules:
- expr: |
rate(es_threadpool_threads_count{name="index", type="rejected"}[2m])
record: indexing:rejected_requests:rate2m
- expr: |
rate(es_threadpool_threads_count{name="index", type="completed"}[2m])
record: indexing:completed_requests:rate2m
- expr: |
sum by (cluster, instance, node) (indexing:rejected_requests:rate2m) / on (cluster, instance, node) (indexing:completed_requests:rate2m)
record: indexing:reject_ratio:rate2m
Hi @Lukas, I have some doubts:
why do we use `es_threadpool_threads_count{name="index", type="rejected"}` and `es_threadpool_threads_count{name="index", type="completed"}` here?
I checked the elasticsearch doc(https://www.elastic.co/guide/en/elasticsearch/reference/6.8/modules-threadpool.html), it says `index For index/delete operations. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.`
Besides, in the pr https://github.com/elastic/elasticsearch/pull/29593, it seems the bulk thread pool has been renamed to write thread pool.
Am I misunderstanding?
You are absolutely correct @Qiaoling Tang Great catch !!! I pushed a new PR to fix this https://github.com/openshift/elasticsearch-operator/pull/533 Verified with quay.io/openshift/origin-elasticsearch-operator@sha256:7cc642127d98d7e63baa5a7ae61a31c87e27f16c8bf0f84387852eb9032e526c Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652 |
Description of problem: ElasticsearchBulkRequestsRejectionJumps never gets pending/firing. - alert: ElasticsearchBulkRequestsRejectionJumps annotations: message: High Bulk Rejection Ratio at {{ $labels.node }} node in {{ $labels.cluster }} cluster. This node may not be keeping up with the indexing speed. summary: High Bulk Rejection Ratio - {{ $value }}% expr: | round( bulk:reject_ratio:rate2m * 100, 0.001 ) > 5 for: 10m labels: severity: warning - name: logging_elasticsearch.rules rules: - expr: | rate(es_threadpool_threads_count{name="bulk", type="rejected"}[2m]) record: bulk:rejected_requests:rate2m - expr: | rate(es_threadpool_threads_count{name="bulk", type="completed"}[2m]) record: bulk:completed_requests:rate2m - expr: | sum by (cluster, instance, node) (bulk:rejected_requests:rate2m) / on (cluster, instance, node) (bulk:completed_requests:rate2m) record: bulk:reject_ratio:rate2m There are no metrics `es_threadpool_threads_count{name="bulk", type="rejected"}` and `es_threadpool_threads_count{name="bulk", type="completed"}` in ES 6.x $ oc exec elasticsearch-cdm-bi4udg5u-1-7fbdf69798-ddr9h -- es_util --query=_prometheus/metrics |grep es_threadpool_threads_count Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-bi4udg5u-1-7fbdf69798-ddr9h -n openshift-logging' to see all of the containers in this pod. # HELP es_threadpool_threads_count Count of threads in thread pool # TYPE es_threadpool_threads_count gauge es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="fetch_shard_started",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="snapshot",type="completed",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="analyze",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="management",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="fetch_shard_store",type="completed",} 138.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="search",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="index",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="get",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="generic",type="completed",} 95669.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="search_throttled",type="completed",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="force_merge",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="warmer",type="completed",} 345521.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="listener",type="completed",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="refresh",type="completed",} 694597.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="write",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="analyze",type="completed",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="index",type="completed",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="warmer",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="write",type="completed",} 1983067.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="get",type="completed",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="refresh",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="fetch_shard_started",type="completed",} 62.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="flush",type="completed",} 913.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="snapshot",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="search",type="completed",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="management",type="completed",} 1866431.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="listener",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="search_throttled",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="force_merge",type="completed",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="flush",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="fetch_shard_store",type="rejected",} 0.0 es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="generic",type="rejected",} 0.0 Version-Release number of selected component (if applicable): elasticsearch-operator.4.6.0-202009232303.p0 How reproducible: Always Steps to Reproduce: 1. deploy logging 4.6 2. check ES metrics and prometheus alert rule 3. Actual results: Expected results: Additional info: