1883719 – Alert ElasticsearchBulkRequestsRejectionJumps never gets pending/firing due to there is no `bulk` thread pool.

Bug 1883719 - Alert ElasticsearchBulkRequestsRejectionJumps never gets pending/firing due to there is no `bulk` thread pool.

Summary: Alert ElasticsearchBulkRequestsRejectionJumps never gets pending/firing due t...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Lukas Vlcek
QA Contact:	Qiaoling Tang
Docs Contact:
URL:
Whiteboard:	logging-exploration
Depends On:
Blocks:	1888393
TreeView+	depends on / blocked

Reported:	2020-09-30 01:12 UTC by Qiaoling Tang
Modified:	2021-02-24 11:21 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 11:21:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 511	None	closed	Bug 1883719: Bulk statistics are recorded under index threadpool	2021-02-19 09:45:33 UTC
Github	openshift elasticsearch-operator pull 533	None	closed	Bug 1883719: Bulk statistics are recorded under write threadpool	2021-02-19 09:45:33 UTC
Red Hat Product Errata	RHBA-2021:0652	None	None	None	2021-02-24 11:21:52 UTC

Description Qiaoling Tang 2020-09-30 01:12:12 UTC

Description of problem:
ElasticsearchBulkRequestsRejectionJumps never gets pending/firing.

    - alert: ElasticsearchBulkRequestsRejectionJumps
      annotations:
        message: High Bulk Rejection Ratio at {{ $labels.node }} node in {{ $labels.cluster }} cluster. This node may not be keeping up with the indexing speed.
        summary: High Bulk Rejection Ratio - {{ $value }}%
      expr: |
        round( bulk:reject_ratio:rate2m * 100, 0.001 ) > 5
      for: 10m
      labels:
        severity: warning

  - name: logging_elasticsearch.rules
    rules:
    - expr: |
        rate(es_threadpool_threads_count{name="bulk", type="rejected"}[2m])
      record: bulk:rejected_requests:rate2m
    - expr: |
        rate(es_threadpool_threads_count{name="bulk", type="completed"}[2m])
      record: bulk:completed_requests:rate2m
    - expr: |
        sum by (cluster, instance, node) (bulk:rejected_requests:rate2m) / on (cluster, instance, node) (bulk:completed_requests:rate2m)
      record: bulk:reject_ratio:rate2m

There are no metrics `es_threadpool_threads_count{name="bulk", type="rejected"}` and `es_threadpool_threads_count{name="bulk", type="completed"}` in ES 6.x
$ oc exec elasticsearch-cdm-bi4udg5u-1-7fbdf69798-ddr9h -- es_util --query=_prometheus/metrics |grep es_threadpool_threads_count
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-bi4udg5u-1-7fbdf69798-ddr9h -n openshift-logging' to see all of the containers in this pod.
# HELP es_threadpool_threads_count Count of threads in thread pool
# TYPE es_threadpool_threads_count gauge
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="fetch_shard_started",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="snapshot",type="completed",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="analyze",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="management",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="fetch_shard_store",type="completed",} 138.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="search",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="index",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="get",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="generic",type="completed",} 95669.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="search_throttled",type="completed",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="force_merge",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="warmer",type="completed",} 345521.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="listener",type="completed",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="refresh",type="completed",} 694597.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="write",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="analyze",type="completed",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="index",type="completed",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="warmer",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="write",type="completed",} 1983067.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="get",type="completed",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="refresh",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="fetch_shard_started",type="completed",} 62.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="flush",type="completed",} 913.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="snapshot",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="search",type="completed",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="management",type="completed",} 1866431.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="listener",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="search_throttled",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="force_merge",type="completed",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="flush",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="fetch_shard_store",type="rejected",} 0.0
es_threadpool_threads_count{cluster="elasticsearch",node="elasticsearch-cdm-bi4udg5u-1",nodeid="Yap68zj5Th-0fBni4Kbyrg",name="generic",type="rejected",} 0.0


Version-Release number of selected component (if applicable):
elasticsearch-operator.4.6.0-202009232303.p0 

How reproducible:
Always

Steps to Reproduce:
1. deploy logging 4.6
2. check ES metrics and prometheus alert rule
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jeff Cantrill 2020-09-30 13:17:38 UTC

Target 4.7 as this is not a release blocker

Comment 2 Lukas Vlcek 2020-10-02 12:08:49 UTC

Bulk statistics is part of `index` threadpool now. See https://www.elastic.co/guide/en/elasticsearch/reference/6.8/modules-threadpool.html 

We need to update both the rules and alerts:
- https://github.com/openshift/elasticsearch-operator/blob/master/files/prometheus_rules.yml
- https://github.com/openshift/elasticsearch-operator/blob/master/files/prometheus_rules.yml

Comment 3 Lukas Vlcek 2020-10-02 12:09:37 UTC

Ooops...

We need to update both the rules and alerts:
- https://github.com/openshift/elasticsearch-operator/blob/master/files/prometheus_rules.yml
- https://github.com/openshift/elasticsearch-operator/blob/master/files/prometheus_alerts.yml

Comment 4 Lukas Vlcek 2020-10-02 12:28:25 UTC

I created PR: https://github.com/openshift/elasticsearch-operator/pull/511

Comment 7 Qiaoling Tang 2020-10-09 03:20:42 UTC

Tested with elasticsearch-operator.4.6.0-202010081538.p0, the rules are changed to:

  - name: logging_elasticsearch.rules
    rules:
    - expr: |
        rate(es_threadpool_threads_count{name="index", type="rejected"}[2m])
      record: indexing:rejected_requests:rate2m
    - expr: |
        rate(es_threadpool_threads_count{name="index", type="completed"}[2m])
      record: indexing:completed_requests:rate2m
    - expr: |
        sum by (cluster, instance, node) (indexing:rejected_requests:rate2m) / on (cluster, instance, node) (indexing:completed_requests:rate2m)
      record: indexing:reject_ratio:rate2m

Hi @Lukas, I have some doubts:

why do we use `es_threadpool_threads_count{name="index", type="rejected"}` and `es_threadpool_threads_count{name="index", type="completed"}` here?

I checked the elasticsearch doc(https://www.elastic.co/guide/en/elasticsearch/reference/6.8/modules-threadpool.html), it says `index For index/delete operations. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.`

Besides, in the pr https://github.com/elastic/elasticsearch/pull/29593, it seems the bulk thread pool has been renamed to write thread pool.

Am I misunderstanding?

Comment 9 Lukas Vlcek 2020-10-13 14:16:19 UTC

You are absolutely correct @Qiaoling Tang
Great catch !!!

I pushed a new PR to fix this https://github.com/openshift/elasticsearch-operator/pull/533

Comment 11 Qiaoling Tang 2020-10-15 03:07:35 UTC

Verified with quay.io/openshift/origin-elasticsearch-operator@sha256:7cc642127d98d7e63baa5a7ae61a31c87e27f16c8bf0f84387852eb9032e526c

Comment 16 errata-xmlrpc 2021-02-24 11:21:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652

Note You need to log in before you can comment on or make changes to this bug.