Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1873013

Summary: Alert `ElasticsearchNodeDiskWatermarkReached` couldn't become Pending/Firing.
Product: OpenShift Container Platform Reporter: Qiaoling Tang <qitang>
Component: LoggingAssignee: Brett Jones <brejones>
Status: CLOSED ERRATA QA Contact: Qiaoling Tang <qitang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: aos-bugs, brejones, jcantril
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: logging-exploration
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 15:10:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Qiaoling Tang 2020-08-27 07:23:49 UTC
Description of problem:
The 3 `ElasticsearchNodeDiskWatermarkReached` alerts could not become Pending/Firing:

    - alert: ElasticsearchNodeDiskWatermarkReached
      annotations:
        message: Disk Low Watermark Reached at {{ $labels.node }} node in {{ $labels.cluster }} cluster. Shards can not be allocated to this node anymore. You should consider adding more disk to the node.
        summary: Disk Low Watermark Reached - disk saturation is {{ $value }}%
      expr: |
        sum by (cluster, instance, node) (
          round(
            (1 - (
              es_fs_path_available_bytes /
              es_fs_path_total_bytes
            )
          ) * 100, 0.001)
        ) > es_cluster_routing_allocation_disk_watermark_low_pct
      for: 5m
      labels:
        severity: info
    - alert: ElasticsearchNodeDiskWatermarkReached
      annotations:
        message: Disk High Watermark Reached at {{ $labels.node }} node in {{ $labels.cluster }} cluster. Some shards will be re-allocated to different nodes if possible. Make sure more disk space is added to the node or drop old indices allocated to this node.
        summary: Disk High Watermark Reached - disk saturation is {{ $value }}%
      expr: |
        sum by (cluster, instance, node) (
          round(
            (1 - (
              es_fs_path_available_bytes /
              es_fs_path_total_bytes
            )
          ) * 100, 0.001)
        ) > es_cluster_routing_allocation_disk_watermark_high_pct
      for: 5m
      labels:
        severity: warning
    - alert: ElasticsearchNodeDiskWatermarkReached
      annotations:
        message: Disk Flood Stage Watermark Reached at {{ $labels.node }} node in {{ $labels.cluster }} cluster. Every index having a shard allocated on this node is enforced a read-only block. The index block is automatically released when the disk utilization falls below the high watermark.
        summary: Disk Flood Stage Watermark Reached - disk saturation is {{ $value }}%
      expr: |
        sum by (cluster, instance, node) (
          round(
            (1 - (
              es_fs_path_available_bytes /
              es_fs_path_total_bytes
            )
          ) * 100, 0.001)
        ) > es_cluster_routing_allocation_disk_watermark_flood_stage_pct
      for: 5m
      labels:
        severity: critical

When executing `sum by(cluster, instance, node) (round((1 - (es_fs_path_available_bytes / es_fs_path_total_bytes)) * 100, 0.001)) > es_cluster_routing_allocation_disk_watermark_flood_stage_pct` on the Prometheus console, it returns empty value.

if execute `sum by(cluster, instance) (round((1 - (es_fs_path_available_bytes / es_fs_path_total_bytes)) * 100, 0.001)) >  sum by(cluster, instance) (es_cluster_routing_allocation_disk_watermark_low_pct)`, it returns:
Element	Value
{cluster="elasticsearch",instance="10.129.2.49:60001"}	93.09

Here are the metrics grabbed from Prometheus console:
sum by(cluster, instance, node) (round((1 - (es_fs_path_available_bytes / es_fs_path_total_bytes)) * 100, 0.001)):
Element	Value
{cluster="elasticsearch",instance="10.128.2.21:60001",node="elasticsearch-cdm-c9mvscsg-3"}	79.078
{cluster="elasticsearch",instance="10.129.2.49:60001",node="elasticsearch-cdm-c9mvscsg-1"}	93.085
{cluster="elasticsearch",instance="10.131.0.20:60001",node="elasticsearch-cdm-c9mvscsg-2"}	79.074

es_cluster_routing_allocation_disk_watermark_low_pct does not have node label as the above metrics
es_cluster_routing_allocation_disk_watermark_low_pct:
Element	Value
es_cluster_routing_allocation_disk_watermark_low_pct{cluster="elasticsearch",endpoint="elasticsearch",instance="10.128.2.21:60001",job="elasticsearch-metrics",namespace="openshift-logging",pod="elasticsearch-cdm-c9mvscsg-3-57b8f745f8-fzfz8",service="elasticsearch-metrics"}	85
es_cluster_routing_allocation_disk_watermark_low_pct{cluster="elasticsearch",endpoint="elasticsearch",instance="10.129.2.49:60001",job="elasticsearch-metrics",namespace="openshift-logging",pod="elasticsearch-cdm-c9mvscsg-1-6c6b967c48-qlzv5",service="elasticsearch-metrics"}	85
es_cluster_routing_allocation_disk_watermark_low_pct{cluster="elasticsearch",endpoint="elasticsearch",instance="10.131.0.20:60001",job="elasticsearch-metrics",namespace="openshift-logging",pod="elasticsearch-cdm-c9mvscsg-2-79f66679bc-dhl5w",service="elasticsearch-metrics"}	85



Version-Release number of selected component (if applicable):
elasticsearch-operator.4.6.0-202008261930.p0 

How reproducible:
Always

Steps to Reproduce:
1. deploy logging
2. create some files in the ES pod to make the disk usage > 85%
oc exec $es-pod -- dd if=/dev/urandom of=/elasticsearch/persistent/file.txt bs=1048576 count=5000
3. check alerts on Prometheus console

Actual results:


Expected results:


Additional info:

Comment 3 Qiaoling Tang 2020-09-02 00:57:19 UTC
Verified with elasticsearch-operator.4.6.0-202008312113.p0

Comment 5 errata-xmlrpc 2020-10-27 15:10:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4198