Bug 1873013
| Summary: | Alert `ElasticsearchNodeDiskWatermarkReached` couldn't become Pending/Firing. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Qiaoling Tang <qitang> |
| Component: | Logging | Assignee: | Brett Jones <brejones> |
| Status: | CLOSED ERRATA | QA Contact: | Qiaoling Tang <qitang> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.6 | CC: | aos-bugs, brejones, jcantril |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | logging-exploration | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 15:10:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Verified with elasticsearch-operator.4.6.0-202008312113.p0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4198 |
Description of problem: The 3 `ElasticsearchNodeDiskWatermarkReached` alerts could not become Pending/Firing: - alert: ElasticsearchNodeDiskWatermarkReached annotations: message: Disk Low Watermark Reached at {{ $labels.node }} node in {{ $labels.cluster }} cluster. Shards can not be allocated to this node anymore. You should consider adding more disk to the node. summary: Disk Low Watermark Reached - disk saturation is {{ $value }}% expr: | sum by (cluster, instance, node) ( round( (1 - ( es_fs_path_available_bytes / es_fs_path_total_bytes ) ) * 100, 0.001) ) > es_cluster_routing_allocation_disk_watermark_low_pct for: 5m labels: severity: info - alert: ElasticsearchNodeDiskWatermarkReached annotations: message: Disk High Watermark Reached at {{ $labels.node }} node in {{ $labels.cluster }} cluster. Some shards will be re-allocated to different nodes if possible. Make sure more disk space is added to the node or drop old indices allocated to this node. summary: Disk High Watermark Reached - disk saturation is {{ $value }}% expr: | sum by (cluster, instance, node) ( round( (1 - ( es_fs_path_available_bytes / es_fs_path_total_bytes ) ) * 100, 0.001) ) > es_cluster_routing_allocation_disk_watermark_high_pct for: 5m labels: severity: warning - alert: ElasticsearchNodeDiskWatermarkReached annotations: message: Disk Flood Stage Watermark Reached at {{ $labels.node }} node in {{ $labels.cluster }} cluster. Every index having a shard allocated on this node is enforced a read-only block. The index block is automatically released when the disk utilization falls below the high watermark. summary: Disk Flood Stage Watermark Reached - disk saturation is {{ $value }}% expr: | sum by (cluster, instance, node) ( round( (1 - ( es_fs_path_available_bytes / es_fs_path_total_bytes ) ) * 100, 0.001) ) > es_cluster_routing_allocation_disk_watermark_flood_stage_pct for: 5m labels: severity: critical When executing `sum by(cluster, instance, node) (round((1 - (es_fs_path_available_bytes / es_fs_path_total_bytes)) * 100, 0.001)) > es_cluster_routing_allocation_disk_watermark_flood_stage_pct` on the Prometheus console, it returns empty value. if execute `sum by(cluster, instance) (round((1 - (es_fs_path_available_bytes / es_fs_path_total_bytes)) * 100, 0.001)) > sum by(cluster, instance) (es_cluster_routing_allocation_disk_watermark_low_pct)`, it returns: Element Value {cluster="elasticsearch",instance="10.129.2.49:60001"} 93.09 Here are the metrics grabbed from Prometheus console: sum by(cluster, instance, node) (round((1 - (es_fs_path_available_bytes / es_fs_path_total_bytes)) * 100, 0.001)): Element Value {cluster="elasticsearch",instance="10.128.2.21:60001",node="elasticsearch-cdm-c9mvscsg-3"} 79.078 {cluster="elasticsearch",instance="10.129.2.49:60001",node="elasticsearch-cdm-c9mvscsg-1"} 93.085 {cluster="elasticsearch",instance="10.131.0.20:60001",node="elasticsearch-cdm-c9mvscsg-2"} 79.074 es_cluster_routing_allocation_disk_watermark_low_pct does not have node label as the above metrics es_cluster_routing_allocation_disk_watermark_low_pct: Element Value es_cluster_routing_allocation_disk_watermark_low_pct{cluster="elasticsearch",endpoint="elasticsearch",instance="10.128.2.21:60001",job="elasticsearch-metrics",namespace="openshift-logging",pod="elasticsearch-cdm-c9mvscsg-3-57b8f745f8-fzfz8",service="elasticsearch-metrics"} 85 es_cluster_routing_allocation_disk_watermark_low_pct{cluster="elasticsearch",endpoint="elasticsearch",instance="10.129.2.49:60001",job="elasticsearch-metrics",namespace="openshift-logging",pod="elasticsearch-cdm-c9mvscsg-1-6c6b967c48-qlzv5",service="elasticsearch-metrics"} 85 es_cluster_routing_allocation_disk_watermark_low_pct{cluster="elasticsearch",endpoint="elasticsearch",instance="10.131.0.20:60001",job="elasticsearch-metrics",namespace="openshift-logging",pod="elasticsearch-cdm-c9mvscsg-2-79f66679bc-dhl5w",service="elasticsearch-metrics"} 85 Version-Release number of selected component (if applicable): elasticsearch-operator.4.6.0-202008261930.p0 How reproducible: Always Steps to Reproduce: 1. deploy logging 2. create some files in the ES pod to make the disk usage > 85% oc exec $es-pod -- dd if=/dev/urandom of=/elasticsearch/persistent/file.txt bs=1048576 count=5000 3. check alerts on Prometheus console Actual results: Expected results: Additional info: