1949123 – Node_filesystem_usage are not being collect and it's not possible to modify the Operator Object

Bug 1949123 - Node_filesystem_usage are not being collect and it's not possible to modify the Operator Object

Summary: Node_filesystem_usage are not being collect and it's not possible to modify t...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.11.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Damien Grisonnet
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-13 13:25 UTC by Odilon Sousa
Modified:	2024-06-14 01:13 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-09 17:06:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1156	0	None	open	Bug 1949123: assets: fix instance:node_filesystem_usage:sum	2021-05-10 10:17:38 UTC
Red Hat Product Errata	RHSA-2021:2150	0	None	None	None	2021-06-09 17:06:45 UTC

Description Odilon Sousa 2021-04-13 13:25:09 UTC

Description of problem:

In some nodes, the node_filesystem_size_bytes item for / is not being collected. To get around this we tried to change the **prometheus-k8s-rules** to get the /host/root instead of /. Even if the / is collected, the size do not reflect the root mount size, it's getting the rootfs size which is not the root mount.

If we change the prometheus-k8s-rules object, it will be reverted to it's original state after a while.

Version-Release number of selected component (if applicable):

Openshift 3.11
openshift3/prometheus-node-exporter:v3.11.272

How reproducible:

It's possible to reproduce it everytime.

Steps to Reproduce:
1. Change the ConfigMap using the prometheusrules object:
$ oc edit -n openshift-monitoring prometheusrules prometheus-k8s-rules

Find the line with the mountpoint rule and replace from only / to /host/root, the line will look like this:

record: instance:node_filesystem_usage:sum
expr: sum
by(instance) ((node_filesystem_size{mountpoint="/host/root"} - node_filesystem_free{mountpoint="/host/root"}))
2. Check if the ConfigMap was updated
oc get cm prometheus-k8s-rulefiles-0 -o yaml -n openshift-monitoring | grep -B2 "instance:node_filesystem_usage:sum"
3. After 10 minutes or less, the Operator will revert back the prometheus-k8s-rules object.

Actual results:

In some nodes the instance:node_filesystem_usage will be empty because there's no rootfs information, this will make the Openshift Dashboard for the node to be empty, and even on the nodes that have the /sysroot information this does not reflect the / mountpoint actual size.

Expected results:

Ability to change the Operator Object or that we change from rootf to /host/root to really reflect the root mount size. Right now the information don't looks right.

Additional info:

We noticed that if we change the kernel version, the sysroot will start to show in the dashboard, if we change to a newer kernel the sysroot will not show in the cat /proc/1/mounts

The rootfs shows in 3.10.0-1127.19.1.el7.x86_64, and it's not present with 3.10.0-1160.2.1.el7.x86_64

Comment 1 Damien Grisonnet 2021-04-14 09:51:21 UTC

It's intended that your changes to the monitoring stack are not persisted as we don't want any user to break their stack. The only way to customize the stack is by tweaking some predefined Ansible variable during installation, but that wouldn't allow you to modify Prometheus rule.

In your case, this might be because of a regression in the kernel considering your discovery, but we might still be able to improve the current Prometheus rule.

From what I can see, it is not really meaningful to only consider the `/` or `/host/root` mountpoint as we want to account for all the filesystem. I'll update the recording rule to reflect that.

Comment 2 Damien Grisonnet 2021-04-15 15:38:41 UTC

We suspect that there might be something else to this bug. Could you please provide the list of mountpoints shown by the `sum(node_filesystem_size_bytes) by (mountpoint) > 0` query with both kernel versions?

Comment 14 Junqi Zhao 2021-05-26 10:56:27 UTC

checked with ose-cluster-monitoring-operator/images/v3.11.445, expr for "instance:node_filesystem_usage:sum" is updated

    - expr: sum((node_filesystem_size{mountpoint="/host/root"} - node_filesystem_free{mountpoint="/host/root"})) BY (instance)
      record: instance:node_filesystem_usage:sum

Comment 17 errata-xmlrpc 2021-06-09 17:06:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 3.11.452 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2150

Note You need to log in before you can comment on or make changes to this bug.