Description of problem: In some nodes, the node_filesystem_size_bytes item for / is not being collected. To get around this we tried to change the **prometheus-k8s-rules** to get the /host/root instead of /. Even if the / is collected, the size do not reflect the root mount size, it's getting the rootfs size which is not the root mount. If we change the prometheus-k8s-rules object, it will be reverted to it's original state after a while. Version-Release number of selected component (if applicable): Openshift 3.11 openshift3/prometheus-node-exporter:v3.11.272 How reproducible: It's possible to reproduce it everytime. Steps to Reproduce: 1. Change the ConfigMap using the prometheusrules object: $ oc edit -n openshift-monitoring prometheusrules prometheus-k8s-rules Find the line with the mountpoint rule and replace from only / to /host/root, the line will look like this: record: instance:node_filesystem_usage:sum expr: sum by(instance) ((node_filesystem_size{mountpoint="/host/root"} - node_filesystem_free{mountpoint="/host/root"})) 2. Check if the ConfigMap was updated oc get cm prometheus-k8s-rulefiles-0 -o yaml -n openshift-monitoring | grep -B2 "instance:node_filesystem_usage:sum" 3. After 10 minutes or less, the Operator will revert back the prometheus-k8s-rules object. Actual results: In some nodes the instance:node_filesystem_usage will be empty because there's no rootfs information, this will make the Openshift Dashboard for the node to be empty, and even on the nodes that have the /sysroot information this does not reflect the / mountpoint actual size. Expected results: Ability to change the Operator Object or that we change from rootf to /host/root to really reflect the root mount size. Right now the information don't looks right. Additional info: We noticed that if we change the kernel version, the sysroot will start to show in the dashboard, if we change to a newer kernel the sysroot will not show in the cat /proc/1/mounts The rootfs shows in 3.10.0-1127.19.1.el7.x86_64, and it's not present with 3.10.0-1160.2.1.el7.x86_64
It's intended that your changes to the monitoring stack are not persisted as we don't want any user to break their stack. The only way to customize the stack is by tweaking some predefined Ansible variable during installation, but that wouldn't allow you to modify Prometheus rule. In your case, this might be because of a regression in the kernel considering your discovery, but we might still be able to improve the current Prometheus rule. From what I can see, it is not really meaningful to only consider the `/` or `/host/root` mountpoint as we want to account for all the filesystem. I'll update the recording rule to reflect that.
We suspect that there might be something else to this bug. Could you please provide the list of mountpoints shown by the `sum(node_filesystem_size_bytes) by (mountpoint) > 0` query with both kernel versions?
checked with ose-cluster-monitoring-operator/images/v3.11.445, expr for "instance:node_filesystem_usage:sum" is updated - expr: sum((node_filesystem_size{mountpoint="/host/root"} - node_filesystem_free{mountpoint="/host/root"})) BY (instance) record: instance:node_filesystem_usage:sum
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 3.11.452 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2150