Bug 1874028
Summary: | Node filesystem used and total are calculations are wrong | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Pablo Alonso Rodriguez <palonsor> | |
Component: | Management Console | Assignee: | ralpert | |
Status: | CLOSED ERRATA | QA Contact: | Yadan Pei <yapei> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.5 | CC: | aos-bugs, fshaikh, jmalde, jokerman, michele.sandro.emma, pkrupa, spadgett, yapei | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: The query was wrong.
Consequence: The data was wrong.
Fix: Updated the query.
Result: The data is as expected.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1883177 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:36:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1883177 |
Description
Pablo Alonso Rodriguez
2020-08-31 11:20:39 UTC
I am working under assumption that `instance:node_cpu:rate:sum` was meant as `instance:node_filesystem_usage:sum`. `instance:node_filesystem_usage:sum` is showing only usage of `/` mountpoint and as such we removed this recording rule in 4.6+. `node_filesystem_size_bytes - node_filesystem_avail_bytes` metric should be used instead. More on the topic: https://www.robustperception.io/filesystem-metrics-from-the-node-exporter Additionally any usage of node_filesystem_* metrics should be restricted either to mountpoint and/or to fs type. Such restrictions can be done by using `fstype` and/or `mountpoint` labels, for example this would show all available storage space excluding `/boot`, tmpfs, and squashfs: `node_filesystem_avail_bytes{fstype!~"tmpfs|squashfs",mountpoint!="/boot"}` Thanks Pawel for more insight! Much appriciated. Is this okay to close given Pawel's insight? We're using the queries he mentioned. No, it is not Ok. First of all, this issue does reproduce in 4.5, so even if we didn't reproduce in 4.6, we would need to get the fix into 4.5. However, I have been having a look at current 4.6 cluster monitoring operator code. The --collector.filesystem.ignored-mount-points operator has been expanded but is not enough: - In 4.5: --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/) - In 4.6: --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/) With that change, only volumes under /var/lib/kubelet/pods would be ignored (pod volume mounts, emptydirs, etc.). However, the following would still be counted incorrectly: - The mounts in / , /usr and /var would still make the root filesystem to be accounted 3 times - tmpfs foilesystems outside /var/lib/kubelet/pods (like /tmp or /dev/shm) would still be added to the count. So no, not ok to close, we still need a fix. Sorry for not being clear enough. Yes, my suggestion is to use metric labels as filters and remove unwanted data from query. For example to calculate used storage with applied filter, query would need to look as follows: sum by (instance) (node_filesystem_size_bytes{fstype!~"tmpfs|squashfs",mountpoint!~"/usr|/var"} - node_filesystem_free_bytes{fstype!~"tmpfs|squashfs",mountpoint!~"/usr|/var"}) *** Bug 1877136 has been marked as a duplicate of this bug. *** [NodeQueries.FILESYSTEM_USAGE]: _.template( `sum(node_filesystem_size_bytes{instance="<%= node %>",fstype!=""} - node_filesystem_avail_bytes{instance="<%= node %>",fstype!=""})`, `sum(node_filesystem_size_bytes{instance="<%= node %>",fstype!~"tmpfs|squashfs",mountpoint!~"/usr|/var"} - node_filesystem_avail_bytes{instance="<%= node %>",fstype!~"tmpfs|squashfs",mountpoint!~"/usr|/var"})`, ), [NodeQueries.FILESYSTEM_TOTAL]: _.template( `node_filesystem_size_bytes{instance='<%= node %>',fstype!~"tmpfs|squashfs",mountpoint!~"/usr|/var"}`, I think for NodeQueries.FILESYSTEM_TOTAL, the correct query should be `sum(node_filesystem_size_bytes{instance='<%= node %>',fstype!~"tmpfs|squashfs",mountpoint!~"/usr|/var"})` Assigning back to confirm Sorry, this is the current PR fix(one extra line in comment 10) [NodeQueries.FILESYSTEM_USAGE]: _.template( `sum(node_filesystem_size_bytes{instance="<%= node %>",fstype!~"tmpfs|squashfs",mountpoint!~"/usr|/var"} - node_filesystem_avail_bytes{instance="<%= node %>",fstype!~"tmpfs|squashfs",mountpoint!~"/usr|/var"})`, ), [NodeQueries.FILESYSTEM_TOTAL]: _.template( `node_filesystem_size_bytes{instance='<%= node %>',fstype!~"tmpfs|squashfs",mountpoint!~"/usr|/var"}`, Node Overview page and Nodes list page are showing the same Filesystem Usage and Total value. And Filesystem Total Usage is showing the same value with query sum(node_filesystem_size_bytes{instance="<node>",fstype!~"tmpfs|squashfs",mountpoint!~"/usr|/var"}) Moving to VERIFIED Verified on 4.6.0-0.nightly-2020-09-26-202331 *** Bug 1852770 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |