Bug 2128263
| Summary: | Alert KubePersistentVolumeInodesFillingUp MON-2802 | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Martin Bukatovic <mbukatov> | |
| Component: | ceph-monitoring | Assignee: | Juan Miguel Olmo <jolmomar> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Vishakha Kathole <vkathole> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.11 | CC: | hnallurv, jfajersk, jolmomar, lmohanty, mmuench, muagarwa, ndevos, nthomas, ocs-bugs, odf-bz-bot, olim, pkhaire, sarora, scuppett, sdodson, spasquie, ssonigra, tdesala, vcojot | |
| Target Milestone: | --- | Keywords: | Upgrades | |
| Target Release: | ODF 4.12.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | 4.12.0-79 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause:
When a PVC uses CephFs as storage backend, the inodes metrics (kubelet_volume_stats_inodes_free, kubelet_volume_stats_inodes, kubelet_volume_stats_inodes_used) are not correct because in CephFs inodes get allocated on-demand.
Note: (CephFS is a filesystem with dynamic inode allocation, this is by design)
Consequence:
False alerts around usage in PVs can be raised because the involved metrics to raise this alerts are providing a storage capacity status which can be dynamically changed without any intervention if it is required more storage space.
Fix:
metrics for kubelet_volume_stats_inodes_free, kubelet_volume_stats_inodes, kubelet_volume_stats_inodes_used are not provided for Cephfs backed PVCs
Result:
False alarms for storage capacity based in inodes metrics won't be raised.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2149676 (view as bug list) | Environment: | ||
| Last Closed: | 2023-02-08 14:06:28 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2132270, 2149677 | |||
| Bug Blocks: | 2149676 | |||
|
Description
Martin Bukatovic
2022-09-20 12:40:22 UTC
Please check whether there is something we can do on ODF's side (in ceph csi or ODF monitoring) to address this. The alarm "KubePersistentVolumeInodesFillingUp" is defined as a calculation between kubelet_volume_stats_inodes_free and kubelet_volume_stats_inodes. If the filesystem used to backed the PV has dynamic inodes allocation, the returning value is going to be always incorrect, or "fictitious" in the better case. In Filesystems implementation with dynamic inode allocation, the inodes are allocated as needed as per the files created/space requested.. That is the expected behavior, and we cannot change this nice feature. So I would suggest to disable this alarm for this kind of PVs, leaving only alarms directly linked with "space" instead of inodes. The alarm is a kubernetes/openshift alarm, so i think that probably this is outside the scope of storage teams. (maybe openshift monitoring team?). I can manage this, but i will need access to the openshift monitoring code. Hello! Monitoring team engineer here :) > So I would suggest to disable this alarm for this kind of PVs, leaving only alarms directly linked with "space" instead of inodes. We've already looked into it and we found no heuristic that would allow to "disable" the alert for certain types of PV unfortunately. It might be possible to filter out the buggy volumes based on kube_persistentvolumeclaim_* and/or kube_persistentvolume_* metrics but we would need access to a live cluster to investigate. > If the filesystem used to backed the PV has dynamic inodes allocation, the returning value is going to be always incorrect, or "fictitious" in the better case. If the value isn't accurate or relevant for this type of setup then my recommendation (from an instrumentation standpoint) would be not to expose the value at all. If you know that you don't know, it's better to not say anything :) In MON-2892 [1], I shared this snippet from the CSI spec [2]: > Similarly, inode information MAY be omitted from NodeGetVolumeStatsResponse when unavailable. IMHO it isn't only a concern with the KubePersistentVolumeInodesFillingUp alert. Anyone looking at the kubelet_volume_stats_inodes_* metrics would be puzzled by the inconsistent numbers. [1] https://issues.redhat.com/browse/MON-2802 [2] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetvolumestats Thanks for the feedback Simon. Probably working all together we will finish fixing this problem. > So I would suggest to disable this alarm for this kind of PVs, leaving only alarms directly linked with "space" instead of inodes. >> We've already looked into it and we found no heuristic that would allow to "disable" the alert for certain types of PV unfortunately. It might be possible to filter out the buggy volumes based on kube_persistentvolumeclaim_* and/or kube_persistentvolume_* metrics but we would need access to a live cluster to investigate. We (storage team) investigating how can we provide the heuristic to determine if a PV is using a data source like CephFS, not sure if we can achieve that for other types of Filesystems. > If the filesystem used to backed the PV has dynamic inodes allocation, the returning value is going to be always incorrect, or "fictitious" in the better case. > If the value isn't accurate or relevant for this type of setup then my recommendation (from an instrumentation standpoint) would be not to expose the value at all. If you know that you don't know, it's better to not say anything :) You are completely right. As you pointed, taking a look to [1] it seems that the real returned value for inodes_free when is not 0 ( and it must be 0 because then number of inodes returned from "ceph df" must be -1) Lets invite Niels de Vos to bring light over this point. Sorry, in my previous comment the reference to [1] is the link to the following piece of code: https://github.com/ceph/ceph-csi/blob/36b061d426b4c683df11f7deedeb27f8bc8fbad5/internal/csi-common/utils.go#L293 I'm raising the severity of this because it makes it difficult to ascertain whether or not there are real problems, at fleet level, with 4.11. Can we please have a firm plan agreed upon by both monitoring and storage teams which details how we intend to address this and what that timeline looks like? If the filesystem is expanded the inode issue will go away, isn't it? I am wondering how customers can mitigate this before we fix the root cause. @ndevos Hi, do you think this is a high severity issue? > I am wondering how customers can mitigate this before we fix the root cause.
From a monitoring standpoint, customers have the ability to silence the alert or redirect the notification to an empty receiver. But it's definitely far from ideal.
As a short-term and temporary fix, we could remove the KubePersistentVolumeInodesFillingUp alerts from cluster monitoring operator until the metric is fixed. The downside is that cluster admins have no way to detect when a PVC is running out of inodes but it's probably way less frequent than a PVC running out of space (which is covered by the KubePersistentVolumeFillingUp alerts).
If there is no way to represent this inode value in a better way (I'm not aware of any convention for us to use, but I can be wrong) maybe we could redefine the alert to fire only on volumes labeled in a particular way (instead of the current state when it's enabled by default, but one can disable it for particular volumes). Then we can have either admins or csi provisioner to label the pv accordingly only when a filesystem which needs inode monitoring is used. This way, we get rid of false positive alert fires, while keep an option to raise an inode alert when it makes sense. > maybe we could redefine the alert to fire only on volumes labeled in a particular way (instead of the current state when it's enabled by default, but one can disable it for particular volumes). Then we can have either admins or csi provisioner to label the pv accordingly only when a filesystem which needs inode monitoring is used.
This would be very cumbersome from a user perspective IMHO.
Also it doesn't solve the issue that the CSI driver reports buggy values for kubelet_volume_stats_inodes_* metrics which would still confuse users.
(In reply to Juan Miguel Olmo from comment #5) > > If the filesystem used to backed the PV has dynamic inodes allocation, the returning value is going to be always incorrect, or "fictitious" in the better case. > > > If the value isn't accurate or relevant for this type of setup then my recommendation (from an instrumentation standpoint) would be not to expose the value at all. If you know that you don't know, it's better to not say anything :) > > You are completely right. As you pointed, taking a look to [1] it seems that > the real returned value for inodes_free when is not 0 ( and it must be 0 > because then number of inodes returned from "ceph df" must be -1) > Lets invite Niels de Vos to bring light over this point. Indeed, for CephFS (and some other filesystems) the number of free inodes is not relevant, they get allocated on-demand. (In reply to Lalatendu Mohanty from comment #8) > If the filesystem is expanded the inode issue will go away, isn't it? I am > wondering how customers can mitigate this before we fix the root cause. > @ndevos Hi, do you think this is a high severity issue? Growing a filesystem might not result in more free inodes reported by tools like 'df'. If inodes get allocated on-demand, the value may be 0. Other filesystems may allocate inodes only during mkfs (ext4?), growing the filesystem does not give you more inodes in that case. This is an alerting issue, some customers may be worried about the alert. It may not be obvious that the alert is not relevant for some particular filesystems, and hence customers worry for nothing. (In reply to Simon Pasquier from comment #11) > Also it doesn't solve the issue that the CSI driver reports buggy values for > kubelet_volume_stats_inodes_* metrics which would still confuse users. For the Ceph-CSI driver we should be able to not return the number of free inodes. The NodeGetVolumeStats CSI procedure notes [1] that the free inodes in the VolumeUsage response are optional. We'll need to work out a special case for CephFS (and NFS) in FilesystemNodeGetVolumeStats [2], but how that is done is up to the Ceph-CSI developers :-) Feel free to open a BZ against OpenShift Data Foundation/ceph-csi so that we can take that request on for a next release. [1] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetvolumestats [2] https://github.com/ceph/ceph-csi/blob/1eaf6af8824b2367ef07ee1e296f319d72bcaa86/internal/csi-common/utils.go#L240 Actually, the previous comment isn't completely right. It seems that Ceph-CSI does not return the inodes_free part, but only a subset of the information:
Logs:
I0929 13:09:15.373762 1 utils.go:195] ID: 212 GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-fbc472b1-3ff6-11ed-bbaa-0a580a800213","volume_path":"/var/lib/kubelet/pods/9f1d5e1a-1927-408d-92c1-9d0999384bd8/volumes/kubernetes.io~csi/pvc-fd4d4b1a-4d16-4b64-8573-f805ecb2b661/mount"}
I0929 13:09:15.380245 1 utils.go:202] ID: 212 GRPC response: {"usage":[{"available":968884224,"total":1073741824,"unit":1,"used":104857600},{"total":3187,"unit":2,"used":3188}]}
The JSON output in more readable form:
{
"usage": [
{
"available": 968884224,
"total": 1073741824,
"unit": 1,
"used": 104857600
},
{
"total": 3187,
"unit": 2,
"used": 3188
}
]
}
unit=2 for inodes, which includes (a strange) "total" and "used". This might cause Kubelet to assume that -1 inodes are free.
Unfortunately "total" (available capacity) is a required attribute according to the CSI specification, and it can not be negative. That means, for CephFS 100% consumption of inodes is normal.
I do not think that Ceph-CSI can do something about this, as it follows the CSI specification. The only way to get this addressed, is changing the specification and making total optional in some way. This would be a large and long-term effort as changes to the specification need to be made/accepted/released before Kubernetes and CSI-drivers can implement it.
After reading the comments, i think that we can deduce that is not a good idea to introduce any kind of changes in Ceph-CSI. @mbukatov has pointed in comment 10 what i think is the ideal solution. 1. We need to mark the PVs (or/and PVCs) using PVCs backed by Cephfs. 2. We need to modify the current alert rule to avoid raise the alert in these PVS @ndevos, What is the best place to add the "label" to these PVs? Can we do that in the Ceph CSI provisioner as @mbukatov pointed? Should we need to create a separate bug to track the solution of the problem? Once we will be able to solve the identification of PVs using Ceph FS. spasquie, do you think is possible to introduce this exception in the alert rule? > Once we will be able to solve the identification of PVs using Ceph FS. spasquie, do you think is possible to introduce this exception in the alert rule? We could do that and in fact the current alert expression makes it possible already: a PVC with the label "alerts.k8s.io/KubePersistentVolumeFillingUp=disabled" will not trigger the alert. From a monitoring/metrics stand point though, I'd like to reiterate that reporting incorrect/non-relevant metrics is far from being ideal. From [1], we see that we may have "inodes used" > "inodes total". kubelet_volume_stats_inodes_free{persistentvolumeclaim="registry-cephfs-rwx-pvc"} 0 kubelet_volume_stats_inodes{persistentvolumeclaim="registry-cephfs-rwx-pvc"} 7418 kubelet_volume_stats_inodes_used{persistentvolumeclaim="registry-cephfs-rwx-pvc"} 7419 Even if we tweak the alerts to exclude PVCs based on a predefined label, these metrics are going to confuse users. > I do not think that Ceph-CSI can do something about this, as it follows the CSI specification. The only way to get this addressed, is changing the specification and making total optional in some way. This would be a large and long-term effort as changes to the specification need to be made/accepted/released before Kubernetes and CSI-drivers can implement it. [2] says: "Similarly, inode information MAY be omitted from NodeGetVolumeStatsResponse when unavailable." It sounds to me like the CSI driver can decide not to report inodes usage while still reporting bytes usage. Am I wrong? [1] https://issues.redhat.com/browse/MON-2802 [2] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetvolumestats (In reply to Juan Miguel Olmo from comment #14) > @ndevos, What is the best place to add the "label" to these PVs? > Can we do that in the Ceph CSI provisioner as @mbukatov pointed? > Should we need to create a separate bug to track the solution of the problem? CSI-drivers do not know about PersistentVolumes or PersistentVolumeClaims. The CSI Standard only uses "Volumes" that are in a format completely independent of the Container Orchestrator (Kubernetes). CSI-drivers can not add annotations/labels to PVs. Maybe an alerting rule can inspect the PV and check for PersistentVolumeSource.CSIPersistentVolumeSource.Driver. This will get you the name of the driver that was used to create the volume. However, driver names are configurable, and differ in upstream/downstream deployments. You would also need to keep a list of known drivernames that should not cause alerts on inodes. (In reply to Simon Pasquier from comment #15) > "Similarly, inode information MAY be omitted from NodeGetVolumeStatsResponse > when unavailable." > > It sounds to me like the CSI driver can decide not to report inodes usage > while still reporting bytes usage. Am I wrong? Yes, that would be a possible approach. For Ceph-CSI we can introduce a configuration option to not report inode information at all. I do not know if there are users relying on inode information; it can be used to configure quotas, so it may be useful to some. We do not configure inode quotas with ODF, so for the Red Hat product I think it should be fine to not report inode information for CephFS volumes. Thank you for the quick answer @ndevos, spasquie. So it seems that an acceptable solution for everybody could be what Niels has proposed (config option to not report inode info). Any other thought or concern Simon? Niels, what is the better way to speed up this change? > Any other thought or concern Simon? Nothing I can think of. As I commented somewhere else, we probably want to disable the alert in the meantime as it keeps knocking on our bug door [1]. [1] https://issues.redhat.com/browse/OCPBUGS-1766 (In reply to Juan Miguel Olmo from comment #17) > Thank you for the quick answer @ndevos, spasquie. > > So it seems that an acceptable solution for everybody could be what Niels > has proposed (config option to not report inode info). > > Any other thought or concern Simon? > > Niels, what is the better way to speed up this change? I think you can file a bug against the "csi-driver" component in the "OpenShift Data Foundation" product. The change for Ceph-CSI is minimal, and should be able to get included in the ODF-4.12 release. If urgent enough, it might be eligible for backporting to the current ODF-4.11 version as a bugfix. @ndevos, spasquie: I have created bug 2132270 for the Ceph CSI driver component. PTAL Apart of introducing the config option in the Ceph CSI driver, do we need another modification in any ODF config file in order to set the value for the option? (In reply to Juan Miguel Olmo from comment #21) > Apart of introducing the config option in the Ceph CSI driver, do we need > another modification in any ODF config file in order to set the value for > the option? No, defaults for ODF will be taken care off. At the moment I am not even sure if a configuration option is needed at all. We might remove inode metrics for CephFS and CephNFS, and only add an option if someone asks for it. Niels already answered, clearing the needinfo flag. Clearing needinfo, we got the feedback from ODF team. The resolution of this bug depends on bug 2132270. Resolution verification should will be executed in ODF 4.12.0 release. See https://bugzilla.redhat.com/show_bug.cgi?id=2132270#c7 Pls provide qa ack QA team will check that alert KubePersistentVolumeInodesFillingUp is not firing when RWX CephFS volume is used to provide persistent storage for some OCP component. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |