Bug 2128263

Summary: Alert KubePersistentVolumeInodesFillingUp MON-2802
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Martin Bukatovic <mbukatov>
Component: ceph-monitoringAssignee: Juan Miguel Olmo <jolmomar>
Status: CLOSED CURRENTRELEASE QA Contact: Vishakha Kathole <vkathole>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.11CC: hnallurv, jfajersk, jolmomar, lmohanty, mmuench, muagarwa, ndevos, nthomas, ocs-bugs, odf-bz-bot, olim, pkhaire, sarora, scuppett, sdodson, spasquie, ssonigra, tdesala, vcojot
Target Milestone: ---Keywords: Upgrades
Target Release: ODF 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.12.0-79 Doc Type: Bug Fix
Doc Text:
Cause: When a PVC uses CephFs as storage backend, the inodes metrics (kubelet_volume_stats_inodes_free, kubelet_volume_stats_inodes, kubelet_volume_stats_inodes_used) are not correct because in CephFs inodes get allocated on-demand. Note: (CephFS is a filesystem with dynamic inode allocation, this is by design) Consequence: False alerts around usage in PVs can be raised because the involved metrics to raise this alerts are providing a storage capacity status which can be dynamically changed without any intervention if it is required more storage space. Fix: metrics for kubelet_volume_stats_inodes_free, kubelet_volume_stats_inodes, kubelet_volume_stats_inodes_used are not provided for Cephfs backed PVCs Result: False alarms for storage capacity based in inodes metrics won't be raised.
Story Points: ---
Clone Of:
: 2149676 (view as bug list) Environment:
Last Closed: 2023-02-08 14:06:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2132270, 2149677    
Bug Blocks: 2149676    

Description Martin Bukatovic 2022-09-20 12:40:22 UTC
Description of problem
======================

Since values of total inode capacity for filesystems with dynamic inode
allocation are not well defined (every such filesystem such as CephFS, XFS,
or Btrfs behaves slightly differently), it's not possible to interpret these
values in the same way as for "traditional" filesystems with static inode
allocation (such as ext4).

And Because alert KubePersistentVolumeInodesFillingUp doesn't distinquish
between the two cases, it could fire for PVCs backed by filesystems with
dynamic inode allocation causing a false alarm.

This is a tracker bug for https://issues.redhat.com/browse/MON-2802

Version-Release number of selected component
============================================

OCP 4.11.0
ODF 4.11.0

How reproducible
================

100%

Steps to Reproduce
==================

1. Install OCP
2. Reconfigure OpenShift Container Platform registry to use RWX
   CephFS volume provided by ODF
3. Use the cluster for a while
4. Check firing alerts

Actual results
==============

Alert KubePersistentVolumeInodesFillingUp is firing with the following
message:

> The PersistentVolume claimed by registry-cephfs-rwx-pvc in Namespace
> openshift-image-registry only has 0% free inodes.

In this particular case, there will be 2 such alerts, as there are 2 replicas
of the registry.

Expected results
================

Alert KubePersistentVolumeInodesFillingUp is not firing when RWX CephFS volume
is used to provide persistent storage for some OCP component.

Additional info
===============

The definition of the alert looks like this:

```
(kubelet_volume_stats_inodes_free{job="kubelet",metrics_path="/metrics",namespace=~"(openshift-.*|kube-.*|default)"} / kubelet_volume_stats_inodes{job="kubelet",metrics_path="/metrics",namespace=~"(openshift-.*|kube-.*|default)"}) < 0.03 and kubelet_volume_stats_inodes_used{job="kubelet",metrics_path="/metrics",namespace=~"(openshift-.*|kube-.*|default)"} > 0 unless on (namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany",namespace=~"(openshift-.*|kube-.*|default)"} == 1 unless on (namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_alerts_k8s_io_kube_persistent_volume_filling_up="disabled",namespace=~"(openshift-.*|kube-.*|default)"} == 1
```

So it looks like there was some attempt to prevent this from happening, but
without some reliable tracking which filesystem is used and whether we want to
take inode values seriously for given volume, the alert can't avoid false
alarms.

Comment 1 Martin Bukatovic 2022-09-20 12:43:39 UTC
Please check whether there is something we can do on ODF's side (in ceph csi or ODF monitoring) to address this.

Comment 2 Juan Miguel Olmo 2022-09-20 14:17:53 UTC
The alarm "KubePersistentVolumeInodesFillingUp" is defined as a calculation between kubelet_volume_stats_inodes_free and kubelet_volume_stats_inodes. If the filesystem used to backed the PV has dynamic inodes allocation, the returning value is going to be always incorrect, or "fictitious" in the better case. 

In Filesystems implementation with dynamic inode allocation, the inodes are allocated as needed as per the files created/space requested.. That is the expected behavior, and we cannot change this nice feature.

So I would suggest to disable this alarm for this kind of PVs, leaving only alarms directly linked with "space" instead of inodes.

The alarm is a kubernetes/openshift alarm, so i think that probably this is outside the scope of storage teams. (maybe openshift monitoring team?).

I can manage this, but i will need access to the openshift monitoring code.

Comment 3 Simon Pasquier 2022-09-20 15:10:36 UTC
Hello! Monitoring team engineer here :)

> So I would suggest to disable this alarm for this kind of PVs, leaving only alarms directly linked with "space" instead of inodes.

We've already looked into it and we found no heuristic that would allow to "disable" the alert for certain types of PV unfortunately. It might be possible to filter out the buggy volumes based on kube_persistentvolumeclaim_* and/or kube_persistentvolume_* metrics but we would need access to a live cluster to investigate.

> If the filesystem used to backed the PV has dynamic inodes allocation, the returning value is going to be always incorrect, or "fictitious" in the better case. 

If the value isn't accurate or relevant for this type of setup then my recommendation (from an instrumentation standpoint) would be not to expose the value at all. If you know that you don't know, it's better to not say anything :)

In MON-2892 [1], I shared this snippet from the CSI spec [2]:

> Similarly, inode information MAY be omitted from NodeGetVolumeStatsResponse when unavailable.

IMHO it isn't only a concern with the KubePersistentVolumeInodesFillingUp alert. Anyone looking at the kubelet_volume_stats_inodes_* metrics would be puzzled by the inconsistent numbers.

[1] https://issues.redhat.com/browse/MON-2802
[2] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetvolumestats

Comment 5 Juan Miguel Olmo 2022-09-21 15:54:11 UTC
Thanks for the feedback Simon. Probably working all together we will finish fixing this problem.

> So I would suggest to disable this alarm for this kind of PVs, leaving only alarms directly linked with "space" instead of inodes.

>> We've already looked into it and we found no heuristic that would allow to "disable" the alert for certain types of PV unfortunately. It might be possible to filter out the buggy volumes based on kube_persistentvolumeclaim_* and/or kube_persistentvolume_* metrics but we would need access to a live cluster to investigate.

We (storage team) investigating how can we provide the heuristic to determine if a PV is using a data source like CephFS, not sure if we can achieve that for other types of Filesystems.

> If the filesystem used to backed the PV has dynamic inodes allocation, the returning value is going to be always incorrect, or "fictitious" in the better case. 

> If the value isn't accurate or relevant for this type of setup then my recommendation (from an instrumentation standpoint) would be not to expose the value at all. If you know that you don't know, it's better to not say anything :)

You are completely right. As you pointed, taking a look to [1] it seems that the real returned value for inodes_free when is not 0 ( and it must be 0 because then number of inodes returned from "ceph df" must be -1)
Lets invite Niels de Vos to bring light over this point.

Comment 6 Juan Miguel Olmo 2022-09-21 15:58:17 UTC
Sorry, in my previous comment the reference to [1] is the link to the following piece of code:
https://github.com/ceph/ceph-csi/blob/36b061d426b4c683df11f7deedeb27f8bc8fbad5/internal/csi-common/utils.go#L293

Comment 7 Scott Dodson 2022-09-26 13:28:54 UTC
I'm raising the severity of this because it makes it difficult to ascertain whether or not there are real problems, at fleet level, with 4.11. Can we please have a firm plan agreed upon by both monitoring and storage teams which details how we intend to address this and what that timeline looks like?

Comment 8 Lalatendu Mohanty 2022-09-26 14:55:57 UTC
If the filesystem is expanded the inode issue will go away, isn't it? I am wondering how customers can mitigate this before we fix the root cause. @ndevos Hi, do you think this is a high severity issue?

Comment 9 Simon Pasquier 2022-09-27 12:18:45 UTC
>  I am wondering how customers can mitigate this before we fix the root cause.

From a monitoring standpoint, customers have the ability to silence the alert or redirect the notification to an empty receiver. But it's definitely far from ideal.
As a short-term and temporary fix, we could remove the KubePersistentVolumeInodesFillingUp alerts from cluster monitoring operator until the metric is fixed. The downside is that cluster admins have no way to detect when a PVC is running out of inodes but it's probably way less frequent than a PVC running out of space (which is covered by the KubePersistentVolumeFillingUp alerts).

Comment 10 Martin Bukatovic 2022-09-27 16:04:40 UTC
If there is no way to represent this inode value in a better way (I'm not aware of any convention for us to use, but I can be wrong) maybe we could redefine the alert to fire only on volumes labeled in a particular way (instead of the current state when it's enabled by default, but one can disable it for particular volumes). Then we can have either admins or csi provisioner to label the pv accordingly only when a filesystem which needs inode monitoring is used.
This way, we get rid of false positive alert fires, while keep an option to raise an inode alert when it makes sense.

Comment 11 Simon Pasquier 2022-09-28 09:44:06 UTC
> maybe we could redefine the alert to fire only on volumes labeled in a particular way (instead of the current state when it's enabled by default, but one can disable it for particular volumes). Then we can have either admins or csi provisioner to label the pv accordingly only when a filesystem which needs inode monitoring is used.

This would be very cumbersome from a user perspective IMHO.
Also it doesn't solve the issue that the CSI driver reports buggy values for kubelet_volume_stats_inodes_* metrics which would still confuse users.

Comment 12 Niels de Vos 2022-09-28 14:33:24 UTC
(In reply to Juan Miguel Olmo from comment #5)
> > If the filesystem used to backed the PV has dynamic inodes allocation, the returning value is going to be always incorrect, or "fictitious" in the better case. 
> 
> > If the value isn't accurate or relevant for this type of setup then my recommendation (from an instrumentation standpoint) would be not to expose the value at all. If you know that you don't know, it's better to not say anything :)
> 
> You are completely right. As you pointed, taking a look to [1] it seems that
> the real returned value for inodes_free when is not 0 ( and it must be 0
> because then number of inodes returned from "ceph df" must be -1)
> Lets invite Niels de Vos to bring light over this point.

Indeed, for CephFS (and some other filesystems) the number of free inodes is not relevant, they get allocated on-demand.

(In reply to Lalatendu Mohanty from comment #8)
> If the filesystem is expanded the inode issue will go away, isn't it? I am
> wondering how customers can mitigate this before we fix the root cause.
> @ndevos Hi, do you think this is a high severity issue?

Growing a filesystem might not result in more free inodes reported by tools like 'df'. If inodes get allocated on-demand, the value may be 0. Other filesystems may allocate inodes only during mkfs (ext4?), growing the filesystem does not give you more inodes in that case.

This is an alerting issue, some customers may be worried about the alert. It may not be obvious that the alert is not relevant for some particular filesystems, and hence customers worry for nothing.

(In reply to Simon Pasquier from comment #11)
> Also it doesn't solve the issue that the CSI driver reports buggy values for
> kubelet_volume_stats_inodes_* metrics which would still confuse users.

For the Ceph-CSI driver we should be able to not return the number of free inodes. The NodeGetVolumeStats CSI procedure notes [1] that the free inodes in the VolumeUsage response are optional.
We'll need to work out a special case for CephFS (and NFS) in FilesystemNodeGetVolumeStats [2], but how that is done is up to the Ceph-CSI developers :-)

Feel free to open a BZ against OpenShift Data Foundation/ceph-csi so that we can take that request on for a next release.

[1] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetvolumestats
[2] https://github.com/ceph/ceph-csi/blob/1eaf6af8824b2367ef07ee1e296f319d72bcaa86/internal/csi-common/utils.go#L240

Comment 13 Niels de Vos 2022-09-29 15:14:26 UTC
Actually, the previous comment isn't completely right. It seems that Ceph-CSI does not return the inodes_free part, but only a subset of the information:

Logs:
I0929 13:09:15.373762       1 utils.go:195] ID: 212 GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-fbc472b1-3ff6-11ed-bbaa-0a580a800213","volume_path":"/var/lib/kubelet/pods/9f1d5e1a-1927-408d-92c1-9d0999384bd8/volumes/kubernetes.io~csi/pvc-fd4d4b1a-4d16-4b64-8573-f805ecb2b661/mount"}
I0929 13:09:15.380245       1 utils.go:202] ID: 212 GRPC response: {"usage":[{"available":968884224,"total":1073741824,"unit":1,"used":104857600},{"total":3187,"unit":2,"used":3188}]}

The JSON output in more readable form:

{
    "usage": [
        {
            "available": 968884224,
            "total": 1073741824,
            "unit": 1,
            "used": 104857600
        },
        {
            "total": 3187,
            "unit": 2,
            "used": 3188
        }
    ]
}

unit=2 for inodes, which includes (a strange) "total" and "used". This might cause Kubelet to assume that -1 inodes are free.

Unfortunately "total" (available capacity) is a required attribute according to the CSI specification, and it can not be negative. That means, for CephFS 100% consumption of inodes is normal.

I do not think that Ceph-CSI can do something about this, as it follows the CSI specification. The only way to get this addressed, is changing the specification and making total optional in some way. This would be a large and long-term effort as changes to the specification need to be made/accepted/released before Kubernetes and CSI-drivers can implement it.

Comment 14 Juan Miguel Olmo 2022-10-03 09:12:58 UTC
After reading the comments, i think that we can deduce that is not a good idea to introduce any kind of changes in Ceph-CSI. @mbukatov has pointed in comment 10 what i think is the ideal solution. 
1. We need to mark the PVs (or/and PVCs) using PVCs backed by Cephfs.
2. We need to modify the current alert rule to avoid raise the alert in these PVS

@ndevos, What is the best place to add the "label" to these PVs? Can we do that in the Ceph CSI provisioner as @mbukatov pointed? Should we need to create a separate bug to track the solution of the problem?

Once we will be able to solve the identification of PVs using Ceph FS. spasquie, do you think is possible to introduce this exception in the alert rule?

Comment 15 Simon Pasquier 2022-10-03 10:08:37 UTC
> Once we will be able to solve the identification of PVs using Ceph FS. spasquie, do you think is possible to introduce this exception in the alert rule?

We could do that and in fact the current alert expression makes it possible already: a PVC with the label "alerts.k8s.io/KubePersistentVolumeFillingUp=disabled" will not trigger the alert.

From a monitoring/metrics stand point though, I'd like to reiterate that reporting incorrect/non-relevant metrics is far from being ideal. From [1], we see that we may have "inodes used" > "inodes total".

kubelet_volume_stats_inodes_free{persistentvolumeclaim="registry-cephfs-rwx-pvc"} 0
kubelet_volume_stats_inodes{persistentvolumeclaim="registry-cephfs-rwx-pvc"} 7418
kubelet_volume_stats_inodes_used{persistentvolumeclaim="registry-cephfs-rwx-pvc"} 7419

Even if we tweak the alerts to exclude PVCs based on a predefined label, these metrics are going to confuse users.

> I do not think that Ceph-CSI can do something about this, as it follows the CSI specification. The only way to get this addressed, is changing the specification and making total optional in some way. This would be a large and long-term effort as changes to the specification need to be made/accepted/released before Kubernetes and CSI-drivers can implement it.

[2] says:

"Similarly, inode information MAY be omitted from NodeGetVolumeStatsResponse when unavailable."

It sounds to me like the CSI driver can decide not to report inodes usage while still reporting bytes usage. Am I wrong?

[1] https://issues.redhat.com/browse/MON-2802
[2] https://github.com/container-storage-interface/spec/blob/master/spec.md#nodegetvolumestats

Comment 16 Niels de Vos 2022-10-03 12:08:39 UTC
(In reply to Juan Miguel Olmo from comment #14)
> @ndevos, What is the best place to add the "label" to these PVs?
> Can we do that in the Ceph CSI provisioner as @mbukatov pointed?
> Should we need to create a separate bug to track the solution of the problem?

CSI-drivers do not know about PersistentVolumes or PersistentVolumeClaims. The CSI Standard only uses "Volumes" that are in a format completely independent of the Container Orchestrator (Kubernetes). CSI-drivers can not add annotations/labels to PVs.

Maybe an alerting rule can inspect the PV and check for PersistentVolumeSource.CSIPersistentVolumeSource.Driver. This will get you the name of the driver that was used to create the volume. However, driver names are configurable, and differ in upstream/downstream deployments. You would also need to keep a list of known drivernames that should not cause alerts on inodes.

(In reply to Simon Pasquier from comment #15)
> "Similarly, inode information MAY be omitted from NodeGetVolumeStatsResponse
> when unavailable."
> 
> It sounds to me like the CSI driver can decide not to report inodes usage
> while still reporting bytes usage. Am I wrong?

Yes, that would be a possible approach. For Ceph-CSI we can introduce a configuration option to not report inode information at all. I do not know if there are users relying on inode information; it can be used to configure quotas, so it may be useful to some.

We do not configure inode quotas with ODF, so for the Red Hat product I think it should be fine to not report inode information for CephFS volumes.

Comment 17 Juan Miguel Olmo 2022-10-03 14:33:16 UTC
Thank you for the quick answer @ndevos, spasquie.

So it seems that an acceptable solution for everybody could be what Niels has proposed (config option to not report inode info). 

Any other thought or concern Simon? 

Niels, what is the better way to speed up this change?

Comment 18 Simon Pasquier 2022-10-03 14:48:11 UTC
> Any other thought or concern Simon? 

Nothing I can think of. As I commented somewhere else, we probably want to disable the alert in the meantime as it keeps knocking on our bug door [1].

[1] https://issues.redhat.com/browse/OCPBUGS-1766

Comment 19 Niels de Vos 2022-10-03 15:54:06 UTC
(In reply to Juan Miguel Olmo from comment #17)
> Thank you for the quick answer @ndevos, spasquie.
> 
> So it seems that an acceptable solution for everybody could be what Niels
> has proposed (config option to not report inode info). 
> 
> Any other thought or concern Simon? 
> 
> Niels, what is the better way to speed up this change?

I think you can file a bug against the "csi-driver" component in the "OpenShift Data Foundation" product. The change for Ceph-CSI is minimal, and should be able to get included in the ODF-4.12 release. If urgent enough, it might be eligible for backporting to the current ODF-4.11 version as a bugfix.

Comment 20 Juan Miguel Olmo 2022-10-05 10:20:08 UTC
@ndevos, spasquie:
I have created bug 2132270 for the Ceph CSI driver component. PTAL

Comment 21 Juan Miguel Olmo 2022-10-05 10:23:46 UTC
Apart of introducing the config option in the Ceph CSI driver, do we need another modification in any ODF config file in order to set the value for the option?

Comment 22 Niels de Vos 2022-10-05 12:45:21 UTC
(In reply to Juan Miguel Olmo from comment #21)
> Apart of introducing the config option in the Ceph CSI driver, do we need
> another modification in any ODF config file in order to set the value for
> the option?

No, defaults for ODF will be taken care off. At the moment I am not even sure if a configuration option is needed at all. We might remove inode metrics for CephFS and CephNFS, and only add an option if someone asks for it.

Comment 23 Simon Pasquier 2022-10-10 08:06:06 UTC
Niels already answered, clearing the needinfo flag.

Comment 24 Martin Bukatovic 2022-10-12 08:42:08 UTC
Clearing needinfo, we got the feedback from ODF team.

Comment 25 Juan Miguel Olmo 2022-10-18 11:57:59 UTC
The resolution of this bug depends on bug 2132270. Resolution verification should will be executed in ODF 4.12.0 release.
See https://bugzilla.redhat.com/show_bug.cgi?id=2132270#c7

Comment 27 Mudit Agarwal 2022-11-02 10:51:01 UTC
Pls provide qa ack

Comment 28 Martin Bukatovic 2022-11-03 12:37:53 UTC
QA team will check that alert KubePersistentVolumeInodesFillingUp is not firing when RWX CephFS volume is used to provide persistent storage for some OCP component.

Comment 39 Red Hat Bugzilla 2023-12-08 04:30:47 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days