Created attachment 1719442 [details] Full screen of alert firing Description of problem: DataVolume appears to provision a PV exactly as the disk image it ends up writing causing the KubePersistentVolumeFillingUp alert in Prometheus to fire with a "Critical" level of severity. Version-Release number of selected component (if applicable): OCP: 4.5.13 OCS: 4.5.0 CNV: 2.4.1 Local Storage: 4.5.0-202009161248.p0 How reproducible: Pretty consistent Steps to Reproduce: 1. Install CNV. 2. Configure OCS with local storage for OSD. 3. Use web UI Wizard to generate a new Virtual Machine using URL source and ocs-storagecluster-cephfs for the disk's storage class. 4. Wait a few minutes after import completes. 5. Observe currently firing Prometheus alerts. Actual results: KubePersistentVolumeFillingUp is firing for the rootdisk PV (screenshots attached) Expected results: No alerts firing due to newly provisioned storage
Created attachment 1719443 [details] PV-related alerts for newly created disk
First: It's an OpenShift thing. Second I do not think it's a bug as a PV(C) crossed a trheshold (85%) and
For clarity this is for a newly created VM, the cluster I was doing this on was reprovisioned (so I can't run df at the moment) but I didn't save a bunch of data to the HDD or anything. I just defined a new VM and it generates a critical alert in Prometheus straight away (meaning they would continually be notified of this non-critical condition unless the alert were re-routed). I don't know if the solution is to either overprovision the PVC (or undersize the disk image) so it runs under the threshold or if there's a way of updating the OCP rule so that it will ignore certain PVC's (not that knowledge of promql but it seems to just check all of them).
Sorry I thought the needinfo was for me, adding it back on alitke
Joeal, thanks, that's a good hint. And yes, this alarm will probably be kicking-off for /at least) pre-allocated file-system (disk.img) PVs, and this is indeed an exception. So far my understanding. We can see if we can fix this. Joal, do you recall what kind of PV it was? FS or block? And if it was pre-allocated or not?
Sorry for the delay getting back. Just reproduced it and it is only happening with filesystem storage but not block storage.
istinguish between two cases: - fs with sparse disk image, this should not lead to an alert - fs with preallocated disk image, this should lead to an alert, but we - CNV - would want to opt-out of the alert mechanism in this case Thus, the disk.imgon the PV, of what size was it? $ ls -shal should tell us everything that is necessary Moving this bug to storage, because as this is about how to deal with our specific way of storing data in disk.imgs
This alert is behaving as the system intended but it is not relevant for this use case. As a workaround you can silence the alert with the following procedure. https://docs.openshift.com/container-platform/4.2/monitoring/cluster_monitoring/managing-cluster-alerts.html#monitoring-silencing-alerts_managing-alerts
> CNV: 2.4.1 We switched back to sparse allocation of disk images to avoid firing off alerts, but this change was already in place in CNV 2.4.1. I'm not knowledgeable about the internals of CephFS but I found this quote in "Differences from POSIX" https://docs.ceph.com/en/latest/cephfs/posix/ > Sparse files propagate incorrectly to the stat(2) st_blocks field. Because CephFS does not explicitly track which parts of a file are allocated/written, the st_blocks field is always populated by the file size divided by the block size. This will cause tools like du(1) to overestimate consumed space. (The recursive size field, maintained by CephFS, also includes file “holes” in its count.) Which seems to say that CephFS might report a higher space usage than other file systems when dealing with sparse files. I wonder if the alert is firing for this specific reason.
Clearing my needinfo because I don't see any specific ask here. At this point we are trying to decide if it's possible to automatically silence certain alerts based on DataVolume settings. To me this would probably be an RFE and not a bug.