Bug 1885662

Summary: PV's created by DataVolume appear to trigger prometheus alerts
Product: Container Native Virtualization (CNV) Reporter: Joel Davis <jodavis>
Component: StorageAssignee: Adam Litke <alitke>
Status: CLOSED DEFERRED QA Contact: Alex Kalenyuk <akalenyu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.4.1CC: alitke, cnv-qe-bugs, fdeutsch, mrashish, stirabos
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-31 19:01:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Full screen of alert firing
none
PV-related alerts for newly created disk none

Description Joel Davis 2020-10-06 15:58:18 UTC
Created attachment 1719442 [details]
Full screen of alert firing

Description of problem:

DataVolume appears to provision a PV exactly as the disk image it ends up writing causing the KubePersistentVolumeFillingUp alert in Prometheus to fire with a "Critical" level of severity.

Version-Release number of selected component (if applicable):

OCP: 4.5.13
OCS: 4.5.0
CNV: 2.4.1
Local Storage: 4.5.0-202009161248.p0

How reproducible:

Pretty consistent

Steps to Reproduce:
1. Install CNV.
2. Configure OCS with local storage for OSD.
3. Use web UI Wizard to generate a new Virtual Machine using URL source and ocs-storagecluster-cephfs for the disk's storage class.
4. Wait a few minutes after import completes.
5. Observe currently firing Prometheus alerts.

Actual results:
KubePersistentVolumeFillingUp is firing for the rootdisk PV (screenshots attached)

Expected results:
No alerts firing due to newly provisioned storage

Comment 1 Joel Davis 2020-10-06 15:58:50 UTC
Created attachment 1719443 [details]
PV-related alerts for newly created disk

Comment 2 Fabian Deutsch 2020-10-07 09:10:43 UTC
First: It's an OpenShift thing.
Second I do not think it's a bug as a PV(C) crossed a trheshold (85%) and

Comment 3 Joel Davis 2020-10-07 11:29:34 UTC
For clarity this is for a newly created VM, the cluster I was doing this on was reprovisioned (so I can't run df at the moment) but I didn't save a bunch of data to the HDD or anything. I just defined a new VM and it generates a critical alert in Prometheus straight away (meaning they would continually be notified of this non-critical condition unless the alert were re-routed). 

I don't know if the solution is to either overprovision the PVC (or undersize the disk image) so it runs under the threshold or if there's a way of updating the OCP rule so that it will ignore certain PVC's (not that knowledge of promql but it seems to just check all of them).

Comment 4 Joel Davis 2020-10-07 11:30:14 UTC
Sorry I thought the needinfo was for me, adding it back on alitke

Comment 5 Fabian Deutsch 2020-10-07 12:13:19 UTC
Joeal, thanks, that's a good hint.
And yes, this alarm will probably be kicking-off for /at least) pre-allocated file-system (disk.img) PVs, and this is indeed an exception. So far my understanding. We can see if we can fix this.

Joal, do you recall what kind of PV it was? FS or block? And if it was pre-allocated or not?

Comment 6 Joel Davis 2020-10-07 15:58:27 UTC
Sorry for the delay getting back. Just reproduced it and it is only happening with filesystem storage but not block storage.

Comment 7 Fabian Deutsch 2020-10-21 09:34:08 UTC
istinguish between two cases:

- fs with sparse disk image, this should not lead to an alert
- fs with preallocated disk image, this should lead to an alert, but we - CNV - would want to opt-out of the alert mechanism in this case

Thus, the disk.imgon the PV, of what size was it?
$ ls -shal

should tell us everything that is necessary

Moving this bug to storage, because as this is about how to deal with our specific way of storing data in disk.imgs

Comment 8 Adam Litke 2020-10-26 12:14:30 UTC
This alert is behaving as the system intended but it is not relevant for this use case.  As a workaround you can silence the alert with the following procedure.

https://docs.openshift.com/container-platform/4.2/monitoring/cluster_monitoring/managing-cluster-alerts.html#monitoring-silencing-alerts_managing-alerts

Comment 10 Maya Rashish 2020-11-25 09:53:31 UTC
> CNV: 2.4.1

We switched back to sparse allocation of disk images to avoid firing off alerts, but this change was already in place in CNV 2.4.1.

I'm not knowledgeable about the internals of CephFS but I found this quote in "Differences from POSIX" https://docs.ceph.com/en/latest/cephfs/posix/
> Sparse files propagate incorrectly to the stat(2) st_blocks field. Because CephFS does not explicitly track which parts of a file are allocated/written, the st_blocks field is always populated by the file size divided by the block size. This will cause tools like du(1) to overestimate consumed space. (The recursive size field, maintained by CephFS, also includes file “holes” in its count.)

Which seems to say that CephFS might report a higher space usage than other file systems when dealing with sparse files.
I wonder if the alert is firing for this specific reason.

Comment 11 Adam Litke 2021-01-04 19:56:07 UTC
Clearing my needinfo because I don't see any specific ask here.  At this point we are trying to decide if it's possible to automatically silence certain alerts based on DataVolume settings.  To me this would probably be an RFE and not a bug.