Bug 1885662 - PV's created by DataVolume appear to trigger prometheus alerts
Summary: PV's created by DataVolume appear to trigger prometheus alerts
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 2.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.8.0
Assignee: Adam Litke
QA Contact: Alex Kalenyuk
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-06 15:58 UTC by Joel Davis
Modified: 2021-03-31 19:01 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-31 19:01:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Full screen of alert firing (86.06 KB, image/png)
2020-10-06 15:58 UTC, Joel Davis
no flags Details
PV-related alerts for newly created disk (32.67 KB, image/png)
2020-10-06 15:58 UTC, Joel Davis
no flags Details

Description Joel Davis 2020-10-06 15:58:18 UTC
Created attachment 1719442 [details]
Full screen of alert firing

Description of problem:

DataVolume appears to provision a PV exactly as the disk image it ends up writing causing the KubePersistentVolumeFillingUp alert in Prometheus to fire with a "Critical" level of severity.

Version-Release number of selected component (if applicable):

OCP: 4.5.13
OCS: 4.5.0
CNV: 2.4.1
Local Storage: 4.5.0-202009161248.p0

How reproducible:

Pretty consistent

Steps to Reproduce:
1. Install CNV.
2. Configure OCS with local storage for OSD.
3. Use web UI Wizard to generate a new Virtual Machine using URL source and ocs-storagecluster-cephfs for the disk's storage class.
4. Wait a few minutes after import completes.
5. Observe currently firing Prometheus alerts.

Actual results:
KubePersistentVolumeFillingUp is firing for the rootdisk PV (screenshots attached)

Expected results:
No alerts firing due to newly provisioned storage

Comment 1 Joel Davis 2020-10-06 15:58:50 UTC
Created attachment 1719443 [details]
PV-related alerts for newly created disk

Comment 2 Fabian Deutsch 2020-10-07 09:10:43 UTC
First: It's an OpenShift thing.
Second I do not think it's a bug as a PV(C) crossed a trheshold (85%) and

Comment 3 Joel Davis 2020-10-07 11:29:34 UTC
For clarity this is for a newly created VM, the cluster I was doing this on was reprovisioned (so I can't run df at the moment) but I didn't save a bunch of data to the HDD or anything. I just defined a new VM and it generates a critical alert in Prometheus straight away (meaning they would continually be notified of this non-critical condition unless the alert were re-routed). 

I don't know if the solution is to either overprovision the PVC (or undersize the disk image) so it runs under the threshold or if there's a way of updating the OCP rule so that it will ignore certain PVC's (not that knowledge of promql but it seems to just check all of them).

Comment 4 Joel Davis 2020-10-07 11:30:14 UTC
Sorry I thought the needinfo was for me, adding it back on alitke

Comment 5 Fabian Deutsch 2020-10-07 12:13:19 UTC
Joeal, thanks, that's a good hint.
And yes, this alarm will probably be kicking-off for /at least) pre-allocated file-system (disk.img) PVs, and this is indeed an exception. So far my understanding. We can see if we can fix this.

Joal, do you recall what kind of PV it was? FS or block? And if it was pre-allocated or not?

Comment 6 Joel Davis 2020-10-07 15:58:27 UTC
Sorry for the delay getting back. Just reproduced it and it is only happening with filesystem storage but not block storage.

Comment 7 Fabian Deutsch 2020-10-21 09:34:08 UTC
istinguish between two cases:

- fs with sparse disk image, this should not lead to an alert
- fs with preallocated disk image, this should lead to an alert, but we - CNV - would want to opt-out of the alert mechanism in this case

Thus, the disk.imgon the PV, of what size was it?
$ ls -shal

should tell us everything that is necessary

Moving this bug to storage, because as this is about how to deal with our specific way of storing data in disk.imgs

Comment 8 Adam Litke 2020-10-26 12:14:30 UTC
This alert is behaving as the system intended but it is not relevant for this use case.  As a workaround you can silence the alert with the following procedure.

https://docs.openshift.com/container-platform/4.2/monitoring/cluster_monitoring/managing-cluster-alerts.html#monitoring-silencing-alerts_managing-alerts

Comment 10 Maya Rashish 2020-11-25 09:53:31 UTC
> CNV: 2.4.1

We switched back to sparse allocation of disk images to avoid firing off alerts, but this change was already in place in CNV 2.4.1.

I'm not knowledgeable about the internals of CephFS but I found this quote in "Differences from POSIX" https://docs.ceph.com/en/latest/cephfs/posix/
> Sparse files propagate incorrectly to the stat(2) st_blocks field. Because CephFS does not explicitly track which parts of a file are allocated/written, the st_blocks field is always populated by the file size divided by the block size. This will cause tools like du(1) to overestimate consumed space. (The recursive size field, maintained by CephFS, also includes file “holes” in its count.)

Which seems to say that CephFS might report a higher space usage than other file systems when dealing with sparse files.
I wonder if the alert is firing for this specific reason.

Comment 11 Adam Litke 2021-01-04 19:56:07 UTC
Clearing my needinfo because I don't see any specific ask here.  At this point we are trying to decide if it's possible to automatically silence certain alerts based on DataVolume settings.  To me this would probably be an RFE and not a bug.


Note You need to log in before you can comment on or make changes to this bug.