Bug 2119039

Summary: Enhance user experience with corner case of full VDOPOOL
Product: Red Hat Enterprise Linux 9 Reporter: Zdenek Kabelac <zkabelac>
Component: lvm2Assignee: Zdenek Kabelac <zkabelac>
lvm2 sub component: VDO QA Contact: cluster-qe <cluster-qe>
Status: NEW --- Docs Contact:
Severity: unspecified    
Priority: medium CC: agk, awalsh, heinzm, jbrassow, prajnoha, zkabelac
Version: 9.1Keywords: Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Zdenek Kabelac 2022-08-17 10:40:39 UTC
When the VDOPOOL becomes full with current version of lvm2 - the only way to discover such state by user is to check for 100% data usage of VDOPOOL by lvs.

However this is not a good way to inform user about such highly problematic situation.

The associated problems are - the fullness might be a 'temporary' issue - so user might be even unaware the error state of full pool even occurred - it's not even reported by kvdo target - the only observable moment is report of 'dmeventd' monitoring and kernel write error message - which might be possibly hardly associated.

Some small example how to examine situation:

# Create small VDO pool with 'overprovisioned' 200MiB volume
# while the vdopool can only store <130MiB 
#
# lvcreate --vdo -V200M -L2.9G --vdosettings 'vdoslabsizemb=128' -n lv vg 

# now start to write to such volume urandom data with 'dd'

# dd if=/dev/urandom of=/dev/vg/lv bs=1M count=140 status=progress

Such operation for a user ends with 'success' return code 0.
Kernel reports in parallel some 'async' write errors in dmesg.

Situation gets better if the options 'conv=fdatasync'  or  'oflag=direct' are used with dd so the userspace app at least recognizes an error on write.

Yet we still can easily 'miss' the error reporting state on lvm2 side - as even simple 'TRIM' on a device might make look such VDO LV looking normaly - although on overfilled pool the  fsck operation shall be always used.

Enhance this use case - similarities could be probably found with thin-pool out-of-data_space/out_of_metadata_space