Created attachment 1727709 [details] VDSM Log Description of problem: We backup our VM's in oVirt with Vinchin. Vinchin uses direct access (LAN-Free) to the iSCSI LUN's to fetch the data from the data domains. Everything went fine for weeks. But today I noticed a backup failed. After investigating, I noticed the following error in the logs: WARNING: invalid metadata text from /dev/mapper/3600a098038305663785d505652713446 at 135074304.', ' WARNING: metadata on /dev/mapper/3600a098038305663785d505652713446 at 135074304 has invalid summary for VG.' This happens when assigning tags to the LV (see logs attached). The main question here is: Is this caused by oVirt, or is Vinchin doing something wrong which causes this error to trigger. Now I see the LV still exists: # lvs -o +tags |grep dd91fd00-69f6-41bd-bad5-8db9b04fb1fa dd91fd00-69f6-41bd-bad5-8db9b04fb1fa 6e99da85-8414-4ec5-92c3-b6cf741fc125 -wi------- 1.00g OVIRT_VOL_INITIALIZING Can it be removed without problems (via lvremove on the SPM?)? And we also notice snapshots now fail on this VM with the following error: 2020-11-09 08:32:28,997+0100 INFO (tasks/0) [storage.LVM] Creating LV (vg=6e99da85-8414-4ec5-92c3-b6cf741fc125, lv=643a306e-cab3-446f-90cf-91a355cf893c, size=1024m, activate=True, contiguous=False, initialTags=('OVIRT_VOL_INITIALIZING',), device=None) (lvm:1552) 2020-11-09 08:32:29,125+0100 ERROR (tasks/5) [storage.Image] There is no leaf in the image fb6fb206-4ca1-417c-8c83-21ea002db69a (image:198) 2020-11-09 08:32:29,125+0100 WARN (tasks/5) [storage.ResourceManager] Resource factory failed to create resource '01_img_6e99da85-8414-4ec5-92c3-b6cf741fc125.fb6fb206-4ca1-417c-8c83-21ea002db69a'. Canceling request. (resourceManager:522) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 518, in registerResource obj = namespaceObj.factory.createResource(name, lockType) File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource lockType) File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList imgUUID=resourceName) File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 199, in getChain raise se.ImageIsNotLegalChain(imgUUID) vdsm.storage.exception.ImageIsNotLegalChain: Image is not a legal chain: ('fb6fb206-4ca1-417c-8c83-21ea002db69a',) -> How to fix this?
Guess the same happens on the Vinchin node as described here: https://access.redhat.com/solutions/4706501
As expected, following info was found in the lvm2 metadata on that device/LUN: # Generated by LVM2 version 2.02.186(2)-RHEL7 (2019-08-27): Sat Nov 7 02:36:14 2020 contents = "Text Format Volume Group" version = 1 description = "" creation_host = "vinchin-node001" # Linux vinchin-node001 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 creation_time = 1604712974 # Sat Nov 7 02:36:14 2020 So Vinchin did some access to the LVM, which caused oVirt to go nuts.
So after more troubleshooting it was indeed Vinchin which caused a LVM metadata update and then triggered the lvchange error in oVirt. Now while the root cause might be Vinchin, I think the error handling here could be better. While the snapshot was never created, the base volume was stuck in 'INTERNAL' state: CAP=2147483648 CTIME=1600986804 DESCRIPTION= DISKTYPE=DATA DOMAIN=6e99da85-8414-4ec5-92c3-b6cf741fc125 FORMAT=COW GEN=0 IMAGE=fb6fb206-4ca1-417c-8c83-21ea002db69a LEGALITY=LEGAL PUUID=00000000-0000-0000-0000-000000000000 TYPE=SPARSE VOLTYPE=INTERNAL EOF This caused the fact snapshots won't work anymore on it. Also I think reboot might cause an issue. Also the snapshot volume (dd91fd00-69f6-41bd-bad5-8db9b04fb1fa) still existed. But a lvremove fixed this :)
Amit, can you please have a look?
What ovirt version is being used here? ticket says 4.4.2.6, but lvm version is 2.02 RHEL-7 here? Release notes for ovirt 4.4.2 ask to use el 8.2 platform [1] so your Vinchin should be aligned to that as well. There are lvm corruption issues fixed for el 8.2 (which is shipped with lvm-2.03) since than. [1] https://www.ovirt.org/release/4.4.2/
Its oVirt 4.4.2.6. But Vinchin is indeed still CentOS 7 (there isn't any newer version currently).
(In reply to Jean-Louis Dupond from comment #6) > Its oVirt 4.4.2.6. But Vinchin is indeed still CentOS 7 (there isn't any > newer version currently). Not sure if you can intermix those, i am not aware of the supported backup tools matrix for ovirt, maybe PM does, but we rely that all hosts access the same LUNs with same platform tools (and lvm-2.03 in specific) we had issue back on 4.3.x with el7 lvm-2.02 similar to the one shown here for the bad VG metadata. I think the steps are to put the SD into maintenance and recover the VG metadata from the latest host lvm backup, before doing any cleanups related to snapshots.
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
The cause is indeed Vinchin. But if for some other reason a lvchange/lvcreate/... fails when taking a snapshot. Shouldn't this be handled better so we don't end up with a broken disk (wrong type)? Or should I create a new bug for that?
(In reply to Jean-Louis Dupond from comment #9) > The cause is indeed Vinchin. > > But if for some other reason a lvchange/lvcreate/... fails when taking a > snapshot. Shouldn't this be handled better so we don't end up with a broken > disk (wrong type)? > Or should I create a new bug for that? IMHO it seems to happen just because the child lv is unexpectedly lost due to the lvm corruption. There is a volume clone rollback from parent volume being internal back to leaf if the child volume creation fails on the first place. In this case the corruption seens to have caused it to be lost during the snapshot job, which comes after the parent clone to the child. lvm corruption is not something vdsm knows how to recover, it needs manual intervention so i am not sure if we need the rollback for the parent volume back to leaf on snapshot job failure, since the child volume is already created and should be intact when it takes place.
An easy way to recover from this would be great (API to change the volume type). But the root cause here is not caused by oVirt. So closing.