Bug 1895840 - Failing lvchange when taking snapshot causes "There is no leaf in the image"
Summary: Failing lvchange when taking snapshot causes "There is no leaf in the image"
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.2.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Tal Nisan
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-09 09:21 UTC by Jean-Louis Dupond
Modified: 2020-11-30 15:49 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-11-30 15:48:25 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
VDSM Log (29.33 KB, text/plain)
2020-11-09 09:21 UTC, Jean-Louis Dupond
no flags Details

Description Jean-Louis Dupond 2020-11-09 09:21:30 UTC
Created attachment 1727709 [details]
VDSM Log

Description of problem:
We backup our VM's in oVirt with Vinchin.
Vinchin uses direct access (LAN-Free) to the iSCSI LUN's to fetch the data from the data domains.

Everything went fine for weeks. But today I noticed a backup failed.
After investigating, I noticed the following error in the logs:

WARNING: invalid metadata text from /dev/mapper/3600a098038305663785d505652713446 at 135074304.', '  WARNING: metadata on /dev/mapper/3600a098038305663785d505652713446 at 135074304 has invalid summary for VG.'

This happens when assigning tags to the LV (see logs attached).


The main question here is: Is this caused by oVirt, or is Vinchin doing something wrong which causes this error to trigger.

Now I see the LV still exists:
# lvs -o +tags |grep dd91fd00-69f6-41bd-bad5-8db9b04fb1fa
  dd91fd00-69f6-41bd-bad5-8db9b04fb1fa 6e99da85-8414-4ec5-92c3-b6cf741fc125 -wi-------   1.00g                                                     OVIRT_VOL_INITIALIZING                         


Can it be removed without problems (via lvremove on the SPM?)?

And we also notice snapshots now fail on this VM with the following error:
2020-11-09 08:32:28,997+0100 INFO  (tasks/0) [storage.LVM] Creating LV (vg=6e99da85-8414-4ec5-92c3-b6cf741fc125, lv=643a306e-cab3-446f-90cf-91a355cf893c, size=1024m, activate=True, contiguous=False, initialTags=('OVIRT_VOL_INITIALIZING',), device=None) (lvm:1552)
2020-11-09 08:32:29,125+0100 ERROR (tasks/5) [storage.Image] There is no leaf in the image fb6fb206-4ca1-417c-8c83-21ea002db69a (image:198)
2020-11-09 08:32:29,125+0100 WARN  (tasks/5) [storage.ResourceManager] Resource factory failed to create resource '01_img_6e99da85-8414-4ec5-92c3-b6cf741fc125.fb6fb206-4ca1-417c-8c83-21ea002db69a'. Canceling request. (resourceManager:522)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceManager.py", line 518, in registerResource
    obj = namespaceObj.factory.createResource(name, lockType)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 193, in createResource
    lockType)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/resourceFactories.py", line 122, in __getResourceCandidatesList
    imgUUID=resourceName)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/image.py", line 199, in getChain
    raise se.ImageIsNotLegalChain(imgUUID)
vdsm.storage.exception.ImageIsNotLegalChain: Image is not a legal chain: ('fb6fb206-4ca1-417c-8c83-21ea002db69a',)

-> How to fix this?

Comment 1 Jean-Louis Dupond 2020-11-09 10:57:54 UTC
Guess the same happens on the Vinchin node as described here: https://access.redhat.com/solutions/4706501

Comment 2 Jean-Louis Dupond 2020-11-09 12:00:09 UTC
As expected, following info was found in the lvm2 metadata on that device/LUN:


# Generated by LVM2 version 2.02.186(2)-RHEL7 (2019-08-27): Sat Nov  7 02:36:14 2020

contents = "Text Format Volume Group"
version = 1
 
description = "" 
 
creation_host = "vinchin-node001"       # Linux vinchin-node001 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64
creation_time = 1604712974      # Sat Nov  7 02:36:14 2020


So Vinchin did some access to the LVM, which caused oVirt to go nuts.

Comment 3 Jean-Louis Dupond 2020-11-09 13:30:59 UTC
So after more troubleshooting it was indeed Vinchin which caused a LVM metadata update and then triggered the lvchange error in oVirt.

Now while the root cause might be Vinchin, I think the error handling here could be better.
While the snapshot was never created, the base volume was stuck in 'INTERNAL' state:
CAP=2147483648
CTIME=1600986804
DESCRIPTION=
DISKTYPE=DATA
DOMAIN=6e99da85-8414-4ec5-92c3-b6cf741fc125
FORMAT=COW
GEN=0
IMAGE=fb6fb206-4ca1-417c-8c83-21ea002db69a
LEGALITY=LEGAL
PUUID=00000000-0000-0000-0000-000000000000
TYPE=SPARSE
VOLTYPE=INTERNAL
EOF


This caused the fact snapshots won't work anymore on it.
Also I think reboot might cause an issue.

Also the snapshot volume (dd91fd00-69f6-41bd-bad5-8db9b04fb1fa) still existed. But a lvremove fixed this :)

Comment 4 Tal Nisan 2020-11-09 15:14:54 UTC
Amit, can you please have a look?

Comment 5 Amit Bawer 2020-11-09 15:31:24 UTC
What ovirt version is being used here? ticket says 4.4.2.6, but lvm version is 2.02 RHEL-7 here?
Release notes for ovirt 4.4.2 ask to use el 8.2 platform [1] so your Vinchin should be aligned to that as well.

There are lvm corruption issues fixed for el 8.2 (which is shipped with lvm-2.03) since than.

[1] https://www.ovirt.org/release/4.4.2/

Comment 6 Jean-Louis Dupond 2020-11-09 15:41:21 UTC
Its oVirt 4.4.2.6. But Vinchin is indeed still CentOS 7 (there isn't any newer version currently).

Comment 7 Amit Bawer 2020-11-09 15:50:42 UTC
(In reply to Jean-Louis Dupond from comment #6)
> Its oVirt 4.4.2.6. But Vinchin is indeed still CentOS 7 (there isn't any
> newer version currently).

Not sure if you can intermix those, i am not aware of the supported backup tools matrix for ovirt, maybe PM does,
but we rely that all hosts access the same LUNs with same platform tools (and lvm-2.03 in specific)
we had issue back on 4.3.x with el7 lvm-2.02 similar to the one shown here for the bad VG metadata.

I think the steps are to put the SD into maintenance and recover the VG metadata from the latest host lvm backup, before doing
any cleanups related to snapshots.

Comment 8 RHEL Program Management 2020-11-09 23:59:53 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 9 Jean-Louis Dupond 2020-11-16 15:51:32 UTC
The cause is indeed Vinchin.

But if for some other reason a lvchange/lvcreate/... fails when taking a snapshot. Shouldn't this be handled better so we don't end up with a broken disk (wrong type)?
Or should I create a new bug for that?

Comment 10 Amit Bawer 2020-11-17 08:36:34 UTC
(In reply to Jean-Louis Dupond from comment #9)
> The cause is indeed Vinchin.
> 
> But if for some other reason a lvchange/lvcreate/... fails when taking a
> snapshot. Shouldn't this be handled better so we don't end up with a broken
> disk (wrong type)?
> Or should I create a new bug for that?

IMHO it seems to happen just because the child lv is unexpectedly lost due to the lvm corruption.
There is a volume clone rollback from parent volume being internal back to leaf if the child volume creation fails on
the first place. In this case the corruption seens to have caused it to be lost during the snapshot job, which comes
after the parent clone to the child. lvm corruption is not something vdsm knows how to recover, it needs manual intervention
so i am not sure if we need the rollback for the parent volume back to leaf on snapshot job failure, since the child 
volume is already created and should be intact when it takes place.

Comment 11 Jean-Louis Dupond 2020-11-30 15:49:37 UTC
An easy way to recover from this would be great (API to change the volume type).
But the root cause here is not caused by oVirt. So closing.


Note You need to log in before you can comment on or make changes to this bug.