## Description of problem: Every evening, customer snapshosts all VM's as part of Commvault backup process. We noticed a one-off failure to delete a snapshot. On closer inspection, the snapshot creation initially reported to succeed, but we did see the following error in the logs: 2019-03-22 01:19:31,835+13 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [...] Failed building DiskImage: candidate can not be null please use static method createGuidFromString 2019-03-22 01:19:31,836+13 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [...] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand' return value ' VolumeInfoReturn:{status='Status [code=0, message=Done]'} status = INVALID truesize = 0 apparentsize = 0 children: [] ' The snapshot delete was failing as the metadata slot was empty: NONE=####################################################################.... ## Version-Release number of selected component (if applicable): ovirt-engine-4.2.8.2-0.1.el7ev.noarch Tue Feb 12 09:58:45 2019 vdsm-4.20.46-1.el7ev.x86_64 Thu Jan 17 07:00:24 2019 (rhvh--4.2.8.0--0.20190116) ## How reproducible: This environment does multiple snapshots through the evening (Commvault), but we have only one we have noticed so far. ## Steps to Reproduce: 1. take a snapshot to do a backup (Commvault) 2. delete snapshot after backup is done 3. ## Actual results: Snapshot 'succeeds' but metadata information is not there ## Expected results: If snapshot volume create failed, snapshot should be rolled back ## Additional info: Why did snapshot create go ahead even though the voulume create failed? What happened to the metadata? Even if this is a one-off storage issue, engine should be robust enough to roll back the snapshot from a failed VolumeCreate
I was checking this with Marcus, we seem to have 2 problems here: 1) The metadata for the volume was empty ~20 seconds after volume creation (GetVolumeInfo at 01:19:31 shows empty metadata, volume just created at 01:19:07). So most likely this metadata was never written. 2) Engine went ahead with the snapshotVdsCommand even after seeing the empty metadata for the created volume. I don't think SnapshotVDSCommand should have been sent because GetVolumeInfoVDSCommand showed a bad volume. Regarding 1, there seem to be something wrong with the LVM metadata of the SD, we see some random and weird failures in the logs. Its a big SD, 50T and 1300LVs with a lot of fragmentation.
sync2jira
This is most likely related to BZ#1553133, which is fixed in vdsm-4.40.7 If we see this symptom again, we'll re-open and attach relevant logs. *** This bug has been marked as a duplicate of bug 1553133 ***