Bug 1858420

Summary:	Snapshot creation on host that engine then loses connection to results in missing snapshots table entry
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Gordon Watson <gwatson>
Component:	ovirt-engine	Assignee:	Benny Zlotnik <bzlotnik>
Status:	CLOSED ERRATA	QA Contact:	Evelina Shames <eshames>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.3.9	CC:	achareka, aefrat, bzlotnik, dfodor, eshenitz, gveitmic, lrotenbe, mavital, sfishbai, tnisan
Target Milestone:	ovirt-4.4.5
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-04-14 11:39:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gordon Watson 2020-07-17 21:31:41 UTC

Description of problem:

There are two parts to this. The first is the customer's problem. In trying to reprpoduce it, I encountered almost the same problem, but not exactly the same.

In both cases, the end result was that the images and snapshots entries in the database for this VM were incorrect and misalgned. In the customer's case, for one of the three disks, the newly-created volume had actually been deleted (from the database and the storage domain), but the newly-created volume was still in use by the 'qemu-kvm' process. So, even though the new volume had been removed and the metadata for the parent had been reset to LEAF, the 'qemu-kvm' process was still writing to the new volume, as a file in the '/rhev/data-center' directory for the disk called '.nfs000000000000198000000006'.

Version-Release number of selected component (if applicable):

- customer;

RHV 4.3.9
RHVH 4.3.9 host (7.8);
vdsm-4.30.44-1.el7ev.x86_64

- me;

RHV 4.3.10
RHEL 7.7 host;
vdsm-4.30.40-1.el7ev.x86_64

How reproducible:

Customer's problem is not, but mine is.

Steps to Reproduce (my issue):

1. I created a VM with 3 disks, but that was just my test case to match what the customer had. My disks were file-based.

2. I started the VM.

3. On this host, I modified 'virt/vm.py' to add a 10 second sleep here;

try:
self.log.info("Taking a live snapshot (drives=%s, memory=%s)",
', '.join(drive["name"] for drive in
newDrives.values()),
memoryParams is not None)
time.sleep(10)
self._dom.snapshotCreateXML(snapxml, snapFlags)

4. I created a snapshot, including all 3 disks, with no memory volumes.

5. During the 10 second sleep, I blocked port 54321;
# iptables -I INPUT -p tcp --dport 54321 -j REJECT

6. When the snapshot had completed on the host ("Completed live snapshot"), I unblocked the port;
# iptables -D INPUT -p tcp --dport 54321 -j REJECT

7. The result was;

- 3 new active volumes were created
- all 3 were in use by 'qemu-kvm'
- all 3 were in the images table as the active volumes in the database
- a new snapshots table entry was created, but then got removed, resulting in only one snapshot table entry
- such that the snapshot id in 'vm_snapshot_id' field for each active volume did not exist, and the Active VM entry was in the 'vm_snapshot_id' field of each parent volume in the chain

Actual results:

One less snapshots table entry than images, and misaligned snapshot ids.

Expected results:

Images and snapshots to align correctly.

Additional info:

Comment 2 Gordon Watson 2020-07-17 21:45:24 UTC

Created attachment 1701589 [details]
Customer's vdsm log

Comment 9 Benny Zlotnik 2020-07-20 15:03:04 UTC

Gordon, which version are you using exactly? 
I suspect this issue should have been fixed in bug 1842375
It was included since ovirt-engine-4.3.10.1

There is an additional related fix in bug 1838493, available in 4.3.11

Any chance you can check these?
I haven't dug deep in the logs yet, but this scenario is very familiar and has been worked on a lot in late 4.3/early 4.4 releases

Comment 16 Liran Rotenberg 2020-10-06 08:50:03 UTC

I didn't look into the log, only on the scenario itself.
In 4.4 we moved all the live snapshots, with or without memory to async.

The idea was regarding the memory volumes which might take long time to generate.
But, In 4.3 the engine snapshot command wait to the VDSM response. In the case of this bug, we wouldn't get one.. which probably make the engine to fail on it and remove the DB entry while it actually did succeed on VDSM and libvirt side.. causing the VM to use the new volumes and make the storage changes.

In 4.4, it changed both engine and VDSM to be async, the engine will send the command to VDSM and in case the connection will be lost the job will still be queried by the engine.
The job status stays in VDSM for 10 minutes once it's done (success or failure). That means, if the engine connection to VDSM was retrieved in less than 10 minutes after the job done in VDSM we will get the right status, and won't see any problem. In case it's over 10 minutes, the engine will fail the command (can cause this bug, but in my opinion 10 minutes is enough - can be changeable in VDSM configuration jobs - `autodelete_delay`).

Therefore, I think 4.4 does resolve this bug and the only thing I would ask is QE to try to reproduce it in 4.4.

Comment 17 Tal Nisan 2021-02-04 10:12:35 UTC

Given comment #16, moving to MODIFIED

Comment 28 Evelina Shames 2021-03-15 13:27:58 UTC

(In reply to Gordon Watson from comment #0)
> 
> 1. I created a VM with 3 disks, but that was just my test case to match what
> the customer had. My disks were file-based.
> 
> 2. I started the VM.
> 
> 3. On this host, I modified 'virt/vm.py' to add a 10 second sleep here;
> 
>             try:
>                 self.log.info("Taking a live snapshot (drives=%s,
> memory=%s)",
>                               ', '.join(drive["name"] for drive in
>                                         newDrives.values()),
>                               memoryParams is not None)
>                 time.sleep(10)
>                 self._dom.snapshotCreateXML(snapxml, snapFlags)
> 
> 
> 4. I created a snapshot, including all 3 disks, with no memory volumes.
> 
> 5. During the 10 second sleep, I blocked port 54321;
>    # iptables -I INPUT -p tcp --dport 54321 -j REJECT
> 
> 6. When the snapshot had completed on the host ("Completed live snapshot"),
> I unblocked the port;
>    # iptables -D INPUT -p tcp --dport 54321 -j REJECT
> 
> 7. The result was;
> 
>    - 3 new active volumes were created
>    - all 3 were in use by 'qemu-kvm'
>    - all 3 were in the images table as the active volumes in the database
>    - a new snapshots table entry was created, but then got removed,
> resulting in only one snapshot table entry
>    - such that the snapshot id in 'vm_snapshot_id' field for each active
> volume did not exist, and the Active VM entry was in the 'vm_snapshot_id'
> field of each parent volume in the chain
> 

Verified with the above steps on env with rhv-4.4.5-9.
The issue didn't reproduce.
3 new active volumes were created, a new snapshots table entry was created and didn't get removed.
VM was up and accessible.
After reboot, the VM powered on and worked as expected.

Moving to 'VERIFIED'.

Comment 33 errata-xmlrpc 2021-04-14 11:39:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1169