Bug 1858420 - Snapshot creation on host that engine then loses connection to results in missing snapshots table entry
Summary: Snapshot creation on host that engine then loses connection to results in mis...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.3.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.5
: ---
Assignee: Benny Zlotnik
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-17 21:31 UTC by Gordon Watson
Modified: 2024-06-13 22:49 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-14 11:39:56 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5237501 0 None None None 2020-07-20 21:56:22 UTC
Red Hat Product Errata RHSA-2021:1169 0 None None None 2021-04-14 11:40:29 UTC
oVirt gerrit 104222 0 None MERGED core: change live snapshot to async 2021-02-17 07:32:56 UTC
oVirt gerrit 104275 0 None MERGED virt: change live snapshot to async 2021-02-17 07:32:57 UTC
oVirt gerrit 106107 0 None MERGED virt: async snapshot refactor 2021-02-17 07:32:56 UTC
oVirt gerrit 106267 0 None MERGED virt: preparatory patch for async snapshot 2021-02-17 07:32:56 UTC

Description Gordon Watson 2020-07-17 21:31:41 UTC
Description of problem:

There are two parts to this. The first is the customer's problem. In trying to reprpoduce it, I encountered almost the same problem, but not exactly the same.

In both cases, the end result was that the images and snapshots entries in the database for this VM were incorrect and misalgned. In the customer's case, for one of the three disks, the newly-created volume had actually been deleted (from the database and the storage domain), but the newly-created volume was still in use by the 'qemu-kvm' process. So, even though the new volume had been removed and the metadata for the parent had been reset to LEAF, the 'qemu-kvm' process was still writing to the new volume, as a file in the '/rhev/data-center' directory for the disk called '.nfs000000000000198000000006'.



Version-Release number of selected component (if applicable):

- customer;

RHV 4.3.9 
RHVH 4.3.9 host (7.8);
  vdsm-4.30.44-1.el7ev.x86_64 


- me;

RHV 4.3.10 
RHEL 7.7 host;
  vdsm-4.30.40-1.el7ev.x86_64


How reproducible:

Customer's problem is not, but mine is.


Steps to Reproduce (my issue):

1. I created a VM with 3 disks, but that was just my test case to match what the customer had. My disks were file-based.

2. I started the VM.

3. On this host, I modified 'virt/vm.py' to add a 10 second sleep here;

            try:
                self.log.info("Taking a live snapshot (drives=%s, memory=%s)",
                              ', '.join(drive["name"] for drive in
                                        newDrives.values()),
                              memoryParams is not None)
                time.sleep(10)
                self._dom.snapshotCreateXML(snapxml, snapFlags)


4. I created a snapshot, including all 3 disks, with no memory volumes.

5. During the 10 second sleep, I blocked port 54321;
   # iptables -I INPUT -p tcp --dport 54321 -j REJECT

6. When the snapshot had completed on the host ("Completed live snapshot"), I unblocked the port;
   # iptables -D INPUT -p tcp --dport 54321 -j REJECT

7. The result was;

   - 3 new active volumes were created
   - all 3 were in use by 'qemu-kvm'
   - all 3 were in the images table as the active volumes in the database
   - a new snapshots table entry was created, but then got removed, resulting in only one snapshot table entry
   - such that the snapshot id in 'vm_snapshot_id' field for each active volume did not exist, and the Active VM entry was in the 'vm_snapshot_id' field of each parent volume in the chain


Actual results:

One less snapshots table entry than images, and misaligned snapshot ids.


Expected results:

Images and snapshots to align correctly.


Additional info:

Comment 2 Gordon Watson 2020-07-17 21:45:24 UTC
Created attachment 1701589 [details]
Customer's vdsm log

Comment 9 Benny Zlotnik 2020-07-20 15:03:04 UTC
Gordon, which version are you using exactly? 
I suspect this issue should have been fixed in bug 1842375
It was included since ovirt-engine-4.3.10.1

There is an additional related fix in bug 1838493, available in 4.3.11

Any chance you can check these?
I haven't dug deep in the logs yet, but this scenario is very familiar and has been worked on a lot in late 4.3/early 4.4 releases

Comment 16 Liran Rotenberg 2020-10-06 08:50:03 UTC
I didn't look into the log, only on the scenario itself.
In 4.4 we moved all the live snapshots, with or without memory to async.

The idea was regarding the memory volumes which might take long time to generate.
But, In 4.3 the engine snapshot command wait to the VDSM response. In the case of this bug, we wouldn't get one.. which probably make the engine to fail on it and remove the DB entry while it actually did succeed on VDSM and libvirt side.. causing the VM to use the new volumes and make the storage changes.

In 4.4, it changed both engine and VDSM to be async, the engine will send the command to VDSM and in case the connection will be lost the job will still be queried by the engine.
The job status stays in VDSM for 10 minutes once it's done (success or failure). That means, if the engine connection to VDSM was retrieved in less than 10 minutes after the job done in VDSM we will get the right status, and won't see any problem. In case it's over 10 minutes, the engine will fail the command (can cause this bug, but in my opinion 10 minutes is enough - can be changeable in VDSM configuration jobs - `autodelete_delay`).

Therefore, I think 4.4 does resolve this bug and the only thing I would ask is QE to try to reproduce it in 4.4.

Comment 17 Tal Nisan 2021-02-04 10:12:35 UTC
Given comment #16, moving to MODIFIED

Comment 28 Evelina Shames 2021-03-15 13:27:58 UTC
(In reply to Gordon Watson from comment #0)
> 
> 1. I created a VM with 3 disks, but that was just my test case to match what
> the customer had. My disks were file-based.
> 
> 2. I started the VM.
> 
> 3. On this host, I modified 'virt/vm.py' to add a 10 second sleep here;
> 
>             try:
>                 self.log.info("Taking a live snapshot (drives=%s,
> memory=%s)",
>                               ', '.join(drive["name"] for drive in
>                                         newDrives.values()),
>                               memoryParams is not None)
>                 time.sleep(10)
>                 self._dom.snapshotCreateXML(snapxml, snapFlags)
> 
> 
> 4. I created a snapshot, including all 3 disks, with no memory volumes.
> 
> 5. During the 10 second sleep, I blocked port 54321;
>    # iptables -I INPUT -p tcp --dport 54321 -j REJECT
> 
> 6. When the snapshot had completed on the host ("Completed live snapshot"),
> I unblocked the port;
>    # iptables -D INPUT -p tcp --dport 54321 -j REJECT
> 
> 7. The result was;
> 
>    - 3 new active volumes were created
>    - all 3 were in use by 'qemu-kvm'
>    - all 3 were in the images table as the active volumes in the database
>    - a new snapshots table entry was created, but then got removed,
> resulting in only one snapshot table entry
>    - such that the snapshot id in 'vm_snapshot_id' field for each active
> volume did not exist, and the Active VM entry was in the 'vm_snapshot_id'
> field of each parent volume in the chain
> 

Verified with the above steps on env with rhv-4.4.5-9.
The issue didn't reproduce.
3 new active volumes were created, a new snapshots table entry was created and didn't get removed.
VM was up and accessible.
After reboot, the VM powered on and worked as expected.

Moving to 'VERIFIED'.

Comment 33 errata-xmlrpc 2021-04-14 11:39:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1169


Note You need to log in before you can comment on or make changes to this bug.