Bug 1858420
Summary: | Snapshot creation on host that engine then loses connection to results in missing snapshots table entry | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Gordon Watson <gwatson> |
Component: | ovirt-engine | Assignee: | Benny Zlotnik <bzlotnik> |
Status: | CLOSED ERRATA | QA Contact: | Evelina Shames <eshames> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.3.9 | CC: | achareka, aefrat, bzlotnik, dfodor, eshenitz, gveitmic, lrotenbe, mavital, sfishbai, tnisan |
Target Milestone: | ovirt-4.4.5 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-04-14 11:39:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Gordon Watson
2020-07-17 21:31:41 UTC
Created attachment 1701589 [details]
Customer's vdsm log
Gordon, which version are you using exactly? I suspect this issue should have been fixed in bug 1842375 It was included since ovirt-engine-4.3.10.1 There is an additional related fix in bug 1838493, available in 4.3.11 Any chance you can check these? I haven't dug deep in the logs yet, but this scenario is very familiar and has been worked on a lot in late 4.3/early 4.4 releases I didn't look into the log, only on the scenario itself. In 4.4 we moved all the live snapshots, with or without memory to async. The idea was regarding the memory volumes which might take long time to generate. But, In 4.3 the engine snapshot command wait to the VDSM response. In the case of this bug, we wouldn't get one.. which probably make the engine to fail on it and remove the DB entry while it actually did succeed on VDSM and libvirt side.. causing the VM to use the new volumes and make the storage changes. In 4.4, it changed both engine and VDSM to be async, the engine will send the command to VDSM and in case the connection will be lost the job will still be queried by the engine. The job status stays in VDSM for 10 minutes once it's done (success or failure). That means, if the engine connection to VDSM was retrieved in less than 10 minutes after the job done in VDSM we will get the right status, and won't see any problem. In case it's over 10 minutes, the engine will fail the command (can cause this bug, but in my opinion 10 minutes is enough - can be changeable in VDSM configuration jobs - `autodelete_delay`). Therefore, I think 4.4 does resolve this bug and the only thing I would ask is QE to try to reproduce it in 4.4. Given comment #16, moving to MODIFIED (In reply to Gordon Watson from comment #0) > > 1. I created a VM with 3 disks, but that was just my test case to match what > the customer had. My disks were file-based. > > 2. I started the VM. > > 3. On this host, I modified 'virt/vm.py' to add a 10 second sleep here; > > try: > self.log.info("Taking a live snapshot (drives=%s, > memory=%s)", > ', '.join(drive["name"] for drive in > newDrives.values()), > memoryParams is not None) > time.sleep(10) > self._dom.snapshotCreateXML(snapxml, snapFlags) > > > 4. I created a snapshot, including all 3 disks, with no memory volumes. > > 5. During the 10 second sleep, I blocked port 54321; > # iptables -I INPUT -p tcp --dport 54321 -j REJECT > > 6. When the snapshot had completed on the host ("Completed live snapshot"), > I unblocked the port; > # iptables -D INPUT -p tcp --dport 54321 -j REJECT > > 7. The result was; > > - 3 new active volumes were created > - all 3 were in use by 'qemu-kvm' > - all 3 were in the images table as the active volumes in the database > - a new snapshots table entry was created, but then got removed, > resulting in only one snapshot table entry > - such that the snapshot id in 'vm_snapshot_id' field for each active > volume did not exist, and the Active VM entry was in the 'vm_snapshot_id' > field of each parent volume in the chain > Verified with the above steps on env with rhv-4.4.5-9. The issue didn't reproduce. 3 new active volumes were created, a new snapshots table entry was created and didn't get removed. VM was up and accessible. After reboot, the VM powered on and worked as expected. Moving to 'VERIFIED'. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1169 |