Bug 1654891
Summary: | Engine sent duplicate SnapshotVDSCommand, causing data corruption | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | |
Component: | ovirt-engine | Assignee: | Eyal Shenitzky <eshenitz> | |
Status: | CLOSED ERRATA | QA Contact: | Elad <ebenahar> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 4.2.7 | CC: | audgiri, bcholler, bzlotnik, dfediuck, ebenahar, frolland, mkalinin, mperina, mtessun, mwest, nashok, nsoffer, rbarry, Rhev-m-bugs, rnori, tnisan, usurse | |
Target Milestone: | ovirt-4.3.0 | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | ovirt-engine-4.3.0_rc | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1658589 (view as bug list) | Environment: | ||
Last Closed: | 2019-05-08 12:39:01 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1640225 | |||
Bug Blocks: | 1658589 |
Description
Germano Veit Michel
2018-11-29 23:33:21 UTC
There is a single POST to snapshots of this VM via API, just in case... [26/Nov/2018:03:35:15 +1300] "POST /ovirt-engine/api/vms/67711673-32a1-4803-873c-bfd029cfd1ca/snapshots HTTP/1.1" 201 13478 No progress, but just to document... 1) I tried to send a POST to /snapshots with the same disk duplicated as below, the engine fails with Internal Sever Error. It does not trigger this bug. <snapshot> <description>test2</description> <persist_memorystate>false</persist_memorystate> <disk_attachments> <disk_attachment> <disk id="0764dcb3-8b21-4cda-9899-5d18ecefc67f"/> </disk_attachment> <disk_attachment> <disk id="0764dcb3-8b21-4cda-9899-5d18ecefc67f"/> </disk_attachment> </disk_attachments> </snapshot> So I'm not sure this could have been triggered by an incorrect API request. 2) Also, the SnapshotVDS command seems to depend on what is returned here, it doesn't look duplicate: engine=> select count(*) from all_disks_for_vms where vm_id = '67711673-32a1-4803-873c-bfd029cfd1ca'; count ------- 1 Following an offline conversation, this issue was reported here - bug 1640225 (upstream). Germano, did you see any disconnection from vdsm or any other error since the first snapshot request was sent? This may be also an infra issue, if infra try to resend the same command twice. Martin, do we have protection from sending the same command twice? On storage side we may also need to have protection from receiving response more then once. If we sent a command and received a response, the command should change the internal state so another response for the same command should be dropped. (In reply to Nir Soffer from comment #8) > Germano, did you see any disconnection from vdsm or any other error since > the first > snapshot request was sent? Hi Nir, No, not on these logs and also not on BZ1640225 which is a very similar issue. The difference between this and BZ1640225 is that the latter is during a LSM, so the parent of CreateSnapshotForVm command is LiveMigrateDisk. I haven't dug too deep yet, but given 2 different threads executed each of the SnapshotVDS commands, the only explanation coming to mind is that both of them ran the CreateSnapshotForVmCommand#performNextOperation, which might be due to some race condition in the polling stage, causing two scheduling threads to see the command as "SUCCEEDED" at the same time, and execute its "next operation" at the same time. Though it seems kind of strange given there was a 6 second delay between each snapshot attempt, but this could be due to the system being under load WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3-ga': '?'}', ] For more info please contact: rhv-devops Moving to VERIFIED based on latest 4.3 regression cycles results. Used: ovirt-engine-4.3.0.4-0.1.el7.noarch vdsm-4.30.8-2.el7ev.x86_64 libvirt-4.5.0-10.el7_6.4.x86_64 qemu-img-rhev-2.12.0-21.el7.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1085 |