Bug 1774230
Summary: | Block commit (live merge) completed per libvirt, but VDSM still saw cur < end | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Gordon Watson <gwatson> | |
Component: | vdsm | Assignee: | Benny Zlotnik <bzlotnik> | |
Status: | CLOSED ERRATA | QA Contact: | Evelina Shames <eshames> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.3.5 | CC: | aefrat, bzlotnik, dfodor, jdenemar, lsurette, michal.skrivanek, mzamazal, pelauter, rhodain, srevivo, tnisan, ycui | |
Target Milestone: | ovirt-4.4.0 | Flags: | lsvaty:
testing_plan_complete-
|
|
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | rhv-4.4.0-31 | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1791458 (view as bug list) | Environment: | ||
Last Closed: | 2020-08-04 13:27:22 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1791458, 1791886 | |||
Bug Blocks: |
Description
Gordon Watson
2019-11-19 20:26:00 UTC
After the vdsms restart the vdsm pickup and finished the job correctly. Starting with "Starting cleanup thread for job" The issue hapenned on two different hosts The one mentioned here is vdsm-client_Host_getAllVmStats: > "guestName": "ocp3-node-5.gsslab.brq2.redhat.com", > "elapsedTime": "522237", > "vmJobs": { > "1b06e423-bf2a-40e8-9f63-e086e49899b2": { > "end": "81592320", > "cur": "196608", > "imgUUID": "e4417b6a-670b-40b1-90c7-e0d68d628691", > "blockJobType": "commit", > "bandwidth": 0, > "id": "1b06e423-bf2a-40e8-9f63-e086e49899b2", > "jobType": "block" > } The second one is: > "guestName": "ocp3-node-4.gsslab.brq2.redhat.com", > "elapsedTime": "514992", > "vmJobs": { > "b42414e2-58c8-49af-8d16-e5493cadd433": { > "end": "60948480", > "cur": "0", > "imgUUID": "787b2c4f-f8d6-4902-9278-8f920bbd8c62", > "blockJobType": "commit", > "bandwidth": 0, > "id": "b42414e2-58c8-49af-8d16-e5493cadd433", > "jobType": "block" > } So it seems that the liveInfo was connected just ones when the merge operation was triggered, but did not check anymore. I just cannot figure out why the mointoring was not runing. Can you attach engine logs as well? I tried to pursue the quest agent poller angle but it led nowhere, since we have warnings when a worker is blocked. The good news is I found the root cause, I downloaded the entire logcollector archive and found out the VM was migrated a couple of days before the live merge. The policy used in the migration is post-copy, and the issue, it seems, is the monitor not being enabled properly after a post-copy migrate is finished. This can be reproduced consistently using the following steps: 1. Create a VM (with an OS), and set the migration policy to post-copy 2. Create a snapshot 3. In the VM run `stress --cpu 1 --vm 4 --vm-bytes 128M --timeout 300s` - I am not entirely sure this is required, but it seems post-copy won't run if there isn't much going on in the memory 4. Migrate the VM 5. Remove the snapshot Hi Jiri, After investigating this bug and discussing the proposed patch with Milan, there is something unclear. It seems that in post-copy migrate both source and destination receive VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY, and it is probably has something to do with the change from[1], as I see the following logs on the destination: 2020-01-15 08:17:54.802+0000: 17327: debug : qemuProcessHandleMigrationStatus:1647 : Migration of domain 0x7fae6801c310 vmski changed state to post copy-active 2020-01-15 08:17:54.802+0000: 17327: debug : qemuProcessHandleMigrationStatus:1663 : Correcting paused state reason for domain vmski to post-copy <--- I assume this emits the VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event 2020-01-15 08:17:55.045+0000: 17327: debug : qemuProcessHandleResume:719 : Transitioned guest vmski into running state, reason 'post-copy', event detail 3 Is this the correct the behaviour, should the destination receive VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY as well? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1647365 (In reply to Benny Zlotnik from comment #13) Your investigation seems to be correct. The domain is started as paused on the destination (it should get a "suspended" event with "migration" reason when migration starts). Once migration switches to post-copy, the code in qemuProcessHandleMigrationStatus will update the reason to "post-copy" and emit a new "suspended" event just a moment before the domain is resumed, which should only happen on the source. I'll clone this bug to libvirt and fix it. (In reply to Benny Zlotnik from comment #12) > The policy used in the migration is post-copy, and the issue, it seems, is the monitor not being enabled properly after a post-copy migrate is finished. I think the WA for now should be just to use a pre-copy policy Verified on libvirt-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64 with these steps: (In reply to Benny Zlotnik from comment #12) > This can be reproduced consistently using the following steps: > 1. Create a VM (with an OS), and set the migration policy to post-copy > 2. Create a snapshot > 3. In the VM run `stress --cpu 1 --vm 4 --vm-bytes 128M --timeout 300s` - I > am not entirely sure this is required, but it seems post-copy won't run if > there isn't much going on in the memory > 4. Migrate the VM > 5. Remove the snapshot Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3246 |