Bug 1774230 - Block commit (live merge) completed per libvirt, but VDSM still saw cur < end
Summary: Block commit (live merge) completed per libvirt, but VDSM still saw cur < end
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: Benny Zlotnik
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On: 1791458 1791886
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-19 20:26 UTC by Gordon Watson
Modified: 2020-08-04 13:27 UTC (History)
12 users (show)

Fixed In Version: rhv-4.4.0-31
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1791458 (view as bug list)
Environment:
Last Closed: 2020-08-04 13:27:22 UTC
oVirt Team: Storage
Target Upstream Version:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4603381 None None None 2019-11-21 19:18:16 UTC
Red Hat Product Errata RHEA-2020:3246 None None None 2020-08-04 13:27:50 UTC
oVirt gerrit 106306 master ABANDONED vm: clear post-copy phase after migration 2021-01-14 12:13:37 UTC
oVirt gerrit 107741 master ABANDONED spec: bump libvirt dependency 2021-01-14 12:13:37 UTC

Description Gordon Watson 2019-11-19 20:26:00 UTC
Description of problem:

A live merge was performed as the last step of an LSM. Libvirt indicated that the block job (commit) had completed, as 'virsh blockjob' show it at 100% and the XML contained "<mirror type='block' job='active-commit' ready='yes'>". 

However, VDSM (in getAllVmStats) still showed 'cur' < 'end;

"end":
"81592320",                               
"cur":
"196608",                              


When VDSM was restarted, the merge completed.


Version-Release number of selected component (if applicable):

RHV 4.3.5
RHEL 7.7 host;
  libvirt-4.5.0-23.el7_7.1.x86_64            	
  qemu-kvm-rhev-2.12.0-33.el7_7.4.x86_64            
  vdsm-4.30.33-1.el7ev.x86_64 


How reproducible:

Not.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 Roman Hodain 2019-11-20 08:06:58 UTC
After the vdsms restart the vdsm pickup and finished the job correctly. Starting with "Starting cleanup thread for job"

Comment 6 Roman Hodain 2019-11-20 12:49:36 UTC
The issue hapenned on two different hosts 

The one mentioned here is

vdsm-client_Host_getAllVmStats:

>         "guestName": "ocp3-node-5.gsslab.brq2.redhat.com",                       
>         "elapsedTime": "522237",                                                 
>         "vmJobs": {                                                              
>             "1b06e423-bf2a-40e8-9f63-e086e49899b2": {                            
>                 "end": "81592320",                                               
>                 "cur": "196608",                                                 
>                 "imgUUID": "e4417b6a-670b-40b1-90c7-e0d68d628691",               
>                 "blockJobType": "commit",                                        
>                 "bandwidth": 0,                                                  
>                 "id": "1b06e423-bf2a-40e8-9f63-e086e49899b2",                    
>                 "jobType": "block"                                               
>             }                                                                    

The second one is:

>         "guestName": "ocp3-node-4.gsslab.brq2.redhat.com",                       
>         "elapsedTime": "514992",                                                 
>         "vmJobs": {                                                              
>             "b42414e2-58c8-49af-8d16-e5493cadd433": {                            
>                 "end": "60948480",                                               
>                 "cur": "0",                                                      
>                 "imgUUID": "787b2c4f-f8d6-4902-9278-8f920bbd8c62",               
>                 "blockJobType": "commit",                                        
>                 "bandwidth": 0,                                                  
>                 "id": "b42414e2-58c8-49af-8d16-e5493cadd433",                    
>                 "jobType": "block"                                               
>             }                                                                    

So it seems that the liveInfo was connected just ones when the merge operation was triggered, but did not check anymore. I just cannot figure out why the mointoring was not runing.

Comment 7 Benny Zlotnik 2019-12-25 09:27:29 UTC
Can you attach engine logs as well?

Comment 12 Benny Zlotnik 2020-01-14 12:15:59 UTC
I tried to pursue the quest agent poller angle but it led nowhere, since we have warnings when a worker is blocked. 

The good news is I found the root cause, I downloaded the entire logcollector archive and found out the VM was migrated a couple of days before the live merge.
The policy used in the migration is post-copy, and the issue, it seems, is the monitor not being enabled properly after a post-copy migrate is finished.

This can be reproduced consistently using the following steps:
1. Create a VM (with an OS), and set the migration policy to post-copy
2. Create a snapshot
3. In the VM run `stress --cpu 1 --vm 4 --vm-bytes 128M --timeout 300s` - I am not entirely sure this is required, but it seems post-copy won't run if there isn't much going on in the memory
4. Migrate the VM
5. Remove the snapshot

Comment 13 Benny Zlotnik 2020-01-15 09:02:25 UTC
Hi Jiri,

After investigating this bug and discussing the proposed patch with Milan, there is something unclear.
It seems that in post-copy migrate both source and destination receive VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY,
and it is probably has something to do with the change from[1], as I see the following logs on the destination:
2020-01-15 08:17:54.802+0000: 17327: debug : qemuProcessHandleMigrationStatus:1647 : Migration of domain 0x7fae6801c310 vmski changed state to post
copy-active
2020-01-15 08:17:54.802+0000: 17327: debug : qemuProcessHandleMigrationStatus:1663 : Correcting paused state reason for domain vmski to post-copy <--- I assume this emits the VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event

2020-01-15 08:17:55.045+0000: 17327: debug : qemuProcessHandleResume:719 : Transitioned guest vmski into running state, reason 'post-copy', event detail 3

Is this the correct the behaviour, should the destination receive VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY as well?




[1] https://bugzilla.redhat.com/show_bug.cgi?id=1647365

Comment 14 Jiri Denemark 2020-01-15 10:06:46 UTC
(In reply to Benny Zlotnik from comment #13) Your investigation seems to be
correct. The domain is started as paused on the destination (it should get a
"suspended" event with "migration" reason when migration starts). Once
migration switches to post-copy, the code in qemuProcessHandleMigrationStatus
will update the reason to "post-copy" and emit a new "suspended" event just a
moment before the domain is resumed, which should only happen on the source.

I'll clone this bug to libvirt and fix it.

Comment 15 Michal Skrivanek 2020-01-15 12:17:13 UTC
(In reply to Benny Zlotnik from comment #12)

> The policy used in the migration is post-copy, and the issue, it seems, is the monitor not being enabled properly after a post-copy migrate is finished.


I think the WA for now should be just to use a pre-copy policy

Comment 23 Evelina Shames 2020-04-20 07:08:44 UTC
Verified on libvirt-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64
with these steps:

(In reply to Benny Zlotnik from comment #12)
> This can be reproduced consistently using the following steps:
> 1. Create a VM (with an OS), and set the migration policy to post-copy
> 2. Create a snapshot
> 3. In the VM run `stress --cpu 1 --vm 4 --vm-bytes 128M --timeout 300s` - I
> am not entirely sure this is required, but it seems post-copy won't run if
> there isn't much going on in the memory
> 4. Migrate the VM
> 5. Remove the snapshot

Comment 31 errata-xmlrpc 2020-08-04 13:27:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246


Note You need to log in before you can comment on or make changes to this bug.