Bug 1774230

Summary:	Block commit (live merge) completed per libvirt, but VDSM still saw cur < end
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Gordon Watson <gwatson>
Component:	vdsm	Assignee:	Benny Zlotnik <bzlotnik>
Status:	CLOSED ERRATA	QA Contact:	Evelina Shames <eshames>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.3.5	CC:	aefrat, bzlotnik, dfodor, jdenemar, lsurette, michal.skrivanek, mzamazal, pelauter, rhodain, srevivo, tnisan, ycui
Target Milestone:	ovirt-4.4.0	Flags:	lsvaty: testing_plan_complete-
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	rhv-4.4.0-31	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1791458 (view as bug list)		Environment:
Last Closed:	2020-08-04 13:27:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1791458, 1791886
Bug Blocks:

Description Gordon Watson 2019-11-19 20:26:00 UTC

Description of problem:

A live merge was performed as the last step of an LSM. Libvirt indicated that the block job (commit) had completed, as 'virsh blockjob' show it at 100% and the XML contained "<mirror type='block' job='active-commit' ready='yes'>". 

However, VDSM (in getAllVmStats) still showed 'cur' < 'end;

"end":
"81592320",                               
"cur":
"196608",                              


When VDSM was restarted, the merge completed.


Version-Release number of selected component (if applicable):

RHV 4.3.5
RHEL 7.7 host;
  libvirt-4.5.0-23.el7_7.1.x86_64            	
  qemu-kvm-rhev-2.12.0-33.el7_7.4.x86_64            
  vdsm-4.30.33-1.el7ev.x86_64 


How reproducible:

Not.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 Roman Hodain 2019-11-20 08:06:58 UTC

After the vdsms restart the vdsm pickup and finished the job correctly. Starting with "Starting cleanup thread for job"

Comment 6 Roman Hodain 2019-11-20 12:49:36 UTC

The issue hapenned on two different hosts 

The one mentioned here is

vdsm-client_Host_getAllVmStats:

>         "guestName": "ocp3-node-5.gsslab.brq2.redhat.com",                       
>         "elapsedTime": "522237",                                                 
>         "vmJobs": {                                                              
>             "1b06e423-bf2a-40e8-9f63-e086e49899b2": {                            
>                 "end": "81592320",                                               
>                 "cur": "196608",                                                 
>                 "imgUUID": "e4417b6a-670b-40b1-90c7-e0d68d628691",               
>                 "blockJobType": "commit",                                        
>                 "bandwidth": 0,                                                  
>                 "id": "1b06e423-bf2a-40e8-9f63-e086e49899b2",                    
>                 "jobType": "block"                                               
>             }                                                                    

The second one is:

>         "guestName": "ocp3-node-4.gsslab.brq2.redhat.com",                       
>         "elapsedTime": "514992",                                                 
>         "vmJobs": {                                                              
>             "b42414e2-58c8-49af-8d16-e5493cadd433": {                            
>                 "end": "60948480",                                               
>                 "cur": "0",                                                      
>                 "imgUUID": "787b2c4f-f8d6-4902-9278-8f920bbd8c62",               
>                 "blockJobType": "commit",                                        
>                 "bandwidth": 0,                                                  
>                 "id": "b42414e2-58c8-49af-8d16-e5493cadd433",                    
>                 "jobType": "block"                                               
>             }                                                                    

So it seems that the liveInfo was connected just ones when the merge operation was triggered, but did not check anymore. I just cannot figure out why the mointoring was not runing.

Comment 7 Benny Zlotnik 2019-12-25 09:27:29 UTC

Can you attach engine logs as well?

Comment 12 Benny Zlotnik 2020-01-14 12:15:59 UTC

I tried to pursue the quest agent poller angle but it led nowhere, since we have warnings when a worker is blocked. 

The good news is I found the root cause, I downloaded the entire logcollector archive and found out the VM was migrated a couple of days before the live merge.
The policy used in the migration is post-copy, and the issue, it seems, is the monitor not being enabled properly after a post-copy migrate is finished.

This can be reproduced consistently using the following steps:
1. Create a VM (with an OS), and set the migration policy to post-copy
2. Create a snapshot
3. In the VM run `stress --cpu 1 --vm 4 --vm-bytes 128M --timeout 300s` - I am not entirely sure this is required, but it seems post-copy won't run if there isn't much going on in the memory
4. Migrate the VM
5. Remove the snapshot

Comment 13 Benny Zlotnik 2020-01-15 09:02:25 UTC

Hi Jiri,

After investigating this bug and discussing the proposed patch with Milan, there is something unclear.
It seems that in post-copy migrate both source and destination receive VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY,
and it is probably has something to do with the change from[1], as I see the following logs on the destination:
2020-01-15 08:17:54.802+0000: 17327: debug : qemuProcessHandleMigrationStatus:1647 : Migration of domain 0x7fae6801c310 vmski changed state to post
copy-active
2020-01-15 08:17:54.802+0000: 17327: debug : qemuProcessHandleMigrationStatus:1663 : Correcting paused state reason for domain vmski to post-copy <--- I assume this emits the VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event

2020-01-15 08:17:55.045+0000: 17327: debug : qemuProcessHandleResume:719 : Transitioned guest vmski into running state, reason 'post-copy', event detail 3

Is this the correct the behaviour, should the destination receive VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY as well?




[1] https://bugzilla.redhat.com/show_bug.cgi?id=1647365

Comment 14 Jiri Denemark 2020-01-15 10:06:46 UTC

(In reply to Benny Zlotnik from comment #13) Your investigation seems to be
correct. The domain is started as paused on the destination (it should get a
"suspended" event with "migration" reason when migration starts). Once
migration switches to post-copy, the code in qemuProcessHandleMigrationStatus
will update the reason to "post-copy" and emit a new "suspended" event just a
moment before the domain is resumed, which should only happen on the source.

I'll clone this bug to libvirt and fix it.

Comment 15 Michal Skrivanek 2020-01-15 12:17:13 UTC

(In reply to Benny Zlotnik from comment #12)

> The policy used in the migration is post-copy, and the issue, it seems, is the monitor not being enabled properly after a post-copy migrate is finished.


I think the WA for now should be just to use a pre-copy policy

Comment 23 Evelina Shames 2020-04-20 07:08:44 UTC

Verified on libvirt-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64
with these steps:

(In reply to Benny Zlotnik from comment #12)
> This can be reproduced consistently using the following steps:
> 1. Create a VM (with an OS), and set the migration policy to post-copy
> 2. Create a snapshot
> 3. In the VM run `stress --cpu 1 --vm 4 --vm-bytes 128M --timeout 300s` - I
> am not entirely sure this is required, but it seems post-copy won't run if
> there isn't much going on in the memory
> 4. Migrate the VM
> 5. Remove the snapshot

Comment 31 errata-xmlrpc 2020-08-04 13:27:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246