Bug 2074205

Summary:	while live-migrating many instances concurrently, libvirt sometimes return internal error: migration was active, but no RAM info was set
Product:	Red Hat Enterprise Linux 8	Reporter:	David Hill <dhill>
Component:	qemu-kvm	Assignee:	Virtualization Maintenance <virt-maint>
qemu-kvm sub component:	Live Migration	QA Contact:	Li Xiaohui <xiaohli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	high	CC:	chayang, coli, gconsalv, ggrimaux, gveitmic, jferlan, jinzhao, jmitterm, jsuchane, juzhang, kchamart, lmen, mwitt, peterx, pkrempa, schhabdi, virt-maint, xuzhang, yama, ymankad
Version:	8.2	Keywords:	Triaged, ZStream
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	qemu-kvm-6.2.0-29.module+el8.8.0+17991+08d03241	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2074219 2161781 2161784 2168214 2168217 2168218 2168219 (view as bug list)		Environment:
Last Closed:	2023-05-16 08:16:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2161784
Bug Blocks:	2074219, 2161781, 2168214, 2168217, 2168218, 2168219

Description David Hill 2022-04-11 18:48:15 UTC

Description of problem:
while live-migrating many instances concurrently, libvirt sometimes return internal error: migration was active, but no RAM info was set:
~~~
2022-03-30 06:08:37.197 7 WARNING nova.virt.libvirt.driver [req-5c3296cf-88ee-4af6-ae6a-ddba99935e23 - - - - -] [instance: af339c99-1182-4489-b15c-21e52f50f724] Error monitoring migration: internal error: migration was active, but no RAM info was set: libvirt.libvirtError: internal error: migration was active, but no RAM info was set
~~~


Version-Release number of selected component (if applicable):
libvirt-daemon-6.0.0-25.6.module+el8.2.1+12457+868e9540.ppc64le	

How reproducible:
Random

Steps to Reproduce:
1. live evacuate a compute
2.
3.

Actual results:
live migration fails and leave database info in dire state

Expected results:
completes successfully

Additional info:

Comment 2 Peter Krempa 2022-04-12 12:43:24 UTC

Could you please attach debug logs [1] of libvirtd when this happens so that we can see what qemu returned? The error message you've seen happens when ram info is missing in the reply of 'query-migrate' from qemu.

[1] https://www.libvirt.org/kbase/debuglogs.html

Comment 3 Fangge Jin 2022-04-13 03:22:18 UTC

My understanding:
According to the description in attached customer case, vm migration is successful, the only issue is that libvirt reports below error when querying job info:
migration was active, but no RAM info was set

Comment 4 Dr. David Alan Gilbert 2022-04-13 09:42:00 UTC

Looking at the customers logs, it doesn't look to me as if they're migrating lots in parallel on any one machine; only one or two VMs at a time.
Most of the VMs migrate very quickly (a few seconds).

Looking at the qemu code, it looks pretty solid; I can kind of see a 'maybe' theoretical race;
the code that generates the 'query-migrate' reads the 'status' twice.  Maybe if it changes
between the two you could end up wiht an inconsistency.  Maybe.  Never seen it though.

Comment 5 Fangge Jin 2022-04-13 10:56:22 UTC

Comparing the timestamp of initiating migration and the error:
1) 2022-03-30 06:08:37.025+0000: initiating migration

2) 2022-03-30 06:08:37.197 7 WARNING nova.virt.libvirt.driver [req-5c3296cf-88ee-4af6-ae6a-ddba99935e23 - - - - -] [instance: af339c99-1182-4489-b15c-21e52f50f724] Error monitoring migration: internal error: migration was active, but no RAM info was set: libvirt.libvirtError: internal error: migration was active, but no RAM info was set

We can see the error happened at the very early phase of migration.

Comment 6 Dr. David Alan Gilbert 2022-04-13 11:17:57 UTC

That could be the race I can imagine;  If the code read the status as 'setup' it wouldn't save the ram info.
If the migration then switche dsetup->active
and then copied the status field as 'active'

you'd see this symptom.

Comment 7 Dr. David Alan Gilbert 2022-04-13 11:35:09 UTC

I've just posted a qemu fix:
migration: Read state once

It's a theoretical fix, in the sense we've not got enough debug to know if this is the real cause.

Comment 8 Kashyap Chamarthy 2022-04-13 12:16:31 UTC

(In reply to Dr. David Alan Gilbert from comment #7)
> I've just posted a qemu fix:
> migration: Read state once

Link to the above: 

https://lists.gnu.org/archive/html/qemu-devel/2022-04/msg01395.html

Thanks, Dave (G)!

> It's a theoretical fix, in the sense we've not got enough debug to know if
> this is the real cause.

Comment 11 Jaroslav Suchanek 2022-04-14 15:04:03 UTC

Passing down to qemu-kvm for further processing.

Comment 12 Dr. David Alan Gilbert 2022-04-14 15:21:26 UTC

Hmm, so I'm not too sure what to do with this; assuming my fix for hte theoretical reason goes in upstream, then what?
We've not got a reproducer fo rit; do we bother backporting it or just take it in the next one?

Comment 16 Dr. David Alan Gilbert 2022-04-25 08:38:50 UTC

My qemu fix is in upstream qemu; 552de79bfdd5e9e53847eb3c6d6e4cd898a4370e   migration: Read state once

Comment 31 Yanan Fu 2023-01-28 02:31:19 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 34 Li Xiaohui 2023-01-31 08:23:43 UTC

Test on kernel-4.18.0-449.el8.x86_64 && qemu-kvm-6.2.0-29.module+el8.8.0+17991+08d03241.x86_64 with repeating nearly 300 times according to the reproduce case (see Comment 13), they all pass, didn't hit any issues.


So mark this bug verified per above results.

Comment 43 errata-xmlrpc 2023-05-16 08:16:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2757