2074205 – while live-migrating many instances concurrently, libvirt sometimes return internal error: migration was active, but no RAM info was set

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2074205 - while live-migrating many instances concurrently, libvirt sometimes return internal error: migration was active, but no RAM info was set

Summary: while live-migrating many instances concurrently, libvirt sometimes return in...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Virtualization Maintenance
QA Contact:	Li Xiaohui
Docs Contact:
URL:
Whiteboard:
Depends On:	2161784
Blocks:	2074219 2161781 2168214 2168217 2168218 2168219
TreeView+	depends on / blocked

Reported:	2022-04-11 18:48 UTC by David Hill
Modified:	2023-05-16 08:56 UTC (History)
CC List:	20 users (show)
Fixed In Version:	qemu-kvm-6.2.0-29.module+el8.8.0+17991+08d03241
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2074219 2161781 2161784 2168214 2168217 2168218 2168219 (view as bug list)
Environment:
Last Closed:	2023-05-16 08:16:30 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/rhel/src/qemu-kvm qemu-kvm merge_requests 249	None	None	None	2023-01-19 17:24:41 UTC
Red Hat Issue Tracker	RHELPLAN-118598	None	None	None	2022-04-11 18:56:30 UTC
Red Hat Knowledge Base (Solution)	6903451	None	None	None	2022-04-11 18:49:22 UTC
Red Hat Product Errata	RHSA-2023:2757	None	None	None	2023-05-16 08:18:13 UTC

Description David Hill 2022-04-11 18:48:15 UTC

Description of problem:
while live-migrating many instances concurrently, libvirt sometimes return internal error: migration was active, but no RAM info was set:
~~~
2022-03-30 06:08:37.197 7 WARNING nova.virt.libvirt.driver [req-5c3296cf-88ee-4af6-ae6a-ddba99935e23 - - - - -] [instance: af339c99-1182-4489-b15c-21e52f50f724] Error monitoring migration: internal error: migration was active, but no RAM info was set: libvirt.libvirtError: internal error: migration was active, but no RAM info was set
~~~


Version-Release number of selected component (if applicable):
libvirt-daemon-6.0.0-25.6.module+el8.2.1+12457+868e9540.ppc64le	

How reproducible:
Random

Steps to Reproduce:
1. live evacuate a compute
2.
3.

Actual results:
live migration fails and leave database info in dire state

Expected results:
completes successfully

Additional info:

Comment 2 Peter Krempa 2022-04-12 12:43:24 UTC

Could you please attach debug logs [1] of libvirtd when this happens so that we can see what qemu returned? The error message you've seen happens when ram info is missing in the reply of 'query-migrate' from qemu.

[1] https://www.libvirt.org/kbase/debuglogs.html

Comment 3 Fangge Jin 2022-04-13 03:22:18 UTC

My understanding:
According to the description in attached customer case, vm migration is successful, the only issue is that libvirt reports below error when querying job info:
migration was active, but no RAM info was set

Comment 4 Dr. David Alan Gilbert 2022-04-13 09:42:00 UTC

Looking at the customers logs, it doesn't look to me as if they're migrating lots in parallel on any one machine; only one or two VMs at a time.
Most of the VMs migrate very quickly (a few seconds).

Looking at the qemu code, it looks pretty solid; I can kind of see a 'maybe' theoretical race;
the code that generates the 'query-migrate' reads the 'status' twice.  Maybe if it changes
between the two you could end up wiht an inconsistency.  Maybe.  Never seen it though.

Comment 5 Fangge Jin 2022-04-13 10:56:22 UTC

Comparing the timestamp of initiating migration and the error:
1) 2022-03-30 06:08:37.025+0000: initiating migration

2) 2022-03-30 06:08:37.197 7 WARNING nova.virt.libvirt.driver [req-5c3296cf-88ee-4af6-ae6a-ddba99935e23 - - - - -] [instance: af339c99-1182-4489-b15c-21e52f50f724] Error monitoring migration: internal error: migration was active, but no RAM info was set: libvirt.libvirtError: internal error: migration was active, but no RAM info was set

We can see the error happened at the very early phase of migration.

Comment 6 Dr. David Alan Gilbert 2022-04-13 11:17:57 UTC

That could be the race I can imagine;  If the code read the status as 'setup' it wouldn't save the ram info.
If the migration then switche dsetup->active
and then copied the status field as 'active'

you'd see this symptom.

Comment 7 Dr. David Alan Gilbert 2022-04-13 11:35:09 UTC

I've just posted a qemu fix:
migration: Read state once

It's a theoretical fix, in the sense we've not got enough debug to know if this is the real cause.

Comment 8 Kashyap Chamarthy 2022-04-13 12:16:31 UTC

(In reply to Dr. David Alan Gilbert from comment #7)
> I've just posted a qemu fix:
> migration: Read state once

Link to the above: 

https://lists.gnu.org/archive/html/qemu-devel/2022-04/msg01395.html

Thanks, Dave (G)!

> It's a theoretical fix, in the sense we've not got enough debug to know if
> this is the real cause.

Comment 11 Jaroslav Suchanek 2022-04-14 15:04:03 UTC

Passing down to qemu-kvm for further processing.

Comment 12 Dr. David Alan Gilbert 2022-04-14 15:21:26 UTC

Hmm, so I'm not too sure what to do with this; assuming my fix for hte theoretical reason goes in upstream, then what?
We've not got a reproducer fo rit; do we bother backporting it or just take it in the next one?

Comment 16 Dr. David Alan Gilbert 2022-04-25 08:38:50 UTC

My qemu fix is in upstream qemu; 552de79bfdd5e9e53847eb3c6d6e4cd898a4370e   migration: Read state once

Comment 31 Yanan Fu 2023-01-28 02:31:19 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 34 Li Xiaohui 2023-01-31 08:23:43 UTC

Test on kernel-4.18.0-449.el8.x86_64 && qemu-kvm-6.2.0-29.module+el8.8.0+17991+08d03241.x86_64 with repeating nearly 300 times according to the reproduce case (see Comment 13), they all pass, didn't hit any issues.


So mark this bug verified per above results.

Comment 43 errata-xmlrpc 2023-05-16 08:16:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2757

Note You need to log in before you can comment on or make changes to this bug.

chayang
coli
gconsalv
ggrimaux
gveitmic
jferlan
jinzhao
jmitterm
jsuchane
juzhang
kchamart
lmen
mwitt
peterx
pkrempa
schhabdi
virt-maint
xuzhang
yama
ymankad