Bug 2074205
| Summary: | while live-migrating many instances concurrently, libvirt sometimes return internal error: migration was active, but no RAM info was set | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | David Hill <dhill> | |
| Component: | qemu-kvm | Assignee: | Virtualization Maintenance <virt-maint> | |
| qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | high | CC: | chayang, coli, gconsalv, ggrimaux, gveitmic, jferlan, jinzhao, jmitterm, jsuchane, juzhang, kchamart, lmen, mwitt, peterx, pkrempa, schhabdi, virt-maint, xuzhang, yama, ymankad | |
| Version: | 8.2 | Keywords: | Triaged, ZStream | |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
|
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | qemu-kvm-6.2.0-29.module+el8.8.0+17991+08d03241 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2074219 2161781 2161784 2168214 2168217 2168218 2168219 (view as bug list) | Environment: | ||
| Last Closed: | 2023-05-16 08:16:30 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2161784 | |||
| Bug Blocks: | 2074219, 2161781, 2168214, 2168217, 2168218, 2168219 | |||
|
Description
David Hill
2022-04-11 18:48:15 UTC
Could you please attach debug logs [1] of libvirtd when this happens so that we can see what qemu returned? The error message you've seen happens when ram info is missing in the reply of 'query-migrate' from qemu. [1] https://www.libvirt.org/kbase/debuglogs.html My understanding: According to the description in attached customer case, vm migration is successful, the only issue is that libvirt reports below error when querying job info: migration was active, but no RAM info was set Looking at the customers logs, it doesn't look to me as if they're migrating lots in parallel on any one machine; only one or two VMs at a time. Most of the VMs migrate very quickly (a few seconds). Looking at the qemu code, it looks pretty solid; I can kind of see a 'maybe' theoretical race; the code that generates the 'query-migrate' reads the 'status' twice. Maybe if it changes between the two you could end up wiht an inconsistency. Maybe. Never seen it though. Comparing the timestamp of initiating migration and the error: 1) 2022-03-30 06:08:37.025+0000: initiating migration 2) 2022-03-30 06:08:37.197 7 WARNING nova.virt.libvirt.driver [req-5c3296cf-88ee-4af6-ae6a-ddba99935e23 - - - - -] [instance: af339c99-1182-4489-b15c-21e52f50f724] Error monitoring migration: internal error: migration was active, but no RAM info was set: libvirt.libvirtError: internal error: migration was active, but no RAM info was set We can see the error happened at the very early phase of migration. That could be the race I can imagine; If the code read the status as 'setup' it wouldn't save the ram info. If the migration then switche dsetup->active and then copied the status field as 'active' you'd see this symptom. I've just posted a qemu fix: migration: Read state once It's a theoretical fix, in the sense we've not got enough debug to know if this is the real cause. (In reply to Dr. David Alan Gilbert from comment #7) > I've just posted a qemu fix: > migration: Read state once Link to the above: https://lists.gnu.org/archive/html/qemu-devel/2022-04/msg01395.html Thanks, Dave (G)! > It's a theoretical fix, in the sense we've not got enough debug to know if > this is the real cause. Passing down to qemu-kvm for further processing. Hmm, so I'm not too sure what to do with this; assuming my fix for hte theoretical reason goes in upstream, then what? We've not got a reproducer fo rit; do we bother backporting it or just take it in the next one? My qemu fix is in upstream qemu; 552de79bfdd5e9e53847eb3c6d6e4cd898a4370e migration: Read state once QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Test on kernel-4.18.0-449.el8.x86_64 && qemu-kvm-6.2.0-29.module+el8.8.0+17991+08d03241.x86_64 with repeating nearly 300 times according to the reproduce case (see Comment 13), they all pass, didn't hit any issues. So mark this bug verified per above results. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:2757 |