Bug 1580325
| Summary: | Postcopy migration failed with nvdimm device | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Yumei Huang <yuhuang> |
| Component: | qemu-kvm | Assignee: | Dr. David Alan Gilbert <dgilbert> |
| qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | medium | ||
| Priority: | low | CC: | chayang, dgilbert, fdeutsch, fgarciad, fjin, jinzhao, juzhang, knoel, mdean, peterx, stefanha, virt-maint, xiaohli, yuhuang |
| Version: | 9.0 | Keywords: | Reopened, Triaged |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-04-08 07:27:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Yumei Huang
2018-05-21 09:29:13 UTC
What is 'test.img' in this case? Postcopy only supports migration of RAM backed by normal memory, or hugepage memory. You might find it works if you create test.img in /dev/shm for example. If it's a file on disk, or a real nVDIMM I wouldn't expect it to work. Kernel work to support nVDIMM postcopy would be needed I think. (In reply to Dr. David Alan Gilbert from comment #2) > What is 'test.img' in this case? It is a file on disk. > Postcopy only supports migration of RAM backed by normal memory, or hugepage > memory. You might find it works if you create test.img in /dev/shm for > example. I tried to create test.img in /dev/shm, when do postcopy migration, hit another error: (qemu) qemu-kvm: postcopy_place_page_zero: File exists zero host: 0x7fc1265b1000 qemu-kvm: error while loading state section id 4(ram) qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -17 Maybe this it the real issue? > If it's a file on disk, or a real nVDIMM I wouldn't expect it to work. > Kernel work to support nVDIMM postcopy would be needed I think. Shoule I file a bug on kernel component to make it support ? Thanks, Yumei Huang (In reply to Yumei Huang from comment #3) > (In reply to Dr. David Alan Gilbert from comment #2) > > What is 'test.img' in this case? > > It is a file on disk. > > > Postcopy only supports migration of RAM backed by normal memory, or hugepage > > memory. You might find it works if you create test.img in /dev/shm for > > example. > > I tried to create test.img in /dev/shm, when do postcopy migration, hit > another error: > > (qemu) qemu-kvm: postcopy_place_page_zero: File exists zero host: > 0x7fc1265b1000 > qemu-kvm: error while loading state section id 4(ram) > qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -17 > > Maybe this it the real issue? Sorry, please ignore this. I did a local migration and both src guest and dst guest used the same test.img. That's why the error happens. Postcopy succeed when I use different files under /dev/shm as the backend. > > If it's a file on disk, or a real nVDIMM I wouldn't expect it to work. > > Kernel work to support nVDIMM postcopy would be needed I think. > > Shoule I file a bug on kernel component to make it support ? > > > Thanks, > Yumei Huang (In reply to Yumei Huang from comment #3) > (In reply to Dr. David Alan Gilbert from comment #2) > > What is 'test.img' in this case? > > It is a file on disk. > > > Postcopy only supports migration of RAM backed by normal memory, or hugepage > > memory. You might find it works if you create test.img in /dev/shm for > > example. > > I tried to create test.img in /dev/shm, when do postcopy migration, hit > another error: > > (qemu) qemu-kvm: postcopy_place_page_zero: File exists zero host: > 0x7fc1265b1000 > qemu-kvm: error while loading state section id 4(ram) > qemu-kvm: postcopy_ram_listen_thread: loadvm failed: -17 > > Maybe this it the real issue? > > > If it's a file on disk, or a real nVDIMM I wouldn't expect it to work. > > Kernel work to support nVDIMM postcopy would be needed I think. > > Shoule I file a bug on kernel component to make it support ? I'll leave that question upto Hai; but I think that would be an RFE. Note your test is really just testing it with a file on disk, which there's no good reason to support; support on a real nVDIMM might make some sense, but would require different kernel support. Closing as not-supported. > > Thanks, > Yumei Huang (In reply to Dr. David Alan Gilbert from comment #5) > I'll leave that question upto Hai; but I think that would be an RFE. > Note your test is really just testing it with a file on disk, which there's > no good reason to support; support on a real nVDIMM might make some sense, > but would require different kernel support. It would work same to guest whether to use a real nvdimm or an emulated nvdimm backed by host file. If postcopy with nvdimm is not supported, I think we should let src qemu know and handle it correctly. But the fact is, dst qemu quits with error message, and src qemu thought the migration went well and paused(postmigrate). Then there is no guest alive, which could cause terrible consequences. > Closing as not-supported. > (In reply to Yumei Huang from comment #6) > (In reply to Dr. David Alan Gilbert from comment #5) > > > I'll leave that question upto Hai; but I think that would be an RFE. > > Note your test is really just testing it with a file on disk, which there's > > no good reason to support; support on a real nVDIMM might make some sense, > > but would require different kernel support. > > It would work same to guest whether to use a real nvdimm or an emulated > nvdimm backed by host file. They require different support in the kernel to do that. > If postcopy with nvdimm is not supported, I think we should let src qemu > know and handle it correctly. But the fact is, dst qemu quits with error > message, and src qemu thought the migration went well and > paused(postmigrate). Then there is no guest alive, which could cause > terrible consequences. Did you set the postcopy-ram capability on the destination? If you did it should have failed very early on before the start of migration. > > Closing as not-supported. > > (In reply to Dr. David Alan Gilbert from comment #7) > (In reply to Yumei Huang from comment #6) > > (In reply to Dr. David Alan Gilbert from comment #5) > > > > > I'll leave that question upto Hai; but I think that would be an RFE. > > > Note your test is really just testing it with a file on disk, which there's > > > no good reason to support; support on a real nVDIMM might make some sense, > > > but would require different kernel support. > > > > It would work same to guest whether to use a real nvdimm or an emulated > > nvdimm backed by host file. > > They require different support in the kernel to do that. I don't quite understand. Put migration aside, I meant it is same to guest whether we use a real nvdimm or a host file as the backend. We could see a same "/dev/pmemX" in guest. "different support in the kernel", did you mean the guest side or host side? Could you please explain why it would be different? > > If postcopy with nvdimm is not supported, I think we should let src qemu > > know and handle it correctly. But the fact is, dst qemu quits with error > > message, and src qemu thought the migration went well and > > paused(postmigrate). Then there is no guest alive, which could cause > > terrible consequences. > > Did you set the postcopy-ram capability on the destination? If you did it > should have failed very early on before the start of migration. Yes, I set it on both src and dst in step3. It didn't fail until migrate_start_postcopy. BTW, precopy migration could work well with nvdimm. > > > Closing as not-supported. > > > (In reply to Yumei Huang from comment #8) > (In reply to Dr. David Alan Gilbert from comment #7) > > (In reply to Yumei Huang from comment #6) > > > (In reply to Dr. David Alan Gilbert from comment #5) > > > > > > > I'll leave that question upto Hai; but I think that would be an RFE. > > > > Note your test is really just testing it with a file on disk, which there's > > > > no good reason to support; support on a real nVDIMM might make some sense, > > > > but would require different kernel support. > > > > > > It would work same to guest whether to use a real nvdimm or an emulated > > > nvdimm backed by host file. > > > > They require different support in the kernel to do that. > > I don't quite understand. Put migration aside, I meant it is same to guest > whether we use a real nvdimm or a host file as the backend. We could see a > same "/dev/pmemX" in guest. "different support in the kernel", did you mean > the guest side or host side? Could you please explain why it would be > different? In the host kernel; the postcopy code uses a kernel feature called 'userfault' and it has to be taught about different types of backing memory. > > > > If postcopy with nvdimm is not supported, I think we should let src qemu > > > know and handle it correctly. But the fact is, dst qemu quits with error > > > message, and src qemu thought the migration went well and > > > paused(postmigrate). Then there is no guest alive, which could cause > > > terrible consequences. > > > > Did you set the postcopy-ram capability on the destination? If you did it > > should have failed very early on before the start of migration. > > Yes, I set it on both src and dst in step3. It didn't fail until > migrate_start_postcopy. Yes, checked my code; the kernel doesn't give me a way to check whether an area of memory is supported or not; I've got an explicit check for some cases I know about; I might be able to add another test. > BTW, precopy migration could work well with nvdimm. Yes, potentially. > > > > > Closing as not-supported. > > > > (In reply to Dr. David Alan Gilbert from comment #9) > (In reply to Yumei Huang from comment #8) > > (In reply to Dr. David Alan Gilbert from comment #7) > > > (In reply to Yumei Huang from comment #6) > > > > (In reply to Dr. David Alan Gilbert from comment #5) > > > > > > > > > I'll leave that question upto Hai; but I think that would be an RFE. > > > > > Note your test is really just testing it with a file on disk, which there's > > > > > no good reason to support; support on a real nVDIMM might make some sense, > > > > > but would require different kernel support. > > > > > > > > It would work same to guest whether to use a real nvdimm or an emulated > > > > nvdimm backed by host file. > > > > > > They require different support in the kernel to do that. > > > > I don't quite understand. Put migration aside, I meant it is same to guest > > whether we use a real nvdimm or a host file as the backend. We could see a > > same "/dev/pmemX" in guest. "different support in the kernel", did you mean > > the guest side or host side? Could you please explain why it would be > > different? > > In the host kernel; the postcopy code uses a kernel feature called > 'userfault' and it has to be taught about different types of backing memory. Thanks for the explanation. > > > > > > If postcopy with nvdimm is not supported, I think we should let src qemu > > > > know and handle it correctly. But the fact is, dst qemu quits with error > > > > message, and src qemu thought the migration went well and > > > > paused(postmigrate). Then there is no guest alive, which could cause > > > > terrible consequences. > > > > > > Did you set the postcopy-ram capability on the destination? If you did it > > > should have failed very early on before the start of migration. > > > > Yes, I set it on both src and dst in step3. It didn't fail until > > migrate_start_postcopy. > > Yes, checked my code; the kernel doesn't give me a way to check whether an > area of memory is supported or not; I've got an explicit check for some > cases I know about; I might be able to add another test. If so, should we reopen this bz? > > BTW, precopy migration could work well with nvdimm. > > Yes, potentially. > > > > > > > > Closing as not-supported. > > > > > try to reproduce this bz on rhel8.1(kernel-4.18.0-129.el8.x86_64 & qemu-4.1.0-rc4) with comment 0's test steps, got error like following: 1. on src host: (qemu) migrate_start_postcopy (qemu) qemu-kvm: Detected IO failure for postcopy. Migration paused. (qemu) info migrate globals: store-global-state: on only-migratable: off send-configuration: on send-section-footer: on decompress-error-check: on clear-bitmap-shift: 18 capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: on x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off Migration status: postcopy-paused total time: 12332 milliseconds expected downtime: 300 milliseconds setup: 24 milliseconds transferred ram: 209625 kbytes throughput: 268.57 mbps remaining ram: 2990892 kbytes total ram: 5260104 kbytes duplicate: 672649 pages skipped: 0 pages normal: 50829 pages normal bytes: 203316 kbytes dirty sync count: 2 page size: 4 kbytes multifd bytes: 0 kbytes pages-per-second: 8230 dirty pages rate: 245673 pages (qemu) info status VM status: paused (finish-migrate) 2. on dst host: (qemu) qemu-kvm: ram_block_enable_notify userfault register: Invalid argument qemu-kvm: ram_block_enable_notify failed qemu-kvm: cleanup_range: userfault unregister Invalid argument qemu-kvm: load of migration failed: Operation not permitted Interesting; I'd hoped upstream 469dd51bc66 would have helped. QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. Still reproduce on rhel8.4.0-AV, so reopen this bz. hosts info: kernel-4.18.0-287.el8.dt4.x86_64 & qemu-img-5.2.0-11.module+el8.4.0+10268+62bcbbed.x86_64 Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release. Still reproduce bz on rhel9(qemu-kvm-6.1.0-1.el9.x86_64), so delay the stale date After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |