Created attachment 1967803 [details] libvirt VM XML Created attachment 1967803 [details] libvirt VM XML Description of problem: While moving VMs to a host with 4.18.0-477.10.1.el8_8.x86_64, they start to hang. Issue does not reproduce with 4.18.0-425.19.2.el8_7.x86_64. Version-Release number of selected component (if applicable): How reproducible: Every time. Takes about 20 minutes to reproduce under lab conditions. Steps to Reproduce: 1. On a node with 128GiB RAM and more than one NUMA node, start 12 VMs with 8 GiB of memory each 2. Start some load in the VMs to allocate memory (for example, memtest) 3. Do "migratepages" for one of the VMs Actual results: VMs hang and the following shows in the kernel log: May 23 12:41:36 t6 kernel: INFO: task CPU 0/KVM:159520 blocked for more than 120 seconds. May 23 12:41:36 t6 kernel: Not tainted 4.18.0-477.10.1.el8_8.x86_64 #1 May 23 12:41:36 t6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 23 12:41:36 t6 kernel: task:CPU 0/KVM state:D stack: 0 pid:159520 ppid: 1 flags:0x80000182 May 23 12:41:36 t6 kernel: Call Trace: May 23 12:41:36 t6 kernel: __schedule+0x2d1/0x870 May 23 12:41:36 t6 kernel: schedule+0x55/0xf0 May 23 12:41:36 t6 kernel: io_schedule+0x12/0x40 May 23 12:41:36 t6 kernel: migration_entry_wait_on_locked+0x1ea/0x290 May 23 12:41:36 t6 kernel: ? filemap_fdatawait_keep_errors+0x50/0x50 May 23 12:41:36 t6 kernel: do_swap_page+0x5b0/0x710 May 23 12:41:36 t6 kernel: ? pmd_devmap_trans_unstable+0x2e/0x40 May 23 12:41:36 t6 kernel: ? handle_pte_fault+0x5d/0x880 May 23 12:41:36 t6 kernel: __handle_mm_fault+0x453/0x6c0 May 23 12:41:36 t6 kernel: handle_mm_fault+0xca/0x2a0 May 23 12:41:36 t6 kernel: __get_user_pages+0x2e1/0x810 May 23 12:41:36 t6 kernel: get_user_pages_unlocked+0xd5/0x2a0 May 23 12:41:36 t6 kernel: hva_to_pfn+0xf5/0x430 [kvm] May 23 12:41:36 t6 kernel: ? mmu_spte_update_no_track+0xaf/0x100 [kvm] May 23 12:41:36 t6 kernel: kvm_faultin_pfn+0x95/0x2e0 [kvm] May 23 12:41:36 t6 kernel: direct_page_fault+0x3b4/0x860 [kvm] May 23 12:41:36 t6 kernel: kvm_mmu_page_fault+0x114/0x680 [kvm] May 23 12:41:36 t6 kernel: ? default_do_nmi+0x49/0x110 May 23 12:41:36 t6 kernel: ? do_nmi+0x104/0x220 May 23 12:41:36 t6 kernel: ? vmx_deliver_interrupt+0x92/0x1c0 [kvm_intel] May 23 12:41:36 t6 kernel: ? vmx_vmexit+0x9f/0x72d [kvm_intel] May 23 12:41:36 t6 kernel: ? vmx_vmexit+0xae/0x72d [kvm_intel] May 23 12:41:36 t6 kernel: ? gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm] May 23 12:41:36 t6 kernel: vmx_handle_exit+0x177/0x770 [kvm_intel] May 23 12:41:36 t6 kernel: ? gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm] May 23 12:41:36 t6 kernel: vcpu_enter_guest+0xaf9/0x18d0 [kvm] May 23 12:41:36 t6 kernel: kvm_arch_vcpu_ioctl_run+0x112/0x600 [kvm] May 23 12:41:36 t6 kernel: kvm_vcpu_ioctl+0x2c9/0x640 [kvm] May 23 12:41:36 t6 kernel: ? __handle_mm_fault+0x453/0x6c0 May 23 12:41:36 t6 kernel: do_vfs_ioctl+0xa4/0x690 May 23 12:41:36 t6 kernel: ksys_ioctl+0x64/0xa0 May 23 12:41:36 t6 kernel: __x64_sys_ioctl+0x16/0x20 May 23 12:41:36 t6 kernel: do_syscall_64+0x5b/0x1b0 May 23 12:41:36 t6 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6 May 23 12:41:36 t6 kernel: RIP: 0033:0x7f55249107cb May 23 12:41:36 t6 kernel: Code: Unable to access opcode bytes at RIP 0x7f55249107a1. May 23 12:41:36 t6 kernel: RSP: 002b:00007f551adbc6d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 May 23 12:41:36 t6 kernel: RAX: ffffffffffffffda RBX: 0000564500c862c0 RCX: 00007f55249107cb May 23 12:41:36 t6 kernel: RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000015 May 23 12:41:36 t6 kernel: RBP: 000000000000ae80 R08: 00005644ffdce5a8 R09: 00007f53040008de May 23 12:41:36 t6 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 May 23 12:41:36 t6 kernel: R13: 00005644ffdfef20 R14: 0000000000000000 R15: 00007f5528095000 Expected results: The VMs shouldn't hang. Additional info: Issue does not reproduce with 4.18.0-425.19.2.el8_7.x86_64. Attached are list of packages, lshw, lscpu and a dump of one of the test VMs' libvirt XML.
Created attachment 1967804 [details] List of packages on the system
Created attachment 1967805 [details] lscpu
Created attachment 1967806 [details] lshw
Probably a duplicate of bz2188249. Can you retry with kernel-4.18.0-492.el8 or later?
I can confirm that the issue does not reproduce with 4.18.0-492.el8. At the same time, a customer is seeing a similar issue in 3.10.0-1160.90.1.el7. As I don't have access to bz2188249 I'm not sure I can see the patch to try to see if it'd affect a CentOS 7 kernel, is it known if the issue is isolated only to RHEL 8.8?
(In reply to redhat from comment #5) > I can confirm that the issue does not reproduce with 4.18.0-492.el8. > > At the same time, a customer is seeing a similar issue in > 3.10.0-1160.90.1.el7. As I don't have access to Red Hatbz2188249 I'm not > sure I can see the patch to try to see if it'd affect a CentOS 7 kernel, is > it known if the issue is isolated only to RHEL 8.8? I don't think your CentOS-7 issue is related with the CentOS-stream-8 problem reported here, given the fact that the the latter is due to an overlook for the backport of the following upstream commit into RHEL-8.8: commit ffa65753c43142f3b803486442813744da71cff2 Author: Alistair Popple <apopple> Date: Fri Jan 21 22:10:46 2022 -0800 mm/migrate.c: rework migration_entry_wait() to not take a pageref and the aforementioned change set hasn't been backported to RHEL-7. *** This bug has been marked as a duplicate of bug 2188249 ***