Bug 1805656
Summary: | Guest hang after "echo 1 > /sys/bus/pci/devices/$vhost_user_nic_pcie/reset" | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Pei Zhang <pezhang> |
Component: | qemu-kvm | Assignee: | Eugenio Pérez Martín <eperezma> |
qemu-kvm sub component: | Networking | QA Contact: | Pei Zhang <pezhang> |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aadam, ailan, ameynarkhede03, chayang, eperezma, jinzhao, juzhang, maxime.coquelin, smitterl, virt-maint |
Version: | unspecified | Keywords: | Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-12-01 07:27:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1948358 |
Description
Pei Zhang
2020-02-21 10:17:01 UTC
I manage to reproduce the issue with both RHL8.3 in host and guest. I get a softlockup in guest, seems virtnet_send_command never completes: [ 196.018102] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-1.module+el8.3.0+6423+e4cb6418 04/01/2014 [ 196.019266] RIP: 0010:virtnet_send_command+0x100/0x150 [virtio_net] [ 196.020066] Code: 74 24 48 e8 e2 74 5a d5 48 8b 7b 08 e8 e9 57 5a d5 84 c0 75 11 eb 22 48 8b 7b 08 e8 7a 52 5a d5 84 c0 75 15 f3 90 48 8b 7b 08 <48> 8d 74 24 04 e8 16 61 5a d5 48 85 c0 74 de 48 8b 83 58 01 00 00 [ 196.022418] RSP: 0018:ffffaa0000b1fa68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [ 196.023377] RAX: 0000000000000000 RBX: ffff88ea736b6ac0 RCX: 0000000000000001 [ 196.024286] RDX: 0000000000000000 RSI: ffffaa0000b1fa6c RDI: ffff88ea7269d180 [ 196.025192] RBP: 0000000000000002 R08: 0000771640000000 R09: ffff88ea736b6ac0 [ 196.026091] R10: 0000000171213000 R11: 0000000000000000 R12: ffffaa0000b1fa90 [ 196.026999] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff9679c3c0 [ 196.027904] FS: 00007fac4745a3c0(0000) GS:ffff88eabbb00000(0000) knlGS:0000000000000000 [ 196.028923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 196.029658] CR2: 00007f3449ad3000 CR3: 00000001771b0001 CR4: 0000000000760ee0 [ 196.030559] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 196.031466] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 196.032364] PKRU: 55555554 [ 196.032718] Call Trace: [ 196.033038] virtnet_set_rx_mode+0xbc/0x330 [virtio_net] [ 196.033721] __dev_mc_del+0x64/0x70 [ 196.034171] igmp6_group_dropped+0xee/0x200 [ 196.034715] ? netlink_broadcast_filtered+0x145/0x400 [ 196.035356] __ipv6_dev_mc_dec+0xbc/0x130 [ 196.035871] addrconf_leave_solict.part.65+0x42/0x60 [ 196.036513] __ipv6_ifa_notify+0x10a/0x320 [ 196.037036] addrconf_ifdown+0x2b9/0x570 [ 196.037543] addrconf_notify+0x24c/0xaf0 [ 196.038048] ? copy_overflow+0x20/0x20 [ 196.038532] ? copy_overflow+0x20/0x20 [ 196.039019] ? __do_proc_dointvec+0x21d/0x410 [ 196.039578] ? dev_disable_change+0x4c/0x80 [ 196.040115] dev_disable_change+0x4c/0x80 [ 196.040635] addrconf_sysctl_disable+0x11e/0x1a0 [ 196.041227] ? dev_disable_change+0x80/0x80 [ 196.041772] proc_sys_call_handler+0x1a5/0x1c0 [ 196.042342] vfs_write+0xa5/0x1a0 [ 196.042776] ksys_write+0x4f/0xb0 [ 196.043205] do_syscall_64+0x5b/0x1a0 [ 196.043682] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 196.044324] RIP: 0033:0x7fac44c50847 [ 196.044788] Code: c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 1b fd ff ff 4c 89 e2 48 89 ee 89 df 41 89 c0 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 54 fd ff ff 48 [ 196.047135] RSP: 002b:00007fff067bcf50 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 196.048097] RAX: ffffffffffffffda RBX: 000000000000001c RCX: 00007fac44c50847 [ 196.049001] RDX: 0000000000000002 RSI: 00007fff067bcf80 RDI: 000000000000001c [ 196.049903] RBP: 00007fff067bcf80 R08: 0000000000000000 R09: 00007fac449f9d40 [ 196.050809] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000002 [ 196.051709] R13: 000000000000001c R14: 0000000000000000 R15: 00007fff067bcf80 [ 202.014002] rcu: INFO: rcu_sched self-detected stall on CPU [ 202.014720] rcu: 2-....: (59922 ticks this GP) idle=29a/1/0x4000000000000002 softirq=12857/12857 fqs=14951 [ 202.015958] (t=60000 jiffies g=18969 q=540) [ 202.016503] NMI backtrace for cpu 2 [ 202.016952] CPU: 2 PID: 752 Comm: NetworkManager Kdump: loaded Tainted: G L --------- - - 4.18.0-215.el8.x86_64 #1 This issue still exists with latest rhel8.4-av. Versions: 4.18.0-276.el8.x86_64 qemu-kvm-5.2.0-3.module+el8.4.0+9499+42e58f08.x86_64 Currently there is hard coded policy/ordering of reset methods in kernel (Dev specific->FLR->AF_FLR->Power Management->slot->bus). I proposed a patch that would let user to see all supported reset methods and call the specific one through new reset_methods sysfs attribute. https://lore.kernel.org/linux-pci/20210409192324.30080-1-ameynarkhede03@gmail.com/ Can you test which reset method is being used by vhost-user-nic using that patch? (In reply to Amey Narkhede from comment #15) > Currently there is hard coded policy/ordering of reset methods in kernel > (Dev specific->FLR->AF_FLR->Power Management->slot->bus). > I proposed a patch that would let user to see all supported reset > methods and call the specific one through new reset_methods sysfs attribute. > https://lore.kernel.org/linux-pci/20210409192324.30080-1- > ameynarkhede03/ > > Can you test which reset method is being used by vhost-user-nic using that > patch? Hi Amey. Thanks for the suggestion. I'm not able to see any reset_method file under /sys/bus/pci/devices/ after apply your path on top of v5.12-rc2. Am I missing something? I'm not able to apply it over latest master due to conflicts, in case you want to send an updated version. Thanks! Hi Amey. Not sure about what failed. The output of the sys file is flr,pm,bus. Thanks! Hi Eugenio, Can you try writing pm and bus to reset_method file and then perform the reset? # echo pm > /sys/bus/..../reset_method Then try performing reset by # echo 1 > /sys/bus/..../reset You can try same steps for the bus reset. Also you can use latest version of patches from here https://lore.kernel.org/linux-pci/20210529192527.2708-1-ameynarkhede03@gmail.com/T/#t if you get merge conflicts. (In reply to Amey Narkhede from comment #19) > Hi Eugenio, > Can you try writing pm and bus to reset_method file and then perform the > reset? > # echo pm > /sys/bus/..../reset_method > Then try performing reset by > # echo 1 > /sys/bus/..../reset > You can try same steps for the bus reset. > > Also you can use latest version of patches from here > https://lore.kernel.org/linux-pci/20210529192527.2708-1-ameynarkhede03@gmail. > com/T/#t > if you get merge conflicts. Hi Armey. Thank you very much, the soft lockup is gone with pm. Could you expand on the differences of these methods? Would it be right to switch to pm or does it have undesired consequences? Thanks! (In reply to Eugenio Pérez Martín from comment #20) > (In reply to Amey Narkhede from comment #19) > > Hi Eugenio, > > Can you try writing pm and bus to reset_method file and then perform the > > reset? > > # echo pm > /sys/bus/..../reset_method > > Then try performing reset by > > # echo 1 > /sys/bus/..../reset > > You can try same steps for the bus reset. > > > > Also you can use latest version of patches from here > > https://lore.kernel.org/linux-pci/20210529192527.2708-1-ameynarkhede03@gmail. > > com/T/#t > > if you get merge conflicts. > > Hi Armey. > > Thank you very much, the soft lockup is gone with pm. > > Could you expand on the differences of these methods? Would it be right to > switch to > pm or does it have undesired consequences? > > Thanks! I think difference is device specific. Looks like the problem is in FLR implementation in vhost-user NIC. Can you try pinging on qemu mailing list? Thanks, Amey Bulk update: Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. Testing update: This issue cannot be reproduced with latest rhel9.0. Versions: 5.14.0-39.el9.x86_64 qemu-kvm-6.2.0-1.el9.x86_64 openvswitch2.15-2.15.0-33.el9fdp.x86_64 Following steps in Description, guest keeps working well. So this issue is gone. Move status to CurrentRelease. |