Bug 1841900
Summary: | [Hyper-V][RHEL8] Backport "hv_netvsc: fix race that may miss tx queue wakeup" to 8.1.z and hotfix | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Jim Minter <jminter> |
Component: | kernel | Assignee: | Rick Barry <ribarry> |
kernel sub component: | Hyper-V | QA Contact: | Huijuan Zhao <huzhao> |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
Severity: | urgent | ||
Priority: | unspecified | CC: | ailan, dhoward, dornelas, eparis, ffranz, hhei, hkrzesin, huzhao, imcleod, jminter, miabbott, mmorsy, nmurray, nstielau, rhowe, ribarry, smilner, xialiu, xuli, yacao, yuxisun |
Version: | 8.1 | ||
Target Milestone: | rc | ||
Target Release: | 8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-06-23 14:14:01 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1774687, 1842485 | ||
Bug Blocks: |
Description
Jim Minter
2020-05-29 18:43:15 UTC
We've requested the 8.1.z back-port using the original RHEL 8.2 BZ 1774687. I'll update progress here. Current status is we're waiting for GSS approval for the 8.1.z back-port. Here is a brew build for 8.1 that contains the fix in case you need to test it https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28933341 (In reply to Mohammed Gamal from comment #2) > Here is a brew build for 8.1 that contains the fix in case you need to test > it > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28933341 Thanks, Mohammed. Jim, is there someone who can test the scratch build provided by Mohammed while we wait for the official 8.1.z back-port? It will be good to know if this fix alone helps resolve the problem. RHEL 8.1.0.z clone created: Bug 1842485 1. Tested with kernel-4.18.0-147.8.1.el8_1, can reproduce the scenario in comment 0: "ethtool -S eth1 shows stop_queue with a value of 1 + wake_queue" # uname -a Linux vm-197-113.lab.eng.pek2.redhat.com 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Wed Feb 26 03:08:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Test steps: 1. Create VM1 with kernel-4.18.0-147.8.1.el8_1 on hyper-v 2019 2. Send package to VM1 with iperf3 # iperf3 -c 192.168.10.244 -P4 -i 10 -t 60 3. Check on VM1, the stop_queue and wake_queue value is not 0: # ethtool -S eth1 NIC statistics: tx_scattered: 0 tx_no_memory: 0 tx_no_space: 0 tx_too_big: 0 tx_busy: 0 tx_send_full: 0 rx_comp_busy: 0 rx_no_memory: 0 stop_queue: 13 wake_queue: 13 vf_rx_packets: 0 vf_rx_bytes: 0 vf_tx_packets: 0 vf_tx_bytes: 0 vf_tx_dropped: 0 tx_queue_0_packets: 1388853 tx_queue_0_bytes: 91706659 rx_queue_0_packets: 20832099 rx_queue_0_bytes: 31539199755 tx_queue_1_packets: 382518 tx_queue_1_bytes: 25248153 rx_queue_1_packets: 90196752 rx_queue_1_bytes: 136501485858 cpu0_rx_packets: 90196752 cpu0_rx_bytes: 136501485858 cpu0_tx_packets: 382518 cpu0_tx_bytes: 25248153 cpu0_vf_rx_packets: 0 cpu0_vf_rx_bytes: 0 cpu0_vf_tx_packets: 0 cpu0_vf_tx_bytes: 0 cpu1_rx_packets: 20832099 cpu1_rx_bytes: 31539199755 cpu1_tx_packets: 1388853 cpu1_tx_bytes: 91706659 cpu1_vf_rx_packets: 0 cpu1_vf_rx_bytes: 0 cpu1_vf_tx_packets: 0 cpu1_vf_tx_bytes: 0 2. Tested with the scratch build in comment 2, also can reproduce the scenario in comment 0: "ethtool -S eth1 shows stop_queue with a value of 1 + wake_queue". # uname -a Linux vm-197-113.lab.eng.pek2.redhat.com 4.18.0-147.17.1.el8_1.test.x86_64 #1 SMP Fri May 29 19:40:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux # ethtool -S eth1 NIC statistics: tx_scattered: 0 tx_no_memory: 0 tx_no_space: 0 tx_too_big: 0 tx_busy: 0 tx_send_full: 5 rx_comp_busy: 0 rx_no_memory: 0 stop_queue: 19 wake_queue: 19 vf_rx_packets: 0 vf_rx_bytes: 0 vf_tx_packets: 0 vf_tx_bytes: 0 vf_tx_dropped: 0 tx_queue_0_packets: 2305274 tx_queue_0_bytes: 76021212175 rx_queue_0_packets: 63032192 rx_queue_0_bytes: 94626258494 tx_queue_1_packets: 1681254 tx_queue_1_bytes: 90215932925 rx_queue_1_packets: 660721 rx_queue_1_bytes: 43613641 cpu0_rx_packets: 63692913 cpu0_rx_bytes: 94669872135 cpu0_tx_packets: 3986528 cpu0_tx_bytes: 166237145100 cpu0_vf_rx_packets: 0 cpu0_vf_rx_bytes: 0 cpu0_vf_tx_packets: 0 cpu0_vf_tx_bytes: 0 cpu1_rx_packets: 0 cpu1_rx_bytes: 0 cpu1_tx_packets: 0 cpu1_tx_bytes: 0 cpu1_vf_rx_packets: 0 cpu1_vf_rx_bytes: 0 cpu1_vf_tx_packets: 0 cpu1_vf_tx_bytes: 0 I am not sure if the stop_queue value shows that we hit customer's issue, let's wait for Jim's test results. 3. Yuxin run automation regression test for the scratch build in comment 2 -- PASS, no regression issue found. 4. Triggered automation regression test on hyper-v for the scratch build in comment 2, will update here once done. Thanks! (In reply to Huijuan Zhao from comment #5) > 3. Yuxin run automation regression test for the scratch build in comment 2 > -- PASS, no regression issue found. Automation regression test on Azure -- PASS > 4. Triggered automation regression test on hyper-v for the scratch build in > comment 2, will update here once done. > Automation regression test on hyper-v 2019 -- PASS, no regression issue found. @Huijuan Zhao, when the issue occurs, on every call to `ethtool -S` we see that the stop_queue number is one greater than the wake_queue number, e.g.: stop_queue: 53 wake_queue: 52 The numbers may increment over time, but the indicator is that they are repeatedly seen to be not equal. It would be amazing if we could see a reproducer for this on the 8.1 kernel and not on the hotfix kernel, however I am not aware that anyone has succeeded in writing a reproducer for this bug. I think that in https://bugzilla.redhat.com/show_bug.cgi?id=1841900#c5 you are saying that you could NOT reproduce this with either kernel? Thanks! (In reply to Jim Minter from comment #7) > @Huijuan Zhao, when the issue occurs, on every call to `ethtool -S` we see > that the stop_queue number is one greater than the wake_queue number, e.g.: > > stop_queue: 53 > wake_queue: 52 > > The numbers may increment over time, but the indicator is that they are > repeatedly seen to be not equal. > > It would be amazing if we could see a reproducer for this on the 8.1 kernel > and not on the hotfix kernel, however I am not aware that anyone has > succeeded in writing a reproducer for this bug. > > I think that in https://bugzilla.redhat.com/show_bug.cgi?id=1841900#c5 you > are saying that you could NOT reproduce this with either kernel? > > Thanks! @Jim Minter, thanks for the explanation. I misunderstood the "stop_queue with a value of 1 + wake_queue", so in comment 5, I did NOT reproduce the "stop_queue with a value of 1 + wake_queue" with either kernel. In comment 5, for both two kernels, the stop_queue and wake_queue values( > 0) are same. Additional info: I tested rhel-7.8 and rhel-8.2, the stop_queue and wake_queue values are 0. (In reply to Mohammed Gamal from comment #2) > Here is a brew build for 8.1 that contains the fix in case you need to test > it > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28933341 Mohammed, Could you create a non-scratch build of this kernel? We'd like to have a "real" build in hand before we get final approval to ship this as a hotfix. (In reply to Ian McLeod from comment #9) > (In reply to Mohammed Gamal from comment #2) > > Here is a brew build for 8.1 that contains the fix in case you need to test > > it > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28933341 > > Mohammed, > > Could you create a non-scratch build of this kernel? > > We'd like to have a "real" build in hand before we get final approval to > ship this as a hotfix. Ian, You might want to wait for the RHEL 8.1.0.z kernel build that is planned for June 8 or 9. See bug 1842485 which is the formal RHEL 8.1.0.z BZ for this fix. (In reply to Ian McLeod from comment #9) > (In reply to Mohammed Gamal from comment #2) > > Here is a brew build for 8.1 that contains the fix in case you need to test > > it > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28933341 > > Mohammed, > > Could you create a non-scratch build of this kernel? > > We'd like to have a "real" build in hand before we get final approval to > ship this as a hotfix. Unfortunately, that's not in my capacity. It's kernel maintainers who provide "real" builds, I could only provide a scratch build. Looking at BZ#1842485, the patch is merged into the latest z-stream kernel, and should be released soon. @Herton: Is there any other way than just waiting for z-stream in which we can provide a build for the OCP team? THe fix "[netdrv] hv_netvsc: fix race that may miss tx queue wakeup" was included through bug 1842485 It's present starting with kernel-4.18.0-147.19.1.el8_1 (which is a non-scratch build). You can check with Norm Murray or Don Howard to provide it as a hotfix. Build above is scheduled to go with 8.1.z EUS batch 5 (however we should have one more build with more patches, so likely the kernel we ship with batch 5 update will be 147.20.1). Ian (or Jim Minter), as Herton indicates in comment 12, the kernel just provided in bug 1842485 (kernel-4.18.0-147.19.1.el8_1) contains the fix. I don't know if RHOCP/RHCOS is able to take that kernel for testing or if you only want to test with an official "hot-fix". I believe you can request a hot-fix in bug 1842485 by setting the hot-fix? flag. I'm cc'ing Norm Murray and Don Howard for their guidance on requesting a hot-fix. RHEL 8.1.0.z bug 1842485 is ON_QA. A kernel is available for testing by RHOSP/RHOCP. Tested with kernel-4.18.0-147.19.1.el8_1 manually, on every call to `ethtool -S` we see that stop_queue and wake_queue values( > 0) are same, the numbers increment over time and is equal. e.g. stop_queue: 2 wake_queue: 2 Will trigger automation regression test when resource is available later and update here once done. Jim Minter, just as Rick said in comment 13, if RHOCP/RHCOS is able to take the kernel-4.18.0-147.19.1.el8_1 for testing or if you only want to test with an official "hot-fix"? Thanks! (In reply to Huijuan Zhao from comment #15) > Tested with kernel-4.18.0-147.19.1.el8_1 manually, on every call to `ethtool > -S` we see that stop_queue and wake_queue values( > 0) are same, the numbers > increment over time and is equal. e.g. > stop_queue: 2 > wake_queue: 2 > > Will trigger automation regression test when resource is available later and > update here once done. > Network automation regression test on hyper-v is PASS, no regression issue found. Moving this BZ to ON_QA to reflect the state of the RHEL 8.1.0.z BZ (bug 1842485) containing the fix. Huijuan, it's not up to me. I think that imcleod is a better person to ask. The difficulty here is that we don't have a good reproducer, this probably needs to be integrated on faith. (In reply to Jim Minter from comment #18) > Huijuan, it's not up to me. I think that imcleod is a better person to ask. > The difficulty here is that we don't have a good reproducer, this probably > needs to be integrated on faith. Jim, thanks for the explanation and information. Yes, I understand. Ian McLeod, do you have a reproducer to validate the official kernel kernel-4.18.0-147.19.1.el8_1? Thanks! |