Bug 1921601
Summary: | [RHEL8.2rt] sosreport hang in rt-kvm if the isolated vCPU of Guest fully utilised 100% | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Pradipta Kumar Sahoo <psahoo> | ||||
Component: | kernel-rt | Assignee: | Virtualization Maintenance <virt-maint> | ||||
kernel-rt sub component: | KVM | QA Contact: | Pei Zhang <pezhang> | ||||
Status: | CLOSED WONTFIX | Docs Contact: | |||||
Severity: | low | ||||||
Priority: | low | CC: | atheurer, bhu, daolivei, jinzhao, jlelli, juzhang, lcapitulino, mtosatti, nilal, peterx, pezhang, virt-maint, williams | ||||
Version: | 8.2 | Keywords: | Reopened, Triaged | ||||
Target Milestone: | rc | ||||||
Target Release: | 8.4 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-01-23 07:26:58 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1983167 | ||||||
Bug Blocks: | 1825271 | ||||||
Attachments: |
|
Description
Pradipta Kumar Sahoo
2021-01-28 09:46:39 UTC
Hi Pradipta, Based on the guest kernel cmdline, I am assuming you are using vIOMMU and hence userspace IOAPIC. Can you please confirm if you are setting the emulator thread sched priority to fifo:1? Thanks Pradipta, another thing that I missed is the kernel version. It looks like you are running an older 8.2.z kernel. Please upgrade the kernel to batch#5 or the latest candidate as some important RT-specific kernel fixes were included in batch#5. I tried to reproduce this issue on rhel8.4 (without osp16.1), but didn't trigger this issue: Versions: 4.18.0-279.rt7.44.el8.x86_64 openvswitch2.13-2.13.0-82.el8fdp.x86_64 tuned-2.15.0-1.el8.noarch qemu-kvm-5.2.0-3.scrmod+el8.4.0+9533+d0097f55.wrb210120.x86_64 libvirt-7.0.0-2.scrmod+el8.4.0+9533+d0097f55.x86_64 dpdk-20.11-1.el8.x86_64 No hang when running "sosreport --batch" in RT guest which running testpmd. Somehow a needinfo got setup for Pei which is not required so clearing it. Thanks Hi Nitesh, As suggested in comment #10, I updated kernel patch in both Compute(rt-kvm) and Guest(rt-kernel), but there is no luck. The sosreport issue is still reproducible while vCPU engage with PMD cycle. Please let me know if you would like take look into my setup for further investigation. DUT(rt-kvm): # uname -r 4.18.0-240.14.1.rt7.68.el8_3.x86_64 Guest: # uname -r 4.18.0-240.14.1.rt7.68.el8_3.x86_64 Hi Pradipta, Just wanted to follow-up, are you still triggering the issue? If so, can you please share the access details of your setup for investigation? Thanks Hi Nilesh, Sorry for my late response as I was busy with pbench test. I have reproduced the issue in RHEL 8.2, 8.3 and 8.4 real-time kernels. # uname -r 4.18.0-287.rt7.52.el8.x86_64 This issue affecting our pbench tool to collect the system SOS data. Here is the below reference sos command which has stuck in RT-KVM host where vCPU inside the rt-guest busy with testpmd. root 173230 3.1 0.0 1023012 43784 ? Sl 04:38 0:07 /usr/libexec/platform-python -s /usr/sbin/sosreport -o date -o host -o release -o kernel -o filesys -o devicemapper -o system -o memory -o hardware -o networking -o block -o processor -o collectd -o tuned --batch --quiet --tmp-dir=/var/lib/pbench-agent/trafficgen_RHOS-16.2-RHEL-8.4Beta-KERNEL-4.18.0-287.rt7.52.el8.x86_64-SRIOV-1Q-PVP-Loss-LAT_tg:trex-profile_pf:profile.json_ml:0.0005_tt:os_2021-02-16T04:38:34/sysinfo/beg/nfv-compute-rt-offload-0 --name pbench-nfv-compute-rt-offload-0.localdomain Although this issue is not a blocker for our test, it would nice we could find out the root cause for further solution/workaround. I am going to schedule a 30min meeting with you for further investigation. -- Pradipta (In reply to Pradipta Kumar Sahoo from comment #16) > Hi Nilesh, > > Sorry for my late response as I was busy with pbench test. I have reproduced > the issue in RHEL 8.2, 8.3 and 8.4 real-time kernels. That's perfectly fine, thank you for testing. The 8.4 testings is also done in an OSP environment, is that right? I am assuming that the reason why Pei has not been able to reproduce this with 8.4 is because she was not testing on an OSP based deployment. > > # uname -r > 4.18.0-287.rt7.52.el8.x86_64 > > This issue affecting our pbench tool to collect the system SOS data. Here > is the below reference sos command which has stuck in RT-KVM host where vCPU > inside the rt-guest busy with testpmd. > > root 173230 3.1 0.0 1023012 43784 ? Sl 04:38 0:07 > /usr/libexec/platform-python -s /usr/sbin/sosreport -o date -o host -o > release -o kernel -o filesys -o devicemapper -o system -o memory -o hardware > -o networking -o block -o processor -o collectd -o tuned --batch --quiet > --tmp-dir=/var/lib/pbench-agent/trafficgen_RHOS-16.2-RHEL-8.4Beta-KERNEL-4. > 18.0-287.rt7.52.el8.x86_64-SRIOV-1Q-PVP-Loss-LAT_tg:trex-profile_pf:profile. > json_ml:0.0005_tt:os_2021-02-16T04:38:34/sysinfo/beg/nfv-compute-rt-offload- > 0 --name pbench-nfv-compute-rt-offload-0.localdomain > Can you also share the testpmd cmdline that you use to run in the KVM-RT guest? > Although this issue is not a blocker for our test, it would nice we could > find out the root cause for further solution/workaround. I am going to > schedule a 30min meeting with you for further investigation. I am assuming this meeting is about walking through the steps required for the reproduction of the issue. For the investigation, I would still need access to your machine atleast for the next few days. If you are using this box for some other testing, then maybe you can reproduce this on some other machine from the beaker. I synced-up with Pradipta yesterday and he is going to reproduce the issue on another beaker machine and will share the access for further investigation. @Pradipta, in case you are overloaded with other testing efforts please let us know, we can check if Pei has the bandwidth to reproduce this in her OSP environment. Although, that may take some time as she is currently busy with some other activities. Thanks Hi Nitesh, To build another OSP rt-kernel environment took more time as nic templates and configurations are not identical in beaker and other scale-lab servers. I will plan to share my environment after the latency test. I will share you the environment details once the current tests are completed (~ 1week). BR, Pradipta Hi Pradipta, Thanks for the update. Hi Pradipta, Did you complete your testing in the environment? Thanks Hi Pradipta, Any new update on the environment or the reproducibility of the issue? Thanks Hi Nitesh, I am occupied with the latency measurement test. We are hitting some higher packet latency in the same environment where we RHEL8.4 latest rt-kernel. Please allow us some time, the issue is still reproduced in my environment irrespective to any version of rt-kernel. Please let me know if you need any specific information for further investigation. So I can collect it. Else I can schedule a meeting with you for further syncup to reproduce the issue. BR, Pradipta Hi Pradipta, It would be faster if I can get access to the environment for a few days for further investigation of the issue. One thing that you can check if you have not done it already is dmesg after the sosreport command stuck. If there is an RCU stall then we can capture vmcore for further investigation. A couple of other experiments that we can do are: - Try bumping up the priority of the sosreport command/process by using chrt - Using stalld Have you tried any of those? Since you are occupied, let's also check with Pei if she was able to reproduce it at her end. @pezhang did you try reproducing this issue with OSP deployment? Thanks Nitesh (In reply to Nitesh Narayan Lal from comment #27) > Hi Pradipta, > > It would be faster if I can get access to the environment for a few days > for further investigation of the issue. > One thing that you can check if you have not done it already is dmesg after > the sosreport command stuck. > If there is an RCU stall then we can capture vmcore for further > investigation. > > A couple of other experiments that we can do are: > > - Try bumping up the priority of the sosreport command/process by using chrt > > - Using stalld > > Have you tried any of those? > > Since you are occupied, let's also check with Pei if she was able to > reproduce > it at her end. > > @pezhang did you try reproducing this issue with OSP deployment? Hello Nitesh, I deployed the OSP 16.1 + rhel8.2.z ready. However I still failed to reproduce this issue. The compute node is not hang when running "sosreport --batch" and VM is running testpmd. In VM: Port statistics ==================================== ######################## NIC statistics for port 0 ######################## RX-packets: 11029174 RX-missed: 0 RX-bytes: 772042576 RX-errors: 0 RX-nombuf: 0 TX-packets: 11029617 TX-errors: 0 TX-bytes: 772073190 Throughput (since last show) Rx-pps: 75711 Rx-bps: 42398264 Tx-pps: 75712 Tx-bps: 42398824 ############################################################################ ######################## NIC statistics for port 1 ######################## RX-packets: 11029675 RX-missed: 0 RX-bytes: 772077250 RX-errors: 0 RX-nombuf: 0 TX-packets: 11029220 TX-errors: 0 TX-bytes: 772045400 Throughput (since last show) Rx-pps: 75723 Rx-bps: 42405368 Tx-pps: 75721 Tx-bps: 42404248 ############################################################################ [heat-admin@overcloud-computerealtime-0 ~]$ top top - 14:26:28 up 6:25, 2 users, load average: 4.71, 2.17, 1.29 Tasks: 813 total, 3 running, 810 sleeping, 0 stopped, 0 zombie %Cpu(s): 5.5 us, 4.6 sy, 0.0 ni, 88.7 id, 0.0 wa, 0.0 hi, 1.2 si, 0.0 st MiB Mem : 63939.1 total, 20074.4 free, 43139.2 used, 725.6 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 20514.0 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 72870 qemu 20 0 11.0g 321640 46672 S 209.3 0.5 13:32.92 qemu-kvm 99320 root 20 0 0 0 0 R 80.8 0.0 0:20.16 vhost-72870 99582 root 20 0 0 0 0 R 79.5 0.0 0:19.93 vhost-72870 2096 openvsw+ 10 -10 565160 66240 34636 S 12.3 0.1 0:33.77 ovs-vswitchd 4509 root 20 0 1770624 33396 19332 S 2.0 0.1 0:48.23 libvirtd 14807 heat-ad+ 20 0 161180 4916 3572 S 1.3 0.0 0:03.09 sshd 96081 root 20 0 257816 15724 13884 S 1.3 0.0 0:02.61 virsh 4371 root 20 0 177028 15512 13812 S 1.0 0.0 0:04.81 virtlogd 100241 heat-ad+ 20 0 68228 5668 4140 R 0.7 0.0 0:00.79 top 11 root -2 0 0 0 0 I 0.3 0.0 0:03.74 rcu_preempt 65 root -5 0 0 0 0 S 0.3 0.0 0:02.78 rcuc/6 82 root -5 0 0 0 0 S 0.3 0.0 0:02.87 rcuc/8 83 root -3 0 0 0 0 S 0.3 0.0 0:15.73 ksoftirqd/8 2215 root -51 0 0 0 0 S 0.3 0.0 0:03.95 irq/85-ens3f0-T 3123 root 20 0 101120 14588 8860 S 0.3 0.0 0:23.75 snmpd 4174 42435 20 0 311792 99236 14780 S 0.3 0.2 0:08.24 /usr/bin/python 87615 root 20 0 0 0 0 I 0.3 0.0 0:00.76 kworker/u64:0-events_unbound 99699 root 20 0 0 0 0 I 0.3 0.0 0:00.24 kworker/u64:2-ixgbe 1 root 20 0 245968 13524 7968 S 0.0 0.0 0:23.12 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kthreadd 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp 9 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq 10 root -3 0 0 0 0 S 0.0 0.0 0:06.80 ksoftirqd/0 12 root -5 0 0 0 0 S 0.0 0.0 0:00.00 rcub/1 13 root -5 0 0 0 0 S 0.0 0.0 0:02.32 rcuc/0 14 root rt 0 0 0 0 S 0.0 0.0 0:00.00 posixcputmr/0 15 root rt 0 0 0 0 S 0.0 0.0 0:00.01 migration/0 16 root 20 0 0 0 0 P 0.0 0.0 0:00.00 watchdog/0 17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0 18 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1 19 root 20 0 0 0 0 P 0.0 0.0 0:00.00 watchdog/1 20 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/1 21 root rt 0 0 0 0 S 0.0 0.0 0:00.00 posixcputmr/1 22 root -5 0 0 0 0 S 0.0 0.0 0:00.00 rcuc/1 23 root -3 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1 24 root 20 0 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0-mm_percpu_wq 25 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0H [heat-admin@overcloud-computerealtime-0 ~]$ sosreport --batch sosreport (version 3.8) WARNING: unable to set option for disabled or non-existing plugin (process) WARNING: unable to set option for disabled or non-existing plugin (process) This command will collect diagnostic and configuration information from this Red Hat Enterprise Linux system and installed applications. An archive containing the collected information will be generated in /var/tmp/sos.avq7srox and may be provided to a Red Hat support representative. Any information provided to Red Hat will be treated in accordance with the published support policies at: https://access.redhat.com/support/ The generated archive may contain data considered sensitive and its content should be reviewed by the originating organization before being passed to any third party. No changes will be made to system configuration. Setting up archive ... Setting up plugins ... Not all environment variables set. Source the environment file for the user intended to connect to the OpenStack environment. Running plugins. Please wait ... Starting 1/1 openstack_aodh [Running: openstack_aodh] Finished running plugins Creating compressed archive... Your sosreport has been generated and saved in: /var/tmp/sosreport-overcloud-computerealtime-0-2021-03-31-zjlcmzb.tar.xz The checksum is: 57022b4d60d5eb108064510ed11cf809 Please send this file to your support representative. Created attachment 1768106 [details]
VM XML in OSP 16.1
Thank you Pei for testing. Pradipta, Since Pei is failing to reproduce the issue, I am thinking if we have some misconfiguration that is causing this issue in your environment. Can you please share your KVM-RT XML, so that we can compare it with Pei's configuration? Thanks Thanks, Pei, for reproducing the issue and for sharing the environment. So as I suspected this is a typical case of starvation caused by the testpmd application that occupies 100% of guest vCPU which is running with SCHED_FIFO. Although, I am not fully aware of all the actions performed by sosreport command but I would assume it involves doing syscalls, involving workers, etc. that could be blocked/starved when they try to run on isolated CPUs occupied by the testpmd application. A piece of clear evidence for this is in the dmesg on both non-debug and the debug rt-kernel: Non-debug-kernel: ---------------- [169350.668608] INFO: task sysctl:2345101 blocked for more than 600 seconds. [169350.676180] Not tainted 4.18.0-304.rt7.71.el8.x86_64 #1 [169350.682686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [169350.691519] sysctl D 0 2345101 2345099 0x00084084 [169350.698026] Call Trace: [169350.700850] __schedule+0x34d/0x880 [169350.704838] schedule+0x56/0xe0 [169350.708437] schedule_timeout+0x250/0x400 [169350.713005] ? preempt_count_add+0x68/0xa0 [169350.717670] ? _raw_spin_lock+0x13/0x40 [169350.722043] ? _raw_spin_unlock_irqrestore+0x20/0x60 [169350.727679] wait_for_completion+0x85/0xe0 [169350.732344] flush_work+0x123/0x1d0 [169350.736331] ? flush_workqueue_prep_pwqs+0x130/0x130 [169350.741966] ? __queue_work+0x1aa/0x590 [169350.746340] schedule_on_each_cpu+0xa7/0xe0 [169350.751103] vmstat_refresh+0x21/0xa0 [169350.755283] proc_sys_call_handler+0x1ad/0x1c0 [169350.760336] vfs_read+0x91/0x140 [169350.764031] ksys_read+0x52/0xc0 [169350.767728] do_syscall_64+0x87/0x1a0 [169350.771908] entry_SYSCALL_64_after_hwframe+0x65/0xca [169350.777639] RIP: 0033:0x7f900bca85a5 [169350.781723] Code: Unable to access opcode bytes at RIP 0x7f900bca857b. [169350.789102] RSP: 002b:00007fff3275fc18 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 Debug-kernel: ------------ [ 840.792750] Using feature eBPF/rawtrace. [ 941.338161] megaraid_sas 0000:01:00.0: Firmware crash dump is not available [ 945.692390] megaraid_sas 0000:01:00.0: Firmware crash dump is not available [ 1872.664007] INFO: task sosreport:7544 blocked for more than 600 seconds. [ 1872.671491] Tainted: G W --------- - - 4.18.0-304.rt7.71.el8.x86_64+debug #1 [ 1872.681393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1872.690132] sosreport D23504 7544 2765 0x00084080 [ 1872.696266] Call Trace: [ 1872.699004] __schedule+0x91d/0x2320 [ 1872.702999] ? __sched_text_start+0x8/0x8 [ 1872.707477] ? kernel_text_address+0x125/0x140 [ 1872.712437] ? __kernel_text_address+0xe/0x30 [ 1872.717301] ? unwind_get_return_address+0x56/0xa0 [ 1872.722642] ? stack_trace_save+0xb0/0xb0 [ 1872.727179] schedule+0x101/0x290 [ 1872.730882] schedule_timeout+0x6f6/0xc90 [ 1872.735360] ? usleep_range+0x130/0x130 [ 1872.739645] ? do_profile_hits.isra.5.cold.13+0x2d/0x2d [ 1872.745482] ? lock_release+0x4a1/0xb60 [ 1872.749763] ? lock_release+0x4a1/0xb60 .... The simple solution here is to use stalld as I suggested in comment 27. Stalld will allow the runnable tasks that are waiting/blocked because of a thread that is occupying 100% of CPU to run with SCHED_DEADLINE runtime/period. We can see stalld boosting the tasks in its logs: Apr 15 13:06:11 dell-per430-11.lab.eng.pek2.redhat.com stalld[12425]: cpu 14 had 301682 idle time, and now has 302182 Apr 15 13:06:11 dell-per430-11.lab.eng.pek2.redhat.com stalld[12425]: cpu 15 had 301640 idle time, and now has 302140 Apr 15 13:06:11 dell-per430-11.lab.eng.pek2.redhat.com stalld[12425]: cpu 16 had 294651 idle time, and now has 295142 Apr 15 13:06:11 dell-per430-11.lab.eng.pek2.redhat.com stalld[12425]: cpu 17 had 301678 idle time, and now has 302179 Apr 15 13:06:11 dell-per430-11.lab.eng.pek2.redhat.com stalld[12425]: cpu 18 had 299678 idle time, and now has 300177 Apr 15 13:06:11 dell-per430-11.lab.eng.pek2.redhat.com stalld[12425]: cpu 19 had 301677 idle time, and now has 302177 Apr 15 13:06:13 dell-per430-11.lab.eng.pek2.redhat.com stalld[12425]: stalld: turbostat-21492 starved on CPU 1 for 33 seconds Apr 15 13:06:13 dell-per430-11.lab.eng.pek2.redhat.com stalld[12425]: stalld: boosted pid 21492 using SCHED_DEADLINE Apr 15 13:06:13 dell-per430-11.lab.eng.pek2.redhat.com stalld[12425]: turbostat-21492 starved on CPU 1 for 33 seconds Apr 15 13:06:13 dell-per430-11.lab.eng.pek2.redhat.com stalld[12425]: boosted pid 21492 using SCHED_DEADLINE Note: CPU 1 is the same CPU where the guest vCPU that is running testpmd thread, is pinned. With stalld, I didn't see a hung in Pei's environment 5/5 times. @Pradipta can you please try to reproduce the issue in your environment with stalld? Pradipta to install and enable stalld on 8.4, please do the following: # yum install -y stalld # systemctl enable stalld.service # systemctl start stalld.service Then try running the testpmd and sosreport --batch command. Thanks As per comment 16 from Pradipta, this issue is not a blocker. Hence, reducing the priority and severity of the Bug. I will still be looking into it to identify the exact process/thread that is trying to enqueue the jobs on the isolated CPUs resulting in this hang, that too only when we are using PF. Thanks (In reply to Nitesh Narayan Lal from comment #42) > As per comment 16 from Pradipta, this issue is not a blocker. > Hence, reducing the priority and severity of the Bug. > > I will still be looking into it to identify the exact process/thread > that is trying to enqueue the jobs on the isolated CPUs resulting in > this hang, that too only when we are using PF. > > Thanks Nitesh, I think its easy to reproduce the problem as follows: 1) Boot host (or guest VM on host without realtime...) with isolcpus=CPULIST. 2) Run oslat on some CPU(s) of CPULIST, with FIFO priority. 3) echo 1 > /proc/sys/vm/stat_refresh (no need to investigate the PF/non-PF issue, IMHO). Agree its low priority. (In reply to Marcelo Tosatti from comment #43) > (In reply to Nitesh Narayan Lal from comment #42) > > As per comment 16 from Pradipta, this issue is not a blocker. > > Hence, reducing the priority and severity of the Bug. > > > > I will still be looking into it to identify the exact process/thread > > that is trying to enqueue the jobs on the isolated CPUs resulting in > > this hang, that too only when we are using PF. > > > > Thanks > > Nitesh, > > I think its easy to reproduce the problem as follows: > > 1) Boot host (or guest VM on host without realtime...) with isolcpus=CPULIST. > 2) Run oslat on some CPU(s) of CPULIST, with FIFO priority. > 3) echo 1 > /proc/sys/vm/stat_refresh > Agree, vmstat_refresh is pretty easy to reproduce. > (no need to investigate the PF/non-PF issue, IMHO). I think it is good to atleast understand what is happening there with PF and without PF so that we don't ignore any issue. (In reply to Nitesh Narayan Lal from comment #39) > Pradipta to install and enable stalld on 8.4, please do the following: > > # yum install -y stalld > > # systemctl enable stalld.service > > # systemctl start stalld.service > > Then try running the testpmd and sosreport --batch command. > > Thanks HI Nitesh/Pei, Thank you so much as you are able to reproduce the issue. As my environment is broken due network issue, please allow me some to bring it up. I will install stalld service and rerun the sosreport during the workload. I will update my finding soon. BR, Pradipta Hi Nitesh, Installed stalld package and start the service in OSP compute node. Interminetly reproduced the issue in scenario2. Scenario1: 1. run the testpmd inside the rt-guest vm. 2. Run the sos "sosreport --batch" in compute node. 3. It took ~4-5min to finish all the sosplugin and able to generate the sosreport. Scenario2: 1. Run the testpmd inside the rt-guest vm and process the traffic into it. 2. Run the sos "sosreport --batch --all-logs" in compute node. 3. The 1st sample test got failed with processor plugin timeout and the ssh conneciton broken. ~~~ Starting 129/129 yum [Running: logs processor xfs yum] Finishing plugins [Running: logs processor yum] Finishing plugins [Running: logs processor] Finishing plugins [Running: processor] Plugin processor timed out client_loop: send disconnect: Broken pipe ~~~ 4. In the 2nd sample we still noticed the processor plugin timeout, but this time it create sosreport archive. ~~~ Starting 129/129 yum [Running: logs processor systemd yum] Finishing plugins [Running: logs processor yum] Finishing plugins [Running: logs processor] Finishing plugins [Running: processor] Plugin processor timed out Creating compressed archive... Your sosreport has been generated and saved in: /var/tmp/sosreport-nfv-compute-rt-offload-0-2021-05-06-duotxba.tar.xz Size 556.60MiB Owner root md5 077b195b6e2072f87a2353bd23ea4aa7 Please send this file to your support representative. ~~~ 5. Stalld log starve the CPU which I didn't find any issue. But is there any further tunning required in stalld to avoid plugin timeout in sosreport. ~~~ May 6 05:41:42 nfv-compute-rt-offload-0 stalld[838604]: turbostat-885072 starved on CPU 8 for 30 seconds May 6 05:41:42 nfv-compute-rt-offload-0 stalld[838604]: boosted pid 885072 using SCHED_DEADLINE May 6 05:41:45 nfv-compute-rt-offload-0 stalld[838604]: turbostat-892562 starved on CPU 8 for 33 seconds May 6 05:41:45 nfv-compute-rt-offload-0 stalld[838604]: boosted pid 892562 using SCHED_DEADLINE May 6 05:41:48 nfv-compute-rt-offload-0 stalld[838604]: turbostat-897929 starved on CPU 8 for 36 seconds May 6 05:41:48 nfv-compute-rt-offload-0 stalld[838604]: boosted pid 897929 using SCHED_DEADLINE May 6 05:42:07 nfv-compute-rt-offload-0 stalld[838604]: turbostat-897929 might starve on CPU 34 (waiting for 15 seconds) May 6 05:42:22 nfv-compute-rt-offload-0 stalld[838604]: turbostat-897929 starved on CPU 34 for 30 seconds May 6 05:42:22 nfv-compute-rt-offload-0 stalld[838604]: boosted pid 897929 using SCHED_DEADLINE ~~~ (In reply to Pradipta Kumar Sahoo from comment #49) > Hi Nitesh, > Hi Pradipta, Thank you for testing. > Installed stalld package and start the service in OSP compute node. > > Interminetly reproduced the issue in scenario2. > <snip> > Your sosreport has been generated and saved in: > /var/tmp/sosreport-nfv-compute-rt-offload-0-2021-05-06-duotxba.tar.xz > > Size 556.60MiB > Owner root > md5 077b195b6e2072f87a2353bd23ea4aa7 > > Please send this file to your support representative. > ~~~ > > 5. Stalld log starve the CPU which I didn't find any issue. But is there any > further tunning required in stalld to avoid plugin timeout in sosreport. > ~~~ > May 6 05:41:42 nfv-compute-rt-offload-0 stalld[838604]: turbostat-885072 > starved on CPU 8 for 30 seconds > May 6 05:41:42 nfv-compute-rt-offload-0 stalld[838604]: boosted pid 885072 > using SCHED_DEADLINE > May 6 05:41:45 nfv-compute-rt-offload-0 stalld[838604]: turbostat-892562 > starved on CPU 8 for 33 seconds > May 6 05:41:45 nfv-compute-rt-offload-0 stalld[838604]: boosted pid 892562 > using SCHED_DEADLINE > May 6 05:41:48 nfv-compute-rt-offload-0 stalld[838604]: turbostat-897929 > starved on CPU 8 for 36 seconds > May 6 05:41:48 nfv-compute-rt-offload-0 stalld[838604]: boosted pid 897929 > using SCHED_DEADLINE > May 6 05:42:07 nfv-compute-rt-offload-0 stalld[838604]: turbostat-897929 > might starve on CPU 34 (waiting for 15 seconds) > May 6 05:42:22 nfv-compute-rt-offload-0 stalld[838604]: turbostat-897929 > starved on CPU 34 for 30 seconds > May 6 05:42:22 nfv-compute-rt-offload-0 stalld[838604]: boosted pid 897929 > using SCHED_DEADLINE > ~~~ We can try a few things with the stalld configuration, however before doing that can you tell what kind of impact did you observe on the testpmd performance numbers with stalld enabled? Having said that scenario2 does indicate that fixing these issues at the source is actually the better way to do things. Hi Nitesh, Apology for my late response. Our performance tool "pbench-trafficgen" usually collect the SOS report from required nodes before the beginning of the network test, so earlier during the test, our tool was stuck while collecting the sosreport of rt-compute where pCPU was reserved for PMD thread inside the Guest After enabling stalled in compute node(DUT), we noticed Trafficgen tool doesn't stuck at sosreport collection. Also, we received similar performance report with and without stalld in compute node. Please find the below results and share any best-practice while stalld for FDP performance test with rt-kernel. VM(TestPMD) information: ------------------------ # virsh vcpupin instance-00000002 VCPU CPU Affinity ---------------------- 0 26 1 32 2 22 3 28 4 34 --->> PMD Core1 5 8 --->> PMD Core2 6 30 7 10 8 16 9 6 10 12 11 18 12 24 13 14 testpmd -l 3,4,5 -n 4 --huge-dir=/dev/hugepages -w 0000:00:05.0 -w 0000:00:06.0 --socket-mem 10000,0 -- --nb-cores=2 --forward-mode=io --rxd=2048 --txd=2048 --rxq=1 --txq=1 --burst=64 --mbcache=512 -a --rss-udp --disable-crc-strip --stats-period 1 Network Performance Test with Stalld Service: --------------------------------------------- 1. Started the stalled service in DUT @nfv-compute-rt-offload-0 ~]# systemctl status stalld _ stalld.service - Stall Monitor Loaded: loaded (/usr/lib/systemd/system/stalld.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2021-05-24 06:04:12 UTC; 1 day 21h ago Main PID: 1480 (stalld) Tasks: 2 (limit: 768443) Memory: 1.6M CGroup: /system.slice/stalld.service __1480 /usr/bin/stalld --systemd -p 1000000000 -r 20000 -d 3 -t 30 --foreground --pidfile /run/stalld.pid 2. Ran the trafficgen test enabling to collect sosreport of all the nodes including DUT and TestPMD VMs. During the execution, it successfully collected sosreport before start the trafficgen test. 3. It looks like stalld service always keeps monitoring and continuously do the actions like turbo-start starve, boosted, kworker starve on pCPU which are pinned to PMD process inside the guest. # grep stalld /var/log/messages May 26 02:52:03 nfv-compute-rt-offload-0 stalld[1480]: turbostat-476138 might starve on CPU 8 (waiting for 15 seconds) May 26 02:52:18 nfv-compute-rt-offload-0 stalld[1480]: turbostat-476138 starved on CPU 8 for 30 seconds May 26 02:52:18 nfv-compute-rt-offload-0 stalld[1480]: boosted pid 476138 using SCHED_DEADLINE May 26 02:52:56 nfv-compute-rt-offload-0 stalld[1480]: turbostat-476138 starved on CPU 8 for 30 seconds May 26 02:52:56 nfv-compute-rt-offload-0 stalld[1480]: boosted pid 476138 using SCHED_DEADLINE May 26 02:53:34 nfv-compute-rt-offload-0 stalld[1480]: turbostat-476138 starved on CPU 8 for 30 seconds May 26 02:53:34 nfv-compute-rt-offload-0 stalld[1480]: boosted pid 476138 using SCHED_DEADLINE May 26 02:53:52 nfv-compute-rt-offload-0 stalld[1480]: turbostat-479184 starved on CPU 8 for 33 seconds May 26 02:53:52 nfv-compute-rt-offload-0 stalld[1480]: boosted pid 479184 using SCHED_DEADLINE May 26 02:54:13 nfv-compute-rt-offload-0 stalld[1480]: turbostat-479184 might starve on CPU 34 (waiting for 15 seconds) May 26 02:54:15 nfv-compute-rt-offload-0 stalld[1480]: kworker/8:1-719 starved on CPU 8 for 33 seconds May 26 02:54:15 nfv-compute-rt-offload-0 stalld[1480]: boosted pid 719 using SCHED_DEADLINE May 26 02:54:18 nfv-compute-rt-offload-0 stalld[1480]: turbostat-476138 starved on CPU 8 for 36 seconds May 26 02:54:18 nfv-compute-rt-offload-0 stalld[1480]: boosted pid 476138 using SCHED_DEADLINE May 26 02:54:28 nfv-compute-rt-offload-0 stalld[1480]: turbostat-479184 starved on CPU 34 for 30 seconds May 26 02:54:28 nfv-compute-rt-offload-0 stalld[1480]: boosted pid 479184 using SCHED_DEADLINE May 26 02:54:56 nfv-compute-rt-offload-0 stalld[1480]: turbostat-476138 starved on CPU 8 for 30 seconds May 26 02:54:56 nfv-compute-rt-offload-0 stalld[1480]: boosted pid 476138 using SCHED_DEADLINE 4. We collected the network test with stalld where we received a 42Mpps line rate. Please find the below results details for further reference including sosreport. http://pbench.perf.lab.eng.bos.redhat.com/results/perf122.perf.lab.eng.bos.redhat.com/nfv-osp16.2-rhel8.4rt-trial/trafficgen_RHOS-16.2-RHEL-8.4-KERNEL-4.18.0-296.rt7.63.el8.x86_64-OVS-HW-OFFLOAD-1Q-PVP-BS_tg:trex_r:none_fs:64_nf:1024_fm:si_td:bi_ml:0.002_tt:bs_2021-05-26T02:47:07/ Network Performance Test without Stalld Service: ------------------------------------------------ 1. Stopped the stalld service in DUT. [root@nfv-compute-rt-offload-0 ~]# systemctl status stalld _ stalld.service - Stall Monitor Loaded: loaded (/usr/lib/systemd/system/stalld.service; enabled; vendor preset: disabled) Active: inactive (dead) since Wed 2021-05-26 04:03:47 UTC; 3s ago Process: 550006 ExecStopPost=/usr/bin/throttlectl on (code=exited, status=0/SUCCESS) Process: 1480 ExecStart=/usr/bin/stalld --systemd $CLIST $AGGR $BP $BR $BD $THRESH $LOGGING $FG $PF (code=exited, status=0/SUCCESS) Main PID: 1480 (code=exited, status=0/SUCCESS) 2. Ran the trafficgen test disabling to collect the sosreport as it is expected sosreport command hangs as the testpmd is running inside the guest. 3. Also we received a 42Mpps rate of traffic without having stalld service in DUT. http://pbench.perf.lab.eng.bos.redhat.com/results/perf122.perf.lab.eng.bos.redhat.com/nfv-osp16.2-rhel8.4rt-trial/trafficgen_RHOS-16.2-RHEL-8.4-KERNEL-4.18.0-296.rt7.63.el8.x86_64-OVS-HW-OFFLOAD-1Q-PVP-BS_tg:trex_r:none_fs:64_nf:1024_fm:si_td:bi_ml:0.002_tt:bs_2021-05-26T04:06:44/ Hence so far we didn't notice any performance impact of using stalld service in the rt-compute node. Regards, Pradipta Hi Pradipta, Thank you for sharing a detailed report. Yes, stalld remains active in the background. Whenever it finds a starving thread in the run queue, it allows that thread to make progress by giving it a few time slices as per the configuration in /etc/sysconfig/stalld. By default, the boost run time (BR) is set to 10us. Ideally, we would want these issues to be fixed in the kernel so that these threads never try to run anything on isolated CPUs in the first place. This is because having stalld enabled for latency-sensitive workload adds overhead. Marcelo is already working on one of the issues but it will take time to first upstream it and then to get that fix downstream. In the meanwhile, for this particular use case, I think we can use stalld as an acceptable solution, specifically because we are not seeing any impact on the performance. @Pradipta do you agree with the above? In terms of best practices, I think stalld configuration may change for different workloads. But as long as the default configuration is not impacting the results we should stick with it (IMHO). @Clark do you have any suggestion about the best practices while enabling stalld for FDP performance test with rt-kernel? Thanks Nitesh (In reply to Nitesh Narayan Lal from comment #52) > > @Clark do you have any suggestion about the best practices while enabling > stalld for FDP performance test with rt-kernel? Nitesh, my only concern currently is that dropping the boost runtime to 10us may push us into a diminishing returns situation, where the overhead of switching cuts into the runtime so that the boosted task only gets a few cycles to make progress. A boosted thread gets 3 boosts, so with the current settings that means 10us every second for 30us total boost time. If the thread doesn't complete it's work in that period it will be stalled again and have to repeat the detection and boost phases from stalld. I'd say monitor the stalld activity and if we see it boosting the same thread all the time, change the boost runtime to 15us and see if that helps. (In reply to Clark Williams from comment #53) > (In reply to Nitesh Narayan Lal from comment #52) > > > > @Clark do you have any suggestion about the best practices while enabling > > stalld for FDP performance test with rt-kernel? > > Nitesh, my only concern currently is that dropping the boost runtime to 10us > may > push us into a diminishing returns situation, where the overhead of switching > cuts into the runtime so that the boosted task only gets a few cycles to make > progress. A boosted thread gets 3 boosts, so with the current settings that > means > 10us every second for 30us total boost time. If the thread doesn't complete > it's > work in that period it will be stalled again and have to repeat the detection > and boost phases from stalld. > > I'd say monitor the stalld activity and if we see it boosting the same > thread all > the time, change the boost runtime to 15us and see if that helps. Ah, that's an excellent point, and looking at the log it does look like stalld is boosting turbostat multiple times. @Pradipta, what we can try is to set the BR value in /etc/sysconfig/stalld to 15000ns and see if that helps in reducing the number of times stalld is boosting the turbostat process. Also, please see if there is a performance degradation because of this in your test-case. There are a few possibilities here: - If there is no performance degradation with 15us BR and the starving threads are not getting boosted multiple times then we can use that stalld configuration in your testing. - If somehow there is a performance impact because of it then perhaps we will have to stick with 10us - If there is no performance degradation but the threads are still getting boosted multiple times, I would say to also try with 20us (20000ns) while you are at it. Please let me know what do you think. Thanks Hi Nitesh/Clark, Regarding dropping boost time to 10us (highlighted on comment #53), I believe it is important for us to revalidate the traffic test with rt-kvm using stalld service. In my last test, I executed the traffic test for a 10min duration which is not sufficient to validate traffic performance with rt-kvm using stalld. Also, I recently upgraded my lab environment with the latest 8.4 kernel and yet to redeploy the environment with the latest rt-kernel. Due to other timebound activity, we will re-execute the test as soon the lab upgraded with the latest rt-kernel. Please keep share if there is any stable rt-kernel and stalld version we will perform the test, so I customised the image accordingly. Also, I will syncup with Andrew and our NFV:DFG whether stalld service is recommended to use in OpenStack rt-kvm or not. Please bear me some time and will update you soon. -- Pradipta Hi Pradipta, Thanks, and please take your time in performing the tests. Just a note we are actually increasing the BR from 10 to 15us. It is a good idea to discuss this with Andrew, but please note having stalld enabled in general for all KVM-RT deployments is **not recommended** yet as it incurs latency overhead. The enablement is a work in progress as a part of Bug 1905799. This is precisely why we are monitoring the impact on the performance of the workload that you are testing. Thanks Hi Pradipta, Just wanted to follow up to see if you get a chance to try the experiments Clark suggested? Thanks Nitesh I had a discussion with Luiz today about this issue and we have concluded the following: There are two ways to fix this issue- 1. With the kernel fixes that Marcelo is working on 2. By enabling stalld in KVM-RT environments Getting the kernel fixes merged upstream and then getting them backported downstream is going to take some time so a more straightforward way to fix this issue at least for the time being is via stalld. The work for enabling stalld for the KVM-RT environment is still underway as a part of the BZ 1905799 and once that completes the issue reported in this BZ will also be resolved. Hence, I am closing this bug as a DUP of BZ 1905799. Please feel free to comment if there are any concerns. Thanks *** This bug has been marked as a duplicate of bug 1905799 *** Reopening to track turbostat changes. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. Hello Luiz, As Nitesh explained in Comment 68, there are 2 ways to solve this bug. Seems item 2(BZ 1905799 has been closed as WONTFIX) is not a valid solution now. Do we still plan to do item 1? Do we plan to fix this bug? What do you think? Thanks. Best regards, Pei (In reply to Pei Zhang from comment #73) > Hello Luiz, > > As Nitesh explained in Comment 68, there are 2 ways to solve this bug. > > Seems item 2(BZ 1905799 has been closed as WONTFIX) is not a valid solution > now. > > Do we still plan to do item 1? Do we plan to fix this bug? What do you think? @Nitesh, can you remind us what's Marcelo fixes you are referring in comment 68? My recollection from this BZ is that we deprioritized since there are possible workarounds. (In reply to Luiz Capitulino from comment #74) > (In reply to Pei Zhang from comment #73) > > Hello Luiz, > > > > As Nitesh explained in Comment 68, there are 2 ways to solve this bug. > > > > Seems item 2(BZ 1905799 has been closed as WONTFIX) is not a valid solution > > now. > > > > Do we still plan to do item 1? Do we plan to fix this bug? What do you think? > > @Nitesh, can you remind us what's Marcelo fixes you are referring in comment > 68? It is the task isolation that Marcelo is working upstream. (In reply to Nitesh Narayan Lal from comment #75) > (In reply to Luiz Capitulino from comment #74) > > (In reply to Pei Zhang from comment #73) > > > Hello Luiz, > > > > > > As Nitesh explained in Comment 68, there are 2 ways to solve this bug. > > > > > > Seems item 2(BZ 1905799 has been closed as WONTFIX) is not a valid solution > > > now. > > > > > > Do we still plan to do item 1? Do we plan to fix this bug? What do you think? > > > > @Nitesh, can you remind us what's Marcelo fixes you are referring in comment > > 68? > > It is the task isolation that Marcelo is working upstream. OK, so that's long term. Pei, I'll stick to the deprioritization of this BZ. Maybe Marcelo can comment once he's back from PTO. |