Description of problem: On Fedora 33 with kernel-5.11.16-200.fc33.x86_64 and bcc-tools-0.18.0-1.fc33.x86_64 /usr/share/bcc/tools/runqlat works as expected. When running /usr/share/bcc/tools/runqlat on Fedora 34 with kernel-5.11.16-300.fc34.x86_64 and bcc-tools-0.18.0-4.fc34.x86_64 the system hangs after the script has started up: # rpm -q kernel bcc-tools kernel-5.11.16-300.fc34.x86_64 bcc-tools-0.18.0-4.fc34.x86_64 root@f34-server:~# /usr/share/bcc/tools/runqlat ... 3 warnings generated. Tracing run queue latency... Hit Ctrl-C to end. CPU usage is high, Ctrl-C doesn't do anything and even logging in via the virtual console doesn't work. I'm not sure how to debug this further but this should be trivial to reproduce. Thanks.
Cannot reproduce with kernel-5.11.17-300: $: uname -r 5.11.17-300.fc34.x86_64 $: rpm -q bcc-tools bcc-tools-0.18.0-4.fc34.x86_64 # /usr/share/bcc/tools/runqlat ... 3 warnings generated. Tracing run queue latency... Hit Ctrl-C to end. ^C usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 3 |* | 8 -> 15 : 40 |************************ | 16 -> 31 : 66 |****************************************| 32 -> 63 : 9 |***** | 64 -> 127 : 15 |********* | 128 -> 255 : 7 |**** | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 1 | | 2048 -> 4095 : 1 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 1 | | 16384 -> 32767 : 3 |* | 32768 -> 65535 : 4 |** | 65536 -> 131071 : 1 | | 131072 -> 262143 : 2 |* |
Thanks for looking into this. I now installed a new Fedora 34 Server VM (on otherwise idle RHEL 7 host) using all defaults and tried .12/.16/.17 kernels. This happens with all those kernel versions here occasionally. Changing the VM CPU doesn't seem to have any notable effect on the frequency of the issue. My testing procedure is: 1) Force off the VM 2) Power on the VM 3) Login via console as root 4) Run 'sync' 5) Run /usr/share/bcc/tools/runqlat After rebooting if the command works, it seems to work several times in row. However, after that the system seems to get stuck on shutdown. If it fails, it gets stuck after "3 warnings generated" or after "Hit Ctrl-C" lines printed. If booting without "rhgb quiet" boot parameters sometimes I see audit messages on consoles about ~100 messages suppressed and backlog limit exceeded. When booting with "audit=0" then it seems runqlat works reliably on two different F34 VMs and the system does not hang on power down. Does this help explaining what might be going on here? Thanks.
It's a known deadlock issue in the kernel. AFAIK, it's still being worked on upstream. *** This bug has been marked as a duplicate of bug 1938312 ***