Bug 1956237

Summary: Running runqlat from bcc-tools hangs the system
Product: [Fedora] Fedora Reporter: Marko Myllynen <myllynen>
Component: bccAssignee: Jiri Olsa <jolsa>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 34CC: acaringi, agerstmayr, jmarchan, jolsa, rdossant, skozina
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-03 15:44:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marko Myllynen 2021-05-03 09:30:37 UTC
Description of problem:
On Fedora 33 with kernel-5.11.16-200.fc33.x86_64 and bcc-tools-0.18.0-1.fc33.x86_64 /usr/share/bcc/tools/runqlat works as expected.

When running /usr/share/bcc/tools/runqlat on Fedora 34 with kernel-5.11.16-300.fc34.x86_64 and bcc-tools-0.18.0-4.fc34.x86_64 the system hangs after the script has started up:

# rpm -q kernel bcc-tools              
kernel-5.11.16-300.fc34.x86_64
bcc-tools-0.18.0-4.fc34.x86_64
root@f34-server:~# /usr/share/bcc/tools/runqlat  
...
3 warnings generated.
Tracing run queue latency... Hit Ctrl-C to end.

CPU usage is high, Ctrl-C doesn't do anything and even logging in via the virtual console doesn't work.

I'm not sure how to debug this further but this should be trivial to reproduce.

Thanks.

Comment 1 Rafael Fonseca 2021-05-03 10:24:35 UTC
Cannot reproduce with kernel-5.11.17-300:

$: uname -r
5.11.17-300.fc34.x86_64

$: rpm -q bcc-tools
bcc-tools-0.18.0-4.fc34.x86_64

# /usr/share/bcc/tools/runqlat
...
3 warnings generated.
Tracing run queue latency... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 3        |*                                       |
         8 -> 15         : 40       |************************                |
        16 -> 31         : 66       |****************************************|
        32 -> 63         : 9        |*****                                   |
        64 -> 127        : 15       |*********                               |
       128 -> 255        : 7        |****                                    |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 1        |                                        |
      2048 -> 4095       : 1        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 1        |                                        |
     16384 -> 32767      : 3        |*                                       |
     32768 -> 65535      : 4        |**                                      |
     65536 -> 131071     : 1        |                                        |
    131072 -> 262143     : 2        |*                                       |

Comment 2 Marko Myllynen 2021-05-03 11:48:04 UTC
Thanks for looking into this.

I now installed a new Fedora 34 Server VM (on otherwise idle RHEL 7 host) using all defaults and tried .12/.16/.17 kernels.

This happens with all those kernel versions here occasionally. Changing the VM CPU doesn't seem to have any notable effect on the frequency of the issue.

My testing procedure is:

1) Force off the VM
2) Power on the VM
3) Login via console as root
4) Run 'sync'
5) Run /usr/share/bcc/tools/runqlat

After rebooting if the command works, it seems to work several times in row. However, after that the system seems to get stuck on shutdown. If it fails, it gets stuck after "3 warnings generated" or after "Hit Ctrl-C" lines printed. If booting without "rhgb quiet" boot parameters sometimes I see audit messages on consoles about ~100 messages suppressed and backlog limit exceeded.

When booting with "audit=0" then it seems runqlat works reliably on two different F34 VMs and the system does not hang on power down.

Does this help explaining what might be going on here?

Thanks.

Comment 3 Jerome Marchand 2021-05-03 15:44:57 UTC
It's a known deadlock issue in the kernel. AFAIK, it's still being worked on upstream.

*** This bug has been marked as a duplicate of bug 1938312 ***