Bug 1956237 - Running runqlat from bcc-tools hangs the system
Summary: Running runqlat from bcc-tools hangs the system
Keywords:
Status: CLOSED DUPLICATE of bug 1938312
Alias: None
Product: Fedora
Classification: Fedora
Component: bcc
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Jiri Olsa
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-03 09:30 UTC by Marko Myllynen
Modified: 2021-05-03 15:44 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-03 15:44:57 UTC
Type: Bug


Attachments (Terms of Use)

Description Marko Myllynen 2021-05-03 09:30:37 UTC
Description of problem:
On Fedora 33 with kernel-5.11.16-200.fc33.x86_64 and bcc-tools-0.18.0-1.fc33.x86_64 /usr/share/bcc/tools/runqlat works as expected.

When running /usr/share/bcc/tools/runqlat on Fedora 34 with kernel-5.11.16-300.fc34.x86_64 and bcc-tools-0.18.0-4.fc34.x86_64 the system hangs after the script has started up:

# rpm -q kernel bcc-tools              
kernel-5.11.16-300.fc34.x86_64
bcc-tools-0.18.0-4.fc34.x86_64
root@f34-server:~# /usr/share/bcc/tools/runqlat  
...
3 warnings generated.
Tracing run queue latency... Hit Ctrl-C to end.

CPU usage is high, Ctrl-C doesn't do anything and even logging in via the virtual console doesn't work.

I'm not sure how to debug this further but this should be trivial to reproduce.

Thanks.

Comment 1 Rafael Fonseca 2021-05-03 10:24:35 UTC
Cannot reproduce with kernel-5.11.17-300:

$: uname -r
5.11.17-300.fc34.x86_64

$: rpm -q bcc-tools
bcc-tools-0.18.0-4.fc34.x86_64

# /usr/share/bcc/tools/runqlat
...
3 warnings generated.
Tracing run queue latency... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 3        |*                                       |
         8 -> 15         : 40       |************************                |
        16 -> 31         : 66       |****************************************|
        32 -> 63         : 9        |*****                                   |
        64 -> 127        : 15       |*********                               |
       128 -> 255        : 7        |****                                    |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 1        |                                        |
      2048 -> 4095       : 1        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 1        |                                        |
     16384 -> 32767      : 3        |*                                       |
     32768 -> 65535      : 4        |**                                      |
     65536 -> 131071     : 1        |                                        |
    131072 -> 262143     : 2        |*                                       |

Comment 2 Marko Myllynen 2021-05-03 11:48:04 UTC
Thanks for looking into this.

I now installed a new Fedora 34 Server VM (on otherwise idle RHEL 7 host) using all defaults and tried .12/.16/.17 kernels.

This happens with all those kernel versions here occasionally. Changing the VM CPU doesn't seem to have any notable effect on the frequency of the issue.

My testing procedure is:

1) Force off the VM
2) Power on the VM
3) Login via console as root
4) Run 'sync'
5) Run /usr/share/bcc/tools/runqlat

After rebooting if the command works, it seems to work several times in row. However, after that the system seems to get stuck on shutdown. If it fails, it gets stuck after "3 warnings generated" or after "Hit Ctrl-C" lines printed. If booting without "rhgb quiet" boot parameters sometimes I see audit messages on consoles about ~100 messages suppressed and backlog limit exceeded.

When booting with "audit=0" then it seems runqlat works reliably on two different F34 VMs and the system does not hang on power down.

Does this help explaining what might be going on here?

Thanks.

Comment 3 Jerome Marchand 2021-05-03 15:44:57 UTC
It's a known deadlock issue in the kernel. AFAIK, it's still being worked on upstream.

*** This bug has been marked as a duplicate of bug 1938312 ***


Note You need to log in before you can comment on or make changes to this bug.