1956237 – Running runqlat from bcc-tools hangs the system

Bug 1956237 - Running runqlat from bcc-tools hangs the system

Summary: Running runqlat from bcc-tools hangs the system

Keywords:
Status:	CLOSED DUPLICATE of bug 1938312
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	bcc
Sub Component:
Version:	34
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Jiri Olsa
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-03 09:30 UTC by Marko Myllynen
Modified:	2021-05-03 15:44 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2021-05-03 15:44:57 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Marko Myllynen 2021-05-03 09:30:37 UTC

Description of problem:
On Fedora 33 with kernel-5.11.16-200.fc33.x86_64 and bcc-tools-0.18.0-1.fc33.x86_64 /usr/share/bcc/tools/runqlat works as expected.

When running /usr/share/bcc/tools/runqlat on Fedora 34 with kernel-5.11.16-300.fc34.x86_64 and bcc-tools-0.18.0-4.fc34.x86_64 the system hangs after the script has started up:

# rpm -q kernel bcc-tools              
kernel-5.11.16-300.fc34.x86_64
bcc-tools-0.18.0-4.fc34.x86_64
root@f34-server:~# /usr/share/bcc/tools/runqlat  
...
3 warnings generated.
Tracing run queue latency... Hit Ctrl-C to end.

CPU usage is high, Ctrl-C doesn't do anything and even logging in via the virtual console doesn't work.

I'm not sure how to debug this further but this should be trivial to reproduce.

Thanks.

Comment 1 Rafael Fonseca 2021-05-03 10:24:35 UTC

Cannot reproduce with kernel-5.11.17-300:

$: uname -r
5.11.17-300.fc34.x86_64

$: rpm -q bcc-tools
bcc-tools-0.18.0-4.fc34.x86_64

# /usr/share/bcc/tools/runqlat
...
3 warnings generated.
Tracing run queue latency... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 3        |*                                       |
         8 -> 15         : 40       |************************                |
        16 -> 31         : 66       |****************************************|
        32 -> 63         : 9        |*****                                   |
        64 -> 127        : 15       |*********                               |
       128 -> 255        : 7        |****                                    |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 1        |                                        |
      2048 -> 4095       : 1        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 1        |                                        |
     16384 -> 32767      : 3        |*                                       |
     32768 -> 65535      : 4        |**                                      |
     65536 -> 131071     : 1        |                                        |
    131072 -> 262143     : 2        |*                                       |

Comment 2 Marko Myllynen 2021-05-03 11:48:04 UTC

Thanks for looking into this.

I now installed a new Fedora 34 Server VM (on otherwise idle RHEL 7 host) using all defaults and tried .12/.16/.17 kernels.

This happens with all those kernel versions here occasionally. Changing the VM CPU doesn't seem to have any notable effect on the frequency of the issue.

My testing procedure is:

1) Force off the VM
2) Power on the VM
3) Login via console as root
4) Run 'sync'
5) Run /usr/share/bcc/tools/runqlat

After rebooting if the command works, it seems to work several times in row. However, after that the system seems to get stuck on shutdown. If it fails, it gets stuck after "3 warnings generated" or after "Hit Ctrl-C" lines printed. If booting without "rhgb quiet" boot parameters sometimes I see audit messages on consoles about ~100 messages suppressed and backlog limit exceeded.

When booting with "audit=0" then it seems runqlat works reliably on two different F34 VMs and the system does not hang on power down.

Does this help explaining what might be going on here?

Thanks.

Comment 3 Jerome Marchand 2021-05-03 15:44:57 UTC

It's a known deadlock issue in the kernel. AFAIK, it's still being worked on upstream.

*** This bug has been marked as a duplicate of bug 1938312 ***

Note You need to log in before you can comment on or make changes to this bug.