Bug 1938312

Summary: bpf_probe_read_compat appears to deadlock the kernel
Product: [Fedora] Fedora Reporter: Serguei Makarov <smakarov>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 34CC: acaringi, adscvr, agerstmayr, airlied, alciregi, bskeggs, daniel, fche, hdegoede, jarodwilson, jeremy, jforbes, jglisse, jmarchan, jolsa, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, myllynen, ptalbert, steved, wcohen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-5.12.10-300.fc34 kernel-5.12.10-200.fc33 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1945162 (view as bug list) Environment:
Last Closed: 2021-06-15 01:05:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1945162    
Attachments:
Description Flags
deadlock warning
none
example dmesg on rawhide (did not capture the final deadlock)
none
example dmesg on f34 (did not capture final deadlock warning) none

Description Serguei Makarov 2021-03-12 18:49:53 UTC
Created attachment 1763055 [details]
deadlock warning

1. Please describe the problem:

Kernel deadlock began to manifest consistently when testing previously-working testcases of the SystemTap BPF backend that use BPF to probe the sched_switch tracepoint. Filing a bug against the kernel package since 'kernel deadlock' is not an expected failure mode of BPF operations.

Example backtrace shows deadlock in bpf_probe_read_compat

[  730.868702] stack backtrace:
[  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
[  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[  730.873278] Call Trace:
[  730.873770]  dump_stack+0x7f/0xa1
[  730.874433]  check_noncircular+0xdf/0x100
[  730.875232]  __lock_acquire+0x1202/0x1e10
[  730.876031]  ? __lock_acquire+0xfc0/0x1e10
[  730.876844]  lock_acquire+0xc2/0x3a0
[  730.877551]  ? __wake_up_common_lock+0x52/0x90
[  730.878434]  ? lock_acquire+0xc2/0x3a0
[  730.879186]  ? lock_is_held_type+0xa7/0x120
[  730.880044]  ? skb_queue_tail+0x1b/0x50
[  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
[  730.881656]  ? __wake_up_common_lock+0x52/0x90
[  730.882532]  __wake_up_common_lock+0x52/0x90
[  730.883375]  audit_log_end+0x5b/0x100
[  730.884104]  slow_avc_audit+0x69/0x90
[  730.884836]  avc_has_perm+0x8b/0xb0
[  730.885532]  selinux_lockdown+0xa5/0xd0
[  730.886297]  security_locked_down+0x20/0x40
[  730.887133]  bpf_probe_read_compat+0x66/0xd0
[  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
[  730.888917]  trace_call_bpf+0xe9/0x240
[  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
[  730.890579]  perf_trace_sched_switch+0x142/0x180
[  730.891485]  ? __schedule+0x6d8/0xb20
[  730.892209]  __schedule+0x6d8/0xb20
[  730.892899]  schedule+0x5b/0xc0
[  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
[  730.894457]  syscall_exit_to_user_mode+0x27/0x70
[  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae

2. What is the Version-Release number of the kernel:

5.11.0-0.rc7.20210210gite0756cfc7d7c.150.fc34.x86_64, reconfirmed 5.11.5-300.fc34.x86_64 - in Fedora 34
5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 - in Rawhide

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Yes, I have successful results from the same testcases on 5.11.0-0.rc6.20210204git61556703b610.144.fc34.x86_64

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Using Fedora systemtap packages:

    dnf install systemtap systemtap-testsuite
    dnf debuginfo-install kernel
    cd /usr/share/systemtap/testsuite
    env DEBUGINFOD_URLS=https://debuginfod.elfutils.org make installcheck RUNTESTFLAGS=bpf.exp CHECK_ONLY="reg_alloc1 reg_alloc2 reg_alloc3"

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Yes, also seen consistently on rawhide 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64

6. Are you running any modules that not shipped with directly Fedora's kernel?:

These tests are running with the SystemTap BPF backend, which does not generate any kernel modules unlike the regular SystemTap backend.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Will attach dmesgs from journalctl -b -1 in a followup.

Note that journalctl was not able to record the deadlock, which I've pasted from the system's console. See bpf.2021-03-deadlock.txt. (Updating the Fedora 34 kernel package to 5.11.5-300.fc34.x86_64 still manifests the deadlock, but I wasn't able to make the stack trace show up on the console.)

Comment 1 Serguei Makarov 2021-03-12 18:50:51 UTC
Created attachment 1763056 [details]
example dmesg on rawhide (did not capture the final deadlock)

Comment 2 Serguei Makarov 2021-03-12 18:51:23 UTC
Created attachment 1763057 [details]
example dmesg on f34 (did not capture final deadlock warning)

Comment 3 Serguei Makarov 2021-03-16 18:01:18 UTC
Fortunately the problem reproduced on the upstream kernel with the same known-good and known-bad versions, so it's not due to a Fedora patch. Currently in the midst of bisecting from known-good 5.11.0-rc6 to known-bad 5.11.0-rc7.

Note that the known-good 5.11.0-rc6 does have reg_alloc3 trigger the following warnings:

[  813.712431] audit: audit_backlog=65 > audit_backlog_limit=64
[  813.714375] audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=64
[  813.716607] audit: backlog limit exceeded
[  813.717972] audit: audit_backlog=65 > audit_backlog_limit=64
[  813.719850] audit: audit_lost=2 audit_rate_limit=0 audit_backlog_limit=64
[  813.722070] audit: backlog limit exceeded
[  813.723421] audit: audit_backlog=65 > audit_backlog_limit=64
[  813.725294] audit: audit_lost=3 audit_rate_limit=0 audit_backlog_limit=64
[  813.727524] audit: backlog limit exceeded
[  813.728864] audit: audit_backlog=65 > audit_backlog_limit=64
[  877.699700] audit_log_start: 2282 callbacks suppressed
[  877.699706] audit: audit_backlog=71 > audit_backlog_limit=64
[  877.703509] audit: audit_lost=766 audit_rate_limit=0 audit_backlog_limit=64
[  877.705869] audit: backlog limit exceeded
[  877.708933] audit: audit_backlog=71 > audit_backlog_limit=64
[  877.710910] audit: audit_lost=767 audit_rate_limit=0 audit_backlog_limit=64
[  877.713297] audit: backlog limit exceeded
[  877.714679] audit: audit_backlog=71 > audit_backlog_limit=64
[  877.716832] audit: audit_lost=768 audit_rate_limit=0 audit_backlog_limit=64
[  877.719198] audit: backlog limit exceeded

But it doesn't result in a deadlock.

Comment 4 Serguei Makarov 2021-03-17 14:33:45 UTC
Bisection result is so far puzzling. I may recheck some of the 'good' commits in the unlikely case the problem was only manifesting intermittently for them, but it's also possible something subtle is going on with this patch.

git bisect start
# bad: [92bf22614b21a2706f4993b278017e437f7785b3] Linux 5.11-rc7
git bisect bad 92bf22614b21a2706f4993b278017e437f7785b3
# good: [1048ba83fb1c00cd24172e23e8263972f6b5d9ac] Linux 5.11-rc6
git bisect good 1048ba83fb1c00cd24172e23e8263972f6b5d9ac
# bad: [654eb3f2a009af1fc64b10442e559e0d1e50904a] MAINTAINERS/.mailmap: use my @kernel.org address
git bisect bad 654eb3f2a009af1fc64b10442e559e0d1e50904a
# good: [40615974f88a918d01606ba27d75de2ff50b8d4e] Merge tag 'gpio-fixes-for-v5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
git bisect good 40615974f88a918d01606ba27d75de2ff50b8d4e
# good: [54fe3ffef0ebb60b1273d0d7b047ee9b4723cc61] Merge tag 'arm-soc-fixes-v5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good 54fe3ffef0ebb60b1273d0d7b047ee9b4723cc61
# good: [4cb2c00c43b3fe88b32f29df4f76da1b92c33224] Merge tag 'ovl-fixes-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
git bisect good 4cb2c00c43b3fe88b32f29df4f76da1b92c33224
# good: [4f6ec8602341e97b364e4e0d41a1ed08148f5e98] mm/vmalloc: separate put pages and flush VM flags
git bisect good 4f6ec8602341e97b364e4e0d41a1ed08148f5e98
# bad: [9c41e526a56f2cf25816e58284f4a5f9c12ccef7] mailmap: add entries for Manivannan Sadhasivam
git bisect bad 9c41e526a56f2cf25816e58284f4a5f9c12ccef7
# good: [1c2f67308af4c102b4e1e6cd6f69819ae59408e0] mm: thp: fix MADV_REMOVE deadlock on shmem THP
git bisect good 1c2f67308af4c102b4e1e6cd6f69819ae59408e0
# bad: [4c415b9a710b6ebce6517f6d4cdc5c4c31cfd7d9] mailmap: fix name/email for Viresh Kumar
git bisect bad 4c415b9a710b6ebce6517f6d4cdc5c4c31cfd7d9
# bad: [2dcb3964544177c51853a210b6ad400de78ef17d] memblock: do not start bottom-up allocations with kernel_end
git bisect bad 2dcb3964544177c51853a210b6ad400de78ef17d
# first bad commit: [2dcb3964544177c51853a210b6ad400de78ef17d] memblock: do not start bottom-up allocations with kernel_end

2dcb3964544177c51853a210b6ad400de78ef17d is the first bad commit
commit 2dcb3964544177c51853a210b6ad400de78ef17d
Author: Roman Gushchin <guro>
Date:   Thu Feb 4 18:32:36 2021 -0800

    memblock: do not start bottom-up allocations with kernel_end
    
    With kaslr the kernel image is placed at a random place, so starting the
    bottom-up allocation with the kernel_end can result in an allocation
    failure and a warning like this one:
    
      hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
      ------------[ cut here ]------------
      memblock: bottom-up allocation failed, memory hotremove may be affected
      WARNING: CPU: 0 PID: 0 at mm/memblock.c:332 memblock_find_in_range_node+0x178/0x25a
      Modules linked in:
      CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0+ #1169
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
      RIP: 0010:memblock_find_in_range_node+0x178/0x25a
      Code: e9 6d ff ff ff 48 85 c0 0f 85 da 00 00 00 80 3d 9b 35 df 00 00 75 15 48 c7 c7 c0 75 59 88 c6 05 8b 35 df 00 01 e8 25 8a fa ff <0f> 0b 48 c7 44 24 20 ff ff ff ff 44 89 e6 44 89 ea 48 c7 c1 70 5c
      RSP: 0000:ffffffff88803d18 EFLAGS: 00010086 ORIG_RAX: 0000000000000000
      RAX: 0000000000000000 RBX: 0000000240000000 RCX: 00000000ffffdfff
      RDX: 00000000ffffdfff RSI: 00000000ffffffea RDI: 0000000000000046
      RBP: 0000000100000000 R08: ffffffff88922788 R09: 0000000000009ffb
      R10: 00000000ffffe000 R11: 3fffffffffffffff R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000080000000 R15: 00000001fb42c000
      FS:  0000000000000000(0000) GS:ffffffff88f71000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffa080fb401000 CR3: 00000001fa80a000 CR4: 00000000000406b0
      Call Trace:
        memblock_alloc_range_nid+0x8d/0x11e
        cma_declare_contiguous_nid+0x2c4/0x38c
        hugetlb_cma_reserve+0xdc/0x128
        flush_tlb_one_kernel+0xc/0x20
        native_set_fixmap+0x82/0xd0
        flat_get_apic_id+0x5/0x10
        register_lapic_address+0x8e/0x97
        setup_arch+0x8a5/0xc3f
        start_kernel+0x66/0x547
        load_ucode_bsp+0x4c/0xcd
        secondary_startup_64_no_verify+0xb0/0xbb
      random: get_random_bytes called from __warn+0xab/0x110 with crng_init=0
      ---[ end trace f151227d0b39be70 ]---
    
    At the same time, the kernel image is protected with memblock_reserve(),
    so we can just start searching at PAGE_SIZE.  In this case the bottom-up
    allocation has the same chances to success as a top-down allocation, so
    there is no reason to fallback in the case of a failure.  All together it
    simplifies the logic.
    
    Link: https://lkml.kernel.org/r/20201217201214.3414100-2-guro@fb.com
    Fixes: 8fabc623238e ("powerpc: Ensure that swiotlb buffer is allocated from low memory")
    Signed-off-by: Roman Gushchin <guro>
    Reviewed-by: Mike Rapoport <rppt.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim>
    Cc: Michal Hocko <mhocko>
    Cc: Rik van Riel <riel>
    Cc: Wonhyuk Yang <vvghjk1234>
    Cc: Thiago Jung Bauermann <bauerman.com>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

 mm/memblock.c | 49 ++++++-------------------------------------------
 1 file changed, 6 insertions(+), 43 deletions(-)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.11-rc7&id=2dcb3964544177c51853a210b6ad400de78ef17d

Comment 5 Jiri Olsa 2021-03-17 15:56:32 UTC
Jerome got audit related lock as well:
https://bugzilla.redhat.com/show_bug.cgi?id=1938640#c4

Comment 6 Jiri Olsa 2021-03-17 16:00:18 UTC
(In reply to Jiri Olsa from comment #5)
> Jerome got audit related lock as well:
> https://bugzilla.redhat.com/show_bug.cgi?id=1938640#c4

I meant with bpf_probe_read_compat ;-)

Comment 7 Serguei Makarov 2021-03-18 16:46:08 UTC
Looks like the bug appears in earlier reversions, though not consistently (which threw off my first bisection attempt). See upstream discussion:

https://lore.kernel.org/bpf/CANYvDQOH5ZDpQBAHtz13YNiJ2Bhd56wnoas71UdYco62g-xBDg@mail.gmail.com/T/#t

Comment 8 Jerome Marchand 2021-05-03 15:44:53 UTC
*** Bug 1956237 has been marked as a duplicate of this bug. ***

Comment 9 Daniel Borkmann 2021-05-03 16:07:28 UTC
As discussed upstream, based on the stack trace, the issue is very clearly from audit subsystem. You can also easily verify by running the same reproducer on a kernel which has no compiled-in audit subsystem.

Comment 10 Daniel Borkmann 2021-05-03 16:19:12 UTC
Also shown as in the other duplicate (1956237) that it's clearly an audit (and /not/ bpf) issue:

> When booting with "audit=0" then it seems runqlat works reliably on two different F34 VMs and the system does not hang on power down.

Comment 11 Jiri Olsa 2021-06-09 11:24:56 UTC
upstream fix:
  ff40e51043af bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks

Comment 12 Justin M. Forbes 2021-06-09 12:56:20 UTC
This patch is queued for 5.12.10 stable, so should make the next update.

Comment 13 Fedora Update System 2021-06-10 23:15:38 UTC
FEDORA-2021-bc2a819bc5 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-bc2a819bc5

Comment 14 Fedora Update System 2021-06-10 23:15:41 UTC
FEDORA-2021-db2bb87f35 has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2021-db2bb87f35

Comment 15 Fedora Update System 2021-06-11 01:42:44 UTC
FEDORA-2021-db2bb87f35 has been pushed to the Fedora 33 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-db2bb87f35`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-db2bb87f35

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 16 Fedora Update System 2021-06-11 02:08:06 UTC
FEDORA-2021-bc2a819bc5 has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-bc2a819bc5`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-bc2a819bc5

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 17 Fedora Update System 2021-06-15 01:05:11 UTC
FEDORA-2021-bc2a819bc5 has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 18 Fedora Update System 2021-06-15 01:21:25 UTC
FEDORA-2021-db2bb87f35 has been pushed to the Fedora 33 stable repository.
If problem still persists, please make note of it in this bug report.