Bug 1767539 - BUG: kernel NULL pointer dereference RIP: 0010:rb_erase+0x1b1/0x370
Summary: BUG: kernel NULL pointer dereference RIP: 0010:rb_erase+0x1b1/0x370
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 31
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1768092 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-31 16:24 UTC by Chris Evich
Modified: 2023-09-14 05:45 UTC (History)
29 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-24 20:22:58 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Panic message from serial console (48.43 KB, text/plain)
2019-10-31 16:25 UTC, Chris Evich
no flags Details
Integration tests ginkgo debug output (180.00 KB, text/plain)
2019-10-31 16:32 UTC, Chris Evich
no flags Details
audit log (400.25 KB, text/plain)
2019-10-31 16:35 UTC, Chris Evich
no flags Details
System journal from relevant boot (231.17 KB, text/plain)
2019-10-31 16:38 UTC, Chris Evich
no flags Details
Debug patch for BFQ (117.42 KB, patch)
2019-11-05 17:45 UTC, Paolo
no flags Details | Diff
Tentative fix patch, to be applied on top of my dev-bfq branch (3.67 KB, patch)
2019-11-11 09:14 UTC, Paolo
no flags Details | Diff
serial output from kernel soft-locks then hung VM (209.29 KB, text/plain)
2020-01-07 15:53 UTC, Chris Evich
no flags Details
Tarball of V2 set of patches (5.24 KB, application/octet-stream)
2020-02-03 15:50 UTC, Chris Evich
no flags Details
Tarball of V2 set of patches (5.24 KB, application/gzip)
2020-02-03 15:54 UTC, Chris Evich
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github containers libpod pull 3901 0 'None' closed Cirrus: Support testing with F31 2021-01-12 20:49:13 UTC
Linux Kernel 205447 0 None None None 2020-01-07 16:15:18 UTC

Internal Links: 1768092 1826091

Description Chris Evich 2019-10-31 16:24:17 UTC
Description of problem:

During automated podman integration testing on F31 (I believe also during beta) I've been tripping a kernel panic.  It's happening on cloud VMs, so I've been procrastinating on the debugging, hoping it would clear upon release.  It has not, but is reproducible and I can modify/setup the VMs in any way which is helpful for debugging.

Version-Release number of selected component (if applicable):
conmon-2.0.2-1.fc31-x86_64
containernetworking-plugins-0.8.2-2.1.dev.git485be65.fc31-x86_64
containers-common-0.1.40-2.fc31-x86_64
container-selinux-2.119.0-2.fc31-noarch
criu-3.13-5.fc31-x86_64
crun-0.10.3-1.fc31-x86_64
golang-1.13.3-1.fc31-x86_64
slirp4netns-0.4.0-20.1.dev.gitbbd6f25.fc31-x86_64

How reproducible:
Within about 20-25 minutes, using libpod CI automation setup

Steps to Reproduce:
1. Use libpod repo from PR #3901 and having proper google cloud credentials
2. $ hack/get_ci_vm.sh fedora-31-libpod-6322976592494592
3. # contrib/cirrus/integration_test.sh

Actual results:

Kernel Panic

Expected results:

No panic, even if one/more integration tests fail

Additional info:

Previous to F31 release, our automated testing of libpod w/ CGroupsV2 (and crun) was limited to a temporary F30 setup.  It is desired by upstream to migrate testing to the latest Fedora release to support ongoing libpod development.

I have full control over these VMs, can describe their current setup precisely, extract a live VM from the test environment as needed, and instrument them however is needed to assist debugging.

Comment 1 Chris Evich 2019-10-31 16:25:47 UTC
Created attachment 1631145 [details]
Panic message from serial console

Comment 2 Chris Evich 2019-10-31 16:32:05 UTC
Created attachment 1631157 [details]
Integration tests ginkgo debug output

Comment 3 Chris Evich 2019-10-31 16:35:06 UTC
Created attachment 1631160 [details]
audit log

Comment 4 Chris Evich 2019-10-31 16:38:19 UTC
Created attachment 1631161 [details]
System journal from relevant boot

Comment 5 Chris Evich 2019-11-01 18:15:15 UTC
I'm setting up kexec on the VM that reproduced this, and will try to reproduce and capture a kernel core.  Unless anyone has a better/easier idea.

Comment 6 Chris Evich 2019-11-01 18:43:24 UTC
Okay, tripped another panic, it has the exact same 'RIP: 0010:rb_erase+0x1b1/0x370' and similar call trace on the serial console.  The VM appears to hang here, and doesn't automatically boot the dump kernel.  I tried feeding a [break]-C to it but there is no response.

ref: https://cloud.google.com/compute/docs/instances/interacting-with-serial-console#sending_a_serial_break

What am I forgetting?

Comment 7 Chris Evich 2019-11-04 21:04:06 UTC
No luck getting a core, something broken with kexec or my /etc/kdump.conf setup:

...cut comments...
ext4 /dev/sda2
path /
core_collector makedumpfile -l --message-level 1 -d 31
kdump_post /var/crash/scripts/kdump-post.sh
kdump_pre /var/crash/scripts/kdump-pre.sh
failure_action shell

I created and formatted the sda2 partition as ext4 and have it mounted as /var/crash from fstab.  The kdump service is enabled/active after reporting success building it's special ramdisk.  The pre/post scripts simply echo some text to stdout.  I turned on /proc/sys/kernel/sysrq then tried to manually test dumping:

Send [BREAK]c over serial console -> *bam* VM reboots kernel -> panics in ramdisk:

...cut kernel messages...
[    2.168485] Freeing unused kernel image memory: 2272K
[    2.169448] Write protecting the kernel read-only data: 20480k
[    2.171306] Freeing unused kernel image memory: 2016K
[    2.173010] Freeing unused kernel image memory: 1580K
[    2.182054] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[    2.183605] rodata_test: all tests were successful
[    2.184715] x86/mm: Checking user space page tables
[    2.194015] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[    2.195655] Run /init as init process
[    2.197587] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00
[    2.198892] CPU: 0 PID: 1 Comm: init Not tainted 5.3.7-301.fc31.x86_64 #1
[    2.200215] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[    2.202507] Call Trace:
[    2.203085]  dump_stack+0x5c/0x80
[    2.204165]  panic+0x101/0x2d7
[    2.204933]  do_exit.cold+0x1a/0xd1
[    2.205748]  ? __do_sys_newstat+0x48/0x70
[    2.206422]  do_group_exit+0x3a/0xa0
[    2.207206]  __x64_sys_exit_group+0x14/0x20
[    2.208367]  do_syscall_64+0x5f/0x1a0
[    2.209117]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    2.210386] RIP: 0033:0x7f01408f118e
[    2.210978] Code: Bad RIP value.
[    2.211912] RSP: 002b:00007ffcd0e7f358 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[    2.213386] RAX: ffffffffffffffda RBX: 00007f01400c3bb0 RCX: 00007f01408f118e
[    2.215020] RDX: 000000000000007f RSI: 000000000000003c RDI: 000000000000007f
[    2.216459] RBP: 00007ffcd0e7ffe0 R08: 00000000000000e7 R09: 00007ffcd0e7f268
[    2.217832] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f01408fa570
[    2.219167] R13: 0000000000000000 R14: 0000000000000019 R15: 00007f01400c3be0
[    2.220720] Kernel Offset: disabled
[    2.221871] Rebooting in 10 seconds..
[   12.224262] ACPI MEMORY or I/O RESET_REG.
...cut system reboots from bios...

Comment 8 Jeff Moyer 2019-11-05 13:22:51 UTC
Paolo, have you seen a trace like this before?

Comment 10 Chris Evich 2019-11-05 16:10:17 UTC
When the problem occurs, the system hangs.  I've now set `kernel.panic_on_oops = 1` and increased the verbosity of makedumpfile, rebuilt the ramdisk, rebooted, and will see if any of that helps...

Comment 11 Chris Evich 2019-11-05 17:25:40 UTC
...did not make any difference, the system still hangs and requires a manual panic via the serial console.

Note: I switched our CI system to use the deadline scheduler for Fedora 31 and initial runs appear promising.  In other words, the BFQ elevator seems to be specifically required/involved in the panic.  This is the default scheduler, along with ext4 for the F31 cloud image. 

https://dl.fedoraproject.org/pub/fedora/linux/releases/31/Cloud/x86_64/images/

Comment 12 Paolo 2019-11-05 17:35:22 UTC
(In reply to Jeff Moyer from comment #8)
> Paolo, have you seen a trace like this before?

Nope, but it looks like you found work for me :) I'm about to share a debugging patch for BFQ.

Comment 13 Paolo 2019-11-05 17:44:32 UTC
(In reply to Chris Evich from comment #5)
> I'm setting up kexec on the VM that reproduced this, and will try to
> reproduce and capture a kernel core.  Unless anyone has a better/easier idea.

I'm about to attach a kernel debugging patch for BFQ. Could you apply it and retry? The patch is for 5.3.0, so it should be ok for your kernel.

The goal of the patch is to hunt the cause of this crash, through a lot of invariant checks (BUG_ONs). If a BUG_ON triggers, the OOPS will hopefully tell us something useful.

Comment 14 Paolo 2019-11-05 17:45:31 UTC
Created attachment 1633036 [details]
Debug patch for BFQ

Comment 15 Chris Evich 2019-11-05 18:13:36 UTC
Sure happy to...but it's been years since I've built a kernel, is there a quick-reference somewhere?

Comment 16 Paolo 2019-11-05 18:27:26 UTC
(In reply to Chris Evich from comment #15)
> Sure happy to...but it's been years since I've built a kernel, is there a
> quick-reference somewhere?

Unless you happen to be rather proficient in Italian, I don't have any resource I know well to suggest to you :)

But I had a look at this Fedora's wiki page, which seems good:
https://fedoraproject.org/wiki/Building_a_custom_kernel

I'm also willing to make and install a modified kernel for you, if you can give me access to the offended system.

Comment 17 Chris Evich 2019-11-05 18:59:08 UTC
Oh even easier, yes happy to give you access.  Do you have a ssh key I can add?

Note: There's a /root/repro.sh that will trigger the panic after some time.  However, serial-console and hard-reset access needs a bigger list of permissions.  There's also a chance that the IP address will change on hard-reset.  So best let me run that and copy-paste the details for ya (assuming kdump/kexec can't be made to work).

Comment 18 Chris Evich 2019-11-05 19:00:04 UTC
or...come find me (cevich) on Freenode IRC and I can set a root password for you.

Comment 19 Paolo 2019-11-06 06:54:59 UTC
(In reply to Chris Evich from comment #17)
> Oh even easier, yes happy to give you access.  Do you have a ssh key I can
> add?
> 
> Note: There's a /root/repro.sh that will trigger the panic after some time. 
> However, serial-console and hard-reset access needs a bigger list of
> permissions.  There's also a chance that the IP address will change on
> hard-reset.  So best let me run that and copy-paste the details for ya
> (assuming kdump/kexec can't be made to work).

Great, I'll send you my key by email. Then I guess we can proceed privately for a little while, and get back to this thread as we have some progress.

Comment 20 Chris Evich 2019-11-06 14:54:53 UTC
sounds good.  I'm just starting my day now, will grab your key and install it...

Comment 21 Paolo 2019-11-11 09:13:07 UTC
I may have found the bug. I've attached a tentative fix, to be applied on top of the default branch of my dev-bfq repo.

Comment 22 Paolo 2019-11-11 09:14:05 UTC
Created attachment 1634784 [details]
Tentative fix patch, to be applied on top of my dev-bfq branch

Comment 23 Paolo 2019-11-12 17:32:05 UTC
Fix accepted for mainline:
https://www.spinics.net/lists/kernel/msg3313638.html

Comment 24 Pavel Raiskup 2019-11-14 06:36:26 UTC
Is this in 5.3.11 release?

Comment 25 Paolo 2019-11-14 18:43:14 UTC
(In reply to Pavel Raiskup from comment #24)
> Is this in 5.3.11 release?

I guess so.

Comment 26 Paolo 2019-11-14 18:44:49 UTC
The previous fix lacked a check. Fixed fix here:
https://lkml.org/lkml/2019/11/14/199

Comment 27 Pavel Raiskup 2019-11-28 08:46:25 UTC
Is this in 5.3.12 release?

Comment 28 Chris Evich 2020-01-06 21:09:18 UTC
This is not in any Fedora kernels that I can find.  I'm trying to build Paolo's patch into v5.3.10 source (after I reproduced the issue) to see if it fixes it...

Comment 29 Chris Evich 2020-01-07 15:53:52 UTC
Created attachment 1650442 [details]
serial output from kernel soft-locks then hung VM

...nope :( seems to have made the problem worse.  Now my reproducer starts spitting tons of CPU soft-lockups before grinding to a halt.  Log attached

Comment 30 Chris Evich 2020-01-07 16:03:01 UTC
(In reply to Paolo from comment #26)
> The previous fix lacked a check. Fixed fix here:
> https://lkml.org/lkml/2019/11/14/199

Paolo,

I only applied this patch.  Should I have also applied anything else?

In any case, please let me know what you need.  I have a fresh VM and knowledge of how to add a patch and build the Fedora kernel package...and you're pinging me on IRC now...

Comment 31 Chris Evich 2020-01-07 16:14:32 UTC
...work is being tracked now in https://bugzilla.kernel.org/show_bug.cgi?id=205447

Comment 32 Chris Evich 2020-02-03 15:34:45 UTC
All: We have a fully tested and blessed fix for this bug now.  I've tested this using the F31 5.4 kernel source and confirmed my reproducer no-longer triggers the hang & OOPS.  

I will also attach the patches, but this is the final LKML thread for ref:

https://lkml.org/lkml/2020/2/3/168

Comment 33 Chris Evich 2020-02-03 15:50:29 UTC
Created attachment 1657370 [details]
Tarball of V2 set of patches

Please double-check the patch files vs LKML content, as I'm fairly new at this kind of thing.

Comment 34 Chris Evich 2020-02-03 15:54:19 UTC
Created attachment 1657371 [details]
Tarball of V2 set of patches

Comment 35 Chris Evich 2020-02-05 16:53:51 UTC
Anything else todo from my end?  I preserved my original reproducer and am happy to help test if needed.

Comment 36 Chris Evich 2020-02-27 15:53:33 UTC
Paolo,

Can the latest patch for this be submitted to 'stable'?  I believe that's the condition needed for it to be accepted for inclusion in Fedora.

Comment 37 Chris Evich 2020-02-27 15:54:24 UTC
Paolo,

Can the latest patch for this be submitted to 'stable'?  I believe that's the condition needed for it to be accepted for inclusion in Fedora.

Comment 38 Paolo 2020-02-27 16:10:24 UTC
(In reply to Chris Evich from comment #37)
> Paolo,
> 
> Can the latest patch for this be submitted to 'stable'?  I believe that's
> the condition needed for it to be accepted for inclusion in Fedora.

Absolutely! Unfortunately, I don't know the process for that. In my workflow I submit stuff only for mainline, and, after it is accepted, I see it being progressivelly selected and applied to stable.

Comment 39 Paolo 2020-03-03 17:16:12 UTC
I've asked some colleagues in Linaro, and they are willing to help me with this. But, first, which stable kernel(s) do you need these fix commits to be added to?

Comment 40 Justin M. Forbes 2020-03-03 19:47:16 UTC
So typically, if there is a fixes tag, it could be picked up for stable though that can take time because it is done by a bot
If you know on submission that it needs to go to stable, you can include a Cc: stable.org in the sign-off area and it will go to stable without you needing to do anything else as long as it applies cleanly.
For patches that have already been pulled to mainline without such tags:

Send the patch, after verifying that it follows the above rules, to
stable.org.  You must note the upstream commit ID in the
changelog of your submission, as well as the kernel version you wish
it to be applied to. If the patch deviates from the original
upstream patch (for example because it had to be backported) this must be very
clearly documented and justified in the patch description.

In this case it would be 5.4 and 5.5

Comment 41 Justin M. Forbes 2020-03-03 19:51:14 UTC
*** Bug 1768092 has been marked as a duplicate of this bug. ***

Comment 42 Chris Evich 2020-03-05 18:41:51 UTC
(In reply to Justin M. Forbes from comment #40)
> In this case it would be 5.4 and 5.5

Upstream would like to know if 5.5.6 okay for inclusion in F31 and beyond?

If so, I still have my original reproducer VMs available.  All I need is the ability to pull down the 5.5.6 kernel with patches applied, using 'fedpkg'.  Then I can build + test it fairly quickly (hours).

Comment 43 Paolo 2020-03-06 18:36:29 UTC
The email thread by which I requested these fixes to be ported to stable branches is now archived:
https://www.spinics.net/lists/stable/msg371209.html

As you can see, these fixes are now available for 5.4-stable and 5.5-stable.

If someone needs to drop in, I can send an email with him/her in CC.

Thanks,
Paolo

Comment 44 Chris Evich 2020-03-11 15:12:14 UTC
Jeff,

So now that it's in 5.4-stable and beyond, is there anything need to have this picked up in F30 and F31?


My motivations come from wanting to remove a workaround I have in place, deep within some automation machinery I'm maintaining.  Though I'm sure humans will also appreciate not ever hitting this difficult-to-diagnose kernel panic :D

Comment 45 Chris Evich 2020-04-13 20:32:55 UTC
ping

Comment 46 Ben Cotton 2020-11-03 17:18:54 UTC
This message is a reminder that Fedora 31 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '31'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 31 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 47 Ben Cotton 2020-11-24 20:22:58 UTC
Fedora 31 changed to end-of-life (EOL) status on 2020-11-24. Fedora 31 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 48 Red Hat Bugzilla 2023-09-14 05:45:17 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.