Bug 807019

Summary: kernel 3.3.0 (including rcs) broken on ARM: Segfaults and oom-killer during boot.
Product: [Fedora] Fedora Reporter: Brendan Conoboy <blc>
Component: kernelAssignee: Jon Masters <jcm>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 17CC: alexvillacislasso, dmarlin, gansalmon, itamar, jdisnard, jonathan, kernel-maint, madhu.chinakonda, pbrobinson, wcohen
Target Milestone: ---   
Target Release: ---   
Hardware: arm   
OS: Linux   
Whiteboard: first=3.3 audit arm
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-28 12:16:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 245418    
Attachments:
Description Flags
Log of a kernel 3.3.0-based kernel booting on a trimslice, going badly.
none
Log of a kernel 2.6.41-based kernel booting on a trimslice, operating normally
none
boot log for trimslice kernel-3.3 built using gcc-4.6
none
Log output with systemd debugging enabled none

Description Brendan Conoboy 2012-03-26 19:10:29 UTC
Created attachment 572849 [details]
Log of a kernel 3.3.0-based kernel booting on a trimslice, going badly.

Description of problem:

When booting kernel-tegra-3.3.0-5 on a Trimslice the system fails to come up fully.  Warnings start up at the 0.0 timestamp between NR_IRQS and sched_clock messages.  Segmentation faults begin when starting the network.  Shortly thereafter systemd invokes the oom-killer and backtraces start flying.

Version-Release number of selected component (if applicable):

Present in 3.3.0-5.  Was also a problem in earlier 3.3.0-rc4 kernel.

How reproducible:

Every time.

Steps to Reproduce:
1. Install kernel-tegra-3-3-*
2. Boot
3. Watch the fireworks

Additional info:

I'm attaching 2 logs, the first is the bad 3.3.0 kernel boot log.  The second is a successful boot of the same system using a "2.6.41" kernel from F15.

Comment 1 Brendan Conoboy 2012-03-26 19:11:54 UTC
Created attachment 572851 [details]
Log of a kernel 2.6.41-based kernel booting on a trimslice, operating normally

Comment 2 D. Marlin 2012-03-29 00:04:10 UTC
I tried booting a Panda Board (ARM OMAP4) using the 3.3.0-0.rc4.git3.1.fc17.armv7hl.omap kernel with an F17 armv7hl rootfs and I see the same type of errors that as I saw on Trim Slice (Tegra2), i.e.,

[   39.046722] BUG: sleeping function called from invalid context at include/linux/freezer.h:46
[   39.051757] in_atomic(): 0, irqs_disabled(): 128, pid: 1417, name: systemd
[   39.051757] INFO: lockdep is turned off.
[   39.051757] irq event stamp: 0
[   39.051757] hardirqs last  enabled at (0): [<  (null)>]   (null)
[   39.072662] hardirqs last disabled at (0): [<c003b7f0>] copy_process.part.35+0x3dc/0x117c
[   39.085052] softirqs last  enabled at (0): [<c003b7f0>] copy_process.part.35+0x3dc/0x117c
[   39.085449] softirqs last disabled at (0): [<  (null)>]   (null)
[   39.100006] [<c0015b24>] (unwind_backtrace+0x0/0xf8) from [<c0010bac>] (do_signal+0x40/0x550)
[   39.103851] [<c0010bac>] (do_signal+0x40/0x550) from [<c0011654>] (do_notify_resume+0x54/0x60)
[   39.111694] [<c0011654>] (do_notify_resume+0x54/0x60) from [<c000dec4>] (work_pending+0x24/0x28)
        :
[ 1108.801422] Out of memory: Kill process 368 (systemd) score 1 or sacrifice child
[ 1108.810119] Killed process 375 (systemd) total-vm:5408kB, anon-rss:1320kB, file-rss:68kB


so this issue does not seem to be isolated to Tegra-based systems.

Comment 3 D. Marlin 2012-03-29 17:55:09 UTC
Created attachment 573756 [details]
boot log for trimslice kernel-3.3 built using gcc-4.6

This is the boot log from a Trim Slice using an upstream 3.3 kernel and Fedora Tegra kernel config, built with gcc version 4.6.1 20110908 (Red Hat 4.6.1-9).

Comment 4 Guy Streeter 2012-04-12 20:06:12 UTC
The systemd fork-bomb is no longer happening on my dreamplug. The only difference I know of is I have done a yum update recently. I'm still running
fedora-arm 3.3.0-0.rc4.git3.1.fc17.armv5tel.kirkwood

I am still getting "sleeping function called from invalid context" traces pretty much constantly.

I have to boot with libertas.disable=1 or udev hangs. I'm investigating that.

Comment 5 Guy Streeter 2012-04-12 20:32:17 UTC
I spoke too soon. I just rebooted and the systemd fork-bomb is back. I don't know why it worked once.

Comment 6 William Cohen 2012-04-17 21:25:22 UTC
Created attachment 578158 [details]
Log output with systemd debugging enabled

According to http://fedoraproject.org/wiki/How_to_debug_Systemd_problems there are kernel boot parameters that can be set to find out what systemd is doing.  The attached log was generated with "systemd.log_level=debug systemd.log_target=console" on the command line of a linux-3.4.0-rc4 kernel that exhibits the problem.

Looking through the pid of the "Forked ... as <pid>" one sees that the pid numbers increasing very quickly after NetworkManager starts:

Forked /sbin/auditd as 308
Forked /sbin/auditctl as 309
Forked /usr/sbin/rootfs-resize as 312
Forked /usr/sbin/NetworkManager as 318
Forked /usr/lib/systemd/systemd-logind as 637
Forked /usr/lib/systemd/systemd-user-sessions as 780
Forked /bin/plymouth as 951
Forked /bin/plymouth as 1131

For a non-problem kernel (2.6.42.12-1.fc15.armv7hl.tegra) see:

Forked /sbin/auditd as 344
Forked /sbin/auditctl as 345
Forked /usr/sbin/rootfs-resize as 348
Forked /usr/sbin/NetworkManager as 354
Forked /usr/lib/systemd/systemd-logind as 356
Forked /usr/lib/systemd/systemd-user-sessions as 357
Forked /bin/plymouth as 358
Forked /bin/plymouth as 360

Noticed in attachment "572849: Log of a kernel 3.3.0-based kernel booting on a trimslice, going badly." there are a number of segmentation faults related to networking like:

network[301]: /etc/sysconfig/network-scripts/init.ipv6-global: line 67:   461 Segmentation fault      /sbin/sysctl -e -w net.ipv6.conf.$sinter

Is there some way to turn off NetworkManager and see if the problem kernel boots up then?

Comment 7 Peter Robinson 2012-04-17 21:36:46 UTC
you can turn off NM and enable the old "network" service. Also make sure you have openssl 1.0.0 and not 1.0.1 as the later has issues

Comment 8 Brendan Conoboy 2012-04-18 00:18:00 UTC
Disabling both NM and network does not stop the OOM storm.

Comment 9 Guy Streeter 2012-04-18 20:08:15 UTC
Booting with a read-only root fs seems somehow to avoid the systemd fork-bomb. I don't know if that's a useful datapoint, but there it is. I can get up to multi-user that way, and then remount the rootfs as rw.

Comment 10 Brendan Conoboy 2012-04-18 20:26:36 UTC
Tried booting trimslice ro, no help.  Perhaps this is only effective on armv5tel or only on kirkwood kernels.

Comment 11 William Cohen 2012-04-20 13:40:22 UTC
On #fedora-arm jonmasters mentioned turning off auditd allows a 3.3/3.4 kernel to boot. Verified able to boot Linux 3.4.0-rc3+ after turning off audit with:

# systemctl disable auditd.service

Comment 12 Jon Disnard 2012-04-20 21:09:31 UTC
As a work around you might append "audit=0" to the kernel cmdline.

Comment 13 Peter Robinson 2012-06-28 12:16:36 UTC
There was a patch sent upstream which stopped the crash when audit was enabled but there was a still issues with audit on ARM, not sure the status of it being properly fixed. 3.4.x doesn't crash and boots fine on both PandaBoard and Trimslice