Bug 821090 - Spontaneous reboots occurring while testing RHEL-7.0-20120508.n.0 on x86_64.
Spontaneous reboots occurring while testing RHEL-7.0-20120508.n.0 on x86_64.
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: acpid (Show other bugs)
7.0
x86_64 Linux
high Severity high
: alpha
: ---
Assigned To: Jaroslav Škarvada
qe-baseos-daemons
: TestBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-11 16:28 EDT by Barry Donahue
Modified: 2012-05-15 09:35 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-05-15 09:35:41 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Barry Donahue 2012-05-11 16:28:56 EDT
I'm trying to run the pNATE regression suite on RHEL 7. While pNATE is still in the process of configuring the client systems, they reboot on their own. Following are the only clues I see. This is very reproducible.

The last messages out of the console are:

May 11 15:56:58 pnate-client-01 abrtd: Got signal 15, exiting
May 11 15:56:58 pnate-client-01 acpid: exiting
May 11 15:56:58 pnate-client-01 smartd[2033]: smartd received signal 15: Terminated
May 11 15:56:58 pnate-client-01 kernel: Kernel logging (proc) stopped.
May 11 15:56:58 pnate-client-01 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="2204" x-info="http://www.rsyslog.com"] exiting on signal 15.

Broadcast message from root@pnate-client-01.lab.bos.redhat.com on pts/0 (Fri, 11 May 2012 15:56:58 -0400):

The system is going down for reboot NOW!

The only interesting things in dmesg are:

[   76.423414] ACPI Error: No handler for Region [IPMI] (ffff880220114320) [IPMI] (20120111/evregion-376)
[   76.423421] ACPI Error: Region IPMI (ID=7) has no handler (20120111/exfldio-306)
[   76.423495] ACPI Error: Method parse/execution failed [\_SB_.PMI0._GHL] (Node ffff880220126b20), AE_NOT_EXIST (20120111/psparse-536)
[   76.423817] ACPI Error: Method parse/execution failed [\_SB_.PMI0._PMC] (Node ffff880220126e00), AE_NOT_EXIST (20120111/psparse-536)
[   76.424331] ACPI Exception: AE_NOT_EXIST, Evaluating _PMC (20120111/power_meter-773)
[   76.740760] EDAC MC: Ver: 2.1.0
[   76.757410] EDAC MC0: Giving out device to 'i7core_edac.c' 'i7 core #0': DEV 0000:fe:03.0
[   76.763334] EDAC PCI0: Giving out device to module 'i7core_edac' controller 'EDAC PCI controller': DEV '0000:fe:03.0' (POLLED)
[   76.764363] 
[   76.764372] ===============================
[   76.764381] [ INFO: suspicious RCU usage. ]
[   76.764388] 3.3.0-0.12.el7.x86_64 #1 Not tainted
[   76.764395] -------------------------------
[   76.764402] arch/x86/kernel/cpu/mcheck/mce.c:194 suspicious rcu_dereference_index_check() usage!
[   76.764414] 
[   76.764414] other info that might help us debug this:
[   76.764416] 
[   76.764425] 
[   76.764426] rcu_scheduler_active = 1, debug_locks = 0
[   76.764435] 3 locks held by udevd/436:
[   76.764441]  #0:  (&__lockdep_no_validate__){......}, at: [<ffffffff8141c0ab>] __driver_attach+0x5b/0xb0
[   76.764462]  #1:  (&__lockdep_no_validate__){......}, at: [<ffffffff8141c0b9>] __driver_attach+0x69/0xb0
[   76.764479]  #2:  (i7core_edac_lock){+.+.+.}, at: [<ffffffffa008db26>] i7core_probe+0x22/0xd8f [i7core_edac]
[   76.764498] 
[   76.764499] stack backtrace:
[   76.764507] Pid: 436, comm: udevd Not tainted 3.3.0-0.12.el7.x86_64 #1
[   76.764515] Call Trace:
[   76.764523]  [<ffffffff810cbe45>] lockdep_rcu_suspicious+0xe5/0x100
[   76.764533]  [<ffffffff810301c0>] mce_register_decode_chain+0x1c0/0x1e0
[   76.764545]  [<ffffffff814e2707>] ? dmi_walk+0x67/0xa0
[   76.764554]  [<ffffffffa008e746>] i7core_probe+0xc42/0xd8f [i7core_edac]
[   76.764567]  [<ffffffff810a5f15>] ? sched_clock_cpu+0xc5/0x120
[   76.764578]  [<ffffffff81354a5c>] local_pci_probe+0x5c/0xd0
[   76.764586]  [<ffffffff81354be1>] pci_device_probe+0x111/0x120
[   76.764595]  [<ffffffff8141bdf6>] driver_probe_device+0x96/0x2f0
[   76.764604]  [<ffffffff8141c0fb>] __driver_attach+0xab/0xb0
[   76.764613]  [<ffffffff8141c050>] ? driver_probe_device+0x2f0/0x2f0
[   76.764622]  [<ffffffff81419ff5>] bus_for_each_dev+0x55/0x90
[   76.764631]  [<ffffffffa0102000>] ? 0xffffffffa0101fff
[   76.764640]  [<ffffffff8141b8de>] driver_attach+0x1e/0x20
[   76.764648]  [<ffffffff8141b5e8>] bus_add_driver+0x1b8/0x2b0
[   76.764656]  [<ffffffffa0102000>] ? 0xffffffffa0101fff
[   76.764663]  [<ffffffff8141c8d7>] driver_register+0x77/0x160
[   76.764670]  [<ffffffffa0102000>] ? 0xffffffffa0101fff
[   76.764678]  [<ffffffff81353833>] __pci_register_driver+0x73/0xf0
[   76.764686]  [<ffffffffa0102000>] ? 0xffffffffa0101fff
[   76.764694]  [<ffffffffa010203f>] i7core_init+0x3f/0x1000 [i7core_edac]
[   76.764705]  [<ffffffff8100212a>] do_one_initcall+0x12a/0x180
[   76.764713]  [<ffffffff810dee00>] sys_init_module+0xc0/0x220
[   76.764724]  [<ffffffff8166b8e9>] system_call_fastpath+0x16/0x1b
[   76.764731] EDAC i7core: Driver loaded, 1 memory controller(s) found.


   This may or may not be related to the pNATE tests trying to start up the ntpd service to sync up the logs. It could be a pure coincidence.

# systemctl status ntpd.service
ntpd.service
	  Loaded: error (Reason: No such file or directory)
	  Active: inactive (dead)
Comment 1 Jaroslav Škarvada 2012-05-14 05:43:19 EDT
The RCU related backtrace maybe dupe of bug 789644 but AFAIK it shouldn't cause reboots.

It doesn't seem to be related to acpid component, because it seems it got SIGTERM like others due to previously initiated reboot:

> May 11 15:56:58 pnate-client-01 abrtd: Got signal 15, exiting
> May 11 15:56:58 pnate-client-01 acpid: exiting
> May 11 15:56:58 pnate-client-01 smartd[2033]: smartd received signal 15:
> Terminated
Comment 2 Jaroslav Škarvada 2012-05-14 06:11:00 EDT
Is there anything more before (maybe in /var/log/messages):
May 11 15:56:58 pnate-client-01 abrtd: Got signal 15, exiting
?
Comment 3 Barry Donahue 2012-05-14 09:41:58 EDT
These are the messages that precede the earlier messages posted.

May 11 15:49:04 pnate-client-02 systemd-logind[2046]: New session 1 of user root.
May 11 15:51:18 pnate-client-02 systemd-logind[2046]: New session 3 of user root.



May 11 15:54:59 pnate-client-02 systemd-logind[2046]: Removed session 1.
May 11 15:56:19 pnate-client-02 systemd-logind[2046]: New session 4 of user root.
May 11 15:56:30 pnate-client-02 kernel: [  732.865363] sanlun used greatest stack depth: 2944 bytes left
May 11 15:56:49 pnate-client-02 systemd-logind[2046]: New session 5 of user root.
May 11 15:56:54 pnate-client-02 systemd-logind[2046]: Removed session 5.
May 11 15:56:58 pnate-client-02 acpid: exiting
Comment 6 Jaroslav Škarvada 2012-05-14 11:06:00 EDT
The kernel patch from bug 789644 doesn't seem to be in RHEL-7 kernel. It should at least quiet the RCU backtrace.

Could you run acpid with -ld command line parameters? It would be more verbose about events.
Comment 7 Jaroslav Škarvada 2012-05-14 11:11:31 EDT
Cloned bug 789644 as bug 821464.

Note You need to log in before you can comment on or make changes to this bug.