Bug 459478 - [FOCUS] Long boot time and strange Hardware Clock message
[FOCUS] Long boot time and strange Hardware Clock message
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
1.0
x86_64 All
medium Severity high
: 1.0.3
: ---
Assigned To: Red Hat Real Time Maintenance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-08-19 05:03 EDT by IBM Bug Proxy
Modified: 2008-11-03 07:35 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-10-07 15:21:02 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Normal kernel boot log, where no problem is seen (29.77 KB, text/plain)
2008-08-19 05:03 EDT, IBM Bug Proxy
no flags Details
Debug kernel boot log, where the problem is seen (41.55 KB, text/plain)
2008-08-19 05:03 EDT, IBM Bug Proxy
no flags Details

  None (edit)
Description IBM Bug Proxy 2008-08-19 05:03:33 EDT
=Comment: #0=================================================
JEFFREY D. LANE <jdlane@us.ibm.com> - 2008-07-09 16:10 EDT
Problem description:

Installed RHEL 5.2 x86_64 base on x3455.  Followed that up with an install of
the R2 RC1 tarball and a reboot into the R2 kernel.  

The first issue of note is that the boot takes FOREVER once the kernel starts
loading.  As of this writing, in fact, the machine has been in the boot/startup
process for over 15 minutes and is currently stuck at the "Starting udev"
prompt.  It may well be locked up completely, but I want to give it more time
before saying that.

This issue was NOT seen when booting R2 on a Z Pro system.

The second issue is that I noticed a weird hardware clock message immediatly
following the "I for Interactive Startup" message.  Basically, the system boots
and then the kernel is uncompressed and booted. 

Next the Red Hat nash startup message appears.  Then there is a pause for a few
minutes.

Following that, the system displays the "Reading logical volumes" message and
then the "Found volume group" message.

Next comes the "Welcome to Red Hat" messages and the "Press 'I' for interactive
startup" messages.

Immediately after that, the system displays this message after sitting for
several minutes:

"Cannot access the Hardware Clock via any known method.  Use the --debug option
to see the details of our search for an access method."

Ironically, this is then followed up by the clock set message:

"Setting Clock (utc):  Wed Jul 9  15:41:07 EDT 2008   [OK]"

then we finally get to starting udev and that's where it stops.  It appears that
the machine is stuck at udev at this point in some sort of churn.  It appears
frozen, and even appeared as though it no longer accepted keyboard interrupts,
however, after hitting CapsLock and walking away for a few minutes, on my return
I noticed that the CapsLock light had lit up.  So, I used sysrq to trigger a
core dump. 

About 30-45 seconds after hitting sysrq-c, the sysrq core dump trigger message
appeared.  I left the system sitting at that point, hopefully kdump will fire
and I'll have a core dump to share.

The imporant thing here is that while the system appears frozen to me, it DOES
still accept keyboard input, so maybe it's just overwhelmed.

If this is a customer issue, please indicate the impact to the customer:

Well, system won't boot, so I'd say that's pretty impactful.


If this is not an installation problem,
       Describe any custom patches installed.
       R2 RC1 kernel and packages

       Provide output from "uname -a", if possible: Unable.  system not fully
booted.


Hardware Environment
    Machine type (p650, x235, SF2, etc.): x3455
    Cpu type (Power4, Power5, IA-64, etc.): 2x dual core Xeon
    Describe any special hardware you think might be relevant to this problem:
    system has a Broadcom PCIe NIC and a QLogic 4GB FC PCIe adapter installed
currently.

    48GB RAM.


Is this reproducible?

    Not sure.  This is the first boot after installing R2.  I'm going to try one
more time, then I'm going to yank the cards and try it again.

    If so, how long does it (did it) take to reproduce it?

    Boot process is agonizingly slow and then locks up while starting udev.  So
not far into the boot process, but it took 5 -10 mintues to get there.

    Describe the steps: 
    Boot the system
    note the Hardware Clock message
    Note the udev message
    note that you no longer have keyboard interrupts.
    PROFIT!


Did the system produce an OOPS message on the console?
    If so, copy it here:
    No, but I MAY be able to get a kdump core


Is the system sitting in a debugger right now?  No
    If so, how long may it stay there?


Additional information:
=Comment: #1=================================================
John G. Stultz <jstultz@us.ibm.com> - 2008-07-09 16:16 EDT
Hrmmm.. I was just looking at a similar community issue. Does booting w/
pci=noacpi work around the issue?
=Comment: #2=================================================
JEFFREY D. LANE <jdlane@us.ibm.com> - 2008-07-09 18:35 EDT
Ok... good news, bad news...

The good news.  The non-booting thing so far seems to only be the debug kernel.
 On boot, I had the default set to 0 in grub thinking that the first entry was
the regular RT kernel.  Turns out, 0 is the debug kernel.  So I reset grub to
boot entry 3, which is the RT kernel.  It booted without incident and I have
stress running on it now.

The bad news...  pci=noacip did not fix the clock issue.

hwclock --debug says:

hwclock from util-linux-2.13-pre7
hwclock: Open of /dev/rtc failed, errno=19: No such device
No usable clock interface found.
Cannot access the Hardware Clock via any known method.

I've only seen this RTC issue on x86_64 systems when running 32bit OSs... I've
not encountered this on a 64bit OS until now.  

To work around the problem previously, we used pnpacpi=off

I also saw this happen on an x3950 system w/ 32bit RHEL 4.7 but the pnpacpi
parameter did not fix it there.

=Comment: #5=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-04 05:40 EDT
I have seen a problem very similar to this (may be the same) on LS20 today. I
was trying to boot into the R2-RC2 debug kernel and run some tests on it, but
the system stopped booting up after printing the following messages:

Setting up new root fs
no fstab.sys, mounting internal defaults
Switching to new root and running init.
unmounting old /dev
unmounting old /proc
unmounting old /sys
SELinux:  Disabled at runtime.
audit(1217842227.809:2): selinux=0 auid=4294967295
INIT: version 2.86 booting
                Welcome to Red Hat Enterprise Linux Server
                Press 'I' to enter interactive startup.
Cannot access the Hardware Clock via any known method.
Use the --debug option to see the details of our search for an access method.
date used greatest stack depth: 2816 bytes left
Setting clock  (localtime): Mon Aug  4 05:30:32 EDT 2008 [  OK  ]
Starting udev: 

I will wait for a while to see if the system will recover and continue to boot.
=Comment: #6=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-04 06:48 EDT
After what must have been around 45 minutes, the system booted up. It is as if
the system was too slow. Every initscript was taking ages to complete. Finally,
when the system booted up, I see the following message in dmesg:

BUG: MAX_STACK_TRACE_ENTRIES too low!
turning off the locking correctness validator.
Pid: 2112, comm: ip Not tainted 2.6.24.7-74ibmrt2.5debug #1

Call Trace:
 [<ffffffff810146b5>] ? save_stack_trace+0x2a/0x49
 [<ffffffff8105d851>] save_trace+0x93/0x9b
 [<ffffffff8105d8d7>] add_lock_to_list+0x7e/0xac
 [<ffffffff81060eb9>] __lock_acquire+0xb43/0xcdc
 [<ffffffff81067443>] ? rt_mutex_slowtrylock+0x18/0x85
 [<ffffffff810610e0>] lock_acquire+0x8e/0xb2
 [<ffffffff81067443>] ? rt_mutex_slowtrylock+0x18/0x85
 [<ffffffff812a7bd2>] __spin_lock_irqsave+0x40/0x73
 [<ffffffff81067443>] rt_mutex_slowtrylock+0x18/0x85
 [<ffffffff812a5694>] rt_mutex_trylock+0x9/0xb
 [<ffffffff812a7105>] rt_spin_lock+0x31/0x56
 [<ffffffff8127e194>] ip_mc_inc_group+0x176/0x232
 [<ffffffff8127e296>] ip_mc_up+0x46/0x64
 [<ffffffff81279947>] inetdev_event+0x263/0x470
 [<ffffffff810882d0>] ? __rcu_read_unlock+0x8c/0x95
 [<ffffffff812aa943>] notifier_call_chain+0x33/0x5b
 [<ffffffff81058001>] __raw_notifier_call_chain+0x9/0xb
 [<ffffffff81058012>] raw_notifier_call_chain+0xf/0x11
 [<ffffffff812305c2>] call_netdevice_notifiers+0x16/0x18
 [<ffffffff81231f2f>] dev_open+0x80/0x88
 [<ffffffff8123072f>] dev_change_flags+0xaf/0x16b
 [<ffffffff81279ee3>] devinet_ioctl+0x267/0x5f2
 [<ffffffff8127a686>] inet_ioctl+0x82/0xa0
 [<ffffffff81223dc1>] sock_ioctl+0x1e7/0x20c
 [<ffffffff810ce955>] do_ioctl+0x2d/0x83
 [<ffffffff810cec20>] vfs_ioctl+0x275/0x292
 [<ffffffff812a6a5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff810cec94>] sys_ioctl+0x57/0x7b
 [<ffffffff8100c248>] ? system_call+0xb8/0xef
 [<ffffffff8100c27f>] system_call_ret+0x0/0x6d

INFO: lockdep is turned off.
---------------------------
| preempt count: 00000001 ]
| 1-level deep critical section nesting:
----------------------------------------
.. [<ffffffff812a7bb5>] .... __spin_lock_irqsave+0x23/0x73
.....[<ffffffff81067443>] ..   ( <= rt_mutex_slowtrylock+0x18/0x85)

The BUG above has been reported in bug 44149. I will add a comment to that bug
as well.
=Comment: #7=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-05 06:15 EDT
I saw this problem on HS21 as well. The debug kernel just doesn't seem to boot
alright. This time, I saw a few messages from lockdep that look scary. I am
going to dig a bit more at this one. I will attach normal and debug kernel boot
logs to this bug.

I am also updating the bug summary to indicate that this problem is not limited
to x3455. We have confirmed this problem on LS20, HS21 and x3455 so far.
=Comment: #8=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-05 06:15 EDT

Normal kernel boot log, where no problem is seen

=Comment: #9=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-05 06:16 EDT

Debug kernel boot log, where the problem is seen

=Comment: #10=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-05 08:28 EDT
Hmm... The MRG kernel shows lockdep errors, but doesn't take that long to boot
up. Strange.
=Comment: #11=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-05 08:29 EDT
Marking this one FOCUS, as I will be looking at this a bit more. This problem
could affect anyone.
=Comment: #13=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-08 02:55 EDT
The message

"Cannot access the Hardware Clock via any known method.  Use the --debug option
to see the details of our search for an access method.""Cannot access the
Hardware Clock via any known method.  Use the --debug option
to see the details of our search for an access method."

seems to be because RTC is not enabled in the kernel configuration. To confirm
whether CONFIG_RTC is needed for us, I am adding John Stultz to the CC list.
John, could you please comment on this?

We still need to find out why it takes so long to boot debug kernel.
=Comment: #14=================================================
John G. Stultz <jstultz@us.ibm.com> - 2008-08-08 14:26 EDT
Hrmm. So does the R2 normal kernel have RTC enabled, but the debug doesn't?

I'm not as familiar with this code, as its been reworked recently (causing a
number of these sorts of messages). 
=Comment: #15=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-11 02:36 EDT
(In reply to comment #14)
> Hrmm. So does the R2 normal kernel have RTC enabled, but the debug doesn't?

Neither normal nor debug kernel have CONFIG_RTC enabled. My understanding
(though not 100% sure) is that we don't have any applications that use RTC, so
not having it enabled doesn't hurt us. Also, long boot up time reported in this
bug may not be related to this message about hwclock.
=Comment: #16=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-11 02:36 EDT
I enabled CONFIG_RTC and confirmed that the messages about hwclock go away.
=Comment: #17=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-11 04:58 EDT
(In reply to comment #16)
> I enabled CONFIG_RTC and confirmed that the messages about hwclock go away.

and confirmed that this has nothing to do with the problem of long boot time we
are seeing.
=Comment: #18=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-11 07:46 EDT
The problem is because of CONFIG_PROVE_LOCKING. Digging this a bit more.
=Comment: #19=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-08-13 02:54 EDT
This problem does not exist with vanilla 2.6.26. I tried 2.6.16-rt1, I hit
another problem with lockdep and I have reported it on linux-rt-users.
Comment 1 IBM Bug Proxy 2008-08-19 05:03:39 EDT
Created attachment 314516 [details]
Normal kernel boot log, where no problem is seen
Comment 2 IBM Bug Proxy 2008-08-19 05:03:45 EDT
Created attachment 314517 [details]
Debug kernel boot log, where the problem is seen
Comment 3 IBM Bug Proxy 2008-08-19 05:21:18 EDT
Peter Zijlstra pointed me to these two patches:
419ca3f13532793b81aff09f80c60af3eacbb43d
d6672c501852d577097f6757c311d937aca0b04b

These solve the problem of long boot-up time when CONFIG_PROVE_LOCKING is
enabled. We surely need these patches in MRG. Hence please include these patches
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=419ca3f13532793b81aff09f80c60af3eacbb43d

and
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=d6672c501852d577097f6757c311d937aca0b04b
in MRG. These patches apply on current MRG kernel 2.6.24-7-75el5rt with some
offsets.
Comment 4 Luis Claudio R. Goncalves 2008-08-26 11:00:26 EDT
These patches and two other from Peter Zijlstra have been added to 2.6.24.7-78.el5rt. The patches:

0001-lockdep-fix-combinatorial-explosion-in-lock-subgrap.patch
0001-lockdep-build-fix.patch
0002-lockdep-change-scheduler-annotation.patch
0004-lockdep-re-annotate-scheduler-runqueues.patch
Comment 6 IBM Bug Proxy 2008-09-25 20:00:50 EDT
Ok, Darren agreed that it should go in to -rc3, so don't have any last minute
changes that don't get wide testing.

I'm going to submit the patches for review on the list, and then try to get acks
and commit them so the -rc3 can be built.
Comment 7 IBM Bug Proxy 2008-09-26 20:10:47 EDT
Ok, the patches are committed and I've rebuilt -rc3. Marking this as accepted.
Comment 8 David Sommerseth 2008-09-30 07:44:35 EDT
Verified that these patches is included into the mrg-rt-2.6.24.7-81 kernel:

* 0001-lockdep-fix-combinatorial-explosion-in-lock-subgrap.patch
       mrg-rt.git commit 7fdba90ff2a2dfd693fab7b76736334b76d8563e
       linux-2.6  commit 419ca3f13532793b81aff09f80c60af3eacbb43d

* 0001-lockdep-build-fix.patch
       mrg-rt.git commit 2e3b018b87616c9d622625889d3270b6bf93870c
       linux-2.6  commit d6672c501852d577097f6757c311d937aca0b04b

* 0002-lockdep-change-scheduler-annotation.patch
       mrg-rt.git commit 2768deaf8c7a1e8bc929303ad897da05fc943389

* 0004-lockdep-re-annotate-scheduler-runqueues.patch
       mrg-rt.git commit b0a890b69d2a807adab797a0936f54a980bb8988


The mrg-rt-2.6.24.7-81 kernel boots without locking up around udev initialisation as the mrg-rt-2.6.24.7.74 kernel does.

Both kernels report the hwclock issue: "Cannot access the Hardware Clock via any known method.  Use the --debug option to see the details of our search for an access method." ... this is found on dell-pesc430-03.rhts.bos.redhat.com.
Comment 9 David Sommerseth 2008-09-30 09:06:58 EDT
Re-tested /sbin/hwclock (without arguments) on stock RHEL5 kernel - 2.6.18-92 - it works as expected.  Fails on 2.6.24.7-81.el5rt.

One observation was that /proc/devices was missing rtc on 2.6.18-92.  On the 2.6.24.7-81rt it was found as a character device with id 254.

On both kernels /dev/rtc was found as character device 10 135.

I tried to recreate /dev/rtc with as character device 10 254, without any success.  Doing cat /dev/rtc also gives "No such device", no matter which id being used.  In dmesg rtc seems to be found:

rtc_cmos 00:05: rtc core: registered rtc_cmos as rtc0
rtc0: alarms up to one day
Comment 10 David Sommerseth 2008-09-30 10:38:48 EDT
Noticed that I recreated /dev/rtc with the wrong major/minor id's.  By doing this:

       # rm /dev/rtc ; mknod /dev/rtc c 254 0 

After this, /sbin/hwclock will work.
Comment 11 David Sommerseth 2008-10-06 13:03:34 EDT
Moved bug to verified.  The issue regarding hwclock is not solved, but tracked in bug #465837.  Long boot time is solved in mrg-rt-2.6.24.7-81rt.
Comment 13 errata-xmlrpc 2008-10-07 15:21:02 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0857.html
Comment 14 IBM Bug Proxy 2008-10-22 16:50:45 EDT
The fixes committed to resolve this issue have caused a regression in the non-debug kernel (LTC Bug #49268)

Here are the commits in question:
* 0001-lockdep-fix-combinatorial-explosion-in-lock-subgrap.patch
mrg-rt.git commit 7fdba90ff2a2dfd693fab7b76736334b76d8563e
linux-2.6  commit 419ca3f13532793b81aff09f80c60af3eacbb43d

* 0001-lockdep-build-fix.patch
mrg-rt.git commit 2e3b018b87616c9d622625889d3270b6bf93870c
linux-2.6  commit d6672c501852d577097f6757c311d937aca0b04b

* 0002-lockdep-change-scheduler-annotation.patch
mrg-rt.git commit 2768deaf8c7a1e8bc929303ad897da05fc943389

* 0004-lockdep-re-annotate-scheduler-runqueues.patch
mrg-rt.git commit b0a890b69d2a807adab797a0936f54a980bb8988

From a bisection it seems the final patch (lockdep-re-annotate-scheduler-runqueues.patch) is where the regression appears.

We're going to revert all 4 patches in our tree for now, but try to sort out the problem shortly.

I'm reopening this issue.
Comment 15 Clark Williams 2008-10-22 17:29:19 EDT
unfortunately, this BZ (on the RH side) was closed by the errata process and we can't reopen it. We'll need a new BZ on our side.
Comment 16 IBM Bug Proxy 2008-10-30 02:20:45 EDT
(In reply to comment #46)
> ------- Comment From williams@redhat.com 2008-10-22 17:29:19 EDT-------
> unfortunately, this BZ (on the RH side) was closed by the errata process and we
> can't reopen it. We'll need a new BZ on our side.
>

Okay, we will open a fresh one.
Comment 17 IBM Bug Proxy 2008-11-03 00:30:43 EST
I have opened LTC bug 49534 - RH 469186 to track the regression due to lockdep patches.

Note You need to log in before you can comment on or make changes to this bug.