=Comment: #0================================================= JEFFREY D. LANE <jdlane.com> - 2008-07-09 16:10 EDT Problem description: Installed RHEL 5.2 x86_64 base on x3455. Followed that up with an install of the R2 RC1 tarball and a reboot into the R2 kernel. The first issue of note is that the boot takes FOREVER once the kernel starts loading. As of this writing, in fact, the machine has been in the boot/startup process for over 15 minutes and is currently stuck at the "Starting udev" prompt. It may well be locked up completely, but I want to give it more time before saying that. This issue was NOT seen when booting R2 on a Z Pro system. The second issue is that I noticed a weird hardware clock message immediatly following the "I for Interactive Startup" message. Basically, the system boots and then the kernel is uncompressed and booted. Next the Red Hat nash startup message appears. Then there is a pause for a few minutes. Following that, the system displays the "Reading logical volumes" message and then the "Found volume group" message. Next comes the "Welcome to Red Hat" messages and the "Press 'I' for interactive startup" messages. Immediately after that, the system displays this message after sitting for several minutes: "Cannot access the Hardware Clock via any known method. Use the --debug option to see the details of our search for an access method." Ironically, this is then followed up by the clock set message: "Setting Clock (utc): Wed Jul 9 15:41:07 EDT 2008 [OK]" then we finally get to starting udev and that's where it stops. It appears that the machine is stuck at udev at this point in some sort of churn. It appears frozen, and even appeared as though it no longer accepted keyboard interrupts, however, after hitting CapsLock and walking away for a few minutes, on my return I noticed that the CapsLock light had lit up. So, I used sysrq to trigger a core dump. About 30-45 seconds after hitting sysrq-c, the sysrq core dump trigger message appeared. I left the system sitting at that point, hopefully kdump will fire and I'll have a core dump to share. The imporant thing here is that while the system appears frozen to me, it DOES still accept keyboard input, so maybe it's just overwhelmed. If this is a customer issue, please indicate the impact to the customer: Well, system won't boot, so I'd say that's pretty impactful. If this is not an installation problem, Describe any custom patches installed. R2 RC1 kernel and packages Provide output from "uname -a", if possible: Unable. system not fully booted. Hardware Environment Machine type (p650, x235, SF2, etc.): x3455 Cpu type (Power4, Power5, IA-64, etc.): 2x dual core Xeon Describe any special hardware you think might be relevant to this problem: system has a Broadcom PCIe NIC and a QLogic 4GB FC PCIe adapter installed currently. 48GB RAM. Is this reproducible? Not sure. This is the first boot after installing R2. I'm going to try one more time, then I'm going to yank the cards and try it again. If so, how long does it (did it) take to reproduce it? Boot process is agonizingly slow and then locks up while starting udev. So not far into the boot process, but it took 5 -10 mintues to get there. Describe the steps: Boot the system note the Hardware Clock message Note the udev message note that you no longer have keyboard interrupts. PROFIT! Did the system produce an OOPS message on the console? If so, copy it here: No, but I MAY be able to get a kdump core Is the system sitting in a debugger right now? No If so, how long may it stay there? Additional information: =Comment: #1================================================= John G. Stultz <jstultz.com> - 2008-07-09 16:16 EDT Hrmmm.. I was just looking at a similar community issue. Does booting w/ pci=noacpi work around the issue? =Comment: #2================================================= JEFFREY D. LANE <jdlane.com> - 2008-07-09 18:35 EDT Ok... good news, bad news... The good news. The non-booting thing so far seems to only be the debug kernel. On boot, I had the default set to 0 in grub thinking that the first entry was the regular RT kernel. Turns out, 0 is the debug kernel. So I reset grub to boot entry 3, which is the RT kernel. It booted without incident and I have stress running on it now. The bad news... pci=noacip did not fix the clock issue. hwclock --debug says: hwclock from util-linux-2.13-pre7 hwclock: Open of /dev/rtc failed, errno=19: No such device No usable clock interface found. Cannot access the Hardware Clock via any known method. I've only seen this RTC issue on x86_64 systems when running 32bit OSs... I've not encountered this on a 64bit OS until now. To work around the problem previously, we used pnpacpi=off I also saw this happen on an x3950 system w/ 32bit RHEL 4.7 but the pnpacpi parameter did not fix it there. =Comment: #5================================================= Sripathi Kodi <sripathi.com> - 2008-08-04 05:40 EDT I have seen a problem very similar to this (may be the same) on LS20 today. I was trying to boot into the R2-RC2 debug kernel and run some tests on it, but the system stopped booting up after printing the following messages: Setting up new root fs no fstab.sys, mounting internal defaults Switching to new root and running init. unmounting old /dev unmounting old /proc unmounting old /sys SELinux: Disabled at runtime. audit(1217842227.809:2): selinux=0 auid=4294967295 INIT: version 2.86 booting Welcome to Red Hat Enterprise Linux Server Press 'I' to enter interactive startup. Cannot access the Hardware Clock via any known method. Use the --debug option to see the details of our search for an access method. date used greatest stack depth: 2816 bytes left Setting clock (localtime): Mon Aug 4 05:30:32 EDT 2008 [ OK ] Starting udev: I will wait for a while to see if the system will recover and continue to boot. =Comment: #6================================================= Sripathi Kodi <sripathi.com> - 2008-08-04 06:48 EDT After what must have been around 45 minutes, the system booted up. It is as if the system was too slow. Every initscript was taking ages to complete. Finally, when the system booted up, I see the following message in dmesg: BUG: MAX_STACK_TRACE_ENTRIES too low! turning off the locking correctness validator. Pid: 2112, comm: ip Not tainted 2.6.24.7-74ibmrt2.5debug #1 Call Trace: [<ffffffff810146b5>] ? save_stack_trace+0x2a/0x49 [<ffffffff8105d851>] save_trace+0x93/0x9b [<ffffffff8105d8d7>] add_lock_to_list+0x7e/0xac [<ffffffff81060eb9>] __lock_acquire+0xb43/0xcdc [<ffffffff81067443>] ? rt_mutex_slowtrylock+0x18/0x85 [<ffffffff810610e0>] lock_acquire+0x8e/0xb2 [<ffffffff81067443>] ? rt_mutex_slowtrylock+0x18/0x85 [<ffffffff812a7bd2>] __spin_lock_irqsave+0x40/0x73 [<ffffffff81067443>] rt_mutex_slowtrylock+0x18/0x85 [<ffffffff812a5694>] rt_mutex_trylock+0x9/0xb [<ffffffff812a7105>] rt_spin_lock+0x31/0x56 [<ffffffff8127e194>] ip_mc_inc_group+0x176/0x232 [<ffffffff8127e296>] ip_mc_up+0x46/0x64 [<ffffffff81279947>] inetdev_event+0x263/0x470 [<ffffffff810882d0>] ? __rcu_read_unlock+0x8c/0x95 [<ffffffff812aa943>] notifier_call_chain+0x33/0x5b [<ffffffff81058001>] __raw_notifier_call_chain+0x9/0xb [<ffffffff81058012>] raw_notifier_call_chain+0xf/0x11 [<ffffffff812305c2>] call_netdevice_notifiers+0x16/0x18 [<ffffffff81231f2f>] dev_open+0x80/0x88 [<ffffffff8123072f>] dev_change_flags+0xaf/0x16b [<ffffffff81279ee3>] devinet_ioctl+0x267/0x5f2 [<ffffffff8127a686>] inet_ioctl+0x82/0xa0 [<ffffffff81223dc1>] sock_ioctl+0x1e7/0x20c [<ffffffff810ce955>] do_ioctl+0x2d/0x83 [<ffffffff810cec20>] vfs_ioctl+0x275/0x292 [<ffffffff812a6a5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff810cec94>] sys_ioctl+0x57/0x7b [<ffffffff8100c248>] ? system_call+0xb8/0xef [<ffffffff8100c27f>] system_call_ret+0x0/0x6d INFO: lockdep is turned off. --------------------------- | preempt count: 00000001 ] | 1-level deep critical section nesting: ---------------------------------------- .. [<ffffffff812a7bb5>] .... __spin_lock_irqsave+0x23/0x73 .....[<ffffffff81067443>] .. ( <= rt_mutex_slowtrylock+0x18/0x85) The BUG above has been reported in bug 44149. I will add a comment to that bug as well. =Comment: #7================================================= Sripathi Kodi <sripathi.com> - 2008-08-05 06:15 EDT I saw this problem on HS21 as well. The debug kernel just doesn't seem to boot alright. This time, I saw a few messages from lockdep that look scary. I am going to dig a bit more at this one. I will attach normal and debug kernel boot logs to this bug. I am also updating the bug summary to indicate that this problem is not limited to x3455. We have confirmed this problem on LS20, HS21 and x3455 so far. =Comment: #8================================================= Sripathi Kodi <sripathi.com> - 2008-08-05 06:15 EDT Normal kernel boot log, where no problem is seen =Comment: #9================================================= Sripathi Kodi <sripathi.com> - 2008-08-05 06:16 EDT Debug kernel boot log, where the problem is seen =Comment: #10================================================= Sripathi Kodi <sripathi.com> - 2008-08-05 08:28 EDT Hmm... The MRG kernel shows lockdep errors, but doesn't take that long to boot up. Strange. =Comment: #11================================================= Sripathi Kodi <sripathi.com> - 2008-08-05 08:29 EDT Marking this one FOCUS, as I will be looking at this a bit more. This problem could affect anyone. =Comment: #13================================================= Sripathi Kodi <sripathi.com> - 2008-08-08 02:55 EDT The message "Cannot access the Hardware Clock via any known method. Use the --debug option to see the details of our search for an access method.""Cannot access the Hardware Clock via any known method. Use the --debug option to see the details of our search for an access method." seems to be because RTC is not enabled in the kernel configuration. To confirm whether CONFIG_RTC is needed for us, I am adding John Stultz to the CC list. John, could you please comment on this? We still need to find out why it takes so long to boot debug kernel. =Comment: #14================================================= John G. Stultz <jstultz.com> - 2008-08-08 14:26 EDT Hrmm. So does the R2 normal kernel have RTC enabled, but the debug doesn't? I'm not as familiar with this code, as its been reworked recently (causing a number of these sorts of messages). =Comment: #15================================================= Sripathi Kodi <sripathi.com> - 2008-08-11 02:36 EDT (In reply to comment #14) > Hrmm. So does the R2 normal kernel have RTC enabled, but the debug doesn't? Neither normal nor debug kernel have CONFIG_RTC enabled. My understanding (though not 100% sure) is that we don't have any applications that use RTC, so not having it enabled doesn't hurt us. Also, long boot up time reported in this bug may not be related to this message about hwclock. =Comment: #16================================================= Sripathi Kodi <sripathi.com> - 2008-08-11 02:36 EDT I enabled CONFIG_RTC and confirmed that the messages about hwclock go away. =Comment: #17================================================= Sripathi Kodi <sripathi.com> - 2008-08-11 04:58 EDT (In reply to comment #16) > I enabled CONFIG_RTC and confirmed that the messages about hwclock go away. and confirmed that this has nothing to do with the problem of long boot time we are seeing. =Comment: #18================================================= Sripathi Kodi <sripathi.com> - 2008-08-11 07:46 EDT The problem is because of CONFIG_PROVE_LOCKING. Digging this a bit more. =Comment: #19================================================= Sripathi Kodi <sripathi.com> - 2008-08-13 02:54 EDT This problem does not exist with vanilla 2.6.26. I tried 2.6.16-rt1, I hit another problem with lockdep and I have reported it on linux-rt-users.
Created attachment 314516 [details] Normal kernel boot log, where no problem is seen
Created attachment 314517 [details] Debug kernel boot log, where the problem is seen
Peter Zijlstra pointed me to these two patches: 419ca3f13532793b81aff09f80c60af3eacbb43d d6672c501852d577097f6757c311d937aca0b04b These solve the problem of long boot-up time when CONFIG_PROVE_LOCKING is enabled. We surely need these patches in MRG. Hence please include these patches http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=419ca3f13532793b81aff09f80c60af3eacbb43d and http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=d6672c501852d577097f6757c311d937aca0b04b in MRG. These patches apply on current MRG kernel 2.6.24-7-75el5rt with some offsets.
These patches and two other from Peter Zijlstra have been added to 2.6.24.7-78.el5rt. The patches: 0001-lockdep-fix-combinatorial-explosion-in-lock-subgrap.patch 0001-lockdep-build-fix.patch 0002-lockdep-change-scheduler-annotation.patch 0004-lockdep-re-annotate-scheduler-runqueues.patch
Ok, Darren agreed that it should go in to -rc3, so don't have any last minute changes that don't get wide testing. I'm going to submit the patches for review on the list, and then try to get acks and commit them so the -rc3 can be built.
Ok, the patches are committed and I've rebuilt -rc3. Marking this as accepted.
Verified that these patches is included into the mrg-rt-2.6.24.7-81 kernel: * 0001-lockdep-fix-combinatorial-explosion-in-lock-subgrap.patch mrg-rt.git commit 7fdba90ff2a2dfd693fab7b76736334b76d8563e linux-2.6 commit 419ca3f13532793b81aff09f80c60af3eacbb43d * 0001-lockdep-build-fix.patch mrg-rt.git commit 2e3b018b87616c9d622625889d3270b6bf93870c linux-2.6 commit d6672c501852d577097f6757c311d937aca0b04b * 0002-lockdep-change-scheduler-annotation.patch mrg-rt.git commit 2768deaf8c7a1e8bc929303ad897da05fc943389 * 0004-lockdep-re-annotate-scheduler-runqueues.patch mrg-rt.git commit b0a890b69d2a807adab797a0936f54a980bb8988 The mrg-rt-2.6.24.7-81 kernel boots without locking up around udev initialisation as the mrg-rt-2.6.24.7.74 kernel does. Both kernels report the hwclock issue: "Cannot access the Hardware Clock via any known method. Use the --debug option to see the details of our search for an access method." ... this is found on dell-pesc430-03.rhts.bos.redhat.com.
Re-tested /sbin/hwclock (without arguments) on stock RHEL5 kernel - 2.6.18-92 - it works as expected. Fails on 2.6.24.7-81.el5rt. One observation was that /proc/devices was missing rtc on 2.6.18-92. On the 2.6.24.7-81rt it was found as a character device with id 254. On both kernels /dev/rtc was found as character device 10 135. I tried to recreate /dev/rtc with as character device 10 254, without any success. Doing cat /dev/rtc also gives "No such device", no matter which id being used. In dmesg rtc seems to be found: rtc_cmos 00:05: rtc core: registered rtc_cmos as rtc0 rtc0: alarms up to one day
Noticed that I recreated /dev/rtc with the wrong major/minor id's. By doing this: # rm /dev/rtc ; mknod /dev/rtc c 254 0 After this, /sbin/hwclock will work.
Moved bug to verified. The issue regarding hwclock is not solved, but tracked in bug #465837. Long boot time is solved in mrg-rt-2.6.24.7-81rt.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0857.html
The fixes committed to resolve this issue have caused a regression in the non-debug kernel (LTC Bug #49268) Here are the commits in question: * 0001-lockdep-fix-combinatorial-explosion-in-lock-subgrap.patch mrg-rt.git commit 7fdba90ff2a2dfd693fab7b76736334b76d8563e linux-2.6 commit 419ca3f13532793b81aff09f80c60af3eacbb43d * 0001-lockdep-build-fix.patch mrg-rt.git commit 2e3b018b87616c9d622625889d3270b6bf93870c linux-2.6 commit d6672c501852d577097f6757c311d937aca0b04b * 0002-lockdep-change-scheduler-annotation.patch mrg-rt.git commit 2768deaf8c7a1e8bc929303ad897da05fc943389 * 0004-lockdep-re-annotate-scheduler-runqueues.patch mrg-rt.git commit b0a890b69d2a807adab797a0936f54a980bb8988 From a bisection it seems the final patch (lockdep-re-annotate-scheduler-runqueues.patch) is where the regression appears. We're going to revert all 4 patches in our tree for now, but try to sort out the problem shortly. I'm reopening this issue.
unfortunately, this BZ (on the RH side) was closed by the errata process and we can't reopen it. We'll need a new BZ on our side.
(In reply to comment #46) > ------- Comment From williams 2008-10-22 17:29:19 EDT------- > unfortunately, this BZ (on the RH side) was closed by the errata process and we > can't reopen it. We'll need a new BZ on our side. > Okay, we will open a fresh one.
I have opened LTC bug 49534 - RH 469186 to track the regression due to lockdep patches.