Bug 746097

Summary:

3.1.0 kernel will not boot with ACPI enabled on Thinkpad T510

Product:

[Fedora] Fedora

Reporter:

David L. Crow <crow>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

CC:

gansalmon, itamar, jfeeney, jonathan, kernel-maint, madhu.chinakonda, mike.reid, redhat, stefan.hoelldampf, tomi.ollila

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-02-23 23:01:43 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Output from "cat /proc/cpuinfo"	none
Output from acpidump	none
screen shot of last kernel messages	none
Output of dmesg command after boot with debug kernel.	none
another screen shot of boot at hang point	none

Description David L. Crow 2011-10-13 21:10:46 UTC

Description of problem:

Hardware: Thinkpad T510 Intel(R) Core(TM) i7 CPU M 620 @ 2.67GHz

I installed Fedora 16 alpha and have been performing regular upgrades via yum since then. When the kernel changed from 3.0.0 to 3.1.0, it stopped booting unless I add acpi=off to the kernel command line arguments.

The last two lines in the kernel boot output (I removed the quiet flag) are

[ 2.578202] Refined TSC clocksource calibration: 2659.999 MHz.
[ 2.578494] Switching to clocksource tsc

I went back to try to find old releases and was only able to find

kernel-3.0.0-1.fc16.x86_64
kernel-3.1.0-0.rc6.git0.3.fc16.x86_64
kernel-3.1.0-0.rc9.git0.0.fc16.x86_64

These are from the alpha release, the beta release, and the latest. The builds on koji have been removed. Using these, the 3.0.0 build always boots, but the 3.1.0 builds rarely succeed.

Version-Release number of selected component (if applicable):

3.1.0

How reproducible:

98% of the time, the boot will hang at the exact same place. Every once in a while, it will succeed completely. At one time, I thought I could boot with "acpi=off single" and then immediately reboot (kind of a warm boot instead of a cold boot) and it would work, but that did not turn out to be consistent. I also thought that the behavior changed depending on whether the laptop was in a dock, plugged in w/o a dock, or straight on battery, but I could get reliable results that way, either.

Steps to Reproduce:
1. Boot a Fedora 16 installation
2.
3.

Actual results:

Boot stops and does not complete when acpi is not disabled.

Expected results:

Boot should succeed when acpi is not disabled.

Additional info:

Per one of the other open kernel bugs, I added inticall_debug=1, but did not see a change.

Per the other bugs, it seems interesting to have the output of /proc/cpuinfo and acpidump, so I will attach those. I will also attach a screenshot of the hung boot.

If there is any other information that I can gather or tests to try, I am more than happy to work hard to help find the resolution.

Comment 1 David L. Crow 2011-10-13 21:11:31 UTC

Created attachment 528102 [details]
Output from "cat /proc/cpuinfo"

Comment 2 David L. Crow 2011-10-13 21:11:57 UTC

Created attachment 528103 [details]
Output from acpidump

Comment 3 David L. Crow 2011-10-13 21:18:39 UTC

Created attachment 528105 [details]
screen shot of last kernel messages

Comment 4 Josh Boyer 2011-10-13 21:19:52 UTC

If you install kernel-debug, do you get a trace instead of just a hang?

Comment 5 David L. Crow 2011-10-14 02:03:35 UTC

After many tries, I can't get the boot to hang with the debug kernel.  I'll continue to try, but in the mean time, I'll attach the dmesg output after booting with the debug kernel in case it is useful.

Comment 6 David L. Crow 2011-10-14 02:04:30 UTC

Created attachment 528135 [details]
Output of dmesg command after boot with debug kernel.

Comment 7 Dave Jones 2011-10-14 14:12:43 UTC

if you can get the normal kernel to still hang, try booting with initcall_debug (and remove 'quiet'). This should tell you the last function we entered before the kernel hangs.

Comment 8 David L. Crow 2011-10-14 14:39:08 UTC

Actually, the screenshot in attachment 528105 [details] is with initcall_debug enabled and the normal kernel.  I did just repro again and had the exact same screen.

Still no luck in getting the debug kernel to fail :-(.

Comment 9 Chuck Ebbert 2011-10-14 18:26:28 UTC

I assume adding clocksource=hpet works?

Comment 10 David L. Crow 2011-10-21 18:00:46 UTC

Sorry for the delay in responding.

The only change when adding clocksource=hpet is that the "Switching to clocksource tsc" line is not printed.  Otherwise it still hangs at the same place.

I have updated to the 3.1.0-0.rc10.git0.1.fc16.x86_64 kernel and the behaviour is exactly the same.  The debug kernel works every time and the non-debug kernel hangs about 80-90% of the time.

Comment 11 Chuck Ebbert 2011-10-25 00:57:49 UTC

Can you get a backtrace by using the sysrq key? Add "sysrq_always_enabled" to the boot options and try hitting alt-sysrq-p for a dump of the current CPU state and/or alt-sysrq-l to show all CPUs:

*  How do I use the magic SysRq key?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On x86   - You press the key combo 'ALT-SysRq-<command key>'. Note - Some
           keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is
           also known as the 'Print Screen' key. Also some keyboards cannot
           handle so many keys being pressed at the same time, so you might
           have better luck with "press Alt", "press SysRq", "release SysRq",
           "press <command key>", release everything.


See also: http://en.wikipedia.org/wiki/Magic_SysRq_key

Comment 12 David L. Crow 2011-10-27 16:46:16 UTC

I'm now on 3.1.0-1.fc16.x86_64 .

The good news (I guess) is that the failure is happening less.  I am now successfully about 40% of the time.  I still do not know what is different between success and failure.

The bad news is that the magic SysRq key is not working.  Once booted, it works just fine, so I feel confident I have the magic keyboard incantation (while holding down alt, hold fn, press sysrq, release fn, then press <command key>) down.

Perhaps this is not enabled this early in the boot?

Comment 13 David L. Crow 2011-10-27 16:48:03 UTC

Created attachment 530528 [details]
another screen shot of boot at hang point

In one of the failed boots today, the screen output was a little different.  Two of the output lines that were previously about 5-10 lines up were at the bottom as is shown in the attachment.

I'm not if that means anything or provides any clues.

Comment 14 Josh Boyer 2011-12-07 16:23:04 UTC

*** Bug 756154 has been marked as a duplicate of this bug. ***

Comment 15 Josh Boyer 2011-12-15 20:58:26 UTC

*** Bug 768133 has been marked as a duplicate of this bug. ***

Comment 16 Mike Reid 2011-12-19 05:04:29 UTC

I'm having similar problems with a Lenovo T510. 

However, it turns out that if I leave it long enough it works: 

Booting: 
3.1.5-6.fc16.i686.PAE 
(which is in updates-testing)

from dmesg:
...
[    1.178143] ACPI: Deprecated procfs I/F for battery is loaded, please retry with CONF
IG_ACPI_PROCFS_POWER cleared
[    1.178412] ACPI: Battery Slot [BAT0] (battery present)
[    1.521036] isapnp: No Plug & Play device found
[    1.521293] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    1.522723] Non-volatile memory driver v1.3
[    1.522891] Linux agpgart interface v0.103
[    1.523298] tpm_tis 00:0b: 1.2 TPM (device-id 0x0, rev-id 78)
[    2.115995] Refined TSC clocksource calibration: 2526.999 MHz.
[    2.116178] Switching to clocksource tsc
[  121.457029] tpm_tis 00:0b: Operation Timed out
...
So it actually works, but takes a VERY long time to time-out...

Comment 17 Josh Boyer 2011-12-19 14:02:59 UTC

Are all of you seeing the same thing that is reported in comment #16?  If so, this sounds like bug 733964

Comment 18 David L. Crow 2011-12-19 17:21:32 UTC

Sure enough, that is what I see

[    1.029206] ACPI: Deprecated procfs I/F for battery is loaded, please retry with CONFIG_ACPI_PROCFS_POWER cleared
[    1.029418] ACPI: Battery Slot [BAT0] (battery present)
[    1.030659] serial 0000:00:16.3: PCI INT B -> GSI 17 (level, low) -> IRQ 17
[    1.051314] 0000:00:16.3: ttyS0 at I/O 0x1800 (irq = 17) is a 16550A
[    1.060647] Non-volatile memory driver v1.3
[    1.060789] Linux agpgart interface v0.103
[    1.061093] tpm_tis 00:0b: 1.2 TPM (device-id 0x0, rev-id 78)
[    1.930427] Refined TSC clocksource calibration: 2659.999 MHz.
[    1.930573] Switching to clocksource tsc
[  120.760356] tpm_tis 00:0b: Operation Timed out
[  120.784825] loop: module loaded
[  120.785014] ahci 0000:00:1f.2: version 3.0
[  120.785023] ahci 0000:00:1f.2: PCI INT B -> GSI 16 (level, low) -> IRQ 16
[  120.785203] ahci 0000:00:1f.2: irq 41 for MSI/MSI-X
[  120.785238] ahci: SSS flag set, parallel bus scan disabled
[  120.785432] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 3 Gbps 0x33 impl SATA mode
[  120.785641] ahci 0000:00:1f.2: flags: 64bit ncq sntf ilck stag pm led clo pio slum part ems sxs apst 
[  120.785849] ahci 0000:00:1f.2: setting latency timer to 64


I apologize that I did not have the patience to wait.

Just to keep up-to-date, I am now running kernel-3.1.5-6.fc16.x86_64 .

Comment 19 David L. Crow 2011-12-19 17:25:37 UTC

Regarding whether this is a duplicate, I never had a problem in Fedora 15 and the problem only started when Fedora 16 alpha/beta moved to the 3.1 kernel.  The 3.0 kernel never showed problems for me.

Comment 20 Mike Reid 2011-12-19 18:00:28 UTC

As described for bug 733964, I have tried adding: 
tpm_tis.interrupts=0
to the boot options. 

This does seem to be helpful. Sometimes it boots immediately, but sometimes not. 

I can't see a pattern in which boots are slow after trying various combinations of restart, cold boot, with external power, with only battery power...

Comment 21 Josh Boyer 2011-12-20 18:36:32 UTC

Can you try adding tpm_tis.itpm=1 to the kernel command line and seeing if that helps.  If not, can you try booting with nohz=off and see if the hang goes away?

The TPM driver is doing some weird stuff and I'd like to see if this is a problem with the iTPM probe function, or something more general.

Comment 22 Mike Reid 2011-12-20 19:50:59 UTC

I added tpm_tis.itpm=1 to my Lenovo T510 and it's booted 5 times now without hanging using a variety of battery power/mains power/warm restart/cold boot. 

Any other info you need?

Comment 23 Josh Boyer 2011-12-20 20:01:14 UTC

(In reply to comment #22)
> I added tpm_tis.itpm=1 to my Lenovo T510 and it's booted 5 times now without
> hanging using a variety of battery power/mains power/warm restart/cold boot. 
> 
> Any other info you need?

I don't think so.  The problem is that the tpm driver is built into the kernel in f15/f16 and on this particular machine it goes off and probes for an iTPM because it's detected via ACPI.  Except for some reason nothing returns data and the driver sits there for up to 2 minutes waiting for it to respond.  This timeout seems rather excessive, and it probably should be blacklisted (or something) on these machines anyway.

For rawhide, we actually switched this to a module, where it can be loaded and hang as long as it wants without stalling the boot.  I'll think of what to do for F15/F16.

Comment 24 Josh Boyer 2011-12-20 20:08:56 UTC

I've also emailed the upstream maintainers to see if there are other options or debug to pursue.

Comment 25 Mike Reid 2011-12-20 20:46:16 UTC

OK, thanks. Of course for now, from my point of view the problem is "solved" (in that I don't have to wait 2 minutes to boot). Just let me know if you want me to test something else.

Comment 26 Dave Jones 2011-12-21 02:40:00 UTC

try the build at http://koji.fedoraproject.org/koji/buildinfo?buildID=279607

(without any boot parameters, it contains a patch which should make it just do the right thing by default).

Comment 27 Mike Reid 2011-12-21 04:17:45 UTC

OK, booted twice OK on my Lenovo T510. 
Can see a slight pause when it hits the point where it used to hang, as you can see from this dmesg extract:
 
...
[    1.974235] Refined TSC clocksource calibration: 2526.999 MHz.
[    1.974411] Switching to clocksource tsc
[    3.414125] loop: module loaded
... 

And I guess this is part of what was causing the problem: 
...
[    3.625251] IMA: No TPM chip found, activating TPM-bypass!
...

Comment 28 tomi ollila 2011-12-28 19:11:35 UTC

I had the same problem -- boot hang for ... 2 minutes (?) after 
'loading initial ramdisk' and then said something about timeout.

Now I yum updated by F16. kernel 3.1.6-1.fc16.x86_64.

Now it stopped for a while (for 1 second (or 2) if what I paste below
is what happened:

...
[    0.994495] tpm_tis 00:0b: 1.2 TPM (device-id 0x0, rev-id 78)
[    1.922197] Refined TSC clocksource calibration: 2393.999 MHz.
[    1.922204] Switching to clocksource tsc
[    3.021751] loop: module loaded
[    3.021836] ahci 0000:00:1f.2: version 3.0
...
[    3.077076] IMA: No TPM chip found, activating TPM-bypass!
...

This is definite improvement (presumed it lasts...) Thanks!

Comment 29 David L. Crow 2012-02-23 22:55:31 UTC

I haven't seen this problem in a while, so as far as I am concerned, this bug can be closed.