Bug 727865 - boot hang w/ 2.6.40 unless processor.nocst=1 - Asus P4V8X-X, Intel D865GBF
boot hang w/ 2.6.40 unless processor.nocst=1 - Asus P4V8X-X, Intel D865GBF
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
15
i686 Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-08-03 09:05 EDT by Stefan Stanacar
Modified: 2013-01-10 03:21 EST (History)
13 users (show)

See Also:
Fixed In Version: kernel-3.2.9-1.fc16
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-03-06 14:29:58 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Smolt profile (2.16 KB, application/octet-stream)
2011-08-03 09:05 EDT, Stefan Stanacar
no flags Details
Dmesg when booting 2.6.38.8-35 without acpi=off (45.51 KB, application/octet-stream)
2011-08-03 09:06 EDT, Stefan Stanacar
no flags Details
Dmesg when booting 2.6.40-4 with acpi=off (118.38 KB, application/octet-stream)
2011-08-03 09:06 EDT, Stefan Stanacar
no flags Details
kernel output dumped to serial console. (32.12 KB, text/plain)
2011-08-21 16:44 EDT, Adam K Kirchhoff
no flags Details
Output from acpidump (79.87 KB, text/plain)
2011-08-26 20:57 EDT, Adam K Kirchhoff
no flags Details

  None (edit)
Description Stefan Stanacar 2011-08-03 09:05:04 EDT
After the update to kernel-2.6.40-4.fc15.i686 the system fails to boot unless I pass acpi=off as boot parameter. This wasn't necessary with the previous kernel ( kernel-2.6.38.8-35.fc15.i686 )

The last displayed messages resemble this: 

 input: Power Button as /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input0
 ACPI: Power Button [PWRB]
 input: Sleep Button as /devices/LNXSYSTM:00/device:00/PNP0C0E:00/input/input1
 ACPI: Sleep Button [SLPB]
 input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input2
 ACPI: Power Button [PWRF]


The system just hangs (ctrl+alt+del, caps lock, num lock not working, no other messages after those).  
It also fails if I use noapic as boot param; it only works if I use acpi=off.

Is this a regression or just a case of crappy motherboard ? The mainboard is Asus P4V8X-X with a Via P4X533 chipset.

I will attach smolt profile, dmesg for 2.6.38-8 and dmesg for 2.6.40-4 with acpi=off.
Comment 1 Stefan Stanacar 2011-08-03 09:05:34 EDT
Created attachment 516508 [details]
Smolt profile
Comment 2 Stefan Stanacar 2011-08-03 09:06:24 EDT
Created attachment 516509 [details]
Dmesg when booting 2.6.38.8-35 without acpi=off
Comment 3 Stefan Stanacar 2011-08-03 09:06:52 EDT
Created attachment 516510 [details]
Dmesg when booting 2.6.40-4 with acpi=off
Comment 4 abelbennett 2011-08-10 10:20:26 EDT
I also have have the same problem.
this is my smolt uuid: pub_9fbe0e20-dabc-4df0-8c82-886fad980710
Comment 5 Stefan Stanacar 2011-08-18 08:14:40 EDT
Update: nothing changed with kernel-2.6.40.3-0.fc15.i686
Comment 6 Adam K Kirchhoff 2011-08-21 16:43:24 EDT
I'm having the same problem with 2.6.40.3-0.fc15-i686.  It only boots with acpi=off:

http://www.smolts.org/client/show/pub_a2e25252-5bf9-46b7-8597-695a377b4aa7

Attaching the serial log from when I boot without acpi=off.
Comment 7 Adam K Kirchhoff 2011-08-21 16:44:21 EDT
Created attachment 519205 [details]
kernel output dumped to serial console.
Comment 8 Chuck Ebbert 2011-08-22 15:20:04 EDT
 BUG: unable to handle kernel paging request at 00c0b141
 IP: [<f4402300>] 0xf44022ff
 *pde = 00000000 
 Oops: 0000 [#1] SMP 
 Modules linked in:
 
 Pid: 0, comm: swapper Not tainted 2.6.40.3-0.fc15.i686 #1                  /D865GBF                        
 EIP: 0060:[<f4402300>] EFLAGS: 00010246 CPU: 0
 EIP is at 0xf4402300
 EAX: 00000000 EBX: f4620800 ECX: 00000000 EDX: 00000000
 ESI: f4402300 EDI: c0634bf5 EBP: f4620800 ESP: f4493d08
  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
 Process swapper (pid: 0, ti=f4492000 task=c0a29fe0 task.ti=c09dc000)
 Stack:
  f4493d40 c04deb9b f45b4a20 00001001 f45b4a30 f4493d48 c06354b2 f45b4a30
  0000001c 000080d0 f45b4a20 f4620800 00000000 f45b4a20 00000000 f4620da8
  f4493d6c c0634ccd f4493d58 f442e618 c096fd0d 00000000 f4620800 f4409c00
 Call Trace:
  [<c04deb9b>] ? __kmalloc+0x103/0x110
  [<c06354b2>] ? acpi_ns_evaluate+0x3a/0x18d
  [<c0634ccd>] ? acpi_evaluate_object+0xd6/0x1c5
  [<c064240d>] ? acpi_processor_get_power_info+0x5a/0x53d
  [<c07e9a6b>] ? _raw_spin_unlock_irqrestore+0x13/0x15
  [<c042a839>] ? task_rq_unlock+0x17/0x19
  [<c043877c>] ? set_cpus_allowed_ptr+0xc7/0xd1
  [<c0641470>] ? acpi_processor_get_throttling_fadt+0x72/0x7a
  [<c06416b0>] ? acpi_processor_get_throttling+0x65/0x6e
  [<c0642339>] ? acpi_processor_get_throttling_info+0x4d1/0x500
  [<c0634daf>] ? acpi_evaluate_object+0x1b8/0x1c5
  [<c07dfeaf>] ? acpi_processor_power_init+0xdc/0x10c
  [<c07dfcf7>] ? acpi_processor_add+0x40e/0x4ea
  [<c0535793>] ? sysfs_do_create_link+0x120/0x157
  [<c0620107>] ? acpi_device_probe+0x41/0xf5
  [<c067ff74>] ? driver_probe_device+0x123/0x1ff
  [<c04297af>] ? should_resched+0xd/0x27
  [<c07e8801>] ? _cond_resched+0xd/0x21
  [<c0680098>] ? __driver_attach+0x48/0x64
  [<c067f1d6>] ? bus_for_each_dev+0x42/0x6b
  [<c067fbd1>] ? driver_attach+0x1f/0x23
  [<c0680050>] ? driver_probe_device+0x1ff/0x1ff
  [<c067f870>] ? bus_add_driver+0xca/0x210
  [<c06804c2>] ? driver_register+0x84/0xe3
  [<c0620888>] ? acpi_bus_register_driver+0x3f/0x41
  [<c0aafe81>] ? acpi_processor_init+0x65/0xd0
  [<c040118a>] ? do_one_initcall+0x8c/0x142
  [<c0aafe1c>] ? acpi_pci_slot_init+0x1b/0x1b
  [<c0a84827>] ? kernel_init+0xaa/0x136
  [<c0a8477d>] ? start_kernel+0x353/0x353
Comment 9 Chuck Ebbert 2011-08-22 15:20:57 EDT
That oops address is somewhere in the ACPI BIOS, i think.
Comment 10 Chuck Ebbert 2011-08-22 15:40:28 EDT
People reporting this bug have either:
 Intel(R) Pentium(R) D CPU 2.66GHz
Or:
 Intel(R) Pentium(R) 4 CPU 3.00GHz
Comment 11 Chuck Ebbert 2011-08-26 15:06:58 EDT
Disassembly of the oopsing code shows that it's not even really valid instructions. So the ACPI code just jumped to some invalid address.
Comment 12 Len Brown 2011-08-26 20:25:33 EDT
Would be great to know via serial console capture if all the failures
look like Adam's in comment 7/8, or if there are multiple failures here.

If we really are crashing under acpi_processor_get_power_info(),
then something in C-states is broken.

Do any of these (individual) cmdline params allow boot to succeed?

idle=poll
idle=halt
processor.nocst=1
processor.max_cstate=1

Adam, please attach the output from acpidump.

re: comment #10
actually comment #1 shows this:

CPU Model: Intel(R) Celeron(R) CPU 2.40GHz
CPU Family: 15
CPU Model Num: 4

but that is still a version of the P4.
Apparently these are all 32-bit processors, so we don't
have the option to try the x86_64 kernel.

Can this be reproduced with an upstream kernel.org kernel?

Presumably upstream 2.6.38.stable works, b/c FC15's
kernel-2.6.38.8-35.fc15.i686 worked.

What about
2.6.39
3.0.0 (I assume that FC is using 2.6.40 as a synonym for this?)
3.1-rc?

BTW, unrelated to the cause of this bug report, but present in comment #3
Linux version 2.6.40-4.fc15.i686 

WARNING: at arch/x86/kernel/apm_32.c:908 apm_cpu_idle+0x42/0x251()
...
deprecated apm_cpu_idle will be deleted in 2012

I added that warning to 3.0 to let folks
know that CONFIG_APM_CPU_IDLE=y may not be what you want.
If you think you really do need it, I need to hear from you...
Comment 13 Adam K Kirchhoff 2011-08-26 20:57:56 EDT
Created attachment 520150 [details]
Output from acpidump
Comment 14 Adam K Kirchhoff 2011-08-26 20:59:16 EDT
It is my understanding that Fedora's 2.6.40 is some version of 3.0.*.  I can try 3.0.3 over the weekend, assuming NJ isn't completely washed away.

I will try those other kernel parameters as well when I get a chance.
Comment 15 Adam K Kirchhoff 2011-08-26 21:37:11 EDT
idle=poll
idle=halt
processor.nocst=1

Each one let the machine boot.  It still crashed with:

processor.max_cstate=1
Comment 16 Len Brown 2011-08-26 22:29:31 EDT
processor.nocst=1 worked -- yay, that's a big clue.

Using that param, please show the output from

grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*

Also, if you can show that same output for the working
2.6.38 kernel, that would be helpful to compare
what the FADT does vs _CST.

It seems that acpi_processor_get_power_info_cst()
is bombing out, presumably in the evaluation of _CST itself.
That routine has not changed recently, so it must be something
funky in the actual AML/interpreter.

Unfortunately, the version of acpidump you used didn't
grab the dynamic tables where your _CST lives,
or they were not exported.
Can you attach the files from here?

/sys/firmware/acpi/tables/dynamic/*

If you don't see anything there w/ the latest kernel,
then go back to working 2.6.38 and they should be present there.
Comment 17 Adam K Kirchhoff 2011-08-27 08:22:49 EDT
On 2.6.40, /sys/devices/system/cpu/cpu0/ doesn't contain cpuidle on this machine...  There is also nothing under /sys/firmware/acpi/tables/dynamic/ on 2.6.40.

I will have to build a 2.6.38 kernel first as the previous version I have installed is a 2.6.35 F14 kernel.

Adam
Comment 18 Adam K Kirchhoff 2011-08-27 13:48:41 EDT
On 2.6.38.8 /sys/firmware/acpi/tables/dynamic/ contains nothing and /sys/devices/system/cpu/cpu0/cpuidle does not exist either.
Comment 19 Josh Boyer 2011-10-10 15:45:24 EDT
Has anyone tried a 2.6.39 kernel as Len asked?  You might be able to use

http://koji.fedoraproject.org/koji/buildinfo?buildID=244663

to test with.  Bug 730007 is showing similar issues and thus far we only see it on particular Pentium 4 models.  If someone is willing to git bisect this on an afflicted machine, that would be very helpful.
Comment 20 Josh Boyer 2012-01-06 10:08:35 EST
Matthew Garrett pointed me at a patch for a regression in ACPI yesterday.  I've started a scratch build with this patch applied.  Could those with an impacted machine please try this kernel when it finishes building and let us know the results?

http://koji.fedoraproject.org/koji/taskinfo?taskID=3624930
Comment 21 Josh Boyer 2012-01-06 11:05:14 EST
My apologies, I pasted the wrong link to the scratch build.  This is the one that should be tested:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3625177
Comment 22 Stefan Stanacar 2012-01-06 12:05:46 EST
Unfortunately I can't test it anymore, I'm running F16 now, with 3.1.6-1.fc16.i686, and I still have to use processor.nocst=1...
Comment 23 Josh Boyer 2012-01-06 12:23:27 EST
(In reply to comment #22)
> Unfortunately I can't test it anymore, I'm running F16 now, with
> 3.1.6-1.fc16.i686, and I still have to use processor.nocst=1...

The 2.6.41.x kernels are almost identical to the F16 3.1.x kernels.  They are both based on the 3.1.x stable series.  You should be able to install the kernel from the scratch build without issue.
Comment 24 Gary Buhrmaster 2012-01-07 00:54:47 EST
(In reply to comment #21)
> My apologies, I pasted the wrong link to the scratch build.  This is the one
> that should be tested:
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=3625177

While I too am now running F16, installing the kernel
and trying a boot indicates that there is still a panic 
with the backtrace indicating acpi issues, so I do not
think this is (yet) identified/fixed.
Comment 25 Stefan Stanacar 2012-01-09 04:07:52 EST
Same here, nothing changed.
Comment 26 Dark Shenada 2012-01-13 12:56:31 EST
One more test case on D865PERL got same result while boot.
(DMI: /D865PERL , BIOS RL86510A.86A.0061.P09.0308281850 08/28/2003)
P.S: P09, P15, P21 BIOS has same result.

After enable ACPI debug and enable early stage serial console then I got bad CST.

Hers is the combination can boot:
enable HT in BIOS, acpi=off (no HT, no ACPI)
disable HT, no kernel command (no HT, have ACPI)
enable HT, processor.nocst=1 (have HT, have ACPI)

This is same family with Bug#730007 I have:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 2.60GHz
stepping        : 9
cpu MHz         : 2600.000
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 1
initial apicid  : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr
bogomips        : 5185.92
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 32 bits virtual
power management:
Comment 27 Dark Shenada 2012-01-13 12:58:57 EST
(In reply to comment #26)
3.1.6-1.fc16.i686.PAE
Comment 28 Dave Jones 2012-02-24 11:15:46 EST
assuming that the 2.6.42 (3.2) builds don't make any difference either ?
Comment 29 Gary Buhrmaster 2012-02-24 12:12:20 EST
(In reply to comment #28)
> assuming that the 2.6.42 (3.2) builds don't make any difference either ?

Running the F16 kernel 3.2.7-1 still requires processor.nocst=1
(crash otherwise).

Is there something specific in the F15 2.6.42 kernel builds
that was intended to fix this (i.e. is it worth getting a
F15 2.6.42 kernel to test?)
Comment 30 Dave Jones 2012-02-24 13:02:38 EST
no.that's pretty much equivalent (as far as acpi is concerned).

Len, is there any hope for resolution on this, or shall we just start dmi blacklisting the affected systems ? There doesn't seem to be too many of them at least..
Comment 31 Josh Boyer 2012-02-24 13:11:30 EST
From 730007 we know that whatever caused this issue showed up in 2.6.39-rc1.  What we haven't been able to do is find someone with an impacted machine that is willing to do a git bisect to figure out which commit changed things.
Comment 32 Gary Buhrmaster 2012-02-24 16:06:56 EST
(In reply to comment #31)
> From 730007 we know that whatever caused this issue showed up in 2.6.39-rc1. 
> What we haven't been able to do is find someone with an impacted machine that
> is willing to do a git bisect to figure out which commit changed things.

Ok, I'll bite(*).  I had been hoping someone else would do the
work (compiling a kernel on that old system can take many hours,
so I can usually only get one/two tests per day), but I'll see
if I can get any useful results from a git bisect.

Gary

(*) I think the term is you shamed me into it.... :-)
Comment 33 Josh Boyer 2012-02-26 11:30:24 EST
Matthew pointed me to:

http://marc.info/?l=linux-acpi&m=133002974918284&w=2

That seems like a rather plausible fix for this.
Comment 34 Gary Buhrmaster 2012-02-26 16:44:35 EST
My git bisect has completed, and seems to confirm that
the commit referenced in comment 33 is the commit that
caused the problems.

---


$ git bisect good
64b3db22c04586997ab4be46dd5a5b99f8a2d390 is the first bad commit
commit 64b3db22c04586997ab4be46dd5a5b99f8a2d390
Author: Bob Moore <robert.moore@intel.com>
Date:   Mon Feb 14 15:50:42 2011 +0800

    ACPICA: Remove use of unreliable FADT revision field
    
    The revision number in the FADT has been found to be completely
    unreliable and cannot be trusted. Only the table length can be
    used to infer the actual version.
    
    Signed-off-by: Bob Moore <robert.moore@intel.com>
    Signed-off-by: Lin Ming <ming.m.lin@intel.com>
    Signed-off-by: Len Brown <len.brown@intel.com>

:040000 040000 e40ed2fa28b82990cc8fb147f61841fd6400e711 544b3a6eb35875e502695e35366f23c6e5c80d2c M	drivers
:040000 040000 165441d52fb3ece49801c33a91f1b5e266d53abf 4b5394b30c2c89d29372d999098ebeba16fbe23d M	include
Comment 35 Josh Boyer 2012-02-27 08:53:45 EST
Excellent.  Thank you very much Gary.  I should have a scratch-build with the patch I referenced in just a bit for people to test out.
Comment 36 Josh Boyer 2012-02-27 09:08:26 EST
This scratch build should have the patch mentioned above:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3822374

Testing when it completes would be much appreciated.
Comment 37 Gary Buhrmaster 2012-02-27 12:04:41 EST
(In reply to comment #36)
> Testing when it completes would be much appreciated.

This new kernel works in my environment without
needing the previous workaround of processor.nocst=1

Thanks!
Comment 38 Stefan Stanacar 2012-02-27 13:29:38 EST
Yup, works for me too.
Thanks!
Comment 39 Josh Boyer 2012-02-27 13:56:37 EST
Excellent.  Thank you both for testing.  I will get this committed to the Fedora branches today and it should be in the next update.
Comment 40 Fedora Update System 2012-02-28 20:34:44 EST
kernel-3.2.8-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.2.8-3.fc16
Comment 41 Fedora Update System 2012-03-01 04:29:53 EST
Package kernel-3.2.8-3.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.2.8-3.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-2745/kernel-3.2.8-3.fc16
then log in and leave karma (feedback).
Comment 42 Fedora Update System 2012-03-01 17:53:03 EST
kernel-3.2.9-1.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.2.9-1.fc16
Comment 43 Fedora Update System 2012-03-06 14:29:58 EST
kernel-3.2.9-1.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 44 Len Brown 2012-04-06 14:40:22 EDT
commit 3e80acd1af40fcd91a200b0416a7616b20c5d647
Author: Julian Anastasov <ja@ssi.bg>
Date:   Thu Feb 23 22:40:43 2012 +0200

    ACPICA: Fix regression in FADT revision checks

shipped in upstream Linux 3.4-rc1

Note You need to log in before you can comment on or make changes to this bug.