324081 – Intermittent infinite loop in init_cacheinfo on Intel Pentium D

Bug 324081 - Intermittent infinite loop in init_cacheinfo on Intel Pentium D

Summary: Intermittent infinite loop in init_cacheinfo on Intel Pentium D

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	7
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-10-09 00:27 UTC by Steve Mead
Modified:	2007-11-30 22:12 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-10-16 11:38:49 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Contents of /proc/cpuinfo (1.13 KB, text/plain) 2007-10-09 00:27 UTC, Steve Mead	no flags	Details
cacheinfo.c (13.27 KB, text/plain) 2007-10-09 07:27 UTC, Jakub Jelinek	no flags	Details
Output of dmidecode (16.20 KB, text/plain) 2007-10-15 18:33 UTC, Steve Mead	no flags	Details
View All

Description Steve Mead 2007-10-09 00:27:15 UTC

Description of problem:

I'm seeing an intermittent infinite loop in init_cacheinfo that causes random
processes to hang and use 100% of the CPU.  Attaching to the processes with gdb
shows that it is always in init_cacheinfo inside of the loop at cacheinfo.c:400.

/* Query until desired cache level is enumerated. */
do
  {
     asm volatile ("cpuid"
                  : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx)
                  : "0" (4), "2" (i++));
  }
while (((eax >> 5) & 0x7) != level);

Gdb gives level == 2 and eax optimized out.

Version-Release number of selected component (if applicable):

2.6-4

How reproducible:

It seems to happen most when running lots of processes quickly, like during a
build or running a configure script.  I definitely see it extremely often while
trying to run the net-snmp configure script.

Steps to Reproduce:
1. Run a configure script, wait for hang.
2.
3.
  
Actual results:

Random process (echo, rm, grep, sed, make) will hang and use 100% of CPU until
manually killed.

Expected results:

No hang.

Additional info:

This machine previously ran Fedora Core 5 for over a year with no problems. 
Problem started right after upgrading to Fedora 7.  I've attached the contents
of /proc/cpuinfo.

Comment 1 Steve Mead 2007-10-09 00:27:15 UTC

Created attachment 220381 [details]
Contents of /proc/cpuinfo

Comment 2 Jakub Jelinek 2007-10-09 07:23:37 UTC

Can you please run attached program?

Comment 3 Jakub Jelinek 2007-10-09 07:27:32 UTC

Created attachment 220771 [details]
cacheinfo.c

Comment 4 Steve Mead 2007-10-09 19:07:12 UTC

After running the program multiple times, I see three distinct sets of output. 
The most common outputs are:

shared 1048576 level 2 max_cpuid 5
cpuid (4, 0) = 04000121
cpuid (4, 1) = 04000143

And:

shared 1048576 level 2 max_cpuid 3

Twice it has printed out the following, where the second number inside the
parentheses keeps incrementing until I hit CTRL-C:

shared 1048576 level 2 max_cpuid 5
cpuid (4, 0) = 04000121
cpuid (4, 1) = 00000000
cpuid (4, 2) = 00000000
cpuid (4, 3) = 00000000
...

Comment 5 Ulrich Drepper 2007-10-10 01:26:38 UTC

My suspicion is that this is a CPU bug.  The program cannot really print
different values in different runs.  The only possible variants in the code is
the content of the other registers.  But that (except eax and ecx) should be
irrelevant.  Especially the return of max_cpuid == 3 is very suspicious.

I've asked Intel to look at this.  In upstream glibc I've implemented a work-around.

Comment 6 Jakub Jelinek 2007-10-10 07:19:09 UTC

Looking at the /proc/cpuinfo dump, CPU 0 has
cpuid level     : 3
and no physical id/siblings/core id/cpu cores lines.
So it makes sense, sometimes you get
shared 1048576 level 2 max_cpuid 3
when the process is scheduled on CPU 0, or
shared 1048576 level 2 max_cpuid 5
when the process is scheduled on CPU 1, and then if rescheduled between that
and the following loop it can loop forever.

Wonder if max_cpuid can be e.g. tweaked in the BIOS and the BIOS wrongly tweaks
it only for the boot CPU and not the other CPUs, or of course it can be a CPU
bug.  Certainly having different CPUs different cpuid levels means cpuid insn
is completely unusable in userland, unless the process is pinned just to one CPU
(but that's very much undesirable for libc initialization).

Comment 7 H.J. Lu 2007-10-15 14:08:53 UTC

What does dmidecode report?

Comment 8 Steve Mead 2007-10-15 18:33:27 UTC

Created attachment 227861 [details]
Output of dmidecode

Comment 9 H.J. Lu 2007-10-15 18:48:17 UTC

You have BIOS A01. Can you try BIOS A03 at

http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R129670&SystemID=DIM_P4_9100&servicetag=&os=WW1&osl=en&deviceid=308&devlib=0&typecnt=0&vercnt=3&catid=-1&impid=-1&formatcnt=1&libid=1&fileid=172852

Comment 10 Steve Mead 2007-10-15 22:44:18 UTC

The BIOS upgrade seems to have fixed the problem.  Now /proc/cpuinfo shows both
cores with a cpuid level of 5, and the cacheinfo.c program reliably gives output of:

shared 1048576 level 2 max_cpuid 5
cpuid (4, 0) = 04000121
cpuid (4, 1) = 04000143

I also ran a few stress tests and didn't see any hung processes.  Thank you all
for your help, and I'm sorry it was something as mundane as an old BIOS.

Comment 11 Jakub Jelinek 2007-10-16 11:38:49 UTC

Closing.  rawhide glibc has workaround in case other people have buggy BIOS,
though of course best would be if people upgrade their BIOSes to fixed ones.

Comment 12 Davide Rossetti 2007-10-23 16:10:29 UTC

may I pick rawhide glibc for testing ? on F7 I mean ??

Note You need to log in before you can comment on or make changes to this bug.