Bug 324081 - Intermittent infinite loop in init_cacheinfo on Intel Pentium D
Intermittent infinite loop in init_cacheinfo on Intel Pentium D
Product: Fedora
Classification: Fedora
Component: glibc (Show other bugs)
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Jakub Jelinek
Fedora Extras Quality Assurance
Depends On:
  Show dependency treegraph
Reported: 2007-10-08 20:27 EDT by Steve Mead
Modified: 2007-11-30 17:12 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-10-16 07:38:49 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Contents of /proc/cpuinfo (1.13 KB, text/plain)
2007-10-08 20:27 EDT, Steve Mead
no flags Details
cacheinfo.c (13.27 KB, text/plain)
2007-10-09 03:27 EDT, Jakub Jelinek
no flags Details
Output of dmidecode (16.20 KB, text/plain)
2007-10-15 14:33 EDT, Steve Mead
no flags Details

  None (edit)
Description Steve Mead 2007-10-08 20:27:15 EDT
Description of problem:

I'm seeing an intermittent infinite loop in init_cacheinfo that causes random
processes to hang and use 100% of the CPU.  Attaching to the processes with gdb
shows that it is always in init_cacheinfo inside of the loop at cacheinfo.c:400.

/* Query until desired cache level is enumerated. */
     asm volatile ("cpuid"
                  : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx)
                  : "0" (4), "2" (i++));
while (((eax >> 5) & 0x7) != level);

Gdb gives level == 2 and eax optimized out.

Version-Release number of selected component (if applicable):


How reproducible:

It seems to happen most when running lots of processes quickly, like during a
build or running a configure script.  I definitely see it extremely often while
trying to run the net-snmp configure script.

Steps to Reproduce:
1. Run a configure script, wait for hang.
Actual results:

Random process (echo, rm, grep, sed, make) will hang and use 100% of CPU until
manually killed.

Expected results:

No hang.

Additional info:

This machine previously ran Fedora Core 5 for over a year with no problems. 
Problem started right after upgrading to Fedora 7.  I've attached the contents
of /proc/cpuinfo.
Comment 1 Steve Mead 2007-10-08 20:27:15 EDT
Created attachment 220381 [details]
Contents of /proc/cpuinfo
Comment 2 Jakub Jelinek 2007-10-09 03:23:37 EDT
Can you please run attached program?
Comment 3 Jakub Jelinek 2007-10-09 03:27:32 EDT
Created attachment 220771 [details]
Comment 4 Steve Mead 2007-10-09 15:07:12 EDT
After running the program multiple times, I see three distinct sets of output. 
The most common outputs are:

shared 1048576 level 2 max_cpuid 5
cpuid (4, 0) = 04000121
cpuid (4, 1) = 04000143


shared 1048576 level 2 max_cpuid 3

Twice it has printed out the following, where the second number inside the
parentheses keeps incrementing until I hit CTRL-C:

shared 1048576 level 2 max_cpuid 5
cpuid (4, 0) = 04000121
cpuid (4, 1) = 00000000
cpuid (4, 2) = 00000000
cpuid (4, 3) = 00000000
Comment 5 Ulrich Drepper 2007-10-09 21:26:38 EDT
My suspicion is that this is a CPU bug.  The program cannot really print
different values in different runs.  The only possible variants in the code is
the content of the other registers.  But that (except eax and ecx) should be
irrelevant.  Especially the return of max_cpuid == 3 is very suspicious.

I've asked Intel to look at this.  In upstream glibc I've implemented a work-around.
Comment 6 Jakub Jelinek 2007-10-10 03:19:09 EDT
Looking at the /proc/cpuinfo dump, CPU 0 has
cpuid level     : 3
and no physical id/siblings/core id/cpu cores lines.
So it makes sense, sometimes you get
shared 1048576 level 2 max_cpuid 3
when the process is scheduled on CPU 0, or
shared 1048576 level 2 max_cpuid 5
when the process is scheduled on CPU 1, and then if rescheduled between that
and the following loop it can loop forever.

Wonder if max_cpuid can be e.g. tweaked in the BIOS and the BIOS wrongly tweaks
it only for the boot CPU and not the other CPUs, or of course it can be a CPU
bug.  Certainly having different CPUs different cpuid levels means cpuid insn
is completely unusable in userland, unless the process is pinned just to one CPU
(but that's very much undesirable for libc initialization).
Comment 7 H.J. Lu 2007-10-15 10:08:53 EDT
What does dmidecode report?
Comment 8 Steve Mead 2007-10-15 14:33:27 EDT
Created attachment 227861 [details]
Output of dmidecode
Comment 10 Steve Mead 2007-10-15 18:44:18 EDT
The BIOS upgrade seems to have fixed the problem.  Now /proc/cpuinfo shows both
cores with a cpuid level of 5, and the cacheinfo.c program reliably gives output of:

shared 1048576 level 2 max_cpuid 5
cpuid (4, 0) = 04000121
cpuid (4, 1) = 04000143

I also ran a few stress tests and didn't see any hung processes.  Thank you all
for your help, and I'm sorry it was something as mundane as an old BIOS.
Comment 11 Jakub Jelinek 2007-10-16 07:38:49 EDT
Closing.  rawhide glibc has workaround in case other people have buggy BIOS,
though of course best would be if people upgrade their BIOSes to fixed ones.
Comment 12 Davide Rossetti 2007-10-23 12:10:29 EDT
may I pick rawhide glibc for testing ? on F7 I mean ??

Note You need to log in before you can comment on or make changes to this bug.