Bug 324081 - Intermittent infinite loop in init_cacheinfo on Intel Pentium D
Summary: Intermittent infinite loop in init_cacheinfo on Intel Pentium D
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: 7
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
Assignee: Jakub Jelinek
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-10-09 00:27 UTC by Steve Mead
Modified: 2007-11-30 22:12 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-10-16 11:38:49 UTC
Type: ---


Attachments (Terms of Use)
Contents of /proc/cpuinfo (1.13 KB, text/plain)
2007-10-09 00:27 UTC, Steve Mead
no flags Details
cacheinfo.c (13.27 KB, text/plain)
2007-10-09 07:27 UTC, Jakub Jelinek
no flags Details
Output of dmidecode (16.20 KB, text/plain)
2007-10-15 18:33 UTC, Steve Mead
no flags Details

Description Steve Mead 2007-10-09 00:27:15 UTC
Description of problem:

I'm seeing an intermittent infinite loop in init_cacheinfo that causes random
processes to hang and use 100% of the CPU.  Attaching to the processes with gdb
shows that it is always in init_cacheinfo inside of the loop at cacheinfo.c:400.

/* Query until desired cache level is enumerated. */
do
  {
     asm volatile ("cpuid"
                  : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx)
                  : "0" (4), "2" (i++));
  }
while (((eax >> 5) & 0x7) != level);

Gdb gives level == 2 and eax optimized out.

Version-Release number of selected component (if applicable):

2.6-4

How reproducible:

It seems to happen most when running lots of processes quickly, like during a
build or running a configure script.  I definitely see it extremely often while
trying to run the net-snmp configure script.

Steps to Reproduce:
1. Run a configure script, wait for hang.
2.
3.
  
Actual results:

Random process (echo, rm, grep, sed, make) will hang and use 100% of CPU until
manually killed.

Expected results:

No hang.

Additional info:

This machine previously ran Fedora Core 5 for over a year with no problems. 
Problem started right after upgrading to Fedora 7.  I've attached the contents
of /proc/cpuinfo.

Comment 1 Steve Mead 2007-10-09 00:27:15 UTC
Created attachment 220381 [details]
Contents of /proc/cpuinfo

Comment 2 Jakub Jelinek 2007-10-09 07:23:37 UTC
Can you please run attached program?

Comment 3 Jakub Jelinek 2007-10-09 07:27:32 UTC
Created attachment 220771 [details]
cacheinfo.c

Comment 4 Steve Mead 2007-10-09 19:07:12 UTC
After running the program multiple times, I see three distinct sets of output. 
The most common outputs are:

shared 1048576 level 2 max_cpuid 5
cpuid (4, 0) = 04000121
cpuid (4, 1) = 04000143

And:

shared 1048576 level 2 max_cpuid 3

Twice it has printed out the following, where the second number inside the
parentheses keeps incrementing until I hit CTRL-C:

shared 1048576 level 2 max_cpuid 5
cpuid (4, 0) = 04000121
cpuid (4, 1) = 00000000
cpuid (4, 2) = 00000000
cpuid (4, 3) = 00000000
...

Comment 5 Ulrich Drepper 2007-10-10 01:26:38 UTC
My suspicion is that this is a CPU bug.  The program cannot really print
different values in different runs.  The only possible variants in the code is
the content of the other registers.  But that (except eax and ecx) should be
irrelevant.  Especially the return of max_cpuid == 3 is very suspicious.

I've asked Intel to look at this.  In upstream glibc I've implemented a work-around.

Comment 6 Jakub Jelinek 2007-10-10 07:19:09 UTC
Looking at the /proc/cpuinfo dump, CPU 0 has
cpuid level     : 3
and no physical id/siblings/core id/cpu cores lines.
So it makes sense, sometimes you get
shared 1048576 level 2 max_cpuid 3
when the process is scheduled on CPU 0, or
shared 1048576 level 2 max_cpuid 5
when the process is scheduled on CPU 1, and then if rescheduled between that
and the following loop it can loop forever.

Wonder if max_cpuid can be e.g. tweaked in the BIOS and the BIOS wrongly tweaks
it only for the boot CPU and not the other CPUs, or of course it can be a CPU
bug.  Certainly having different CPUs different cpuid levels means cpuid insn
is completely unusable in userland, unless the process is pinned just to one CPU
(but that's very much undesirable for libc initialization).

Comment 7 H.J. Lu 2007-10-15 14:08:53 UTC
What does dmidecode report?

Comment 8 Steve Mead 2007-10-15 18:33:27 UTC
Created attachment 227861 [details]
Output of dmidecode

Comment 10 Steve Mead 2007-10-15 22:44:18 UTC
The BIOS upgrade seems to have fixed the problem.  Now /proc/cpuinfo shows both
cores with a cpuid level of 5, and the cacheinfo.c program reliably gives output of:

shared 1048576 level 2 max_cpuid 5
cpuid (4, 0) = 04000121
cpuid (4, 1) = 04000143

I also ran a few stress tests and didn't see any hung processes.  Thank you all
for your help, and I'm sorry it was something as mundane as an old BIOS.

Comment 11 Jakub Jelinek 2007-10-16 11:38:49 UTC
Closing.  rawhide glibc has workaround in case other people have buggy BIOS,
though of course best would be if people upgrade their BIOSes to fixed ones.

Comment 12 Davide Rossetti 2007-10-23 16:10:29 UTC
may I pick rawhide glibc for testing ? on F7 I mean ??


Note You need to log in before you can comment on or make changes to this bug.