Description of problem: I'm seeing an intermittent infinite loop in init_cacheinfo that causes random processes to hang and use 100% of the CPU. Attaching to the processes with gdb shows that it is always in init_cacheinfo inside of the loop at cacheinfo.c:400. /* Query until desired cache level is enumerated. */ do { asm volatile ("cpuid" : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "0" (4), "2" (i++)); } while (((eax >> 5) & 0x7) != level); Gdb gives level == 2 and eax optimized out. Version-Release number of selected component (if applicable): 2.6-4 How reproducible: It seems to happen most when running lots of processes quickly, like during a build or running a configure script. I definitely see it extremely often while trying to run the net-snmp configure script. Steps to Reproduce: 1. Run a configure script, wait for hang. 2. 3. Actual results: Random process (echo, rm, grep, sed, make) will hang and use 100% of CPU until manually killed. Expected results: No hang. Additional info: This machine previously ran Fedora Core 5 for over a year with no problems. Problem started right after upgrading to Fedora 7. I've attached the contents of /proc/cpuinfo.
Created attachment 220381 [details] Contents of /proc/cpuinfo
Can you please run attached program?
Created attachment 220771 [details] cacheinfo.c
After running the program multiple times, I see three distinct sets of output. The most common outputs are: shared 1048576 level 2 max_cpuid 5 cpuid (4, 0) = 04000121 cpuid (4, 1) = 04000143 And: shared 1048576 level 2 max_cpuid 3 Twice it has printed out the following, where the second number inside the parentheses keeps incrementing until I hit CTRL-C: shared 1048576 level 2 max_cpuid 5 cpuid (4, 0) = 04000121 cpuid (4, 1) = 00000000 cpuid (4, 2) = 00000000 cpuid (4, 3) = 00000000 ...
My suspicion is that this is a CPU bug. The program cannot really print different values in different runs. The only possible variants in the code is the content of the other registers. But that (except eax and ecx) should be irrelevant. Especially the return of max_cpuid == 3 is very suspicious. I've asked Intel to look at this. In upstream glibc I've implemented a work-around.
Looking at the /proc/cpuinfo dump, CPU 0 has cpuid level : 3 and no physical id/siblings/core id/cpu cores lines. So it makes sense, sometimes you get shared 1048576 level 2 max_cpuid 3 when the process is scheduled on CPU 0, or shared 1048576 level 2 max_cpuid 5 when the process is scheduled on CPU 1, and then if rescheduled between that and the following loop it can loop forever. Wonder if max_cpuid can be e.g. tweaked in the BIOS and the BIOS wrongly tweaks it only for the boot CPU and not the other CPUs, or of course it can be a CPU bug. Certainly having different CPUs different cpuid levels means cpuid insn is completely unusable in userland, unless the process is pinned just to one CPU (but that's very much undesirable for libc initialization).
What does dmidecode report?
Created attachment 227861 [details] Output of dmidecode
You have BIOS A01. Can you try BIOS A03 at http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R129670&SystemID=DIM_P4_9100&servicetag=&os=WW1&osl=en&deviceid=308&devlib=0&typecnt=0&vercnt=3&catid=-1&impid=-1&formatcnt=1&libid=1&fileid=172852
The BIOS upgrade seems to have fixed the problem. Now /proc/cpuinfo shows both cores with a cpuid level of 5, and the cacheinfo.c program reliably gives output of: shared 1048576 level 2 max_cpuid 5 cpuid (4, 0) = 04000121 cpuid (4, 1) = 04000143 I also ran a few stress tests and didn't see any hung processes. Thank you all for your help, and I'm sorry it was something as mundane as an old BIOS.
Closing. rawhide glibc has workaround in case other people have buggy BIOS, though of course best would be if people upgrade their BIOSes to fixed ones.
may I pick rawhide glibc for testing ? on F7 I mean ??