Hardware Environment: Intellistation M Pro, 2 333MHz PII CPUs, 128MB RAM, 9GB SCSI disk hooked up to Adaptec AIC-7895 Ultra SCSI host adapter. Software Environment: RedHat Advanced Server, 2.4.9-e.10smp kernel, GPFS 1.3.0-1 Steps to Reproduce: 1.Create a GPFS filesystem on any SCSI disk partition, at least 1GB in size. 2.Unpack GCC distribution (gcc-3.2.tar.gz) on the GPFS filesystem. 3.Start GCC build: cd gcc-3.2 ; ./configure; make Actual Results: Normally, GCC buils fine on a GPFS filesystem. However, twice now in about two weeks of testing, kernel oopses, see the traceback below. Expected Results: The build should run normally to completion. Additional Information: So far, the oops at the same address (schedule 0x1bb) was hit twice. On the first occurance, GPFS was in the stack: [root@gpfs10 root]# Oops: 0000 Kernel 2.4.9-e.10smp CPU: 1 EIP: 0010:[<c011945b>] Tainted: PF EFLAGS: 00010007 EIP is at schedule [kernel] 0x1bb eax: 0000008c ebx: c0350120 ecx: c0350140 edx: c4aa4000 esi: ffffffd4 edi: c4aa4000 ebp: c4aa5c78 esp: c4aa5c58 ds: 0018 es: 0018 ss: 0018 Process threaddirmaker (pid: 10388, stackpage=c4aa5000) Stack: c7034040 c46b7ac0 c0349900 c46b4360 c4aa4000 00000202 c89ef4f8 c2ef487c c2ef4884 c893c671 c4aa4000 c46b4000 c4aa5de8 00000000 c2ef487c c46b5ca4 00000001 c4aa5cac c4aa5cac 00000000 c4aa4000 c4aa5c9c c4aa5c9c 00000000 Call Trace: [<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818 [<c893c671>] cxiWaitEventWait [mmfslinux] 0xc1 [<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818 [<c8989600>] internalAcquire__14BaseMutexClassUi [mmfs] 0xdc [<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818 [<c8989550>] internalAcquire__14BaseMutexClassUi [mmfs] 0x2c [<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818 [<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818 [<c89897b2>] kAcquireSlow__19NotGlobalMutexClass [mmfs] 0x86 [<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818 [<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818 [<c89899cb>] kWait__6ThCondiPCc [mmfs] 0x8b [<c8954913>] sendToDaemonWithReply__13KernelMailbox [mmfs] 0x17b [<c89b6f89>] .rodata [mmfs] 0x15e9 [<c89547ce>] sendToDaemonWithReply__13KernelMailbox [mmfs] 0x36 [<c8964304>] kSFSRmdir__FP15KernelOperationG7FileUIDPcT1P10ext_cred_t [mmfs] 0x198 [<c8962f46>] ReleaseDaemonSegAndSG__F12SegMapStatusiP11StripeGroupUiiUi [mmfs] 0x152 [<c896cabb>] gpfsRmdir__FP13gpfsVfsData_tP9cxiNode_tT1PcP10ext_cred_t [mmfs] 0x187 [<c0159fd4>] clear_inode [kernel] 0x124 [<c893f8fc>] .text.trace [mmfslinux] 0x121c [<c893835b>] gpfs_i_rmdir [mmfslinux] 0x9b [<c0243482>] call_reschedule_interrupt [kernel] 0x5 [<c0243482>] call_reschedule_interrupt [kernel] 0x5 [<c0157fec>] dput [kernel] 0x1c [<c014ef60>] cached_lookup [kernel] 0x10 [<c014faf8>] path_walk [kernel] 0x908 [<c01510d7>] vfs_rmdir [kernel] 0x1f7 [<c014fe9a>] lookup_hash [kernel] 0x4a [<c0151217>] sys_rmdir [kernel] 0xa7 [<c01072e3>] system_call [kernel] 0x33 Code: 8b 46 58 89 45 ec 8b 47 54 89 45 e8 8b 4d ec 85 c9 75 32 89 <0>Kernel panic: not continuing However, on the second occurence, GPFS wasn\\\'t in the stack at all: gpfs10 login: Oops: 0000 Kernel 2.4.9-e.10smp CPU: 1 EIP: 0010:[<c011945b>] Tainted: PF EFLAGS: 00010007 EIP is at schedule [kernel] 0x1bb eax: 0000008c ebx: c0350120 ecx: c03505c0 edx: c4076000 esi: ffffffd4 edi: c4076000 ebp: c4077f04 esp: c4077ee4 ds: 0018 es: 0018 ss: 0018 Process sh (pid: 29543, stackpage=c4077000) Stack: c5e13fb4 c4076000 c42629c0 c4262900 c4076000 c41c37a0 c4076000 c41c3740 c4077f2c c014dfec 00000000 c4076000 00000000 00000000 00000000 c4076000 c4076000 c4262900 00000000 c4076000 c22ba564 c22ba564 00000000 00000286 Call Trace: [<c014dfec>] pipe_wait [kernel] 0x8c [<c014e0d4>] pipe_read [kernel] 0xb4 [<c01441b6>] sys_read [kernel] 0x96 [<c01072e3>] system_call [kernel] 0x33 Code: 8b 46 58 89 45 ec 8b 47 54 89 45 e8 8b 4d ec 85 c9 75 32 89 The identical test never produced this particular oops on other kernels (non-AS) despite very extensive testing. The bug is apparently caused by the linux-2.4.9-o1-scheduler.patch, which contains a lot of 2.5.x code. Either \\\'prev\\\' or \\\'next\\\' is NULL when it shouldn\\\'t be, around sched.c:823. The analysis of the GPFS code involved in the first oops shows nothing wrong or unusual, it\\\'s most likely that GPFS was not the cause, but since most CPU cycles are taken by GPFS during this test, there\\\'s a good chance it may be on the stack when the bug occurs. ------- Additional Comment #1 From Khoa Huynh 2003-02-07 18:17 ------- Internal Only Sreelatha - please have your team look at this bug for me. This is the type of bugs where we can get help from the LTC kernel team (under Hans Tanneberger) since they are very familiar with the O1 scheduler. Please see if you can reproduce this bug on one of the your Intel machines, do some initial debugging, and let me know if you need help from the LTC kernel team. Thanks. ------- Additional Comment #2 From Srikrishnan 2003-02-10 09:14 ------- Internal Only I had installed RH AS 2.1 which comes with kernel 2.4.9-e.3. I could locate only the GPFS Portability Layer alone on the internet. (LTC Page). 1. Where do I get the GPFS for Linux installables? 2. For this problem recreation do I need to set up a cluster, use CSM, RSCT etc. or can I set up just one machine with RH AS 2.1 and install GPFS and recreate? 3. Can you point me to some place/manual etc which might help in setting up GPFS? ------- Additional Comment #3 From Yuri Volobuev 2003-02-10 13:48 ------- Internal Only I\\\'ll send FTP download info for GPFS images my e-mail. The problem was encountered when running on a single node (GPFS cluster type \\\'single\\\'). GPFS RPMs prereq RSCT, so RSCT RPMs need to be installed, but there\\\'s no extra configuration required for RSCT. CSM is not involved here. GPFS docs are available at http://www-1.ibm.com/servers/eserver/clusters/software/gpfs.html. Please note that 2.4.9-e.3 kernel is not very stable, with GPFS or without, especially with. Upgrade to 2.4.9-e.10 before you start. There are other known GPFS-specific problems with RH AS, such as periodic deadlocks, which occur with some regularity (about once a week). Recreating this oops won\\\'t be easy. I\\\'m not sure what the intent is here. If you want to recreate the oops e.g. in kdb, this may pose a problem, too. RH AS ships with \\\'ikd\\\' patch, which has kdb patch in it, along with a bunch of other debugging aids, but this patch is only applied when building \\\'debug\\\' kernel, which is UP and is thus useless for troubleshooting this particular problem. Building an SMP kernel with ikd patch is possible, but requires some hacking, and if not done right will break GPFS. Not to mention that applying ikd patch throws off timing quite a bit (I\\\'ve been running with kdb enabled, and haven\\\'t been able to hit this bug). All in all, trying to reproduce this bug may not be very profitable. ------- Additional Comment #4 From Srikrishnan 2003-02-11 09:07 ------- Internal Only Yuri - Thanks for your email and info about GPFS.Can you give me access to the kernel-source-2.4.9-e.10.i386.rpm? ------- Additional Comment #5 From Yuri Volobuev 2003-02-11 18:29 ------- Internal Only I\\\'ve make kernel-source rpm available in the same location as indicated in private e-mail, directory \\\'1912\\\'. ------- Additional Comment #6 From Yuri Volobuev 2003-02-17 13:47 ------- Internal Only I hit another oops, which may or may not be the same problem. It\\\'s certainly along the same lines. I\\\'m still running RHAS 2.1 w/2.4.9-e.10 kernel, only now with ikd patch applied and kdb enabled (no other debugging aids from idk are activated). The test was running sumulataneous GCC and OpenSSH builds on a GPFS filesystem. The oops happened in the schedule, but at a different offset: EBP EIP Function(args) 0xc6cd3f04 0xc011aafe schedule 0xbe (0x0, 0xc6cd2000, 0x0, 0x0, 0x0) kernel .text 0xc0100000 0xc011aa40 0xc011adf0 0xc6cd3f58 0xc014d94a pipe_wait 0x8a (0xc3ede200, 0x0, 0x0) kernel .text 0xc0100000 0xc014d8c0 0xc014d980 0xc6cd3f78 0xc014da34 pipe_read 0xb4 (0xc1c028a0, 0xbffec970, 0x80, 0xc1c028c0, 0x0) kernel .text 0xc0100000 0xc014d980 0xc014db90 0xc6cd3fbc 0xc0143c78 sys_read 0x98 (0x3, 0xbffec970, 0x80, 0x80, 0xbffec970) kernel .text 0xc0100000 0xc0143be0 0xc0143cf0 0xc010748b system_call 0x33 kernel .text 0xc0100000 0xc0107458 0xc0107490 The calling process is \\\'configure\\\'. 0xc6cd2000 00020693 00001472 1 000 stop 0xc6cd2360*configure Registers: eax = 0xc6cd202c ebx = 0xc037a780 ecx = 0xc6cd2000 edx = 0xc6cd2000 esi = 0x00000000 edi = 0xc6cd2000 esp = 0xc6cd3ed4 eip = 0xc011aafe ebp = 0xc6cd3f04 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010046 xds = 0xc6cd0018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xc6cd3ea0 The offending instruction sequence: 0xc011aaf8 schedule 0xb8mov 0xffffffe0(ëp),ìx 0xc011aafb schedule 0xbbmov 0x34(ìx),%esi 0xc011aafe schedule 0xbedecl (%esi) The oops is due to esi containing NULL. My humble opinion on what happened: The assembly above is from sched.c:180 array->nr_active--; function dequeue_task, called as dequeue_task(p, p->array); line 247, called as deactivate_task(prev, rq); from schedule(), line 783. So apparently \\\'prev\\\' contains NULL \\\'array\\\' field. Here\\\'s a look at prev: [0]kdb> md ìx 20 0xc6cd2000 00000001 00000000 00000000 c0000000 ...............@ 0xc6cd2010 c0321520 00000000 00000000 ffffffff .2@........ 0xc6cd2020 00000000 0000007c 0000007d c6cd202c ....|...}..., MF 0xc6cd2030 c6cd202c 00000000 00000085 004abed6 , MF........V>J. 0xc6cd2040 00000000 ffffffff 00000003 c691e000 .........`.F 0xc6cd2050 c305e000 c256cce0 c256cce0 c0324d4c .`.C`LVB`LVBLM2@ 0xc6cd2060 00000000 00000011 00000000 00000000 ................ 0xc6cd2070 00000001 000050d5 000005c0 00000000 ....UP..@....... 0xc6cd2080 00000445 000050d5 00000000 c4ce2000 E...UP....... ND 0xc6cd2090 c4ce2000 c691e000 00000000 00000000 . ND.`.F........ 0xc6cd20a0 c6cd20a0 c6cd20a0 00000000 c03a9274 MF MF....t.:@ 0xc6cd20b0 00000001 c6cd20b4 c6cd20b4 00000000 ....4 MF4 MF.... 0xc6cd20c0 00000000 00000000 00000000 00000000 ................ 0xc6cd20d0 00000000 00000000 00000000 00000000 ................ 0xc6cd20e0 00000000 000285f2 c6cd2000 c01216b0 ....r.... MF0..@ 0xc6cd20f0 c03b31c0 000000b8 00000152 000010ad @1;@8...R...-... 0xc6cd2100 0000097f 004a9faf 0000007b 0000003d ..../.J.{...=... 0xc6cd2110 00000000 00000000 00000000 00000000 ................ 0xc6cd2120 00000000 00000000 00000000 00000000 ................ 0xc6cd2130 00000000 00000000 00000000 00000000 ................ 0xc6cd2140 00000000 00000000 00000000 00000000 ................ 0xc6cd2150 00000000 00000000 00000000 00000000 ................ 0xc6cd2160 00000000 00000000 00000000 00000000 ................ 0xc6cd2170 00000000 00000000 00000000 00000000 ................ 0xc6cd2180 00000000 00000000 000000da 00000078 ........Z...x... 0xc6cd2190 00000000 00000000 00000000 00000000 ................ 0xc6cd21a0 00000000 00000000 00000000 00000000 ................ 0xc6cd21b0 00000000 00000000 00000000 00000000 ................ 0xc6cd21c0 00000000 00000000 00000000 00000000 ................ 0xc6cd21d0 00000000 00000000 00000000 00000000 ................ 0xc6cd21e0 00000000 00000000 00000000 00000000 ................ 0xc6cd21f0 00000000 00000000 00000000 00000000 ................ 0xc6cd2200 00000000 00000000 0001622e 000000ec .........b..l... 0xc6cd2210 00000000 0004dade 0006b4ba 00000000 ....^Z..:4...... 0xc6cd2220 00000001 00000000 00000000 00000000 ................ 0xc6cd2230 00000000 00000000 00000000 00000000 ................ 0xc6cd2240 00000000 00000007 00000000 00000001 ................ 0xc6cd2250 00000002 00000003 00000004 00000006 ................ 0xc6cd2260 0000000a 00000000 00000000 00000000 ................ 0xc6cd2270 00000000 00000000 00000000 00000000 ................ 0xc6cd2280 00000000 00000000 00000000 00000000 ................ 0xc6cd2290 00000000 00000000 00000000 00000000 ................ 0xc6cd22a0 00000000 00000000 00000000 00000000 ................ 0xc6cd22b0 00000000 00000000 00000000 00000000 ................ 0xc6cd22c0 00000000 00000000 fffffeff 00000000 ........~.... 0xc6cd22d0 fffffeff 00000000 c0322708 ffffffff ~.....\\\'2@ 0xc6cd22e0 ffffffff ffffffff ffffffff ffffffff 0xc6cd22f0 ffffffff 00800000 ffffffff 00000000 ........ 0xc6cd2300 ffffffff ffffffff ffffffff 000001ff ... 0xc6cd2310 000001ff 00000400 00000400 ffffffff ........... 0xc6cd2320 ffffffff ffffffff ffffffff ffffffff 0xc6cd2330 ffffffff 6f630000 6769666e 00657275 ..configure. 0xc6cd2340 68732e64 00000000 00000000 00000000 d.sh............ 0xc6cd2350 00000000 00000000 00000000 00000000 ................ 0xc6cd2360 c6cd4000 c011aafe c6cd3ed4 00000000 .@MF~*.@T>MF.... 0xc6cd2370 00000000 00000000 00000000 00000000 ................ 0xc6cd2380 00000000 00000000 00000000 00000000 ................ 0xc6cd2390 00000000 00000000 00000000 00000000 ................ 0xc6cd23a0 4020037f 077d0000 40025926 00000023 .. @..}.&Y.@#... 0xc6cd23b0 bffec7f0 0000002b 00000000 00000000 pG~? ........... 0xc6cd23c0 00000000 00000000 00000000 00000000 ................ 0xc6cd23d0 00000000 00000000 00000000 00000000 ................ 0xc6cd23e0 00000000 00000000 00000000 00000000 ................ 0xc6cd23f0 00000000 00000000 00000000 00000000 ................ 0xc6cd2400 00000000 00000000 00000000 00000000 ................ 0xc6cd2410 00000000 00000000 00000000 00000000 ................ 0xc6cd2420 566490a4 963b70ef 00003fff 00000000 $.dVop;.?...... 0xc6cd2430 00000001 8f45c800 00004014 00000000 .....HE..@...... 0xc6cd2440 00000000 00000000 00000000 00000000 ................ 0xc6cd2450 00000000 00000000 00000000 00000000 ................ 0xc6cd2460 00000000 00000000 00000000 00000000 ................ 0xc6cd2470 00000000 00000000 00000000 00000000 ................ 0xc6cd2480 00000000 00000000 00000000 00000000 ................ 0xc6cd2490 00000000 00000000 00000000 00000000 ................ 0xc6cd24a0 00000000 00000000 00000000 00000000 ................ 0xc6cd24b0 00000000 00000000 00000000 00000000 ................ 0xc6cd24c0 00000000 00000000 00000000 00000000 ................ 0xc6cd24d0 00000000 00000000 00000000 00000000 ................ 0xc6cd24e0 00000000 00000000 00000000 00000000 ................ 0xc6cd24f0 00000000 00000000 00000000 00000000 ................ 0xc6cd2500 00000000 00000000 00000000 00000000 ................ 0xc6cd2510 00000000 00000000 00000000 00000000 ................ 0xc6cd2520 00000000 00000000 00000000 00000000 ................ 0xc6cd2530 00000000 00000000 00000000 00000000 ................ 0xc6cd2540 00000000 00000000 00000000 00000000 ................ 0xc6cd2550 00000000 00000000 00000000 00000000 ................ 0xc6cd2560 00000000 00000000 00000000 00000000 ................ 0xc6cd2570 00000000 00000000 00000000 00000000 ................ 0xc6cd2580 00000000 00000000 00000000 00000000 ................ 0xc6cd2590 00000000 00000000 00000000 00000000 ................ 0xc6cd25a0 00000000 00000000 00000000 00000000 ................ 0xc6cd25b0 00000000 00000000 00000000 ffffffff ............ 0xc6cd25c0 00000000 00000000 00000000 00000000 ................ 0xc6cd25d0 00000000 00000000 00000000 00000000 ................ 0xc6cd25e0 00000000 00000000 00000000 00000000 ................ 0xc6cd25f0 00000000 00000000 00000000 00000000 ................ 0xc6cd2600 00000000 00000000 00000000 00000000 ................ 0xc6cd2610 00000000 00000000 00000000 00000000 ................ 0xc6cd2620 00000000 00000000 00000000 00000000 ................ 0xc6cd2630 00000000 00000000 00000000 00000000 ................ 0xc6cd2640 c5a513a0 c62a2280 00000000 00000001 .%E.\\\"*F........ 0xc6cd2650 c55ab060 00000000 00000000 00000000 `0ZE............ 0xc6cd2660 c6cd265c 00000000 00000000 00000000 \\\\&MF............ 0xc6cd2670 00000000 00000000 00000000 00000000 ................ 0xc6cd2680 00000000 00000000 0000000c 0000000d ................ 0xc6cd2690 00000001 40000000 00000000 00000000 .......@........ 0xc6cd26a0 00000000 00000000 00000003 00000000 ................ 0xc6cd26b0 00000000 00000000 00000000 00000000 ................ The fields preceding \\\'array\\\' look reasonable, but the \\\'array\\\' itself at offset 0x34 is NULL. Don\\\'t know about the rest of the struct. BTW, I did confirm my theory that the original oops (at schedule 0x1bb) is the result of \\\"struct mm_struct *mm = next->mm;\\\" at sched.c:403. The offending instruction is \\\"mov 0x58(%esi),êx\\\", and \\\'mm\\\' does have offset 0x58, while %esi appears to be containing \\\'next\\\' at this point. ------- Additional Comment #7 From Khoa Huynh 2003-02-18 18:02 ------- Internal Only I\\\'ve requested technical help from the LTC Kernel Team.... ------- Additional Comment #8 From Khoa Huynh 2003-02-20 17:01 ------- Internal Only Rick - thanks for looking at this bug for us! ------- Additional Comment #9 From Yuri Volobuev 2003-03-03 13:22 ------- Internal Only Any updates on this problem? I\\\'m raising the severity of the defect per request from marketing. A very significant customer opportunity depends on being able to run RH AS 2.1 on x440. Right now, this is far from being rock solid. Perhaps RedHat could be contacted to take a look at this? The person who wrote the O(1) scheduler patch, Igno Molnar, works for RedHat now, so perhaps if the right communication channel could be found, RedHat could contribute significantly to fixing this issue.
Please try to reproduce this bug without any binary only kernel modules loaded. In addition you seem to be using insmod -f, please don't do that.
Moving to close this as Not a Bug.