Bug 86017 - LTC1912 - oops at schedule+0x1bb
Summary: LTC1912 - oops at schedule+0x1bb
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel
Version: 2.1
Hardware: i586
OS: Linux
high
high
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-03-12 15:15 UTC by Need Real Name
Modified: 2007-11-30 22:06 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2003-03-25 21:43:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Need Real Name 2003-03-12 15:15:00 UTC
Hardware Environment:
Intellistation M Pro, 2 333MHz PII CPUs, 128MB RAM, 9GB SCSI disk hooked up
to Adaptec AIC-7895 Ultra SCSI host adapter.

Software Environment:
RedHat Advanced Server, 2.4.9-e.10smp kernel, GPFS 1.3.0-1

Steps to Reproduce:
1.Create a GPFS filesystem on any SCSI disk partition, at least 1GB in size.
2.Unpack GCC distribution (gcc-3.2.tar.gz) on the GPFS filesystem.
3.Start GCC build: cd gcc-3.2 ; ./configure; make

Actual Results:
Normally, GCC buils fine on a GPFS filesystem.  However, twice now in about two
weeks of testing, kernel oopses, see the traceback below.

Expected Results:
The build should run normally to completion.

Additional Information:
So far, the oops at the same address (schedule 0x1bb) was hit twice.  On the
first occurance, GPFS was in the stack:
[root@gpfs10 root]# Oops: 0000
Kernel 2.4.9-e.10smp
CPU:    1
EIP:    0010:[<c011945b>]    Tainted: PF
EFLAGS: 00010007
EIP is at schedule [kernel] 0x1bb
eax: 0000008c   ebx: c0350120   ecx: c0350140   edx: c4aa4000
esi: ffffffd4   edi: c4aa4000   ebp: c4aa5c78   esp: c4aa5c58
ds: 0018   es: 0018   ss: 0018
Process threaddirmaker (pid: 10388, stackpage=c4aa5000)
Stack: c7034040 c46b7ac0 c0349900 c46b4360 c4aa4000 00000202 c89ef4f8 c2ef487c
      c2ef4884 c893c671 c4aa4000 c46b4000 c4aa5de8 00000000 c2ef487c c46b5ca4
      00000001 c4aa5cac c4aa5cac 00000000 c4aa4000 c4aa5c9c c4aa5c9c 00000000
Call Trace: [<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818
[<c893c671>] cxiWaitEventWait [mmfslinux] 0xc1
[<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818
[<c8989600>] internalAcquire__14BaseMutexClassUi [mmfs] 0xdc
[<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818
[<c8989550>] internalAcquire__14BaseMutexClassUi [mmfs] 0x2c
[<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818
[<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818
[<c89897b2>] kAcquireSlow__19NotGlobalMutexClass [mmfs] 0x86
[<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818
[<c89ef4f8>] _14BaseMutexClass.kernelSynchState [mmfs] 0x818
[<c89899cb>] kWait__6ThCondiPCc [mmfs] 0x8b
[<c8954913>] sendToDaemonWithReply__13KernelMailbox [mmfs] 0x17b
[<c89b6f89>] .rodata [mmfs] 0x15e9
[<c89547ce>] sendToDaemonWithReply__13KernelMailbox [mmfs] 0x36
[<c8964304>] kSFSRmdir__FP15KernelOperationG7FileUIDPcT1P10ext_cred_t [mmfs] 0x198
[<c8962f46>] ReleaseDaemonSegAndSG__F12SegMapStatusiP11StripeGroupUiiUi [mmfs] 0x152
[<c896cabb>] gpfsRmdir__FP13gpfsVfsData_tP9cxiNode_tT1PcP10ext_cred_t [mmfs] 0x187
[<c0159fd4>] clear_inode [kernel] 0x124
[<c893f8fc>] .text.trace [mmfslinux] 0x121c
[<c893835b>] gpfs_i_rmdir [mmfslinux] 0x9b
[<c0243482>] call_reschedule_interrupt [kernel] 0x5
[<c0243482>] call_reschedule_interrupt [kernel] 0x5
[<c0157fec>] dput [kernel] 0x1c
[<c014ef60>] cached_lookup [kernel] 0x10
[<c014faf8>] path_walk [kernel] 0x908
[<c01510d7>] vfs_rmdir [kernel] 0x1f7
[<c014fe9a>] lookup_hash [kernel] 0x4a
[<c0151217>] sys_rmdir [kernel] 0xa7
[<c01072e3>] system_call [kernel] 0x33


Code: 8b 46 58 89 45 ec 8b 47 54 89 45 e8 8b 4d ec 85 c9 75 32 89
<0>Kernel panic: not continuing


However, on the second occurence, GPFS wasn\\\'t in the stack at all:

gpfs10 login: Oops: 0000
Kernel 2.4.9-e.10smp
CPU:    1
EIP:    0010:[<c011945b>]    Tainted: PF
EFLAGS: 00010007
EIP is at schedule [kernel] 0x1bb
eax: 0000008c   ebx: c0350120   ecx: c03505c0   edx: c4076000
esi: ffffffd4   edi: c4076000   ebp: c4077f04   esp: c4077ee4
ds: 0018   es: 0018   ss: 0018
Process sh (pid: 29543, stackpage=c4077000)
Stack: c5e13fb4 c4076000 c42629c0 c4262900 c4076000 c41c37a0 c4076000 c41c3740
      c4077f2c c014dfec 00000000 c4076000 00000000 00000000 00000000 c4076000
      c4076000 c4262900 00000000 c4076000 c22ba564 c22ba564 00000000 00000286
Call Trace: [<c014dfec>] pipe_wait [kernel] 0x8c
[<c014e0d4>] pipe_read [kernel] 0xb4
[<c01441b6>] sys_read [kernel] 0x96
[<c01072e3>] system_call [kernel] 0x33


Code: 8b 46 58 89 45 ec 8b 47 54 89 45 e8 8b 4d ec 85 c9 75 32 89

The identical test never produced this particular oops on other kernels (non-AS)
despite very extensive testing.  The bug is apparently caused by the
linux-2.4.9-o1-scheduler.patch, which contains a lot of 2.5.x code. Either
\\\'prev\\\' or \\\'next\\\' is NULL when it shouldn\\\'t be, around
sched.c:823.  The analysis
of the GPFS code involved in the first oops shows nothing wrong or unusual, it\\\'s
most likely that GPFS was not the cause, but since most CPU cycles are taken by
GPFS during this test, there\\\'s a good chance it may be on the stack when the bug
occurs.


------- Additional Comment #1 From Khoa Huynh 2003-02-07 18:17 ------- Internal Only

Sreelatha - please have your team look at this bug for me.  This is the
type of bugs where we can get help from the LTC kernel team (under
Hans Tanneberger) since they are very familiar with the O1 scheduler.
Please see if you can reproduce this bug on one of the your Intel
machines, do some initial debugging, and let me know if you need
help from the LTC kernel team.  Thanks.


------- Additional Comment #2 From Srikrishnan 2003-02-10 09:14 ------- Internal
Only

I had installed RH AS 2.1 which comes with kernel 2.4.9-e.3. I could locate
only the GPFS Portability Layer alone on the internet. (LTC Page).
1. Where do I get the GPFS for Linux installables?
2. For this problem recreation do I need to set up a cluster, use CSM, RSCT
etc. or can I set up just one machine with RH AS 2.1 and install GPFS and
recreate?
3. Can you point me to some place/manual etc which might help in setting up
GPFS?


------- Additional Comment #3 From Yuri Volobuev 2003-02-10 13:48 -------
Internal Only

I\\\'ll send FTP download info for GPFS images my e-mail.

The problem was encountered when running on a single node (GPFS cluster type
\\\'single\\\').  GPFS RPMs prereq RSCT, so RSCT RPMs need to be installed, but
there\\\'s no extra configuration required for RSCT.  CSM is not involved here.

GPFS docs are available at
http://www-1.ibm.com/servers/eserver/clusters/software/gpfs.html.

Please note that 2.4.9-e.3 kernel is not very stable, with GPFS or without,
especially with.  Upgrade to 2.4.9-e.10 before you start.  There are other known
GPFS-specific problems with RH AS, such as periodic deadlocks, which occur with
some regularity (about once a week).  Recreating this oops won\\\'t be easy.  I\\\'m
not sure what the intent is here.  If you want to recreate the oops e.g. in kdb,
this may pose a problem, too.  RH AS ships with \\\'ikd\\\' patch, which has kdb
patch
in it, along with a bunch of other debugging aids, but this patch is only
applied when building \\\'debug\\\' kernel, which is UP and is thus useless for
troubleshooting this particular problem.  Building an SMP kernel with ikd patch
is possible, but requires some hacking, and if not done right will break GPFS.
Not to mention that applying ikd patch throws off timing quite a bit (I\\\'ve been
running with kdb enabled, and haven\\\'t been able to hit this bug).  All in all,
trying to reproduce this bug may not be very profitable.


------- Additional Comment #4 From Srikrishnan 2003-02-11 09:07 ------- Internal
Only

Yuri - Thanks for your email and info about GPFS.Can you give me access to the
kernel-source-2.4.9-e.10.i386.rpm?


------- Additional Comment #5 From Yuri Volobuev 2003-02-11 18:29 -------
Internal Only

I\\\'ve make kernel-source rpm available in the same location as indicated in
private e-mail, directory \\\'1912\\\'.


------- Additional Comment #6 From Yuri Volobuev 2003-02-17 13:47 -------
Internal Only

I hit another oops, which may or may not be the same problem.  It\\\'s certainly
along the same lines.  I\\\'m still running RHAS 2.1 w/2.4.9-e.10 kernel, only now
with ikd patch applied and kdb enabled (no other debugging aids from idk are
activated).  The test was running sumulataneous GCC and OpenSSH builds on a GPFS
filesystem.  The oops happened in the schedule, but at a different offset:

   EBP       EIP         Function(args)
0xc6cd3f04 0xc011aafe schedule 0xbe (0x0, 0xc6cd2000, 0x0, 0x0, 0x0)
                              kernel .text 0xc0100000 0xc011aa40 0xc011adf0
0xc6cd3f58 0xc014d94a pipe_wait 0x8a (0xc3ede200, 0x0, 0x0)
                              kernel .text 0xc0100000 0xc014d8c0 0xc014d980
0xc6cd3f78 0xc014da34 pipe_read 0xb4 (0xc1c028a0, 0xbffec970, 0x80, 0xc1c028c0, 0x0)
                              kernel .text 0xc0100000 0xc014d980 0xc014db90
0xc6cd3fbc 0xc0143c78 sys_read 0x98 (0x3, 0xbffec970, 0x80, 0x80, 0xbffec970)
                              kernel .text 0xc0100000 0xc0143be0 0xc0143cf0
          0xc010748b system_call 0x33
                              kernel .text 0xc0100000 0xc0107458 0xc0107490

The calling process is \\\'configure\\\'.  
0xc6cd2000 00020693 00001472  1  000  stop  0xc6cd2360*configure

Registers:

eax = 0xc6cd202c ebx = 0xc037a780 ecx = 0xc6cd2000 edx = 0xc6cd2000
esi = 0x00000000 edi = 0xc6cd2000 esp = 0xc6cd3ed4 eip = 0xc011aafe
ebp = 0xc6cd3f04 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010046
xds = 0xc6cd0018 xes = 0x00000018 origeax = 0xffffffff &regs = 0xc6cd3ea0


The offending instruction sequence:

0xc011aaf8 schedule 0xb8mov    0xffffffe0(ëp),ìx
0xc011aafb schedule 0xbbmov    0x34(ìx),%esi
0xc011aafe schedule 0xbedecl   (%esi)

The oops is due to esi containing NULL.  My humble opinion on what happened:
The assembly above is from sched.c:180  array->nr_active--; function
dequeue_task, called as dequeue_task(p, p->array); line 247, called as
deactivate_task(prev, rq); from schedule(), line 783.  So apparently \\\'prev\\\'
contains NULL \\\'array\\\' field.  Here\\\'s a look at prev:

[0]kdb> md ìx 20
0xc6cd2000 00000001 00000000 00000000 c0000000  ...............@
0xc6cd2010 c0321520 00000000 00000000 ffffffff   .2@........
0xc6cd2020 00000000 0000007c 0000007d c6cd202c  ....|...}..., MF
0xc6cd2030 c6cd202c 00000000 00000085 004abed6  , MF........V>J.
0xc6cd2040 00000000 ffffffff 00000003 c691e000  .........`.F
0xc6cd2050 c305e000 c256cce0 c256cce0 c0324d4c  .`.C`LVB`LVBLM2@
0xc6cd2060 00000000 00000011 00000000 00000000  ................
0xc6cd2070 00000001 000050d5 000005c0 00000000  ....UP..@.......
0xc6cd2080 00000445 000050d5 00000000 c4ce2000  E...UP....... ND
0xc6cd2090 c4ce2000 c691e000 00000000 00000000  . ND.`.F........
0xc6cd20a0 c6cd20a0 c6cd20a0 00000000 c03a9274    MF  MF....t.:@
0xc6cd20b0 00000001 c6cd20b4 c6cd20b4 00000000  ....4 MF4 MF....
0xc6cd20c0 00000000 00000000 00000000 00000000  ................
0xc6cd20d0 00000000 00000000 00000000 00000000  ................
0xc6cd20e0 00000000 000285f2 c6cd2000 c01216b0  ....r.... MF0..@
0xc6cd20f0 c03b31c0 000000b8 00000152 000010ad  @1;@8...R...-...
0xc6cd2100 0000097f 004a9faf 0000007b 0000003d  ..../.J.{...=...
0xc6cd2110 00000000 00000000 00000000 00000000  ................
0xc6cd2120 00000000 00000000 00000000 00000000  ................
0xc6cd2130 00000000 00000000 00000000 00000000  ................
0xc6cd2140 00000000 00000000 00000000 00000000  ................
0xc6cd2150 00000000 00000000 00000000 00000000  ................
0xc6cd2160 00000000 00000000 00000000 00000000  ................
0xc6cd2170 00000000 00000000 00000000 00000000  ................
0xc6cd2180 00000000 00000000 000000da 00000078  ........Z...x...
0xc6cd2190 00000000 00000000 00000000 00000000  ................
0xc6cd21a0 00000000 00000000 00000000 00000000  ................
0xc6cd21b0 00000000 00000000 00000000 00000000  ................
0xc6cd21c0 00000000 00000000 00000000 00000000  ................
0xc6cd21d0 00000000 00000000 00000000 00000000  ................
0xc6cd21e0 00000000 00000000 00000000 00000000  ................
0xc6cd21f0 00000000 00000000 00000000 00000000  ................
0xc6cd2200 00000000 00000000 0001622e 000000ec  .........b..l...
0xc6cd2210 00000000 0004dade 0006b4ba 00000000  ....^Z..:4......
0xc6cd2220 00000001 00000000 00000000 00000000  ................
0xc6cd2230 00000000 00000000 00000000 00000000  ................
0xc6cd2240 00000000 00000007 00000000 00000001  ................
0xc6cd2250 00000002 00000003 00000004 00000006  ................
0xc6cd2260 0000000a 00000000 00000000 00000000  ................
0xc6cd2270 00000000 00000000 00000000 00000000  ................
0xc6cd2280 00000000 00000000 00000000 00000000  ................
0xc6cd2290 00000000 00000000 00000000 00000000  ................
0xc6cd22a0 00000000 00000000 00000000 00000000  ................
0xc6cd22b0 00000000 00000000 00000000 00000000  ................
0xc6cd22c0 00000000 00000000 fffffeff 00000000  ........~....
0xc6cd22d0 fffffeff 00000000 c0322708 ffffffff  ~.....\\\'2@
0xc6cd22e0 ffffffff ffffffff ffffffff ffffffff  
0xc6cd22f0 ffffffff 00800000 ffffffff 00000000  ........
0xc6cd2300 ffffffff ffffffff ffffffff 000001ff  ...
0xc6cd2310 000001ff 00000400 00000400 ffffffff  ...........
0xc6cd2320 ffffffff ffffffff ffffffff ffffffff  
0xc6cd2330 ffffffff 6f630000 6769666e 00657275  ..configure.
0xc6cd2340 68732e64 00000000 00000000 00000000  d.sh............
0xc6cd2350 00000000 00000000 00000000 00000000  ................
0xc6cd2360 c6cd4000 c011aafe c6cd3ed4 00000000  .@MF~*.@T>MF....
0xc6cd2370 00000000 00000000 00000000 00000000  ................
0xc6cd2380 00000000 00000000 00000000 00000000  ................
0xc6cd2390 00000000 00000000 00000000 00000000  ................
0xc6cd23a0 4020037f 077d0000 40025926 00000023  .. @..}.&Y.@#...
0xc6cd23b0 bffec7f0 0000002b 00000000 00000000  pG~? ...........
0xc6cd23c0 00000000 00000000 00000000 00000000  ................
0xc6cd23d0 00000000 00000000 00000000 00000000  ................
0xc6cd23e0 00000000 00000000 00000000 00000000  ................
0xc6cd23f0 00000000 00000000 00000000 00000000  ................
0xc6cd2400 00000000 00000000 00000000 00000000  ................
0xc6cd2410 00000000 00000000 00000000 00000000  ................
0xc6cd2420 566490a4 963b70ef 00003fff 00000000  $.dVop;.?......
0xc6cd2430 00000001 8f45c800 00004014 00000000  .....HE..@......
0xc6cd2440 00000000 00000000 00000000 00000000  ................
0xc6cd2450 00000000 00000000 00000000 00000000  ................
0xc6cd2460 00000000 00000000 00000000 00000000  ................
0xc6cd2470 00000000 00000000 00000000 00000000  ................
0xc6cd2480 00000000 00000000 00000000 00000000  ................
0xc6cd2490 00000000 00000000 00000000 00000000  ................
0xc6cd24a0 00000000 00000000 00000000 00000000  ................
0xc6cd24b0 00000000 00000000 00000000 00000000  ................
0xc6cd24c0 00000000 00000000 00000000 00000000  ................
0xc6cd24d0 00000000 00000000 00000000 00000000  ................
0xc6cd24e0 00000000 00000000 00000000 00000000  ................
0xc6cd24f0 00000000 00000000 00000000 00000000  ................
0xc6cd2500 00000000 00000000 00000000 00000000  ................
0xc6cd2510 00000000 00000000 00000000 00000000  ................
0xc6cd2520 00000000 00000000 00000000 00000000  ................
0xc6cd2530 00000000 00000000 00000000 00000000  ................
0xc6cd2540 00000000 00000000 00000000 00000000  ................
0xc6cd2550 00000000 00000000 00000000 00000000  ................
0xc6cd2560 00000000 00000000 00000000 00000000  ................
0xc6cd2570 00000000 00000000 00000000 00000000  ................
0xc6cd2580 00000000 00000000 00000000 00000000  ................
0xc6cd2590 00000000 00000000 00000000 00000000  ................
0xc6cd25a0 00000000 00000000 00000000 00000000  ................
0xc6cd25b0 00000000 00000000 00000000 ffffffff  ............
0xc6cd25c0 00000000 00000000 00000000 00000000  ................
0xc6cd25d0 00000000 00000000 00000000 00000000  ................
0xc6cd25e0 00000000 00000000 00000000 00000000  ................
0xc6cd25f0 00000000 00000000 00000000 00000000  ................
0xc6cd2600 00000000 00000000 00000000 00000000  ................
0xc6cd2610 00000000 00000000 00000000 00000000  ................
0xc6cd2620 00000000 00000000 00000000 00000000  ................
0xc6cd2630 00000000 00000000 00000000 00000000  ................
0xc6cd2640 c5a513a0 c62a2280 00000000 00000001   .%E.\\\"*F........
0xc6cd2650 c55ab060 00000000 00000000 00000000  `0ZE............
0xc6cd2660 c6cd265c 00000000 00000000 00000000  \\\\&MF............
0xc6cd2670 00000000 00000000 00000000 00000000  ................
0xc6cd2680 00000000 00000000 0000000c 0000000d  ................
0xc6cd2690 00000001 40000000 00000000 00000000  .......@........
0xc6cd26a0 00000000 00000000 00000003 00000000  ................
0xc6cd26b0 00000000 00000000 00000000 00000000  ................

The fields preceding \\\'array\\\' look reasonable, but the \\\'array\\\' itself
at offset
0x34 is NULL. Don\\\'t know about the rest of the struct.

BTW, I did confirm my theory that the original oops (at schedule 0x1bb) is the
result of \\\"struct mm_struct *mm = next->mm;\\\" at sched.c:403.  The offending
instruction is \\\"mov    0x58(%esi),êx\\\", and \\\'mm\\\' does have offset
0x58, while
%esi appears to be containing \\\'next\\\' at this point.


------- Additional Comment #7 From Khoa Huynh 2003-02-18 18:02 ------- Internal Only

I\\\'ve requested technical help from the LTC Kernel Team....


------- Additional Comment #8 From Khoa Huynh 2003-02-20 17:01 ------- Internal Only

Rick - thanks for looking at this bug for us!


------- Additional Comment #9 From Yuri Volobuev 2003-03-03 13:22 -------
Internal Only

Any updates on this problem?  I\\\'m raising the severity of the defect per request
from marketing.  A very significant customer opportunity depends on being able
to run RH AS 2.1 on x440.  Right now, this is far from being rock solid.

Perhaps RedHat could be contacted to take a look at this?  The person who wrote
the O(1) scheduler patch, Igno Molnar, works for RedHat now, so perhaps if the
right communication channel could be found, RedHat could contribute
significantly to fixing this issue.

Comment 1 Arjan van de Ven 2003-03-12 15:20:41 UTC
Please try to reproduce this bug without any binary only kernel modules loaded.
In addition you seem to be using insmod -f, please don't do that.


Comment 2 Need Real Name 2003-03-25 21:43:04 UTC
Moving to close this as Not a Bug.


Note You need to log in before you can comment on or make changes to this bug.