Bug 205592 - Processes hanging in stafs() system call to GFS mounted filesystem
Processes hanging in stafs() system call to GFS mounted filesystem
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: GFS-kernel (Show other bugs)
4.4
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Wendy Cheng
Dean Jansa
:
Depends On:
Blocks: 207487
  Show dependency treegraph
 
Reported: 2006-09-07 10:50 EDT by Eric Z. Ayers
Modified: 2010-01-11 22:23 EST (History)
6 users (show)

See Also:
Fixed In Version: RHBA-2006-0705
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-10-11 12:49:32 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log/messages after sysrq-w sysrq-m sysrq-t (14.02 KB, application/x-tar)
2006-09-07 12:13 EDT, Eric Z. Ayers
no flags Details
screen shot of console of blade that had kernel panic (385.75 KB, image/png)
2006-09-08 17:58 EDT, Eric Z. Ayers
no flags Details

  None (edit)
Description Eric Z. Ayers 2006-09-07 10:50:32 EDT
Description of problem:

(Note, I mailed this to Wendy Chang this morning, but then I found the 
exact system call the problem was failing on)

we checked on the cluster this morning we found processed running at full CPU
that we cannot attach to with the debugger.  We have seen this before - the only
solution is to reboot some or all nodes in the cluster

from blade02:

top - 10:16:58 up 22 days, 16:13,  2 users,  load average: 11.00, 11.00, 11.00
Tasks: 156 total,  11 running, 144 sleeping,   1 stopped,   0 zombie
Cpu(s):  0.2% us,  3.4% sy,  0.0% ni, 96.2% id,  0.1% wa,  0.0% hi,  0.0% si
Mem:   8005372k total,  2314368k used,  5691004k free,   106640k buffers
Swap:  8385920k total,        0k used,  8385920k free,  1869916k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
9283 cgi       35  10 26272  12m 2732 R   22  0.2 216:51.86 dcs_operation_m
9306 cgi       35  10 26276  12m 2732 R   22  0.2 217:08.51 dcs_operation_m
18606 root      25   0 49928  604  524 R   22  0.0  78:27.56 df
9312 cgi       35  10 26272  12m 2732 R   22  0.2 218:20.53 dcs_operation_m
9285 cgi       35  10 26276  12m 2732 R   20  0.2 209:36.55 dcs_operation_m
9286 cgi       35  10 26276  12m 2732 R   20  0.2 204:58.36 dcs_operation_m
9304 cgi       35  10 26272  12m 2732 R   20  0.2 199:41.17 dcs_operation_m
9308 cgi       35  10 26276  12m 2732 R   20  0.2 204:08.35 dcs_operation_m
9310 cgi       35  10 26272  12m 2732 R   18  0.2 203:50.90 dcs_operation_m
9314 cgi       35  10 26272  12m 2732 R   16  0.2 217:11.20 dcs_operation_m
   1 root      16   0  4748  524  440 S    0  0.0   0:02.70 init
   2 root      RT   0     0    0    0 S    0  0.0   0:13.85 migration/0
   3 root      34  19     0    0    0 S    0  0.0   0:00.26 ksoftirqd/0
   4 root      RT   0     0    0    0 S    0  0.0   0:14.73 migration/1
   5 root      34  19     0    0    0 S    0  0.0   0:00.20 ksoftirqd/1
   6 root       5 -10     0    0    0 S    0  0.0   0:00.01 events/0
...

dcs_operation_mgr is part of our application, but 'df' is the one that came with
the OS and run automatically

-bash-3.00$ ps -fp 18606
UID        PID  PPID  C STIME TTY          TIME CMD
root     18606 18604 20 04:02 ?        01:19:48 df -hP
UID        PID  PPID  C STIME TTY          TIME CMD
root     18604 18278  0 04:02 ?        00:00:00 /bin/bash
/etc/log.d/scripts/services/zz-disk_space

I tried to attach to our dcs_operation_mgr tasks or to the df task with gdb and
strace and   just got a hang.  This same behavior is happening on other nodes in
our cluster as well.

We also found this in 'dmesg' on one of the other nodes (similar to the message
we saw when trying up upgrade the cluster the last time)

dlm: message size from 2 too big 43337(pkt len=114)
02 03 ff e5 03 ee 49 a9-00 84 01 00 00 00 00 03
0e 00 00 00 00 00 00 00-00 00 00 00 12 00 00 00
00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00-02 05 ff a9 03 83 49 d5
02 8e 01 00 00 00 00 03-00 00 00 33 5d 00 00 01
09 00 00 00 00 00 00 00-09 00 00 00 00 00 00 00
09
00 

I ran 'df -hP' under sttrace and I see the last system call accessed is hanging:

statfs("/data01",

This is our GFS mount.

$ mount | grep data01
/dev/mapper/IME-imelv on /data01 type gfs (rw,debug)

Version-Release number of selected component (if applicable):

$ uname -a
Linux blade04-1 2.6.9-34.0.1.ELsmp #1 SMP Wed May 17 16:59:36 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux

GFS-kernel-2.6.9-49.1.1

$ cat /etc/issue
Red Hat Enterprise Linux ES release 4 (Nahant Update 4)

How reproducible:

We've had it happen several times by running our application with lots of
activity for a period of a few hours.   

Additional info:

Our database is pretty lock and we've
been working with Wendy Cheng and Rob Kenna to help us isolate a problem with
corruption we're seeing from out database library when using GFS.
Comment 1 Wendy Cheng 2006-09-07 11:26:55 EDT
The df is built upon statfs and gfs_statfs would need to check all resource
group locks. If one of the locks is kept by other process, gfs df will appear to
be hanging. Since it polls for these asynchrous locks, it may sometime look like
hogging the CPUs. 

So (instead of looking at statfs), this issue should get started with DLM. 
Comment 2 Eric Z. Ayers 2006-09-07 12:10:54 EDT
(From Wendy  Cheng):
We do have a dlm wrap-around issue under heavy io. Will talk to Rob Kenna to
make sure you get the fix for the new RPMs he is working on. If you ever bump
into this again, could you try:

shell > echo "w" > /proc/sysrq-trigger (do this two or three times)
shell> echo "m" > /proc/sysrq-trigger
shell> echo "t" > /proc/sysrq-trigger

The sysrq-t will print out the kernel threads trace into /var/log/messages file.
The sysrq-m will print out kernel memory usage. The sysrq-w will show stack
traces of active CPUs (this needs to get repeated a couple of times so we would
know if the CPUs are hogged by certain process). Then tar the /var/log/messages
file and open a bugzilla. Put the tar ball into either the bugzilla or email it
to us. Warning - when system goes wrong, sysrq-t could take a while to finish
(or even hang) but sysrq-w and "m" are relatively safe. So do sysrq-t as the
last command. If the system is forced rebooted, the "w" and "m" info should
still be captured in the /var/log/messages file.

--

Added attachment
Comment 3 Eric Z. Ayers 2006-09-07 12:13:29 EDT
Created attachment 135782 [details]
/var/log/messages after sysrq-w sysrq-m sysrq-t

This is the contents of /var/log/messages after running sysreq-w/m/t from 
before rebooting the cluster members when in this state.
Comment 4 Wendy Cheng 2006-09-07 12:22:29 EDT
None of the gfs threads stack trace was captured - look like sysrq-t didn't
complete. If it happens again, please re-try on the ill system. Thanks. 
Comment 5 Eric Z. Ayers 2006-09-08 16:55:05 EDT
FYI, our cluster was upgraded today

$ uname -a
Linux blade04-1 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 17:57:31 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux
$ rpm -q GFS-kernel
GFS-kernel-2.6.9-58.3
Comment 6 David Teigland 2006-09-08 17:48:09 EDT
This message is really bad, we need to determine if it's still happening,
if it is, then it'll be the cause of any problems you're seeing:

dlm: message size from 2 too big 43337(pkt len=114)

Quoting from a mailing list message from Aug 15:

>>     dlm: message size from 1 too big 32767(pkt len=874)
>>
>>
>> on a three-node GFS 6.1 cluster using x86_64.
>>
>> is this anything to be worried about.
>>
> 
> yes it is :)
> 
> It's either a corrupted packet or a bug. Which, exact version of cluster suite
> are you using because a bug that caused these sorts of errors was fixed in U3
> (I think).

Correction, it's fixed in U4 but there was an errata released for it.

RHBA-2006:0558

The message is slightly different so it might not be the same thing, but it
looks similar to me.

patrick
Comment 7 David Teigland 2006-09-08 17:54:18 EDT
If the "dlm: message size" errors are no longer occuring, but you're
still seeing processes stuck accessing gfs, then check the cluster
status and whether there's a recovery in progress by running the
following on all the cluster nodes:

> cman_tool status
> cman_tool nodes
> cman_tool services
> ps ax -o pid,cmd,wchan
Comment 8 Eric Z. Ayers 2006-09-08 17:56:08 EDT
I am looking for the message, but I don't see it in any of our /var/log/messages
files since we upgraded today.

The way this happend was to run our application at full load.  This afternoon
when we try to run at full load,  one of the nodes is experiencing a kernel
panic.  Our console setup is a bit odd (a cluster of 4 blades) so all I can do
is get a screenshot when this happens (I can't cut & paste the text or scroll
back.)  Any advice on how to get you a better oops message/ crash dump?

I'll attach the screen dump - it is panicing in GFS it appears.
Comment 9 Eric Z. Ayers 2006-09-08 17:58:04 EDT
Created attachment 135885 [details]
screen shot of console of blade that had kernel panic

Shows what we see on the blade console when the node is fenced off
after upgrading the kernel/GFS to solve the process hanging issue.
Comment 10 David Teigland 2006-09-08 18:10:10 EDT
Is it possible to scroll back in that console view in comment 9 and
see the earlier lines?  Is this a consistent panic, do you see the
same panic each time you run your application now?

It looks like the panic may be coming from within the i/o path rather
than the locking path but I'm not certain without seeing the other lines
in the panic message.
Comment 11 Eric Z. Ayers 2006-09-08 21:27:13 EDT
I suppose this panic is the one that has been the cause of all of our node
fencing today, but I've only been able to catch the actual panic message once.

This console is from the HP Blade System's Integrated Lights Out (ILO) server
which is piggybacked on each blade.  Unfortunately, when the RHCS software
'fences' the node and turns off the power, the console locks up and then resets
itself - you can't see any of the previous history.  Luckily, I got the last
screen shot before the reset.

I don't see anything in /var/log/messages and AFAIK the blades don't have a
serial port either, and that's about all I know about how to get a panic message. 

Is there any other way to get it off the machine in this environment?  Jesse of
Intec thought that changing to manual fencing might be a good thing to try (then
maybe our console will be able to scroll back).  Wendy of RedHat mentioned some
kind of 'netdump' or something like that...
Comment 12 Eric Z. Ayers 2006-09-09 09:39:38 EDT
netdump captured the following after running the application for about 5 minutes
under heavy load (load average was > 15 on the node that crashed.  this was also
the node doing the most database access)

GFS: fsid=ime_cluster:data01.2: fatal: assertion "FALSE" failed
GFS: fsid=ime_cluster:data01.2:   function = xmote_bh
GFS: fsid=ime_cluster:data01.2:   file =
/usr/src/build/765946-x86_64/BUILD/gfs-kernel-2.6.9-58/smp/src/gfs/glock.c, line
= 1093
GFS: fsid=ime_cluster:data01.2:   time = 1157808840
GFS: fsid=ime_cluster:data01.2: about to withdraw from the cluster
GFS: fsid=ime_cluster:data01.2: waiting for outstanding I/O
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lm:190
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: netconsole netdump nfsd exportfs sctp parport_pc lp parport
autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd nfs_acl
dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror emcphr(U)
emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery
ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc
cciss sd_mod scsi_mod
Pid: 3846, comm: lock_dlm1 Tainted: P      2.6.9-42.0.2.ELsmp
RIP: 0010:[<ffffffffa0455b27>] <ffffffffa0455b27>{:gfs:gfs_lm_withdraw+215}
RSP: 0018:00000100f135bcc8  EFLAGS: 00010202
RAX: 000000000000003c RBX: ffffff00004ea8c0 RCX: 0000000000000246
RDX: 0000000000005da0 RSI: 0000000000000246 RDI: ffffffff803e2000
RBP: ffffff00004b2000 R08: 0000000000000004 R09: ffffff00004ea8c0
R10: 0000000000000000 R11: 000000000000003c R12: 00000101f0a5aaac
R13: 00000101f0a5aa80 R14: ffffff00004b2000 R15: 0000000000000003
FS:  0000002a95571b00(0000) GS:ffffffff804e5180(0000) knlGS:00000000f7d4c6c0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000080c818c CR3: 0000000000101000 CR4: 00000000000006e0
Process lock_dlm1 (pid: 3846, threadinfo 00000100f135a000, task 00000101f31b3030)
Stack: 0000003000000030 00000100f135bde0 00000100f135bce8 ffffffff80132155
       00000101f0f7c728 ffffffffa03c8c2b ffffff00004ea8c0 ffffffffa0470948
       ffffff00004ea8c0 ffffffffa046ecf2
Call Trace:<ffffffff80132155>{recalc_task_prio+337}
<ffffffffa03c8c2b>{:dlm:send_cluster_request+1001}
       <ffffffff80132155>{recalc_task_prio+337} <ffffffff8030a1c1>{thread_return+0}
       <ffffffff8030a219>{thread_return+88}
<ffffffffa046e37b>{:gfs:gfs_assert_withdraw_i+48}
       <ffffffffa044ca7c>{:gfs:xmote_bh+850}
<ffffffffa044e43a>{:gfs:gfs_glock_cb+194}
       <ffffffffa0493a69>{:lock_dlm:dlm_async+1989}
<ffffffff80133dfa>{__wake_up_common+67}
       <ffffffff80133da9>{default_wake_function+0}
<ffffffff8014b4f0>{keventd_create_kthread+0}
       <ffffffffa04932a4>{:lock_dlm:dlm_async+0}
<ffffffff8014b4f0>{keventd_create_kthread+0}
       <ffffffff8014b4c7>{kthread+200} <ffffffff80110f47>{child_rip+8}
       <ffffffff8014b4f0>{keventd_create_kthread+0} <ffffffff8014b3ff>{kthread+0}
       <ffffffff80110f3f>{child_rip+0}

Code: 0f 0b ec 1f 47 a0 ff ff ff ff be 00 8b 85 a0 88 03 00 85 c0
RIP <ffffffffa0455b27>{:gfs:gfs_lm_withdraw+215} RSP <00000100f135bcc8>

Comment 13 Eric Z. Ayers 2006-09-09 09:59:53 EDT
Here's another one... It doesn't look the same to me.

uio_file_serv is one of our application processes that server database i/o to
other nodes.

[root@ns crash]# cat 10.0.0.237-2006-09-09-09:38/log
[...network console startup...]
Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1171
GFS: fsid=ime_cluster:data01.2: jid=3: Trying to acquire journal lock...
GFS: fsid=ime_cluster:data01.2: jid=3: Looking at journal...
GFS: fsid=ime_cluster:data01.2: jid=3: Acquiring the transaction lock...
GFS: fsid=ime_cluster:data01.2: jid=3: Replaying journal...
GFS: fsid=ime_cluster:data01.2: jid=3: Replayed 146 of 155 blocks
GFS: fsid=ime_cluster:data01.2: jid=3: replays = 146, skips = 5, sames = 4
GFS: fsid=ime_cluster:data01.2: jid=3: Journal replayed in 1s
GFS: fsid=ime_cluster:data01.2: jid=3: Done
GFS: fsid=ime_cluster:data01.2: fatal: assertion "!queue_empty(gl,
&gh->gh_list)" failed
GFS: fsid=ime_cluster:data01.2:   function = gfs_glock_dq
GFS: fsid=ime_cluster:data01.2:   file =
/usr/src/build/765946-x86_64/BUILD/gfs-kernel-2.6.9-58/smp/src/gfs/glock.c, line
= 1616
GFS: fsid=ime_cluster:data01.2:   time = 1157810154
GFS: fsid=ime_cluster:data01.2: about to withdraw from the cluster
GFS: fsid=ime_cluster:data01.2: waiting for outstanding I/O
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lm:190
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: nfsd exportfs sctp parport_pc lp parport netconsole netdump
autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd nfs_acl
dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror emcphr(U)
emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery
ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc
cciss sd_mod scsi_mod
Pid: 14343, comm: uio_file_serv Tainted: P      2.6.9-42.0.2.ELsmp
RIP: 0010:[<ffffffffa0455b27>] <ffffffffa0455b27>{:gfs:gfs_lm_withdraw+215}
RSP: 0018:00000100db789bb8  EFLAGS: 00010202
RAX: 000000000000003c RBX: ffffff00004ea8c0 RCX: 0000000000000246
RDX: 0000000000005d74 RSI: 0000000000000246 RDI: ffffffff803e2000
RBP: ffffff00004b2000 R08: 0000000000000004 R09: ffffff00004ea8c0
R10: 0000000000000000 R11: 000000000000003c R12: 00000100eb3c3e14
R13: 00000100d2372000 R14: ffffff00004b2000 R15: 0000000000000000
FS:  0000002a95574860(0000) GS:ffffffff804e5200(005b) knlGS:00000000f7d4c6c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000080b0090 CR3: 0000000003d72000 CR4: 00000000000006e0
Process uio_file_serv (pid: 14343, threadinfo 00000100db788000, task
00000100d12447f0)
Stack: 0000003000000030 00000100db789cd0 00000100db789bd8 ffffff00004b2000
       00000100ea4fc040 ffffffff8030a445 ffffff00004ea8c0 ffffffffa0470a38
       ffffff00004ea8c0 ffffffffa046ec84
Call Trace:<ffffffff8030a445>{wait_for_completion+231}
<ffffffff8030a219>{thread_return+88}
       <ffffffff8030a445>{wait_for_completion+231}
<ffffffffa046e37b>{:gfs:gfs_assert_withdraw_i+48}
       <ffffffffa044bac0>{:gfs:gfs_glock_dq+124}
<ffffffffa044bc1c>{:gfs:gfs_glock_dq_uninit+9}
       <ffffffffa046c95f>{:gfs:gfs_stat_gfs+292}
<ffffffffa044bbd4>{:gfs:gfs_glock_dq+400}
       <ffffffffa0464aae>{:gfs:gfs_statfs+39} <ffffffff80177982>{vfs_statfs+82}
       <ffffffff8019e523>{compat_fstatfs64+69} <ffffffff801283c1>{sys32_open+130}
       <ffffffff80125d33>{cstar_do_call+27} <ffffffffa045f56e>{:gfs:do_read_buf+0}


Code: 0f 0b ec 1f 47 a0 ff ff ff ff be 00 8b 85 a0 88 03 00 85 c0
RIP <ffffffffa0455b27>{:gfs:gfs_lm_withdraw+215} RSP <00000100db789bb8>
Comment 14 Eric Z. Ayers 2006-09-09 12:46:11 EDT
I've since configured the application to process less data in a steady state in
hopes of reproducing the hang that started this report. Its been running for
almost 2 hours with no problems encountered, butI thought I'd note that I'm
seeing the following messages come out the console on several of the nodes:

10.0.0.234-2006-09-09-09:13/log:dlm: data01: process_lockqueue_reply id 1247022a
state 0
10.0.0.235-2006-09-09-10:10/log:dlm: data01: process_lockqueue_reply id 10247
state 0
10.0.0.237-2006-09-09-10:00/log:dlm: data01: process_lockqueue_reply id 6be0389
state 0
10.0.0.237-2006-09-09-10:00/log:dlm: data01: process_lockqueue_reply id 8460033
state 0
10.0.0.237-2006-09-09-10:00/log:dlm: data01: process_lockqueue_reply id e6a0111
state 0
10.0.0.237-2006-09-09-10:00/log:dlm: data01: process_lockqueue_reply id f790200
state 0
Comment 15 David Teigland 2006-09-12 10:53:22 EDT
The dlm messages in comment 14 are not a problem.
The gfs assertion failures in comments 12 and 13 are problems.
I vaguely recall seeing a report of the bug in comment 12 before.
Given how simple these problems are to create in your environment,
I'd expect we should be able to create them with a bit of work.
Comment 16 David Teigland 2006-09-12 10:56:28 EDT
changing the summary to remove the oblique reference to an
unrelated dlm bug
Comment 17 David Teigland 2006-09-12 12:13:18 EDT
The "process_lockqueue_reply" messages in comment 14 are very
common and appear on almost all clusters.  They appear when
the message granting the lock is processed ahead of the reply
to the initial request for the lock.  This is a harmless.
Comment 18 Eric Z. Ayers 2006-09-13 21:08:11 EDT
I've not noticed a process hang like in the original report since we put on the
new kernel from Wendy & Rob.  I changed some code today that made a big
difference in the performance of the application on the cluster (after
discovering there is still a problem in fcntl(F_GETLCK)).  I cranked up the load
on the application again and it crashed after about 2.5 hours (as opposed to
about 5-10 minutes before).  The crash dump looks completely different to me so
I thought I'd include it in this report:

# cat  10.0.0.237-2006-09-09-10:00/log
[...network console startup...]
Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1171
GFS: fsid=ime_cluster:data01.2: jid=3: Trying to acquire journal lock...
GFS: fsid=ime_cluster:data01.2: jid=3: Looking at journal...
GFS: fsid=ime_cluster:data01.2: jid=3: Acquiring the transaction lock...
GFS: fsid=ime_cluster:data01.2: jid=3: Replaying journal...
GFS: fsid=ime_cluster:data01.2: jid=3: Replayed 516 of 587 blocks
GFS: fsid=ime_cluster:data01.2: jid=3: replays = 516, skips = 57, sames = 14
GFS: fsid=ime_cluster:data01.2: jid=3: Journal replayed in 1s
GFS: fsid=ime_cluster:data01.2: jid=3: Done
dlm: data01: process_lockqueue_reply id 6be0389 state 0
dlm: data01: process_lockqueue_reply id 8460033 state 0
dlm: data01: process_lockqueue_reply id e6a0111 state 0
dlm: data01: process_lockqueue_reply id f790200 state 0
dlm: data01: process_lockqueue_reply id 3587017d state 0
dlm: data01: process_lockqueue_reply id 456e0224 state 0
eip: ffffffff80162187
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at spinlock:118
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: nfsd exportfs sctp parport_pc lp parport netconsole netdump
autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd nfs_acl
dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror emcphr(U)
emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery
ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc
cciss sd_mod scsi_mod
Pid: 19260, comm: uio_file_serv Tainted: P      2.6.9-42.0.2.ELsmp
RIP: 0010:[<ffffffff8030b0d0>] <ffffffff8030b0d0>{__lock_text_start+32}
RSP: 0018:00000100ee569ce8  EFLAGS: 00010012
RAX: 0000000000000016 RBX: 000001000397d6d0 RCX: ffffffff803e1fe8
RDX: ffffffff803e1fe8 RSI: 0000000000000046 RDI: ffffffff803e1fe0
RBP: 00000100036ffc88 R08: ffffffff803e1fe8 R09: 000001000397d6d0
R10: 0000000100000000 R11: 0000000000000016 R12: 000001000397d628
R13: 00000100ceda8000 R14: 0000000000000000 R15: ffffff00004b2000
FS:  0000002a95571b00(0000) GS:ffffffff804e5180(005b) knlGS:00000000f7d4a6c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000f7dbb341 CR3: 0000000000101000 CR4: 00000000000006e0
Process uio_file_serv (pid: 19260, threadinfo 00000100ee568000, task
00000100e4b5e7f0)
Stack: 000000000397d628 ffffffff80162187 00000100036ffc88 00000100ceda8000
       00000100036ffc98 ffffffff80161e79 0000000000000001 0000000000000206
       00000100ceda9600 0000000000000000
Call Trace:<ffffffff80162187>{cache_flusharray+25} <ffffffff80161e79>{kfree+195}
       <ffffffffa046c9f5>{:gfs:gfs_stat_gfs+442}
<ffffffffa044bbd4>{:gfs:gfs_glock_dq+400}
       <ffffffffa0464aae>{:gfs:gfs_statfs+39} <ffffffff80177982>{vfs_statfs+82}
       <ffffffff8019e523>{compat_fstatfs64+69} <ffffffff801283c1>{sys32_open+130}
       <ffffffff80125d33>{cstar_do_call+27}

Code: 0f 0b 37 4e 32 80 ff ff ff ff 76 00 f0 ff 0b 0f 88 8b 03 00
RIP <ffffffff8030b0d0>{__lock_text_start+32} RSP <00000100ee569ce8>
Comment 19 Wendy Cheng 2006-09-13 21:27:38 EDT
This is one of the few places we use asynchronous locking (to boost statfs
performance since obtaining filesystem wide resource group locks is a time
consuming task). Look like there is a memory corruption bug lurking in this
code - so far only seen on x86_64 machine and under heavy lock activities.   

I'll find myself a pair of RHEL4 x86_64 machines to test it out tomorrow 
morning.  

Comment 20 Wendy Cheng 2006-09-14 11:35:15 EDT
Eric, while we're trying to do the in-house recreation, could you tell us what
you did that could make this issue poped up 5-10 minutes (instead of 2.5 hours) ?
Comment 21 Eric Z. Ayers 2006-09-14 12:21:37 EDT
The only difference seems to be the load on the host. I just reproduced it
within just a few minutes by increasing the load settings in the application. 
The application starts many processes on the master node (around 70 or so?) that
are doing database i/o on behalf of processes running on the other nodes.  All
70 processes are interleaving and acessing similar database files (using fcntl()
to lock and unlock the database, and fdatasync() when changes are made.)
Comment 22 Eric Z. Ayers 2006-09-14 12:23:23 EDT
One more thing - the load on the master node was > 10 when the node crashed (we
have a 2 processor AMD 64 bit machine).
Comment 23 Eric Z. Ayers 2006-09-14 12:29:44 EDT
One more thing - we have a 4 node cluster, but all this contention is happening
only one one node.  All database access is on the master node, and it is the
master node that always crashes, and the uio_file_serv process usually seems to
be caught in the 'Oops' message.

Just to bolster your memory corruption theory - this time the stack trace was
different again.

[...network console startup...]
Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1171
GFS: fsid=ime_cluster:data01.2: jid=3: Trying to acquire journal lock...
GFS: fsid=ime_cluster:data01.2: jid=3: Busy
Unable to handle kernel NULL pointer dereference at 0000000000000079 RIP:
<ffffffffa044aed0>{:gfs:run_queue+477}
PML4 dde3a067 PGD e45ea067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: nfsd exportfs sctp parport_pc lp parport netconsole netdump
autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd nfs_acl
dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror emcphr(U)
emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery
ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc
cciss sd_mod scsi_mod
Pid: 28242, comm: uio_file_serv Tainted: P      2.6.9-42.0.2.ELsmp
RIP: 0010:[<ffffffffa044aed0>] <ffffffffa044aed0>{:gfs:run_queue+477}
RSP: 0018:00000100da8bfcb8  EFLAGS: 00010202
RAX: 00000000017c034e RBX: 00000100ef739578 RCX: 0000000000000001
RDX: ffffffffa048b580 RSI: 0000000000000001 RDI: 00000100ef739520
RBP: 00000100daab2f20 R08: 00000100da8be000 R09: dead4ead00000001
R10: 0000000100000000 R11: 00000101e1df4138 R12: 0000000000000001
R13: 0000000000000001 R14: ffffff00004b2000 R15: 00000100ef739520
FS:  0000002a95571b00(0000) GS:ffffffff804e5200(005b) knlGS:00000000f7d4a6c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000079 CR3: 0000000003d72000 CR4: 00000000000006e0
Process uio_file_serv (pid: 28242, threadinfo 00000100da8be000, task
00000100eaa28030)Stack: 00000000000000c2 00000100ef73954c 00000100dcd4cf20
0000000000000000
       00000100ef739520 ffffff00004b2000 0000000000000000 ffffffffa044b8c2
       ffffffffa048b580 00000100dcd4cf20
Call Trace:<ffffffffa044b8c2>{:gfs:glock_wait_internal+702}
<ffffffffa044c2eb>{:gfs:gfs_glock_wait+61}
       <ffffffffa046c927>{:gfs:gfs_stat_gfs+236}
<ffffffffa044bbd4>{:gfs:gfs_glock_dq+400}
       <ffffffffa0464aae>{:gfs:gfs_statfs+39} <ffffffff80177982>{vfs_statfs+82}
       <ffffffff8019e523>{compat_fstatfs64+69} <ffffffff801283c1>{sys32_open+130}
       <ffffffff80125d33>{cstar_do_call+27}

Code: 49 8b 45 78 4d 8b b5 18 01 00 00 48 89 04 24 41 8b 4d 34 8b
RIP <ffffffffa044aed0>{:gfs:run_queue+477} RSP <00000100da8bfcb8>
CR2: 0000000000000079


Too bad you can't run the kernel under valgrind...
Comment 24 David Teigland 2006-09-14 12:32:05 EDT
Something to keep in mind is that fcntl on gfs does not work nearly as
well as flock.  fcntl will be far slower, and is far more complex and
prone to bugs when the "odd" features of posix locks are used.
(fcntl posix locks have been greatly improved for the next major gfs version)
So, flocks would be a really good choice if the application is adaptable.

It doesn't look like this is related to the current panics, though.
Comment 25 Eric Z. Ayers 2006-09-14 12:45:55 EDT
I appreciate the suggestion, but for our database library, we really do use
those 'odd' features of posix locks, like being able to lock a region of a file
(not just the entire file) and querying to see what process is interfering with
a lock.
So I will look forward to the next version of GFS.
Comment 27 Eric Z. Ayers 2006-09-15 19:39:49 EDT
Here is some different behavior. We are getting a process that will not exit
with kill -9 and we can't attach to it with the debugger.ps shows it has a weird
state:

USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
cgi       5442  0.0  0.0 59724 4180 pts/0    T    19:30   0:00 gdb dcs_event_mgr
25851
cgi      25851  0.0  0.0 61652 6600 ?        Dsl  16:45   0:00
/opt/dcs/dcs/bin/dcs_event_mgr -reload -delay 60


Sysreq -T prints out this process as the last one:

...
dcs_event_mgr D 00000101e916dbe0     0 26328      1         25851 25852 (NOTLB)
00000101d6fbdcf8 0000000000000006 00000101d6fbdd08 ffffffff00000076
       00000100f4fe87f0 0000000000000069 000001010005fb00 0000000100000001
       00000101ec3f9030 0000000000000303
Call Trace:<ffffffffa03ca2af>{:dlm:put_lockspace+16} <ffffffffa03c7c3f>{:dlm:dlm
_lock+1182}
       <ffffffff8030a405>{wait_for_completion+167} <ffffffff80133da9>{default_wa
ke_function+0}
       <ffffffff80133da9>{default_wake_function+0} <ffffffffa0491b2b>{:lock_dlm:
wait_async+17}
       <ffffffffa0491c42>{:lock_dlm:wait_async_list+201} <ffffffffa049239a>{:loc
k_dlm:lm_dlm_punlock+581}
       <ffffffffa0455e06>{:gfs:gfs_lm_punlock+46} <ffffffffa0460dc1>{:gfs:gfs_lo
ck+173}
       <ffffffff8018e49b>{locks_remove_posix+156} <ffffffff80159f49>{wait_on_pag
e_writeback_range+218}
       <ffffffff80178e3c>{filp_close+95} <ffffffff80178ecd>{sys_close+130}
       <ffffffff80125d33>{cstar_do_call+27}
bash          R  running task       0  6747  29857                     (NOTLB)
(that's the end of the log)


I was trying to recreate the crash with a simple program running on blade01, but
this process that is stuck is running on blade02.
Comment 28 Wendy Cheng 2006-09-16 01:28:38 EDT
Eric, wait_for_completion() is an uninterruptible and blocking kernel call. 
The thread will wait there (forever) until the event it is expected happens.   
So it is not killable. Not sure why dlm got into this state though. 

The urgent need here is to get a test case that can reliably triggering the 
problems. We're trying - but so far no news yet. 
Comment 31 Eric Z. Ayers 2006-09-16 11:56:07 EDT
To kill the process I had to reboot. In fact, I rebooted all nodes in the
cluster and ran fsck.gfs - only one seemingly minor problem was found.

I have been trying to reproduce the issue without the full blown application
running and thus far have had no luck.  One thing I thought I'd mention, each of
our processes that runs on the non-master nodes in the cluster calls 'statvfs()'
2 times on the gfs filesystem.  They also read files from the gfs filesystem and
create new ones (using sequential i/o)  They start and stop frequently.  

When the crash occurs, we have maybe 20 processes on each of the 3 other blades,
and about 60 processes performing database i/o on the 'master' node. (those
start and stop frequently as well) The database processes are  doing lots of
fcntl() locking along with read() and write(). The database files are growing
during the test.
Comment 32 Wendy Cheng 2006-09-18 17:43:48 EDT
Eric, I think I've recreated the hang - not sure whether it is the same as
yours. Still working on it.

stack trace on one of the nodes (just save here for future reference)

crash> bt 4573
PID: 4573   TASK: 1006f6e37f0       CPU: 0   COMMAND: "fl_64"
 #0 [1006d86bb78] schedule at ffffffff8030a1c1
 #1 [1006d86bc50] wait_for_completion at ffffffff8030a405
 #2 [1006d86bcd0] do_dlm_lock_sync at ffffffffa02c2ab7
 #3 [1006d86bcf0] unlock_resource at ffffffffa02c472a
 #4 [1006d86bd10] lm_dlm_plock at ffffffffa02c5fed
 #5 [1006d86bd90] inode_has_perm at ffffffff801d1e3f
 #6 [1006d86bdf0] selinux_file_lock at ffffffff801d45a9
 #7 [1006d86be50] gfs_lm_plock at ffffffffa02e6dcb
 #8 [1006d86be60] gfs_lock at ffffffffa02f1dd3
 #9 [1006d86be90] fcntl_setlk at ffffffff8018e244
#10 [1006d86bed0] sys_select at ffffffff8018bcc5
#11 [1006d86bee0] dnotify_parent at ffffffff801932d4
#12 [1006d86bf50] sys_fcntl at ffffffff8018a613
#13 [1006d86bf80] system_call at ffffffff8011026a
Comment 33 Wendy Cheng 2006-09-20 22:54:36 EDT
Eric, here is a quick update on the issue ..

One problem we've found is that under heavy system load, the memory could 
be in shortage that results in kmalloc() failure. In previous versions, 
upon kmalloc() failure, in certain code paths, sometime it was impossible 
to back out - so GFS either panic or hung. I worked with the customer who 
reported this problem and had added a vmalloc() call if kmalloc() failed. 
The changes seemed to work very well (for that customer who ran on i686 
boxes) so GFS has been carried that fixes since then. Unfortuntely,
I forgot to add vmalloc.h header - since the compiler has not been 
not able to get the correct vmalloc() prototype, it assumes the default 
that casts vmalloc() return code to "int" which is 4 byes. I seem to get 
away with i686 machines but this obviously causes problem with x86_64 
machine which has 8 bytes addressing.  

We may have other areas that could have the very same issues so will 
comb thru the code and re-run the test. Will keep you updated.
Comment 34 Wendy Cheng 2006-09-21 16:44:24 EDT
The test case has been running for 6 hours without hang or panic - I can 
no longer recreate my hang. Will package the test RPMs for Eric to try 
out tonight. Let's shoot for tomorrow morning delivery.

There is one (performance) issue that I should mention:

Occasionally a process may send a dlm request and gets einval (as shown in
/proc/cluster/dlm_debug) as part of the plock call. Look to me that it is
caused by the window between sent-request and actual-process that the lock
could be obtained by another node (? I'm not very sure though as dlm is
new to me). The process_lock_queue_reply() seems to retry (by re-lookuping 
who is the master?). However, after this happens, that particular process 
will get *very* little chance to obtain the lock and normally stay in the 
following route for a *very* long time (unless I actually kill some other
processes that compete for this lock - otherwise it seems to hang there 
forever):

PID: 18177  TASK: 10070fb3030       CPU: 0   COMMAND: "fl_64"
 #0 [1004f621c38] schedule at ffffffff8030a1c1
 #1 [1004f621d10] lm_dlm_plock at ffffffffa030194f
 #2 [1004f621d90] inode_has_perm at ffffffff801d1e3f
 #3 [1004f621e50] gfs_lm_plock at ffffffffa02c4dd3
 #4 [1004f621e60] gfs_lock at ffffffffa02cfddb
 #5 [1004f621e90] fcntl_setlk at ffffffff8018e244
 #6 [1004f621ed0] sys_select at ffffffff8018bcc5
 #7 [1004f621ee0] dnotify_parent at ffffffff801932d4
 #8 [1004f621f50] sys_fcntl at ffffffff8018a613
 #9 [1004f621f80] system_call at ffffffff8011026a`



Comment 35 Eric Z. Ayers 2006-09-21 16:51:02 EDT
I have seen hangs of particluar processes (like running multiple instances of
'df' in a loop during my other load testing) but they did recover eventually (a
few minutes.)  We'll be ready to put on the new RPMs tomorrow. -EZA
Comment 38 Wendy Cheng 2006-09-22 01:04:35 EDT
Eric, the RPMs (in tar file) format are in:
http://people.redhat.com/wcheng/Patches/GFS/R4/intec.tar

If something goes wrong with the tar file, the real RPMs are in 
http://people.redhat.com/wcheng/Patches/GFS/R4/intec.0922 directory. 

Note that these RPMs do not include GFS statfs performance improvement. 
Please be aware that "df" (that uses GFS statfs system call) needs to 
get *all* the resource group locks. If for whatever reason one of these 
locks are hogged by other processes, "df" will appear to hang. We have
statfs improvement on the way but not included in these RPMs.
Comment 39 Wendy Cheng 2006-09-22 09:31:29 EDT
hmm, something is still not right - I have two processes that have been hanging
(waiting for) DLM plock since 2:45 am. So I just kill all other processes and
they start to run again. 
Comment 40 Jesse Marlin 2006-09-22 10:50:05 EDT
I updated the nodes, rebooted.  All seemed to run for a while fine.  I cranked 
up the load and it ran for probably about 30 minutes and then crashed.  Here is 
backtrace from netdump:

Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1171
GFS: fsid=ime_cluster:data01.1: fatal: assertion "!queue_empty(gl, &gh-
>gh_list)" failed
GFS: fsid=ime_cluster:data01.1:   function = gfs_glock_dq
GFS: fsid=ime_cluster:data01.1:   file = /usr/src/redhat/BUILD/gfs-kernel-2.6.9-
58/smp/src/gfs/glock.c, line = 1616
GFS: fsid=ime_cluster:data01.1:   time = 1158934946
GFS: fsid=ime_cluster:data01.1: about to withdraw from the cluster
GFS: fsid=ime_cluster:data01.1: waiting for outstanding I/O
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lm:190
invalid operand: 0000 [1] SMP 
CPU 0 
Modules linked in: nfsd exportfs sctp parport_pc lp parport netconsole netdump 
autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd
 nfs_acl dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror 
emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib
(U) button battery ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx 
scsi_transport_fc cciss sd_mod scsi_mod
Pid: 26655, comm: uio_file_serv Tainted: P      2.6.9-42.0.2.ELsmp
RIP: 0010:[<ffffffffa0456b2f>] <ffffffffa0456b2f>{:gfs:gfs_lm_withdraw+215}
RSP: 0018:00000100dd9a7bb8  EFLAGS: 00010202
RAX: 000000000000003c RBX: ffffff00004ea8c0 RCX: 0000000000000246
RDX: 0000000000005962 RSI: 0000000000000246 RDI: ffffffff803e2000
RBP: ffffff00004b2000 R08: 0000000000000004 R09: ffffff00004ea8c0
R10: 0000000000000000 R11: 000000000000003c R12: 00000100f02a860c
R13: 00000100e2688000 R14: ffffff00004b2000 R15: 0000000000000000
FS:  0000002a95572820(0000) GS:ffffffff804e5180(005b) knlGS:00000000f7d4a6c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 000000000805544f CR3: 0000000000101000 CR4: 00000000000006e0
Process uio_file_serv (pid: 26655, threadinfo 00000100dd9a6000, task 
00000101e32a4030)
Stack: 0000003000000030 00000100dd9a7cd0 00000100dd9a7bd8 0000000000000018 
       00000100dd9a7c48 ffffffff80132155 ffffff00004ea8c0 ffffffffa04719fd 
       ffffff00004ea8c0 ffffffffa046fca4 
Call 
Trace:<ffffffff80132155>{recalc_task_prio+337} <ffffffff802d030d>{tcp_sendpage+0}
 
       
<ffffffff8030a219>{thread_return+88} <ffffffff8030a445>{wait_for_completion+231}
 
       
<ffffffffa046f39f>{:gfs:gfs_assert_withdraw_i+48} <ffffffffa044cac4>{:gfs:gfs_glock_dq+124}
 
       
<ffffffffa044cc20>{:gfs:gfs_glock_dq_uninit+9} <ffffffffa046d983>{:gfs:gfs_stat_gfs+292}
 
       
<ffffffffa044cbd8>{:gfs:gfs_glock_dq+400} <ffffffffa0465ad2>{:gfs:gfs_statfs+39}
 
       
<ffffffff80177982>{vfs_statfs+82} <ffffffff8019e523>{compat_fstatfs64+69} 
       <ffffffff80125d33>{cstar_do_call+27} 

Code: 0f 0b 7d 2f 47 a0 ff ff ff ff be 00 8b 85 a0 88 03 00 85 c0 
RIP <ffffffffa0456b2f>{:gfs:gfs_lm_withdraw+215} RSP <00000100dd9a7bb8>
Comment 41 Jesse Marlin 2006-09-22 12:14:14 EDT
Here is another which looks different:


Kernel BUG at lm:190
invalid operand: 0000 [1] SMP 
CPU 1 
Modules linked in: nfsd exportfs sctp parport_pc lp parport netconsole netdump 
autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd
 nfs_acl dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror 
emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib
(U) button battery ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx 
scsi_transport_fc cciss sd_mod scsi_mod
Pid: 7764, comm: dcs_cftp_flx Tainted: P      2.6.9-42.0.2.ELsmp
RIP: 0010:[<ffffffffa0456b2f>] <ffffffffa0456b2f>{:gfs:gfs_lm_withdraw+215}
RSP: 0018:00000100eff3fa08  EFLAGS: 00010202
RAX: 000000000000003c RBX: ffffff00004ea8c0 RCX: ffffffff803e1fe8
RDX: ffffffff803e1fe8 RSI: 0000000000000246 RDI: ffffffff803e1fe0
RBP: ffffff00004b2000 R08: ffffffff803e1fe8 R09: ffffff00004ea8c0
R10: 0000000100000000 R11: ffffffff8011e884 R12: 0000010037e30200
R13: 00000000ffffffff R14: 0000000000000001 R15: 0000000000000006
FS:  0000002a95573b00(0000) GS:ffffffff804e5200(005b) knlGS:00000000f72a06c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000f76c71f4 CR3: 0000000003d72000 CR4: 00000000000006e0
Process dcs_cftp_flx (pid: 7764, threadinfo 00000100eff3e000, task 
00000100f14e27f0)
Stack: 0000003000000030 00000100eff3fb20 00000100eff3fa28 ffffff00004b2000 
       0000000300000000 ffffff00004ea770 ffffff00004ea8c0 ffffffffa04758dc 
       ffffff00004ea8c0 ffffffffa0470b30 
Call 
Trace:<ffffffff80133e50>{__wake_up+54} <ffffffffa046f39f>{:gfs:gfs_assert_withdraw_i+48}
<ffffffffa046b8e5>{:gfs:blkalloc_internal+266} <ffffffffa046bf10>{:gfs:gfs_dialloc+102}
<ffffffffa0450ece>{:gfs:dinode_alloc+137} <ffffffffa0451237>{:gfs:gfs_createi+156}
<ffffffff80133f1d>{complete+53} <ffffffffa044bb83>{:gfs:lock_on_glock+112}
<ffffffffa0463476>{:gfs:gfs_create+118} <ffffffffa0465065>{:gfs:gfs_permission+475}
<ffffffffa045f601>{:gfs:gfs_drevalidate+409} <ffffffff80187563>{vfs_create+214}
<ffffffff80187991>{open_namei+430} <ffffffff80178b6e>{filp_open+80}
<ffffffff8012633d>{sys32_stat64+17} <ffffffff801ecf45>{strncpy_from_user+74}
<ffffffff80128375>{sys32_open+54} <ffffffff80125d33>{cstar_do_call+27}
Comment 42 Wendy Cheng 2006-09-22 13:32:46 EDT
Eric, is it possible to get your database in-house (so we can do problem
recreation here) ? Remote login normally don't do well with kernel debugging. 
Comment 45 Wendy Cheng 2006-09-26 01:23:39 EDT
I need a bugzilla to check in vmalloc changes for tomorrow's RHEL 4.4 errata 
build - this seems to be one that I can use.

Since we have more than one issues with this cluster (and its application),
will open another bugzilla to log the issues we've found. It will start
with our first round of live debugging tomorrow. 
Comment 47 Red Hat Bugzilla 2006-10-11 12:49:32 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0705.html

Note You need to log in before you can comment on or make changes to this bug.