Description of problem: (Note, I mailed this to Wendy Chang this morning, but then I found the exact system call the problem was failing on) we checked on the cluster this morning we found processed running at full CPU that we cannot attach to with the debugger. We have seen this before - the only solution is to reboot some or all nodes in the cluster from blade02: top - 10:16:58 up 22 days, 16:13, 2 users, load average: 11.00, 11.00, 11.00 Tasks: 156 total, 11 running, 144 sleeping, 1 stopped, 0 zombie Cpu(s): 0.2% us, 3.4% sy, 0.0% ni, 96.2% id, 0.1% wa, 0.0% hi, 0.0% si Mem: 8005372k total, 2314368k used, 5691004k free, 106640k buffers Swap: 8385920k total, 0k used, 8385920k free, 1869916k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9283 cgi 35 10 26272 12m 2732 R 22 0.2 216:51.86 dcs_operation_m 9306 cgi 35 10 26276 12m 2732 R 22 0.2 217:08.51 dcs_operation_m 18606 root 25 0 49928 604 524 R 22 0.0 78:27.56 df 9312 cgi 35 10 26272 12m 2732 R 22 0.2 218:20.53 dcs_operation_m 9285 cgi 35 10 26276 12m 2732 R 20 0.2 209:36.55 dcs_operation_m 9286 cgi 35 10 26276 12m 2732 R 20 0.2 204:58.36 dcs_operation_m 9304 cgi 35 10 26272 12m 2732 R 20 0.2 199:41.17 dcs_operation_m 9308 cgi 35 10 26276 12m 2732 R 20 0.2 204:08.35 dcs_operation_m 9310 cgi 35 10 26272 12m 2732 R 18 0.2 203:50.90 dcs_operation_m 9314 cgi 35 10 26272 12m 2732 R 16 0.2 217:11.20 dcs_operation_m 1 root 16 0 4748 524 440 S 0 0.0 0:02.70 init 2 root RT 0 0 0 0 S 0 0.0 0:13.85 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.26 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:14.73 migration/1 5 root 34 19 0 0 0 S 0 0.0 0:00.20 ksoftirqd/1 6 root 5 -10 0 0 0 S 0 0.0 0:00.01 events/0 ... dcs_operation_mgr is part of our application, but 'df' is the one that came with the OS and run automatically -bash-3.00$ ps -fp 18606 UID PID PPID C STIME TTY TIME CMD root 18606 18604 20 04:02 ? 01:19:48 df -hP UID PID PPID C STIME TTY TIME CMD root 18604 18278 0 04:02 ? 00:00:00 /bin/bash /etc/log.d/scripts/services/zz-disk_space I tried to attach to our dcs_operation_mgr tasks or to the df task with gdb and strace and just got a hang. This same behavior is happening on other nodes in our cluster as well. We also found this in 'dmesg' on one of the other nodes (similar to the message we saw when trying up upgrade the cluster the last time) dlm: message size from 2 too big 43337(pkt len=114) 02 03 ff e5 03 ee 49 a9-00 84 01 00 00 00 00 03 0e 00 00 00 00 00 00 00-00 00 00 00 12 00 00 00 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00-02 05 ff a9 03 83 49 d5 02 8e 01 00 00 00 00 03-00 00 00 33 5d 00 00 01 09 00 00 00 00 00 00 00-09 00 00 00 00 00 00 00 09 00 I ran 'df -hP' under sttrace and I see the last system call accessed is hanging: statfs("/data01", This is our GFS mount. $ mount | grep data01 /dev/mapper/IME-imelv on /data01 type gfs (rw,debug) Version-Release number of selected component (if applicable): $ uname -a Linux blade04-1 2.6.9-34.0.1.ELsmp #1 SMP Wed May 17 16:59:36 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux GFS-kernel-2.6.9-49.1.1 $ cat /etc/issue Red Hat Enterprise Linux ES release 4 (Nahant Update 4) How reproducible: We've had it happen several times by running our application with lots of activity for a period of a few hours. Additional info: Our database is pretty lock and we've been working with Wendy Cheng and Rob Kenna to help us isolate a problem with corruption we're seeing from out database library when using GFS.
The df is built upon statfs and gfs_statfs would need to check all resource group locks. If one of the locks is kept by other process, gfs df will appear to be hanging. Since it polls for these asynchrous locks, it may sometime look like hogging the CPUs. So (instead of looking at statfs), this issue should get started with DLM.
(From Wendy Cheng): We do have a dlm wrap-around issue under heavy io. Will talk to Rob Kenna to make sure you get the fix for the new RPMs he is working on. If you ever bump into this again, could you try: shell > echo "w" > /proc/sysrq-trigger (do this two or three times) shell> echo "m" > /proc/sysrq-trigger shell> echo "t" > /proc/sysrq-trigger The sysrq-t will print out the kernel threads trace into /var/log/messages file. The sysrq-m will print out kernel memory usage. The sysrq-w will show stack traces of active CPUs (this needs to get repeated a couple of times so we would know if the CPUs are hogged by certain process). Then tar the /var/log/messages file and open a bugzilla. Put the tar ball into either the bugzilla or email it to us. Warning - when system goes wrong, sysrq-t could take a while to finish (or even hang) but sysrq-w and "m" are relatively safe. So do sysrq-t as the last command. If the system is forced rebooted, the "w" and "m" info should still be captured in the /var/log/messages file. -- Added attachment
Created attachment 135782 [details] /var/log/messages after sysrq-w sysrq-m sysrq-t This is the contents of /var/log/messages after running sysreq-w/m/t from before rebooting the cluster members when in this state.
None of the gfs threads stack trace was captured - look like sysrq-t didn't complete. If it happens again, please re-try on the ill system. Thanks.
FYI, our cluster was upgraded today $ uname -a Linux blade04-1 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 17:57:31 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux $ rpm -q GFS-kernel GFS-kernel-2.6.9-58.3
This message is really bad, we need to determine if it's still happening, if it is, then it'll be the cause of any problems you're seeing: dlm: message size from 2 too big 43337(pkt len=114) Quoting from a mailing list message from Aug 15: >> dlm: message size from 1 too big 32767(pkt len=874) >> >> >> on a three-node GFS 6.1 cluster using x86_64. >> >> is this anything to be worried about. >> > > yes it is :) > > It's either a corrupted packet or a bug. Which, exact version of cluster suite > are you using because a bug that caused these sorts of errors was fixed in U3 > (I think). Correction, it's fixed in U4 but there was an errata released for it. RHBA-2006:0558 The message is slightly different so it might not be the same thing, but it looks similar to me. patrick
If the "dlm: message size" errors are no longer occuring, but you're still seeing processes stuck accessing gfs, then check the cluster status and whether there's a recovery in progress by running the following on all the cluster nodes: > cman_tool status > cman_tool nodes > cman_tool services > ps ax -o pid,cmd,wchan
I am looking for the message, but I don't see it in any of our /var/log/messages files since we upgraded today. The way this happend was to run our application at full load. This afternoon when we try to run at full load, one of the nodes is experiencing a kernel panic. Our console setup is a bit odd (a cluster of 4 blades) so all I can do is get a screenshot when this happens (I can't cut & paste the text or scroll back.) Any advice on how to get you a better oops message/ crash dump? I'll attach the screen dump - it is panicing in GFS it appears.
Created attachment 135885 [details] screen shot of console of blade that had kernel panic Shows what we see on the blade console when the node is fenced off after upgrading the kernel/GFS to solve the process hanging issue.
Is it possible to scroll back in that console view in comment 9 and see the earlier lines? Is this a consistent panic, do you see the same panic each time you run your application now? It looks like the panic may be coming from within the i/o path rather than the locking path but I'm not certain without seeing the other lines in the panic message.
I suppose this panic is the one that has been the cause of all of our node fencing today, but I've only been able to catch the actual panic message once. This console is from the HP Blade System's Integrated Lights Out (ILO) server which is piggybacked on each blade. Unfortunately, when the RHCS software 'fences' the node and turns off the power, the console locks up and then resets itself - you can't see any of the previous history. Luckily, I got the last screen shot before the reset. I don't see anything in /var/log/messages and AFAIK the blades don't have a serial port either, and that's about all I know about how to get a panic message. Is there any other way to get it off the machine in this environment? Jesse of Intec thought that changing to manual fencing might be a good thing to try (then maybe our console will be able to scroll back). Wendy of RedHat mentioned some kind of 'netdump' or something like that...
netdump captured the following after running the application for about 5 minutes under heavy load (load average was > 15 on the node that crashed. this was also the node doing the most database access) GFS: fsid=ime_cluster:data01.2: fatal: assertion "FALSE" failed GFS: fsid=ime_cluster:data01.2: function = xmote_bh GFS: fsid=ime_cluster:data01.2: file = /usr/src/build/765946-x86_64/BUILD/gfs-kernel-2.6.9-58/smp/src/gfs/glock.c, line = 1093 GFS: fsid=ime_cluster:data01.2: time = 1157808840 GFS: fsid=ime_cluster:data01.2: about to withdraw from the cluster GFS: fsid=ime_cluster:data01.2: waiting for outstanding I/O ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lm:190 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: netconsole netdump nfsd exportfs sctp parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd nfs_acl dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod Pid: 3846, comm: lock_dlm1 Tainted: P 2.6.9-42.0.2.ELsmp RIP: 0010:[<ffffffffa0455b27>] <ffffffffa0455b27>{:gfs:gfs_lm_withdraw+215} RSP: 0018:00000100f135bcc8 EFLAGS: 00010202 RAX: 000000000000003c RBX: ffffff00004ea8c0 RCX: 0000000000000246 RDX: 0000000000005da0 RSI: 0000000000000246 RDI: ffffffff803e2000 RBP: ffffff00004b2000 R08: 0000000000000004 R09: ffffff00004ea8c0 R10: 0000000000000000 R11: 000000000000003c R12: 00000101f0a5aaac R13: 00000101f0a5aa80 R14: ffffff00004b2000 R15: 0000000000000003 FS: 0000002a95571b00(0000) GS:ffffffff804e5180(0000) knlGS:00000000f7d4c6c0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000080c818c CR3: 0000000000101000 CR4: 00000000000006e0 Process lock_dlm1 (pid: 3846, threadinfo 00000100f135a000, task 00000101f31b3030) Stack: 0000003000000030 00000100f135bde0 00000100f135bce8 ffffffff80132155 00000101f0f7c728 ffffffffa03c8c2b ffffff00004ea8c0 ffffffffa0470948 ffffff00004ea8c0 ffffffffa046ecf2 Call Trace:<ffffffff80132155>{recalc_task_prio+337} <ffffffffa03c8c2b>{:dlm:send_cluster_request+1001} <ffffffff80132155>{recalc_task_prio+337} <ffffffff8030a1c1>{thread_return+0} <ffffffff8030a219>{thread_return+88} <ffffffffa046e37b>{:gfs:gfs_assert_withdraw_i+48} <ffffffffa044ca7c>{:gfs:xmote_bh+850} <ffffffffa044e43a>{:gfs:gfs_glock_cb+194} <ffffffffa0493a69>{:lock_dlm:dlm_async+1989} <ffffffff80133dfa>{__wake_up_common+67} <ffffffff80133da9>{default_wake_function+0} <ffffffff8014b4f0>{keventd_create_kthread+0} <ffffffffa04932a4>{:lock_dlm:dlm_async+0} <ffffffff8014b4f0>{keventd_create_kthread+0} <ffffffff8014b4c7>{kthread+200} <ffffffff80110f47>{child_rip+8} <ffffffff8014b4f0>{keventd_create_kthread+0} <ffffffff8014b3ff>{kthread+0} <ffffffff80110f3f>{child_rip+0} Code: 0f 0b ec 1f 47 a0 ff ff ff ff be 00 8b 85 a0 88 03 00 85 c0 RIP <ffffffffa0455b27>{:gfs:gfs_lm_withdraw+215} RSP <00000100f135bcc8>
Here's another one... It doesn't look the same to me. uio_file_serv is one of our application processes that server database i/o to other nodes. [root@ns crash]# cat 10.0.0.237-2006-09-09-09:38/log [...network console startup...] Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1171 GFS: fsid=ime_cluster:data01.2: jid=3: Trying to acquire journal lock... GFS: fsid=ime_cluster:data01.2: jid=3: Looking at journal... GFS: fsid=ime_cluster:data01.2: jid=3: Acquiring the transaction lock... GFS: fsid=ime_cluster:data01.2: jid=3: Replaying journal... GFS: fsid=ime_cluster:data01.2: jid=3: Replayed 146 of 155 blocks GFS: fsid=ime_cluster:data01.2: jid=3: replays = 146, skips = 5, sames = 4 GFS: fsid=ime_cluster:data01.2: jid=3: Journal replayed in 1s GFS: fsid=ime_cluster:data01.2: jid=3: Done GFS: fsid=ime_cluster:data01.2: fatal: assertion "!queue_empty(gl, &gh->gh_list)" failed GFS: fsid=ime_cluster:data01.2: function = gfs_glock_dq GFS: fsid=ime_cluster:data01.2: file = /usr/src/build/765946-x86_64/BUILD/gfs-kernel-2.6.9-58/smp/src/gfs/glock.c, line = 1616 GFS: fsid=ime_cluster:data01.2: time = 1157810154 GFS: fsid=ime_cluster:data01.2: about to withdraw from the cluster GFS: fsid=ime_cluster:data01.2: waiting for outstanding I/O ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lm:190 invalid operand: 0000 [1] SMP CPU 1 Modules linked in: nfsd exportfs sctp parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd nfs_acl dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod Pid: 14343, comm: uio_file_serv Tainted: P 2.6.9-42.0.2.ELsmp RIP: 0010:[<ffffffffa0455b27>] <ffffffffa0455b27>{:gfs:gfs_lm_withdraw+215} RSP: 0018:00000100db789bb8 EFLAGS: 00010202 RAX: 000000000000003c RBX: ffffff00004ea8c0 RCX: 0000000000000246 RDX: 0000000000005d74 RSI: 0000000000000246 RDI: ffffffff803e2000 RBP: ffffff00004b2000 R08: 0000000000000004 R09: ffffff00004ea8c0 R10: 0000000000000000 R11: 000000000000003c R12: 00000100eb3c3e14 R13: 00000100d2372000 R14: ffffff00004b2000 R15: 0000000000000000 FS: 0000002a95574860(0000) GS:ffffffff804e5200(005b) knlGS:00000000f7d4c6c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 00000000080b0090 CR3: 0000000003d72000 CR4: 00000000000006e0 Process uio_file_serv (pid: 14343, threadinfo 00000100db788000, task 00000100d12447f0) Stack: 0000003000000030 00000100db789cd0 00000100db789bd8 ffffff00004b2000 00000100ea4fc040 ffffffff8030a445 ffffff00004ea8c0 ffffffffa0470a38 ffffff00004ea8c0 ffffffffa046ec84 Call Trace:<ffffffff8030a445>{wait_for_completion+231} <ffffffff8030a219>{thread_return+88} <ffffffff8030a445>{wait_for_completion+231} <ffffffffa046e37b>{:gfs:gfs_assert_withdraw_i+48} <ffffffffa044bac0>{:gfs:gfs_glock_dq+124} <ffffffffa044bc1c>{:gfs:gfs_glock_dq_uninit+9} <ffffffffa046c95f>{:gfs:gfs_stat_gfs+292} <ffffffffa044bbd4>{:gfs:gfs_glock_dq+400} <ffffffffa0464aae>{:gfs:gfs_statfs+39} <ffffffff80177982>{vfs_statfs+82} <ffffffff8019e523>{compat_fstatfs64+69} <ffffffff801283c1>{sys32_open+130} <ffffffff80125d33>{cstar_do_call+27} <ffffffffa045f56e>{:gfs:do_read_buf+0} Code: 0f 0b ec 1f 47 a0 ff ff ff ff be 00 8b 85 a0 88 03 00 85 c0 RIP <ffffffffa0455b27>{:gfs:gfs_lm_withdraw+215} RSP <00000100db789bb8>
I've since configured the application to process less data in a steady state in hopes of reproducing the hang that started this report. Its been running for almost 2 hours with no problems encountered, butI thought I'd note that I'm seeing the following messages come out the console on several of the nodes: 10.0.0.234-2006-09-09-09:13/log:dlm: data01: process_lockqueue_reply id 1247022a state 0 10.0.0.235-2006-09-09-10:10/log:dlm: data01: process_lockqueue_reply id 10247 state 0 10.0.0.237-2006-09-09-10:00/log:dlm: data01: process_lockqueue_reply id 6be0389 state 0 10.0.0.237-2006-09-09-10:00/log:dlm: data01: process_lockqueue_reply id 8460033 state 0 10.0.0.237-2006-09-09-10:00/log:dlm: data01: process_lockqueue_reply id e6a0111 state 0 10.0.0.237-2006-09-09-10:00/log:dlm: data01: process_lockqueue_reply id f790200 state 0
The dlm messages in comment 14 are not a problem. The gfs assertion failures in comments 12 and 13 are problems. I vaguely recall seeing a report of the bug in comment 12 before. Given how simple these problems are to create in your environment, I'd expect we should be able to create them with a bit of work.
changing the summary to remove the oblique reference to an unrelated dlm bug
The "process_lockqueue_reply" messages in comment 14 are very common and appear on almost all clusters. They appear when the message granting the lock is processed ahead of the reply to the initial request for the lock. This is a harmless.
I've not noticed a process hang like in the original report since we put on the new kernel from Wendy & Rob. I changed some code today that made a big difference in the performance of the application on the cluster (after discovering there is still a problem in fcntl(F_GETLCK)). I cranked up the load on the application again and it crashed after about 2.5 hours (as opposed to about 5-10 minutes before). The crash dump looks completely different to me so I thought I'd include it in this report: # cat 10.0.0.237-2006-09-09-10:00/log [...network console startup...] Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1171 GFS: fsid=ime_cluster:data01.2: jid=3: Trying to acquire journal lock... GFS: fsid=ime_cluster:data01.2: jid=3: Looking at journal... GFS: fsid=ime_cluster:data01.2: jid=3: Acquiring the transaction lock... GFS: fsid=ime_cluster:data01.2: jid=3: Replaying journal... GFS: fsid=ime_cluster:data01.2: jid=3: Replayed 516 of 587 blocks GFS: fsid=ime_cluster:data01.2: jid=3: replays = 516, skips = 57, sames = 14 GFS: fsid=ime_cluster:data01.2: jid=3: Journal replayed in 1s GFS: fsid=ime_cluster:data01.2: jid=3: Done dlm: data01: process_lockqueue_reply id 6be0389 state 0 dlm: data01: process_lockqueue_reply id 8460033 state 0 dlm: data01: process_lockqueue_reply id e6a0111 state 0 dlm: data01: process_lockqueue_reply id f790200 state 0 dlm: data01: process_lockqueue_reply id 3587017d state 0 dlm: data01: process_lockqueue_reply id 456e0224 state 0 eip: ffffffff80162187 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at spinlock:118 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: nfsd exportfs sctp parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd nfs_acl dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod Pid: 19260, comm: uio_file_serv Tainted: P 2.6.9-42.0.2.ELsmp RIP: 0010:[<ffffffff8030b0d0>] <ffffffff8030b0d0>{__lock_text_start+32} RSP: 0018:00000100ee569ce8 EFLAGS: 00010012 RAX: 0000000000000016 RBX: 000001000397d6d0 RCX: ffffffff803e1fe8 RDX: ffffffff803e1fe8 RSI: 0000000000000046 RDI: ffffffff803e1fe0 RBP: 00000100036ffc88 R08: ffffffff803e1fe8 R09: 000001000397d6d0 R10: 0000000100000000 R11: 0000000000000016 R12: 000001000397d628 R13: 00000100ceda8000 R14: 0000000000000000 R15: ffffff00004b2000 FS: 0000002a95571b00(0000) GS:ffffffff804e5180(005b) knlGS:00000000f7d4a6c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 00000000f7dbb341 CR3: 0000000000101000 CR4: 00000000000006e0 Process uio_file_serv (pid: 19260, threadinfo 00000100ee568000, task 00000100e4b5e7f0) Stack: 000000000397d628 ffffffff80162187 00000100036ffc88 00000100ceda8000 00000100036ffc98 ffffffff80161e79 0000000000000001 0000000000000206 00000100ceda9600 0000000000000000 Call Trace:<ffffffff80162187>{cache_flusharray+25} <ffffffff80161e79>{kfree+195} <ffffffffa046c9f5>{:gfs:gfs_stat_gfs+442} <ffffffffa044bbd4>{:gfs:gfs_glock_dq+400} <ffffffffa0464aae>{:gfs:gfs_statfs+39} <ffffffff80177982>{vfs_statfs+82} <ffffffff8019e523>{compat_fstatfs64+69} <ffffffff801283c1>{sys32_open+130} <ffffffff80125d33>{cstar_do_call+27} Code: 0f 0b 37 4e 32 80 ff ff ff ff 76 00 f0 ff 0b 0f 88 8b 03 00 RIP <ffffffff8030b0d0>{__lock_text_start+32} RSP <00000100ee569ce8>
This is one of the few places we use asynchronous locking (to boost statfs performance since obtaining filesystem wide resource group locks is a time consuming task). Look like there is a memory corruption bug lurking in this code - so far only seen on x86_64 machine and under heavy lock activities. I'll find myself a pair of RHEL4 x86_64 machines to test it out tomorrow morning.
Eric, while we're trying to do the in-house recreation, could you tell us what you did that could make this issue poped up 5-10 minutes (instead of 2.5 hours) ?
The only difference seems to be the load on the host. I just reproduced it within just a few minutes by increasing the load settings in the application. The application starts many processes on the master node (around 70 or so?) that are doing database i/o on behalf of processes running on the other nodes. All 70 processes are interleaving and acessing similar database files (using fcntl() to lock and unlock the database, and fdatasync() when changes are made.)
One more thing - the load on the master node was > 10 when the node crashed (we have a 2 processor AMD 64 bit machine).
One more thing - we have a 4 node cluster, but all this contention is happening only one one node. All database access is on the master node, and it is the master node that always crashes, and the uio_file_serv process usually seems to be caught in the 'Oops' message. Just to bolster your memory corruption theory - this time the stack trace was different again. [...network console startup...] Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1171 GFS: fsid=ime_cluster:data01.2: jid=3: Trying to acquire journal lock... GFS: fsid=ime_cluster:data01.2: jid=3: Busy Unable to handle kernel NULL pointer dereference at 0000000000000079 RIP: <ffffffffa044aed0>{:gfs:run_queue+477} PML4 dde3a067 PGD e45ea067 PMD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: nfsd exportfs sctp parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd nfs_acl dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod Pid: 28242, comm: uio_file_serv Tainted: P 2.6.9-42.0.2.ELsmp RIP: 0010:[<ffffffffa044aed0>] <ffffffffa044aed0>{:gfs:run_queue+477} RSP: 0018:00000100da8bfcb8 EFLAGS: 00010202 RAX: 00000000017c034e RBX: 00000100ef739578 RCX: 0000000000000001 RDX: ffffffffa048b580 RSI: 0000000000000001 RDI: 00000100ef739520 RBP: 00000100daab2f20 R08: 00000100da8be000 R09: dead4ead00000001 R10: 0000000100000000 R11: 00000101e1df4138 R12: 0000000000000001 R13: 0000000000000001 R14: ffffff00004b2000 R15: 00000100ef739520 FS: 0000002a95571b00(0000) GS:ffffffff804e5200(005b) knlGS:00000000f7d4a6c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 0000000000000079 CR3: 0000000003d72000 CR4: 00000000000006e0 Process uio_file_serv (pid: 28242, threadinfo 00000100da8be000, task 00000100eaa28030)Stack: 00000000000000c2 00000100ef73954c 00000100dcd4cf20 0000000000000000 00000100ef739520 ffffff00004b2000 0000000000000000 ffffffffa044b8c2 ffffffffa048b580 00000100dcd4cf20 Call Trace:<ffffffffa044b8c2>{:gfs:glock_wait_internal+702} <ffffffffa044c2eb>{:gfs:gfs_glock_wait+61} <ffffffffa046c927>{:gfs:gfs_stat_gfs+236} <ffffffffa044bbd4>{:gfs:gfs_glock_dq+400} <ffffffffa0464aae>{:gfs:gfs_statfs+39} <ffffffff80177982>{vfs_statfs+82} <ffffffff8019e523>{compat_fstatfs64+69} <ffffffff801283c1>{sys32_open+130} <ffffffff80125d33>{cstar_do_call+27} Code: 49 8b 45 78 4d 8b b5 18 01 00 00 48 89 04 24 41 8b 4d 34 8b RIP <ffffffffa044aed0>{:gfs:run_queue+477} RSP <00000100da8bfcb8> CR2: 0000000000000079 Too bad you can't run the kernel under valgrind...
Something to keep in mind is that fcntl on gfs does not work nearly as well as flock. fcntl will be far slower, and is far more complex and prone to bugs when the "odd" features of posix locks are used. (fcntl posix locks have been greatly improved for the next major gfs version) So, flocks would be a really good choice if the application is adaptable. It doesn't look like this is related to the current panics, though.
I appreciate the suggestion, but for our database library, we really do use those 'odd' features of posix locks, like being able to lock a region of a file (not just the entire file) and querying to see what process is interfering with a lock. So I will look forward to the next version of GFS.
Here is some different behavior. We are getting a process that will not exit with kill -9 and we can't attach to it with the debugger.ps shows it has a weird state: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND cgi 5442 0.0 0.0 59724 4180 pts/0 T 19:30 0:00 gdb dcs_event_mgr 25851 cgi 25851 0.0 0.0 61652 6600 ? Dsl 16:45 0:00 /opt/dcs/dcs/bin/dcs_event_mgr -reload -delay 60 Sysreq -T prints out this process as the last one: ... dcs_event_mgr D 00000101e916dbe0 0 26328 1 25851 25852 (NOTLB) 00000101d6fbdcf8 0000000000000006 00000101d6fbdd08 ffffffff00000076 00000100f4fe87f0 0000000000000069 000001010005fb00 0000000100000001 00000101ec3f9030 0000000000000303 Call Trace:<ffffffffa03ca2af>{:dlm:put_lockspace+16} <ffffffffa03c7c3f>{:dlm:dlm _lock+1182} <ffffffff8030a405>{wait_for_completion+167} <ffffffff80133da9>{default_wa ke_function+0} <ffffffff80133da9>{default_wake_function+0} <ffffffffa0491b2b>{:lock_dlm: wait_async+17} <ffffffffa0491c42>{:lock_dlm:wait_async_list+201} <ffffffffa049239a>{:loc k_dlm:lm_dlm_punlock+581} <ffffffffa0455e06>{:gfs:gfs_lm_punlock+46} <ffffffffa0460dc1>{:gfs:gfs_lo ck+173} <ffffffff8018e49b>{locks_remove_posix+156} <ffffffff80159f49>{wait_on_pag e_writeback_range+218} <ffffffff80178e3c>{filp_close+95} <ffffffff80178ecd>{sys_close+130} <ffffffff80125d33>{cstar_do_call+27} bash R running task 0 6747 29857 (NOTLB) (that's the end of the log) I was trying to recreate the crash with a simple program running on blade01, but this process that is stuck is running on blade02.
Eric, wait_for_completion() is an uninterruptible and blocking kernel call. The thread will wait there (forever) until the event it is expected happens. So it is not killable. Not sure why dlm got into this state though. The urgent need here is to get a test case that can reliably triggering the problems. We're trying - but so far no news yet.
To kill the process I had to reboot. In fact, I rebooted all nodes in the cluster and ran fsck.gfs - only one seemingly minor problem was found. I have been trying to reproduce the issue without the full blown application running and thus far have had no luck. One thing I thought I'd mention, each of our processes that runs on the non-master nodes in the cluster calls 'statvfs()' 2 times on the gfs filesystem. They also read files from the gfs filesystem and create new ones (using sequential i/o) They start and stop frequently. When the crash occurs, we have maybe 20 processes on each of the 3 other blades, and about 60 processes performing database i/o on the 'master' node. (those start and stop frequently as well) The database processes are doing lots of fcntl() locking along with read() and write(). The database files are growing during the test.
Eric, I think I've recreated the hang - not sure whether it is the same as yours. Still working on it. stack trace on one of the nodes (just save here for future reference) crash> bt 4573 PID: 4573 TASK: 1006f6e37f0 CPU: 0 COMMAND: "fl_64" #0 [1006d86bb78] schedule at ffffffff8030a1c1 #1 [1006d86bc50] wait_for_completion at ffffffff8030a405 #2 [1006d86bcd0] do_dlm_lock_sync at ffffffffa02c2ab7 #3 [1006d86bcf0] unlock_resource at ffffffffa02c472a #4 [1006d86bd10] lm_dlm_plock at ffffffffa02c5fed #5 [1006d86bd90] inode_has_perm at ffffffff801d1e3f #6 [1006d86bdf0] selinux_file_lock at ffffffff801d45a9 #7 [1006d86be50] gfs_lm_plock at ffffffffa02e6dcb #8 [1006d86be60] gfs_lock at ffffffffa02f1dd3 #9 [1006d86be90] fcntl_setlk at ffffffff8018e244 #10 [1006d86bed0] sys_select at ffffffff8018bcc5 #11 [1006d86bee0] dnotify_parent at ffffffff801932d4 #12 [1006d86bf50] sys_fcntl at ffffffff8018a613 #13 [1006d86bf80] system_call at ffffffff8011026a
Eric, here is a quick update on the issue .. One problem we've found is that under heavy system load, the memory could be in shortage that results in kmalloc() failure. In previous versions, upon kmalloc() failure, in certain code paths, sometime it was impossible to back out - so GFS either panic or hung. I worked with the customer who reported this problem and had added a vmalloc() call if kmalloc() failed. The changes seemed to work very well (for that customer who ran on i686 boxes) so GFS has been carried that fixes since then. Unfortuntely, I forgot to add vmalloc.h header - since the compiler has not been not able to get the correct vmalloc() prototype, it assumes the default that casts vmalloc() return code to "int" which is 4 byes. I seem to get away with i686 machines but this obviously causes problem with x86_64 machine which has 8 bytes addressing. We may have other areas that could have the very same issues so will comb thru the code and re-run the test. Will keep you updated.
The test case has been running for 6 hours without hang or panic - I can no longer recreate my hang. Will package the test RPMs for Eric to try out tonight. Let's shoot for tomorrow morning delivery. There is one (performance) issue that I should mention: Occasionally a process may send a dlm request and gets einval (as shown in /proc/cluster/dlm_debug) as part of the plock call. Look to me that it is caused by the window between sent-request and actual-process that the lock could be obtained by another node (? I'm not very sure though as dlm is new to me). The process_lock_queue_reply() seems to retry (by re-lookuping who is the master?). However, after this happens, that particular process will get *very* little chance to obtain the lock and normally stay in the following route for a *very* long time (unless I actually kill some other processes that compete for this lock - otherwise it seems to hang there forever): PID: 18177 TASK: 10070fb3030 CPU: 0 COMMAND: "fl_64" #0 [1004f621c38] schedule at ffffffff8030a1c1 #1 [1004f621d10] lm_dlm_plock at ffffffffa030194f #2 [1004f621d90] inode_has_perm at ffffffff801d1e3f #3 [1004f621e50] gfs_lm_plock at ffffffffa02c4dd3 #4 [1004f621e60] gfs_lock at ffffffffa02cfddb #5 [1004f621e90] fcntl_setlk at ffffffff8018e244 #6 [1004f621ed0] sys_select at ffffffff8018bcc5 #7 [1004f621ee0] dnotify_parent at ffffffff801932d4 #8 [1004f621f50] sys_fcntl at ffffffff8018a613 #9 [1004f621f80] system_call at ffffffff8011026a`
I have seen hangs of particluar processes (like running multiple instances of 'df' in a loop during my other load testing) but they did recover eventually (a few minutes.) We'll be ready to put on the new RPMs tomorrow. -EZA
Eric, the RPMs (in tar file) format are in: http://people.redhat.com/wcheng/Patches/GFS/R4/intec.tar If something goes wrong with the tar file, the real RPMs are in http://people.redhat.com/wcheng/Patches/GFS/R4/intec.0922 directory. Note that these RPMs do not include GFS statfs performance improvement. Please be aware that "df" (that uses GFS statfs system call) needs to get *all* the resource group locks. If for whatever reason one of these locks are hogged by other processes, "df" will appear to hang. We have statfs improvement on the way but not included in these RPMs.
hmm, something is still not right - I have two processes that have been hanging (waiting for) DLM plock since 2:45 am. So I just kill all other processes and they start to run again.
I updated the nodes, rebooted. All seemed to run for a while fine. I cranked up the load and it ran for probably about 30 minutes and then crashed. Here is backtrace from netdump: Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1171 GFS: fsid=ime_cluster:data01.1: fatal: assertion "!queue_empty(gl, &gh- >gh_list)" failed GFS: fsid=ime_cluster:data01.1: function = gfs_glock_dq GFS: fsid=ime_cluster:data01.1: file = /usr/src/redhat/BUILD/gfs-kernel-2.6.9- 58/smp/src/gfs/glock.c, line = 1616 GFS: fsid=ime_cluster:data01.1: time = 1158934946 GFS: fsid=ime_cluster:data01.1: about to withdraw from the cluster GFS: fsid=ime_cluster:data01.1: waiting for outstanding I/O ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lm:190 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: nfsd exportfs sctp parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd nfs_acl dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib (U) button battery ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod Pid: 26655, comm: uio_file_serv Tainted: P 2.6.9-42.0.2.ELsmp RIP: 0010:[<ffffffffa0456b2f>] <ffffffffa0456b2f>{:gfs:gfs_lm_withdraw+215} RSP: 0018:00000100dd9a7bb8 EFLAGS: 00010202 RAX: 000000000000003c RBX: ffffff00004ea8c0 RCX: 0000000000000246 RDX: 0000000000005962 RSI: 0000000000000246 RDI: ffffffff803e2000 RBP: ffffff00004b2000 R08: 0000000000000004 R09: ffffff00004ea8c0 R10: 0000000000000000 R11: 000000000000003c R12: 00000100f02a860c R13: 00000100e2688000 R14: ffffff00004b2000 R15: 0000000000000000 FS: 0000002a95572820(0000) GS:ffffffff804e5180(005b) knlGS:00000000f7d4a6c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 000000000805544f CR3: 0000000000101000 CR4: 00000000000006e0 Process uio_file_serv (pid: 26655, threadinfo 00000100dd9a6000, task 00000101e32a4030) Stack: 0000003000000030 00000100dd9a7cd0 00000100dd9a7bd8 0000000000000018 00000100dd9a7c48 ffffffff80132155 ffffff00004ea8c0 ffffffffa04719fd ffffff00004ea8c0 ffffffffa046fca4 Call Trace:<ffffffff80132155>{recalc_task_prio+337} <ffffffff802d030d>{tcp_sendpage+0} <ffffffff8030a219>{thread_return+88} <ffffffff8030a445>{wait_for_completion+231} <ffffffffa046f39f>{:gfs:gfs_assert_withdraw_i+48} <ffffffffa044cac4>{:gfs:gfs_glock_dq+124} <ffffffffa044cc20>{:gfs:gfs_glock_dq_uninit+9} <ffffffffa046d983>{:gfs:gfs_stat_gfs+292} <ffffffffa044cbd8>{:gfs:gfs_glock_dq+400} <ffffffffa0465ad2>{:gfs:gfs_statfs+39} <ffffffff80177982>{vfs_statfs+82} <ffffffff8019e523>{compat_fstatfs64+69} <ffffffff80125d33>{cstar_do_call+27} Code: 0f 0b 7d 2f 47 a0 ff ff ff ff be 00 8b 85 a0 88 03 00 85 c0 RIP <ffffffffa0456b2f>{:gfs:gfs_lm_withdraw+215} RSP <00000100dd9a7bb8>
Here is another which looks different: Kernel BUG at lm:190 invalid operand: 0000 [1] SMP CPU 1 Modules linked in: nfsd exportfs sctp parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) nfs lockd nfs_acl dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib (U) button battery ac ohci_hcd hw_random tg3 ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod Pid: 7764, comm: dcs_cftp_flx Tainted: P 2.6.9-42.0.2.ELsmp RIP: 0010:[<ffffffffa0456b2f>] <ffffffffa0456b2f>{:gfs:gfs_lm_withdraw+215} RSP: 0018:00000100eff3fa08 EFLAGS: 00010202 RAX: 000000000000003c RBX: ffffff00004ea8c0 RCX: ffffffff803e1fe8 RDX: ffffffff803e1fe8 RSI: 0000000000000246 RDI: ffffffff803e1fe0 RBP: ffffff00004b2000 R08: ffffffff803e1fe8 R09: ffffff00004ea8c0 R10: 0000000100000000 R11: ffffffff8011e884 R12: 0000010037e30200 R13: 00000000ffffffff R14: 0000000000000001 R15: 0000000000000006 FS: 0000002a95573b00(0000) GS:ffffffff804e5200(005b) knlGS:00000000f72a06c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 00000000f76c71f4 CR3: 0000000003d72000 CR4: 00000000000006e0 Process dcs_cftp_flx (pid: 7764, threadinfo 00000100eff3e000, task 00000100f14e27f0) Stack: 0000003000000030 00000100eff3fb20 00000100eff3fa28 ffffff00004b2000 0000000300000000 ffffff00004ea770 ffffff00004ea8c0 ffffffffa04758dc ffffff00004ea8c0 ffffffffa0470b30 Call Trace:<ffffffff80133e50>{__wake_up+54} <ffffffffa046f39f>{:gfs:gfs_assert_withdraw_i+48} <ffffffffa046b8e5>{:gfs:blkalloc_internal+266} <ffffffffa046bf10>{:gfs:gfs_dialloc+102} <ffffffffa0450ece>{:gfs:dinode_alloc+137} <ffffffffa0451237>{:gfs:gfs_createi+156} <ffffffff80133f1d>{complete+53} <ffffffffa044bb83>{:gfs:lock_on_glock+112} <ffffffffa0463476>{:gfs:gfs_create+118} <ffffffffa0465065>{:gfs:gfs_permission+475} <ffffffffa045f601>{:gfs:gfs_drevalidate+409} <ffffffff80187563>{vfs_create+214} <ffffffff80187991>{open_namei+430} <ffffffff80178b6e>{filp_open+80} <ffffffff8012633d>{sys32_stat64+17} <ffffffff801ecf45>{strncpy_from_user+74} <ffffffff80128375>{sys32_open+54} <ffffffff80125d33>{cstar_do_call+27}
Eric, is it possible to get your database in-house (so we can do problem recreation here) ? Remote login normally don't do well with kernel debugging.
I need a bugzilla to check in vmalloc changes for tomorrow's RHEL 4.4 errata build - this seems to be one that I can use. Since we have more than one issues with this cluster (and its application), will open another bugzilla to log the issues we've found. It will start with our first round of live debugging tomorrow.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0705.html