Bug 206477
Summary: | GFS2 crashes server | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Gary Lindstrom <gplindstrom> | ||||||
Component: | kernel | Assignee: | Christine Caulfield <ccaulfie> | ||||||
Status: | CLOSED WORKSFORME | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 5.0 | CC: | cattelan, cluster-maint, davej, wtogami | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i686 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-11-29 10:40:17 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Gary Lindstrom
2006-09-14 16:41:30 UTC
if you boot with vga=1, you'll get more lines of text on the screen, so you should be able to get the full context. Capturing that text is vital to diagnosing this problem. Even a digital camera picture will do as long as its readable. Hi Dave, I managed to get this much of a fragment of what comes on the screen out of the messages file.... any more than this and I will have to get a camera... I'll try and bring one in tomorrow... BTW, with VGA=1 on boot, it starts in the alternate mode but resets back to normal just before the SELINUX stuff loads... Anyway here is the fragment: Sep 14 19:00:43 spool5 kernel: SELinux: initialized (dev dm-3, type gfs2), uses xattr Sep 14 19:01:46 spool5 kernel: ------------[ cut here ]------------ Sep 14 19:01:46 spool5 kernel: kernel BUG at arch/i386/mm/highmem.c:43! Sep 14 19:01:46 spool5 kernel: invalid opcode: 0000 [#1] Sep 14 19:01:46 spool5 kernel: SMP Sep 14 19:01:46 spool5 kernel: last sysfs file: /devices/system/cpu/cpu3/cpufreq/scaling_setspeed Sep 14 19:01:46 spool5 kernel: Modules linked in: autofs4 md5 sctp lock_dlm gfs2 dlm configfs sunrpc xt_state ip_conntrack nfnetlink xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 acpi_cpufreq video sbs i2c_ec i2c_core button battery asus_acpi ac parport_pc lp parport sg ide_cd intel_rng floppy qla2xxx e752x_edac cdrom tg3 edac_mc serio_raw scsi_transport_fc pcspkr dm_snapshot dm_zero dm_mirror dm_mod cciss sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Sep 14 19:01:46 spool5 kernel: CPU: 3 Sep 14 19:01:46 spool5 kernel: EIP: 0060:[<c041cb17>] Not tainted VLI Sep 14 19:01:46 spool5 kernel: EFLAGS: 00010206 (2.6.17-1.2617.2.1.fc6 #1) Sep 14 19:01:46 spool5 kernel: EIP is at kmap_atomic+0x61/0x85 Sep 14 19:01:46 spool5 kernel: eax: c0004d8c ebx: fff63000 ecx: c1e9f9c0 edx: 00000160 Sep 14 19:01:46 spool5 kernel: esi: 00000000 edi: 00000000 ebp: e36f8dc0 esp: e36f8dbc Sep 14 19:01:46 spool5 kernel: ds: 007b es: 007b ss: 0068 Sep 14 19:05:09 spool5 syslogd 1.4.1: restart. It would be very useful if it is possible to get at least the first few calls on the stack. I've grepped through the GFS2 source for all the kmap_atomic() calls and I can't spot anywhere that we are obviously doing anything wrong, so a pointer as to which call it is which is causing the problem would be very helpful. Created attachment 136381 [details]
Dump from server with GFS2 problems...
This is a dump from my server when I attempted to copy from an ext3 partition
to a GFS2 partition...
OK, see the attache messages.rh log file... while trying to take pictures, I managed to succeed (1st and only time) to get a good dump into a log file that I could see... A little bit of an update... I applied the latest round of updates for FC6 Test3. The basics it applied was a new kernel (kernel-2.6.17-1.2647.fc6), a new cman, rgmanager and some other stuff. After rebooting the cluster, I could not get anything to work. After struggling with it for 6-8 hours, I reverted back to cman-2.0.13-2.fc6 and things worked sorta. I still have the same kernel issue so nothing in that regard was resolved. However from my perspective, this laster version of cman (2.0.14) is bad, but this is a seperate issue. So basically, my original problem still exists in kernel-2.6.17-1.2647.fc6. Can you elaborate on the cman problems? It may not help the other bug, but we don't know of any problems with that version of cman. Sure... maybe I am doing something wrong and if so, please correct me. Basically, after loading CMAN everything initially looks OK... clustat says all three servers are online. After a minute or two, fenced "fences" one of the servers for no good reason. It reboots the fenced server through the ILO interface. When the rebooted server rejoins the cluster, the fenced, dlm_controld, and gfs_controld daemons all die off on the server that initiated the reboot. It seems as though it can't communicate with each other. However cman-2.0.13-2.fc6 seems to be able to do so. I adding my latest version of cluster.conf to this bug report in case you see something obvious... Note that one server does not have an ILO interface so I just use manual since at this point, this is a test system. Also, I tried it with both multicast enabled and disabled with the same results. So, is it just me, do I have some unsupported hardware, am I missing something? I had the cluster working with EL4 and GFS, but I thought I would give gfs2 a try amd see what is new in fedora. I have basically been using EL3 and EL4, FC1 being the last Fedora I had used. I know this is test and I have no problems with that. I primarily want to test XEN and GFS2. This is a test system in order to design a new system for our business. If I can help work out the bugs, I am happy to help. Created attachment 136651 [details]
My latest cluster.conf
My latest cluster.conf just in case it helps...
It might well be Xen that is the problem in this set up. Is it possible for you to retry without xen? I'm not certain what xen does with respect to multicast forwarding/routing, so that might be part of the problem. This is a bit of a hack but worth trying. Try adding this stanza to your cluster.conf file, just inside the <cluster> bit. <totem token="5000"/> Sorry, Xen has never been loaded on the machines I am trying to setup as a cluster (not yet anyway) so that cannot be the problem. I'll try the suggested '<totem token="5000"/>' part shortly. However, is that an attempt to fix the cman problem or the opcode problem in himem.c... I assume the cman problem? Yes, the cman problem. I'm still wondering about the highmem problem. The trace suggests that the problem is in the VFS. It looks like the cause is scheduling while holding an atomic kmap (which is not allowed) and the VFS happend to be the next call that was scheduled and called the same code causing the oops. The problem is in tracking down where the original problem is. There are few possible call sites and I've checked them all over and I cannot see any reason why scheduling should take place at those points. Sounds to me like a missing lock somewhere causing a race... I use to do a lot of programming but it has been 15+ years since I did anything serious so I would only be guessing... :) swhiteho, see the email I sent you directly and let me know... Ok, same problem with <totem token="5000"/> added. Once I finally got it to stop rebooting each other and keep fenced running I told it to load clvmd. On one machine, the startup script finished, but then fenced and the two control daemons (dlm and gfs) died off. The other servers never completed loading clvmd. This is the same problem I had the other day... Tried restarting the daemons manually... fenced runs but dlm_controld and gfs_controld die again.. This is from the messages log...: Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] entering GATHER state from 12. Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] Saving state aru 2d high seq received 2d Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] Storing new sequence id for ring 2128 Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] entering COMMIT state. Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] entering RECOVERY state. Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] position [0] member 209.161.1.151: Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] previous ring seq 8480 rep 209.161.1.151 Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] aru 2d high delivered 2d received flag 0 Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] position [1] member 209.161.1.170: Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] previous ring seq 8480 rep 209.161.1.151 Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] aru 2d high delivered 2d received flag 0 Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] position [2] member 209.161.1.171: Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] previous ring seq 8484 rep 209.161.1.171 Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] aru 9 high delivered 9 received flag 0 Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] Did not need to originate any messages in recovery. Sep 20 11:13:17 spool5 openais[2269]: [CLM ] CLM CONFIGURATION CHANGE Sep 20 11:13:17 spool5 openais[2269]: [CLM ] New Configuration: Sep 20 11:13:17 spool5 openais[2269]: [CLM ] r(0) ip(209.161.1.151) Sep 20 11:13:17 spool5 openais[2269]: [CLM ] r(0) ip(209.161.1.170) Sep 20 11:13:17 spool5 openais[2269]: [CLM ] r(0) ip(209.161.1.171) Sep 20 11:13:17 spool5 openais[2269]: [CLM ] Members Left: Sep 20 11:13:17 spool5 openais[2269]: [CLM ] Members Joined: Sep 20 11:13:17 spool5 openais[2269]: [SYNC ] This node is within the primary component and will provide service. Sep 20 11:13:17 spool5 openais[2269]: [CLM ] CLM CONFIGURATION CHANGE Sep 20 11:13:17 spool5 openais[2269]: [CLM ] New Configuration: Sep 20 11:13:17 spool5 openais[2269]: [CLM ] r(0) ip(209.161.1.151) Sep 20 11:13:17 spool5 openais[2269]: [CLM ] r(0) ip(209.161.1.170) Sep 20 11:13:17 spool5 openais[2269]: [CLM ] r(0) ip(209.161.1.171) Sep 20 11:13:17 spool5 openais[2269]: [CLM ] Members Left: Sep 20 11:13:17 spool5 openais[2269]: [CLM ] Members Joined: Sep 20 11:13:17 spool5 openais[2269]: [SYNC ] This node is within the primary component and will provide service. Sep 20 11:13:17 spool5 openais[2269]: [TOTEM] entering OPERATIONAL state. Sep 20 11:13:17 spool5 openais[2269]: [CLM ] got nodejoin message 209.161.1.151 Sep 20 11:13:17 spool5 openais[2269]: [CLM ] got nodejoin message 209.161.1.170 Sep 20 11:13:17 spool5 openais[2269]: [CLM ] got nodejoin message 209.161.1.171 Sep 20 11:13:17 spool5 openais[2269]: [CPG ] got joinlist message from node 3 Sep 20 11:13:17 spool5 openais[2269]: [CPG ] got joinlist message from node 1 Sep 20 11:26:08 spool5 gfs_controld[2723]: groupd_dispatch error -1 errno 22 Sep 20 11:26:08 spool5 gfs_controld[2723]: groupd connection died Sep 20 11:26:08 spool5 fenced[2711]: groupd is down, exiting Sep 20 11:26:08 spool5 gfs_controld[2723]: cluster is down, exiting Sep 20 11:26:08 spool5 dlm_controld[2717]: groupd is down, exiting Sep 20 11:26:17 spool5 clvmd: Cluster LVM daemon started - connected to CMAN Sep 20 11:30:42 spool5 fenced[3958]: group_init error 0 111 Sep 20 11:30:46 spool5 dlm_controld[3960]: group_init error 0 111 Sep 20 11:31:05 spool5 gfs_controld[3965]: group_init error 0 111 The original bug reported here, relating to highmem and kmap_atomic has also been reported again more recently and hopefully with another angle upon it, we'll now have a better chance at tracking that down. I had thought about marking this bug a dup of bz #207764, however perhaps the bext solution is to leave this one open to deal with the cman problem, and transfer the highmem discussion to the bz #207764. Ummm... I can't see bz #207764... It says I am not authorized to see that bug... FWIW, I tried upgrading to the 2.6.18-1.2689.fc6 kernel in development, upgraded device-mapper to 2.02.10 (my own compile), upgraded lvm to 2.02.10 (my own compile), recompiled cman 2.0.14-3 source package on my machine, all with no success.... same problem for both himem and cman... However, it did bring up another bug. Maybe you want to open another bug for this one. It seems that if you set locking_type to 2 in lvm.conf, but don't have clvmd loaded when doing a mkinitrd, lvm.static seg faults and does not add lvm support into the .img file, and of course the kernel fails to boot if you are using lvm for the boot disks... upgrading to 2.02.10 fixed that as it does not seg fault and also falls back to internal and file locking... Also cleaned up a couple of misc segfault messages on bootup regarding lvm... Also, while I am at it, I noticed in the log file the following which is similiar to bz #206754 but not exactly. Don't know if it is related or not: Sep 24 18:29:20 spool6 kernel: INFO: trying to register non-static key. Sep 24 18:29:20 spool6 kernel: the code is fine but needs lockdep annotation. Sep 24 18:29:20 spool6 kernel: turning off the locking correctness validator. Sep 24 18:29:20 spool6 kernel: [<c04051ee>] show_trace_log_lvl+0x58/0x171 Sep 24 18:29:20 spool6 kernel: [<c0405802>] show_trace+0xd/0x10 Sep 24 18:29:20 spool6 kernel: [<c040591b>] dump_stack+0x19/0x1b Sep 24 18:29:20 spool6 kernel: [<c043b366>] __lock_acquire+0xfd/0x99c Sep 24 18:29:20 spool6 kernel: [<c043c176>] lock_acquire+0x4b/0x6d Sep 24 18:29:20 spool6 kernel: [<c06155a4>] _spin_lock_irq+0x1f/0x2e Sep 24 18:29:20 spool6 kernel: [<c0613422>] wait_for_completion+0x29/0x9e Sep 24 18:29:20 spool6 kernel: [<f88cfc36>] sendcmd_withirq+0x173/0x289 [cciss] Sep 24 18:29:20 spool6 kernel: [<f88d0c9a>] cciss_read_capacity+0x2f/0xa6 [cciss] Sep 24 18:29:20 spool6 kernel: [<f88d2c32>] cciss_revalidate+0xaa/0xfa [cciss] Sep 24 18:29:20 spool6 kernel: [<c04a8380>] rescan_partitions+0x6b/0x1e7 Sep 24 18:29:20 spool6 kernel: [<c047a5c8>] do_open+0x2e9/0x3e9 Sep 24 18:29:20 spool6 kernel: [<c047a739>] blkdev_get+0x71/0x7c Sep 24 18:29:20 spool6 kernel: [<c04a82c3>] register_disk+0x115/0x167 Sep 24 18:29:20 spool6 kernel: [<c04e099b>] add_disk+0x2e/0x3d Sep 24 18:29:20 spool6 kernel: [<f88d356a>] cciss_init_one+0x8e8/0xa5a [cciss] Sep 24 18:29:20 spool6 kernel: [<c04f4241>] pci_device_probe+0x39/0x5b Sep 24 18:29:20 spool6 kernel: [<c05552d6>] driver_probe_device+0x45/0x92 Sep 24 18:29:20 spool6 kernel: [<c0555403>] __driver_attach+0x68/0x93 Sep 24 18:29:20 spool6 kernel: [<c0554d4f>] bus_for_each_dev+0x3a/0x5f Sep 24 18:29:20 spool6 kernel: [<c0555231>] driver_attach+0x14/0x17 Sep 24 18:29:20 spool6 kernel: [<c0554a26>] bus_add_driver+0x68/0x106 Sep 24 18:29:20 spool6 kernel: [<c05556ac>] driver_register+0x78/0x7d Sep 24 18:29:20 spool6 kernel: [<c04f4392>] __pci_register_driver+0x4f/0x69 Sep 24 18:29:20 spool6 kernel: [<f880401c>] cciss_init+0x1c/0x1e [cciss] Sep 24 18:29:21 spool6 kernel: [<c04426ce>] sys_init_module+0x16ad/0x1856 Sep 24 18:29:21 spool6 kernel: [<c0403faf>] syscall_call+0x7/0xb Sep 24 18:29:21 spool6 kernel: DWARF2 unwinder stuck at syscall_call+0x7/0xb Sep 24 18:29:21 spool6 kernel: Leftover inexact backtrace: Sep 24 18:29:21 spool6 kernel: [<c0405802>] show_trace+0xd/0x10 Sep 24 18:29:21 spool6 kernel: [<c040591b>] dump_stack+0x19/0x1b Sep 24 18:29:21 spool6 kernel: [<c043b366>] __lock_acquire+0xfd/0x99c Sep 24 18:29:21 spool6 kernel: [<c043c176>] lock_acquire+0x4b/0x6d Sep 24 18:29:21 spool6 kernel: [<c06155a4>] _spin_lock_irq+0x1f/0x2e Sep 24 18:29:21 spool6 kernel: [<c0613422>] wait_for_completion+0x29/0x9e Sep 24 18:29:21 spool6 kernel: [<f88cfc36>] sendcmd_withirq+0x173/0x289 [cciss] Sep 24 18:29:21 spool6 kernel: [<f88d0c9a>] cciss_read_capacity+0x2f/0xa6 [cciss] Sep 24 18:29:21 spool6 kernel: [<f88d2c32>] cciss_revalidate+0xaa/0xfa [cciss] Sep 24 18:29:21 spool6 kernel: [<c04a8380>] rescan_partitions+0x6b/0x1e7 Sep 24 18:29:21 spool6 kernel: [<c047a5c8>] do_open+0x2e9/0x3e9 Sep 24 18:29:21 spool6 kernel: [<c047a739>] blkdev_get+0x71/0x7c Sep 24 18:29:21 spool6 kernel: [<c04a82c3>] register_disk+0x115/0x167 Sep 24 18:29:21 spool6 kernel: [<c04e099b>] add_disk+0x2e/0x3d Sep 24 18:29:21 spool6 kernel: [<f88d356a>] cciss_init_one+0x8e8/0xa5a [cciss] Sep 24 18:29:21 spool6 kernel: [<c04f4241>] pci_device_probe+0x39/0x5b Sep 24 18:29:21 spool6 kernel: [<c05552d6>] driver_probe_device+0x45/0x92 Sep 24 18:29:21 spool6 kernel: [<c0555403>] __driver_attach+0x68/0x93 Sep 24 18:29:21 spool6 kernel: [<c0554d4f>] bus_for_each_dev+0x3a/0x5f Sep 24 18:29:21 spool6 kernel: [<c0555231>] driver_attach+0x14/0x17 Sep 24 18:29:21 spool6 kernel: [<c0554a26>] bus_add_driver+0x68/0x106 Sep 24 18:29:21 spool6 kernel: [<c05556ac>] driver_register+0x78/0x7d Sep 24 18:29:21 spool6 kernel: [<c04f4392>] __pci_register_driver+0x4f/0x69 Sep 24 18:29:21 spool6 kernel: [<f880401c>] cciss_init+0x1c/0x1e [cciss] Sep 24 18:29:21 spool6 kernel: [<c04426ce>] sys_init_module+0x16ad/0x1856 Sep 24 18:29:21 spool6 kernel: [<c0403faf>] syscall_call+0x7/0xb I am sure this is related to the scsi controller for the compaq boxes. Not having ever worked with bugzilla or you guys, should I just be opening seperate bugs for each of these? Sorry - I'll take a look at the settings on that bug in a moment so it should be viewable shortly. As you say that trace looks like a scsi driver problem. Or most likely, its not actually a problem at all, but a case where the lockdep annotation isn't quite right. Please open a separate bug for that if you'd like us to look into it. I'm not sure who the right person is for scsi bugs, but bugzilla will know :-) The lvm issue is a separate thing though and opening another bug is a good plan. Its best to open a new bug for each separate issue to so that we can track each one individually. It makes it easier to pass to the right person to investigate each particular bug. Thanks Steve, I'll open a couple other bugzilla for the other problems... I still cant see 207764 but maybe you have not changed permissions yet... I'm still working on getting 207764 opened up. It seems that some of the info is confidential, so its not as easy as I'd first thought, sorry its taking so long. One other question, you appeared in the original stack trace to be running selinux. If so did you find that you got the same results without selinux running? There is an outstanding selinux bug in gfs2 which I'm about to receive a patch for, and while I don't think thats the cause here, it would be good to confirm that, and it would help narrow down the possible causes. > If so did you find that you got the same results without selinux running?
Yep, same results with selinux enabled, permissive, and disabled... actually, I
usually set selinux to permissive to that is does not get in the way of things,
but I want't to know about any issues. However, I have it disabled completly
(for a while now) and same problem.
Thanks for your work on this!
Haven't heard anything, but tried kernel 2.6.18-1.2724.fc6 with same results... Howerver, something new has started appearing... Not sure where this goes, but I get in sometimes when I load clvmd... Oct 3 17:35:23 spool6 kernel: SCTP: Hash tables configured (established 32768 bind 32768) Oct 3 17:35:23 spool6 kernel: Module sctp cannot be unloaded due to unsafe usage in net/sctp/protocol.c:1189 Oct 3 17:35:23 spool6 kernel: dlm: clvmd: recover 1 Oct 3 17:35:23 spool6 kernel: dlm: clvmd: add member 1 Oct 3 17:35:23 spool6 kernel: dlm: clvmd: add member 2 Oct 3 17:35:23 spool6 kernel: dlm: got new/restarted association 1 nodeid 1 Oct 3 17:35:23 spool6 kernel: dlm: clvmd: total members 2 error 0 Oct 3 17:35:23 spool6 kernel: dlm: clvmd: dlm_recover_directory Oct 3 17:35:23 spool6 kernel: dlm: clvmd: dlm_recover_directory 0 entries Oct 3 17:35:23 spool6 kernel: Oct 3 17:35:23 spool6 kernel: ===================================== Oct 3 17:35:23 spool6 kernel: [ BUG: bad unlock balance detected! ] Oct 3 17:35:23 spool6 kernel: ------------------------------------- Oct 3 17:35:23 spool6 kernel: dlm_recoverd/4419 is trying to release lock (&ls->ls_in_recovery) at: Oct 3 17:35:23 spool6 kernel: [<f8eff592>] dlm_recoverd+0x23d/0x337 [dlm] Oct 3 17:35:23 spool6 kernel: but there are no more locks to release! Oct 3 17:35:23 spool6 kernel: Oct 3 17:35:23 spool6 kernel: other info that might help us debug this: Oct 3 17:35:23 spool6 kernel: 2 locks held by dlm_recoverd/4419: Oct 3 17:35:23 spool6 kernel: #0: (&ls->ls_recoverd_active){--..}, at: [<c0613382>] mutex_lock+0x21/0x24 Oct 3 17:35:23 spool6 kernel: #1: (&ls->ls_recover_lock){--..}, at: [<f8eff56b>] dlm_recoverd+0x216/0x337 [dlm] Oct 3 17:35:23 spool6 kernel: Oct 3 17:35:23 spool6 kernel: stack backtrace: Oct 3 17:35:23 spool6 kernel: [<c04051ed>] show_trace_log_lvl+0x58/0x16a Oct 3 17:35:23 spool6 kernel: [<c04057fa>] show_trace+0xd/0x10 Oct 3 17:35:23 spool6 kernel: [<c0405913>] dump_stack+0x19/0x1b Oct 3 17:35:23 spool6 kernel: [<c043a25b>] print_unlock_inbalance_bug+0xce/0xd8 Oct 3 17:35:23 spool6 kernel: [<c043ba84>] lock_release_non_nested+0x85/0x129 Oct 3 17:35:23 spool6 kernel: [<c043bd77>] lock_release+0x111/0x130 Oct 3 17:35:23 spool6 kernel: [<c0438d8f>] up_write+0x16/0x2a Oct 3 17:35:23 spool6 kernel: [<f8eff592>] dlm_recoverd+0x23d/0x337 [dlm] Oct 3 17:35:23 spool6 kernel: [<c04361b9>] kthread+0xc3/0xf2 Oct 3 17:35:23 spool6 kernel: [<c0402005>] kernel_thread_helper+0x5/0xb Oct 3 17:35:23 spool6 kernel: DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb Oct 3 17:35:23 spool6 kernel: Leftover inexact backtrace: Oct 3 17:35:23 spool6 kernel: [<c04057fa>] show_trace+0xd/0x10 Oct 3 17:35:23 spool6 kernel: [<c0405913>] dump_stack+0x19/0x1b Oct 3 17:35:23 spool6 kernel: [<c043a25b>] print_unlock_inbalance_bug+0xce/0xd8 Oct 3 17:35:23 spool6 kernel: [<c043ba84>] lock_release_non_nested+0x85/0x129 Oct 3 17:35:23 spool6 kernel: [<c043bd77>] lock_release+0x111/0x130 Oct 3 17:35:23 spool6 kernel: [<c0438d8f>] up_write+0x16/0x2a Oct 3 17:35:24 spool6 kernel: [<f8eff592>] dlm_recoverd+0x23d/0x337 [dlm] Oct 3 17:35:24 spool6 kernel: [<c04361b9>] kthread+0xc3/0xf2 Oct 3 17:35:24 spool6 kernel: [<c0402005>] kernel_thread_helper+0x5/0xb Oct 3 17:35:24 spool6 kernel: dlm: clvmd: recover 1 done: 144 ms Oct 3 17:35:24 spool6 kernel: dlm: clvmd: recover 3 Oct 3 17:35:24 spool6 kernel: dlm: clvmd: add member 3 Oct 3 17:35:24 spool6 kernel: dlm: Initiating association with node 3 Oct 3 17:35:24 spool6 kernel: dlm: clvmd: total members 3 error 0 Oct 3 17:35:24 spool6 kernel: dlm: clvmd: dlm_recover_directory Oct 3 17:35:24 spool6 kernel: dlm: clvmd: dlm_recover_directory 1 entries Oct 3 17:35:24 spool6 kernel: dlm: clvmd: recover 3 done: 60 ms That new bug is not really a bug. Its some debugging code thats getting over excited and flagging up a problem where a lock is getting unlocked from a different thread to the one which locked it. Thats harmless and the dlm team know about it already. I'm still working on the highmem issue and I'm building a test kernel at this very moment in fact. OK, here is what I have done in case it helps... Upgraded with latest updates - kernel is now 2.6.18-1.2741.fc6. Same problems as above. I rebooted with highmem=0. Problem seems to disappear as I wrote 40GB to SAN, so would seem to be a highmem problem, right? However, as I was doing a tape backup to an ADIC LTO3 drive, I got the following... Don't know if it is related or if this is a different problem, but maybe it will help: Oct 6 00:58:31 virtual1b kernel: ------------[ cut here ]------------ Oct 6 00:58:31 virtual1b kernel: kernel BUG at mm/filemap.c:553! Oct 6 00:58:31 virtual1b kernel: invalid opcode: 0000 [#1] Oct 6 00:58:31 virtual1b kernel: SMP Oct 6 00:58:31 virtual1b kernel: last sysfs file: /fs/gfs2/fpcl01:vg00lv00/id Oct 6 00:58:31 virtual1b kernel: Modules linked in: md5 sctp autofs4 ipv6 lock_dlm gfs2 dlm configfs sunrpc video sbs i2c_ec button battery asus_acpi ac parport_pc lp parport st sg lpfc ide_c d aic7xxx e100 cdrom scsi_transport_spi r8169 scsi_transport_fc floppy cpqphp mii pcspkr i2c_piix4 i2c_core serio_raw dm_snapshot dm_zero dm_mirror dm_mod cciss sd_mod scsi_mod ext3 jbd ehci_h cd ohci_hcd uhci_hcd Oct 6 00:58:31 virtual1b kernel: CPU: 1 Oct 6 00:58:31 virtual1b kernel: EIP: 0060:[<c0453202>] Not tainted VLI Oct 6 00:58:31 virtual1b kernel: EFLAGS: 00010246 (2.6.18-1.2741.fc6 #1) Oct 6 00:58:31 virtual1b kernel: EIP is at unlock_page+0xe/0x27 Oct 6 00:58:31 virtual1b kernel: eax: 00000000 ebx: c12a28c0 ecx: e40f0000 edx: e40f0de0 Oct 6 00:58:31 virtual1b kernel: esi: 00000000 edi: 00000001 ebp: e9ea90f4 esp: e40f0ce0 Oct 6 00:58:31 virtual1b kernel: ds: 007b es: 007b ss: 0068 Oct 6 00:58:31 virtual1b kernel: Process save (pid: 3684, ti=e40f0000 task=eadf2450 task.ti=e40f0000) Oct 6 00:58:31 virtual1b kernel: Stack: c12a28c0 f8bee5fb e40f0d38 e40f0de0 00000000 e715d000 c0685aa8 00000000 Oct 6 00:58:31 virtual1b kernel: 00000044 c0683694 00000202 00000002 00000044 00000000 00000002 00000000 Oct 6 00:58:32 virtual1b kernel: 000201d2 c0685aa8 00000001 00000001 0000000f 00000001 e40f0d38 e40f0d38 Oct 6 00:58:32 virtual1b kernel: Call Trace: Oct 6 00:58:32 virtual1b kernel: [<f8bee5fb>] gfs2_readpages+0x18e/0x1b1 [gfs2] Oct 6 00:58:32 virtual1b kernel: [<c0458e93>] __do_page_cache_readahead+0x124/0x1c8 Oct 6 00:58:32 virtual1b kernel: [<c0458f83>] blockable_page_cache_readahead+0x4c/0x9f Oct 6 00:58:32 virtual1b kernel: [<c045912e>] page_cache_readahead+0xbf/0x196 Oct 6 00:58:32 virtual1b kernel: [<c04538fd>] do_generic_mapping_read+0x13d/0x49b Oct 6 00:58:32 virtual1b kernel: [<c0454535>] __generic_file_aio_read+0x18c/0x1d1 Oct 6 00:58:32 virtual1b kernel: [<c0455604>] generic_file_read+0xa2/0xb8 Oct 6 00:58:32 virtual1b kernel: [<c046f77c>] vfs_read+0xa6/0x157 Oct 6 00:58:32 virtual1b kernel: [<c046fbeb>] sys_read+0x41/0x67 Oct 6 00:58:32 virtual1b kernel: [<c0403f77>] syscall_call+0x7/0xb Oct 6 00:58:32 virtual1b kernel: DWARF2 unwinder stuck at syscall_call+0x7/0xb Oct 6 00:58:32 virtual1b kernel: Oct 6 00:58:32 virtual1b kernel: Leftover inexact backtrace: Oct 6 00:58:32 virtual1b kernel: Oct 6 00:58:32 virtual1b kernel: ======================= Oct 6 00:58:32 virtual1b kernel: Code: 08 0f 0b 5a 02 12 7c 63 c0 89 d8 e8 45 fa ff ff 89 da b9 08 00 00 00 5b e9 c7 37 fe ff 53 89 c3 f0 0f ba 30 00 19 c0 85 c0 75 08 <0f> 0b 29 02 12 7c 63 c0 89 d8 e8 1b fa ff ff 89 da 31 c9 5b e9 Oct 6 00:58:32 virtual1b kernel: EIP: [<c0453202>] unlock_page+0xe/0x27 SS:ESP 0068:e40f0ce0 I've just pushed the fix from Russell Cattelan for the highmem bug to my gfs2 git tree. I'll push it to Linus later today and also get it into both FC and RHEL as soon as I can. Well, either I am lucky, or the fix got into kernel-2.6.18-1.2784.fc6 as I no longer get the himem problem. However, I have a new problem, opened bz 211044 for that. The highmem bug discussion was transferred to another bug and has long since been fixed. I'm reassigning this to Patrick because I believe that the cman issue is probably also fixed, and if so he can mark it as resolved. Patrick, if this is the case, then please add some suitable resolution to this bug to take it off the list. Gary, please let us know if you have any outstanding issues relating to this bug, although it may be better to open a new one to prevent the information here getting too confusing. I'm pretty sure that the cman issues have been resolved by many updates to openais & cman over the last few weeks of 'smoke' work. Not sure if this is the right resolution code but it gets this bug off the lists! |