Bug 206477

Summary: GFS2 crashes server
Product: Red Hat Enterprise Linux 5 Reporter: Gary Lindstrom <gplindstrom>
Component: kernelAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED WORKSFORME QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: cattelan, cluster-maint, davej, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-11-29 10:40:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Dump from server with GFS2 problems...
none
My latest cluster.conf none

Description Gary Lindstrom 2006-09-14 16:41:30 UTC
OK, I give... time to ask for help as I think this is a bug...  Maybe I am doing
something wrong, but I don't believe so, but if I am, feel free to correct me.


Description of problem:

GFS2 crashes server requiring hard reboot.  If someone can tell me how to
capture the errors, I would be happy to send them, but it starts by generating a
seg fault when doing a copy from an ext3 partition to the gfs2 filesystem.  The
console gets a hugh number of errors but it starts out I believe saying it is a
kernel bug with an invalid opcode: 0000 error,  DWARF2 unwinder errors, and
scrolls a bunch of other errore.  The computer then shortly becomes unusable
requiring a hard reboot.


Version-Release number of selected component (if applicable):

Loaded FC6 Test 2 with all recent updates...  occurs on both 2.6.17-1.2630.fc6
and 2.6.17-1.2517.fc6.  


How reproducible:

Very...  occurs every time I do it.  Servers are proliant DL380, G2 and G4s, 2GB
memory, SMP, atatched to Hitachi 4TB SAN, qlogic and Emulex HBA cards.  Cluster
setup with 3 nodes.  Modified lvm.conf to enable locking (tried both types 2 and
3, btw what is preffered).  GFS volume is created using lvm2, 8 pv's striped. 
Gfs2 volume is marked as clustered.  I execute cman and clvmd startup scripts,
mount gfs2 volumes and try copying data from the ext3 partition to the GFS
partition.  Copy starts, then seg faults...  down she goes...


Steps to Reproduce:
1.  cp -ar <ext3 volume> <gfs2 volume>
  
Actual results:

System crashes...

Expected results:

Copy of data from ext3 volume to gfs2 volume...

Additional info:

If we need to take this offline or need more info, just email.  I would be happy
to send you whatever you need, but I am not sure how to capture the errors since
the machine requires a hard reboot, and the messages file does not have the
errors listed in it once rebooted.

Comment 1 Dave Jones 2006-09-14 23:53:38 UTC
if you boot with vga=1, you'll get more lines of text on the screen, so you
should be able to get the full context.  Capturing that text is vital to
diagnosing this problem.  Even a digital camera picture will do as long as its
readable.


Comment 2 Gary Lindstrom 2006-09-15 01:37:40 UTC
Hi Dave,

I managed to get this much of a fragment of what comes on the screen out of the
messages file....  any more than this and I will have to get a camera...  I'll
try and bring one in tomorrow...  BTW, with VGA=1 on boot, it starts in the
alternate mode but resets back to normal just before the SELINUX stuff loads...
 Anyway here is the fragment:

Sep 14 19:00:43 spool5 kernel: SELinux: initialized (dev dm-3, type gfs2), uses
xattr
Sep 14 19:01:46 spool5 kernel: ------------[ cut here ]------------
Sep 14 19:01:46 spool5 kernel: kernel BUG at arch/i386/mm/highmem.c:43!
Sep 14 19:01:46 spool5 kernel: invalid opcode: 0000 [#1]
Sep 14 19:01:46 spool5 kernel: SMP
Sep 14 19:01:46 spool5 kernel: last sysfs file:
/devices/system/cpu/cpu3/cpufreq/scaling_setspeed
Sep 14 19:01:46 spool5 kernel: Modules linked in: autofs4 md5 sctp lock_dlm gfs2
dlm configfs sunrpc xt_state ip_conntrack nfnetlink xt_tcpudp ip6table_filter
ip6_tables x_tables ipv6 acpi_cpufreq video sbs i2c_ec i2c_core button battery
asus_acpi ac parport_pc lp parport sg ide_cd intel_rng floppy qla2xxx e752x_edac
cdrom tg3 edac_mc serio_raw scsi_transport_fc pcspkr dm_snapshot dm_zero
dm_mirror dm_mod cciss sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Sep 14 19:01:46 spool5 kernel: CPU:    3
Sep 14 19:01:46 spool5 kernel: EIP:    0060:[<c041cb17>]    Not tainted VLI
Sep 14 19:01:46 spool5 kernel: EFLAGS: 00010206   (2.6.17-1.2617.2.1.fc6 #1)
Sep 14 19:01:46 spool5 kernel: EIP is at kmap_atomic+0x61/0x85
Sep 14 19:01:46 spool5 kernel: eax: c0004d8c   ebx: fff63000   ecx: c1e9f9c0  
edx: 00000160
Sep 14 19:01:46 spool5 kernel: esi: 00000000   edi: 00000000   ebp: e36f8dc0  
esp: e36f8dbc
Sep 14 19:01:46 spool5 kernel: ds: 007b   es: 007b   ss: 0068
Sep 14 19:05:09 spool5 syslogd 1.4.1: restart.


Comment 3 Steve Whitehouse 2006-09-15 08:02:31 UTC
It would be very useful if it is possible to get at least the first few calls on
the stack. I've grepped through the GFS2 source for all the kmap_atomic() calls
and I can't spot anywhere that we are obviously doing anything wrong, so a
pointer as to which call it is which is causing the problem would be very helpful. 

Comment 4 Gary Lindstrom 2006-09-15 18:19:18 UTC
Created attachment 136381 [details]
Dump from server with GFS2 problems...

This is a dump from my server when I attempted to copy from an ext3 partition
to a GFS2 partition...

Comment 5 Gary Lindstrom 2006-09-15 18:21:49 UTC
OK, see the attache messages.rh log file...  while trying to take pictures, I
managed to succeed (1st and only time) to get a good dump into a log file that I
could see...

Comment 6 Gary Lindstrom 2006-09-18 04:15:11 UTC
A little bit of an update... I applied the latest round of updates for FC6
Test3.  The basics it applied was a new kernel (kernel-2.6.17-1.2647.fc6), a new
cman, rgmanager and some other stuff.  After rebooting the cluster, I could not
get anything to work.  After struggling with it for 6-8 hours, I reverted back
to cman-2.0.13-2.fc6 and things worked sorta.  I still have the same kernel
issue so nothing in that regard was resolved.  However from my perspective, this
laster version of cman (2.0.14) is bad, but this is a seperate issue.  So
basically, my original problem still exists in kernel-2.6.17-1.2647.fc6.

Comment 7 Steve Whitehouse 2006-09-19 14:31:34 UTC
Can you elaborate on the cman problems? It may not help the other bug, but we
don't know of any problems with that version of cman.


Comment 8 Gary Lindstrom 2006-09-19 15:51:02 UTC
Sure...  maybe I am doing something wrong and if so, please correct me.

Basically, after loading CMAN everything initially looks OK... clustat says all
three servers are online.  After a minute or two, fenced "fences" one of the
servers for no good reason.  It reboots the fenced server through the ILO
interface.  When the rebooted server rejoins the cluster, the fenced,
dlm_controld, and gfs_controld daemons all die off on the server that initiated
the reboot.  It seems as though it can't communicate with each other.  However
cman-2.0.13-2.fc6 seems to be able to do so.

I adding my latest version of cluster.conf to this bug report in case you see
something obvious...  Note that one server does not have an ILO interface so I
just use manual since at this point, this is a test system.   Also, I tried it
with both multicast enabled and disabled with the same results.

So, is it just me, do I have some unsupported hardware, am I missing something?
 I had the cluster working with EL4 and GFS, but I thought I would give gfs2 a
try amd see what is new in fedora.  I have basically been using EL3 and EL4, FC1
being the last Fedora I had used.  I know this is test and I have no problems
with that.  I primarily want to test XEN and GFS2.  This is a test system in
order to design a new system for our business.  If I can help work out the bugs,
I am happy to help.

Comment 9 Gary Lindstrom 2006-09-19 15:52:26 UTC
Created attachment 136651 [details]
My latest cluster.conf

My latest cluster.conf just in case it helps...

Comment 10 Steve Whitehouse 2006-09-20 09:56:19 UTC
It might well be Xen that is the problem in this set up. Is it possible for you
to retry without xen?

I'm not certain what xen does with respect to multicast forwarding/routing, so
that might be part of the problem.


Comment 11 Christine Caulfield 2006-09-20 10:49:13 UTC
This is a bit of a hack but worth trying. Try adding this stanza to your
cluster.conf file, just inside the <cluster> bit.

<totem token="5000"/>


Comment 12 Gary Lindstrom 2006-09-20 15:43:57 UTC
Sorry, Xen has never been loaded on the machines I am trying to setup as a
cluster (not yet anyway) so that cannot be the problem.  I'll try the suggested
'<totem token="5000"/>' part shortly.  However, is that an attempt to fix the
cman problem or  the opcode problem in himem.c... I assume the cman problem?

Comment 13 Steve Whitehouse 2006-09-20 15:50:27 UTC
Yes, the cman problem. I'm still wondering about the highmem problem. The trace
suggests that the problem is in the VFS. It looks like the cause is scheduling
while holding an atomic kmap (which is not allowed) and the VFS happend to be
the next call that was scheduled and called the same code causing the oops. The
problem is in tracking down where the original problem is.

There are few possible call sites and I've checked them all over and I cannot
see any reason why scheduling should take place at those points.

Comment 14 Gary Lindstrom 2006-09-20 16:20:41 UTC
Sounds to me like a missing lock somewhere causing a race...  I use to do a lot
of programming but it has been 15+ years since I did anything serious so I would
only be guessing... :)  

swhiteho, see the email I sent you directly and let me know...

Comment 15 Gary Lindstrom 2006-09-20 17:33:49 UTC
Ok, same problem with <totem token="5000"/> added.  Once I finally got it to
stop rebooting each other and keep fenced running I told it to load clvmd.  On
one machine, the startup script finished, but then fenced and the two control
daemons (dlm and gfs) died off.  The other servers never completed loading
clvmd.  This is the same problem I had the other day...  Tried restarting the
daemons manually...  fenced runs but dlm_controld and gfs_controld die again.. 
This is from the messages log...:

Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] entering GATHER state from 12.
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] Saving state aru 2d high seq
received 2d
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] Storing new sequence id for ring 2128
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] entering COMMIT state.
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] entering RECOVERY state.
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] position [0] member 209.161.1.151:
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] previous ring seq 8480 rep
209.161.1.151
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] aru 2d high delivered 2d received
flag 0
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] position [1] member 209.161.1.170:
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] previous ring seq 8480 rep
209.161.1.151
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] aru 2d high delivered 2d received
flag 0
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] position [2] member 209.161.1.171:
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] previous ring seq 8484 rep
209.161.1.171
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] aru 9 high delivered 9 received flag 0
Sep 20 11:13:16 spool5 openais[2269]: [TOTEM] Did not need to originate any
messages in recovery.
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ] CLM CONFIGURATION CHANGE
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ] New Configuration:
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ]   r(0) ip(209.161.1.151)
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ]   r(0) ip(209.161.1.170)
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ]   r(0) ip(209.161.1.171)
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ] Members Left:
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ] Members Joined:
Sep 20 11:13:17 spool5 openais[2269]: [SYNC ] This node is within the primary
component and will provide service.
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ] CLM CONFIGURATION CHANGE
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ] New Configuration:
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ]   r(0) ip(209.161.1.151)
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ]   r(0) ip(209.161.1.170)
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ]   r(0) ip(209.161.1.171)
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ] Members Left:
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ] Members Joined:
Sep 20 11:13:17 spool5 openais[2269]: [SYNC ] This node is within the primary
component and will provide service.
Sep 20 11:13:17 spool5 openais[2269]: [TOTEM] entering OPERATIONAL state.
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ] got nodejoin message 209.161.1.151
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ] got nodejoin message 209.161.1.170
Sep 20 11:13:17 spool5 openais[2269]: [CLM  ] got nodejoin message 209.161.1.171
Sep 20 11:13:17 spool5 openais[2269]: [CPG  ] got joinlist message from node 3
Sep 20 11:13:17 spool5 openais[2269]: [CPG  ] got joinlist message from node 1
Sep 20 11:26:08 spool5 gfs_controld[2723]: groupd_dispatch error -1 errno 22
Sep 20 11:26:08 spool5 gfs_controld[2723]: groupd connection died
Sep 20 11:26:08 spool5 fenced[2711]: groupd is down, exiting
Sep 20 11:26:08 spool5 gfs_controld[2723]: cluster is down, exiting
Sep 20 11:26:08 spool5 dlm_controld[2717]: groupd is down, exiting
Sep 20 11:26:17 spool5 clvmd: Cluster LVM daemon started - connected to CMAN
Sep 20 11:30:42 spool5 fenced[3958]: group_init error 0 111
Sep 20 11:30:46 spool5 dlm_controld[3960]: group_init error 0 111
Sep 20 11:31:05 spool5 gfs_controld[3965]: group_init error 0 111


Comment 16 Steve Whitehouse 2006-09-25 13:45:08 UTC
The original bug reported here, relating to highmem and kmap_atomic has also
been reported again more recently and hopefully with another angle upon it,
we'll now have a better chance at tracking that down.

I had thought about marking this bug a dup of bz #207764, however perhaps the
bext solution is to leave this one open to deal with the cman problem, and
transfer the
highmem discussion to the bz #207764.


Comment 17 Gary Lindstrom 2006-09-25 17:05:03 UTC
Ummm... I can't see bz #207764... It says I am not authorized to see that bug...

FWIW, I tried upgrading to the 2.6.18-1.2689.fc6 kernel in development, upgraded
device-mapper to 2.02.10 (my own compile), upgraded lvm to 2.02.10 (my own
compile), recompiled cman 2.0.14-3 source package on my machine, all with no
success....  same problem for both himem and cman...  However, it did bring up
another bug.  Maybe you want to open another bug for this one.  It seems that if
you set locking_type to 2 in lvm.conf, but don't have clvmd loaded when doing a
mkinitrd, lvm.static seg faults and does not add lvm support into the .img file,
and of course the kernel fails to boot if you are using lvm for the boot
disks...  upgrading to 2.02.10 fixed that as it does not seg fault and also
falls back to internal and file locking...  Also cleaned up a couple of misc
segfault messages on bootup regarding lvm...

Also, while I am at it, I noticed in the log file the following which is
similiar to bz #206754 but not exactly.  Don't know if it is related or not:

Sep 24 18:29:20 spool6 kernel: INFO: trying to register non-static key.
Sep 24 18:29:20 spool6 kernel: the code is fine but needs lockdep annotation.
Sep 24 18:29:20 spool6 kernel: turning off the locking correctness validator.
Sep 24 18:29:20 spool6 kernel:  [<c04051ee>] show_trace_log_lvl+0x58/0x171
Sep 24 18:29:20 spool6 kernel:  [<c0405802>] show_trace+0xd/0x10
Sep 24 18:29:20 spool6 kernel:  [<c040591b>] dump_stack+0x19/0x1b
Sep 24 18:29:20 spool6 kernel:  [<c043b366>] __lock_acquire+0xfd/0x99c
Sep 24 18:29:20 spool6 kernel:  [<c043c176>] lock_acquire+0x4b/0x6d
Sep 24 18:29:20 spool6 kernel:  [<c06155a4>] _spin_lock_irq+0x1f/0x2e
Sep 24 18:29:20 spool6 kernel:  [<c0613422>] wait_for_completion+0x29/0x9e
Sep 24 18:29:20 spool6 kernel:  [<f88cfc36>] sendcmd_withirq+0x173/0x289 [cciss]
Sep 24 18:29:20 spool6 kernel:  [<f88d0c9a>] cciss_read_capacity+0x2f/0xa6 [cciss]
Sep 24 18:29:20 spool6 kernel:  [<f88d2c32>] cciss_revalidate+0xaa/0xfa [cciss]
Sep 24 18:29:20 spool6 kernel:  [<c04a8380>] rescan_partitions+0x6b/0x1e7
Sep 24 18:29:20 spool6 kernel:  [<c047a5c8>] do_open+0x2e9/0x3e9
Sep 24 18:29:20 spool6 kernel:  [<c047a739>] blkdev_get+0x71/0x7c
Sep 24 18:29:20 spool6 kernel:  [<c04a82c3>] register_disk+0x115/0x167
Sep 24 18:29:20 spool6 kernel:  [<c04e099b>] add_disk+0x2e/0x3d
Sep 24 18:29:20 spool6 kernel:  [<f88d356a>] cciss_init_one+0x8e8/0xa5a [cciss]
Sep 24 18:29:20 spool6 kernel:  [<c04f4241>] pci_device_probe+0x39/0x5b
Sep 24 18:29:20 spool6 kernel:  [<c05552d6>] driver_probe_device+0x45/0x92
Sep 24 18:29:20 spool6 kernel:  [<c0555403>] __driver_attach+0x68/0x93
Sep 24 18:29:20 spool6 kernel:  [<c0554d4f>] bus_for_each_dev+0x3a/0x5f
Sep 24 18:29:20 spool6 kernel:  [<c0555231>] driver_attach+0x14/0x17
Sep 24 18:29:20 spool6 kernel:  [<c0554a26>] bus_add_driver+0x68/0x106
Sep 24 18:29:20 spool6 kernel:  [<c05556ac>] driver_register+0x78/0x7d
Sep 24 18:29:20 spool6 kernel:  [<c04f4392>] __pci_register_driver+0x4f/0x69
Sep 24 18:29:20 spool6 kernel:  [<f880401c>] cciss_init+0x1c/0x1e [cciss]
Sep 24 18:29:21 spool6 kernel:  [<c04426ce>] sys_init_module+0x16ad/0x1856
Sep 24 18:29:21 spool6 kernel:  [<c0403faf>] syscall_call+0x7/0xb
Sep 24 18:29:21 spool6 kernel: DWARF2 unwinder stuck at syscall_call+0x7/0xb
Sep 24 18:29:21 spool6 kernel: Leftover inexact backtrace:
Sep 24 18:29:21 spool6 kernel:  [<c0405802>] show_trace+0xd/0x10
Sep 24 18:29:21 spool6 kernel:  [<c040591b>] dump_stack+0x19/0x1b
Sep 24 18:29:21 spool6 kernel:  [<c043b366>] __lock_acquire+0xfd/0x99c
Sep 24 18:29:21 spool6 kernel:  [<c043c176>] lock_acquire+0x4b/0x6d
Sep 24 18:29:21 spool6 kernel:  [<c06155a4>] _spin_lock_irq+0x1f/0x2e
Sep 24 18:29:21 spool6 kernel:  [<c0613422>] wait_for_completion+0x29/0x9e
Sep 24 18:29:21 spool6 kernel:  [<f88cfc36>] sendcmd_withirq+0x173/0x289 [cciss]
Sep 24 18:29:21 spool6 kernel:  [<f88d0c9a>] cciss_read_capacity+0x2f/0xa6 [cciss]
Sep 24 18:29:21 spool6 kernel:  [<f88d2c32>] cciss_revalidate+0xaa/0xfa [cciss]
Sep 24 18:29:21 spool6 kernel:  [<c04a8380>] rescan_partitions+0x6b/0x1e7
Sep 24 18:29:21 spool6 kernel:  [<c047a5c8>] do_open+0x2e9/0x3e9
Sep 24 18:29:21 spool6 kernel:  [<c047a739>] blkdev_get+0x71/0x7c
Sep 24 18:29:21 spool6 kernel:  [<c04a82c3>] register_disk+0x115/0x167
Sep 24 18:29:21 spool6 kernel:  [<c04e099b>] add_disk+0x2e/0x3d
Sep 24 18:29:21 spool6 kernel:  [<f88d356a>] cciss_init_one+0x8e8/0xa5a [cciss]
Sep 24 18:29:21 spool6 kernel:  [<c04f4241>] pci_device_probe+0x39/0x5b
Sep 24 18:29:21 spool6 kernel:  [<c05552d6>] driver_probe_device+0x45/0x92
Sep 24 18:29:21 spool6 kernel:  [<c0555403>] __driver_attach+0x68/0x93
Sep 24 18:29:21 spool6 kernel:  [<c0554d4f>] bus_for_each_dev+0x3a/0x5f
Sep 24 18:29:21 spool6 kernel:  [<c0555231>] driver_attach+0x14/0x17
Sep 24 18:29:21 spool6 kernel:  [<c0554a26>] bus_add_driver+0x68/0x106
Sep 24 18:29:21 spool6 kernel:  [<c05556ac>] driver_register+0x78/0x7d
Sep 24 18:29:21 spool6 kernel:  [<c04f4392>] __pci_register_driver+0x4f/0x69
Sep 24 18:29:21 spool6 kernel:  [<f880401c>] cciss_init+0x1c/0x1e [cciss]
Sep 24 18:29:21 spool6 kernel:  [<c04426ce>] sys_init_module+0x16ad/0x1856
Sep 24 18:29:21 spool6 kernel:  [<c0403faf>] syscall_call+0x7/0xb

I am sure this is related to the scsi controller for the compaq boxes.

Not having ever worked with bugzilla or you guys, should I just be opening
seperate bugs for each of these?



Comment 18 Steve Whitehouse 2006-09-27 10:16:02 UTC
Sorry - I'll take a look at the settings on that bug in a moment so it should be
viewable shortly.

As you say that trace looks like a scsi driver problem. Or most likely, its not
actually a problem at all, but a case where the lockdep annotation isn't quite
right. Please open a separate bug for that if you'd like us to look into it. I'm
not sure who the right person is for scsi bugs, but bugzilla will know :-)

The lvm issue is a separate thing though and opening another bug is a good
plan. Its best to open a new bug for each separate issue to so that we can track
each one individually. It makes it easier to pass to the right person to
investigate each particular bug.


Comment 19 Gary Lindstrom 2006-09-27 19:31:24 UTC
Thanks Steve, I'll open a couple other bugzilla for the other problems...

I still cant see 207764 but maybe you have not changed permissions yet...

Comment 20 Steve Whitehouse 2006-09-28 15:51:54 UTC
I'm still working on getting 207764 opened up. It seems that some of the info is
confidential, so its not as easy as I'd first thought, sorry its taking so long.
One other question, you appeared in the original stack trace to be running
selinux. If so did you find that you got the same results without selinux running?

There is an outstanding selinux bug in gfs2 which I'm about to receive a patch
for, and while I don't think thats the cause here, it would be good to confirm
that, and it would help narrow down the possible causes.


Comment 21 Gary Lindstrom 2006-09-28 16:36:11 UTC
> If so did you find that you got the same results without selinux running?

Yep, same results with selinux enabled, permissive, and disabled...  actually, I
usually set selinux to permissive to that is does not get in the way of things,
but I want't to know about any issues.  However, I have it disabled completly
(for a while now) and same problem.

Thanks for your work on this!

Comment 22 Gary Lindstrom 2006-10-04 00:03:48 UTC
Haven't heard anything, but tried kernel 2.6.18-1.2724.fc6 with same results...

Howerver, something new has started appearing... Not sure where this goes, but I
get in sometimes when I load clvmd...

Oct  3 17:35:23 spool6 kernel: SCTP: Hash tables configured (established 32768
bind 32768)
Oct  3 17:35:23 spool6 kernel: Module sctp cannot be unloaded due to unsafe
usage in net/sctp/protocol.c:1189
Oct  3 17:35:23 spool6 kernel: dlm: clvmd: recover 1
Oct  3 17:35:23 spool6 kernel: dlm: clvmd: add member 1
Oct  3 17:35:23 spool6 kernel: dlm: clvmd: add member 2
Oct  3 17:35:23 spool6 kernel: dlm: got new/restarted association 1 nodeid 1
Oct  3 17:35:23 spool6 kernel: dlm: clvmd: total members 2 error 0
Oct  3 17:35:23 spool6 kernel: dlm: clvmd: dlm_recover_directory
Oct  3 17:35:23 spool6 kernel: dlm: clvmd: dlm_recover_directory 0 entries
Oct  3 17:35:23 spool6 kernel:
Oct  3 17:35:23 spool6 kernel: =====================================
Oct  3 17:35:23 spool6 kernel: [ BUG: bad unlock balance detected! ]
Oct  3 17:35:23 spool6 kernel: -------------------------------------
Oct  3 17:35:23 spool6 kernel: dlm_recoverd/4419 is trying to release lock
(&ls->ls_in_recovery) at:
Oct  3 17:35:23 spool6 kernel: [<f8eff592>] dlm_recoverd+0x23d/0x337 [dlm]
Oct  3 17:35:23 spool6 kernel: but there are no more locks to release!
Oct  3 17:35:23 spool6 kernel:
Oct  3 17:35:23 spool6 kernel: other info that might help us debug this:
Oct  3 17:35:23 spool6 kernel: 2 locks held by dlm_recoverd/4419:
Oct  3 17:35:23 spool6 kernel:  #0:  (&ls->ls_recoverd_active){--..}, at:
[<c0613382>] mutex_lock+0x21/0x24
Oct  3 17:35:23 spool6 kernel:  #1:  (&ls->ls_recover_lock){--..}, at:
[<f8eff56b>] dlm_recoverd+0x216/0x337 [dlm]
Oct  3 17:35:23 spool6 kernel:
Oct  3 17:35:23 spool6 kernel: stack backtrace:
Oct  3 17:35:23 spool6 kernel:  [<c04051ed>] show_trace_log_lvl+0x58/0x16a
Oct  3 17:35:23 spool6 kernel:  [<c04057fa>] show_trace+0xd/0x10
Oct  3 17:35:23 spool6 kernel:  [<c0405913>] dump_stack+0x19/0x1b
Oct  3 17:35:23 spool6 kernel:  [<c043a25b>] print_unlock_inbalance_bug+0xce/0xd8
Oct  3 17:35:23 spool6 kernel:  [<c043ba84>] lock_release_non_nested+0x85/0x129
Oct  3 17:35:23 spool6 kernel:  [<c043bd77>] lock_release+0x111/0x130
Oct  3 17:35:23 spool6 kernel:  [<c0438d8f>] up_write+0x16/0x2a
Oct  3 17:35:23 spool6 kernel:  [<f8eff592>] dlm_recoverd+0x23d/0x337 [dlm]
Oct  3 17:35:23 spool6 kernel:  [<c04361b9>] kthread+0xc3/0xf2
Oct  3 17:35:23 spool6 kernel:  [<c0402005>] kernel_thread_helper+0x5/0xb
Oct  3 17:35:23 spool6 kernel: DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb
Oct  3 17:35:23 spool6 kernel: Leftover inexact backtrace:
Oct  3 17:35:23 spool6 kernel:  [<c04057fa>] show_trace+0xd/0x10
Oct  3 17:35:23 spool6 kernel:  [<c0405913>] dump_stack+0x19/0x1b
Oct  3 17:35:23 spool6 kernel:  [<c043a25b>] print_unlock_inbalance_bug+0xce/0xd8
Oct  3 17:35:23 spool6 kernel:  [<c043ba84>] lock_release_non_nested+0x85/0x129
Oct  3 17:35:23 spool6 kernel:  [<c043bd77>] lock_release+0x111/0x130
Oct  3 17:35:23 spool6 kernel:  [<c0438d8f>] up_write+0x16/0x2a
Oct  3 17:35:24 spool6 kernel:  [<f8eff592>] dlm_recoverd+0x23d/0x337 [dlm]
Oct  3 17:35:24 spool6 kernel:  [<c04361b9>] kthread+0xc3/0xf2
Oct  3 17:35:24 spool6 kernel:  [<c0402005>] kernel_thread_helper+0x5/0xb
Oct  3 17:35:24 spool6 kernel: dlm: clvmd: recover 1 done: 144 ms
Oct  3 17:35:24 spool6 kernel: dlm: clvmd: recover 3
Oct  3 17:35:24 spool6 kernel: dlm: clvmd: add member 3
Oct  3 17:35:24 spool6 kernel: dlm: Initiating association with node 3
Oct  3 17:35:24 spool6 kernel: dlm: clvmd: total members 3 error 0
Oct  3 17:35:24 spool6 kernel: dlm: clvmd: dlm_recover_directory
Oct  3 17:35:24 spool6 kernel: dlm: clvmd: dlm_recover_directory 1 entries
Oct  3 17:35:24 spool6 kernel: dlm: clvmd: recover 3 done: 60 ms


Comment 23 Steve Whitehouse 2006-10-04 08:57:37 UTC
That new bug is not really a bug. Its some debugging code thats getting over
excited and flagging up a problem where a lock is getting unlocked from a
different thread to the one which locked it. Thats harmless and the dlm team
know about it already.

I'm still working on the highmem issue and I'm building a test kernel at this
very moment in fact.

Comment 24 Gary Lindstrom 2006-10-06 07:13:12 UTC
OK, here is what I have done in case it helps...  Upgraded with latest updates -
kernel is now  2.6.18-1.2741.fc6.  Same problems as above.  I rebooted with
highmem=0.  Problem seems to disappear as I wrote 40GB to SAN, so would seem to
be a highmem problem, right?  However, as I was doing a tape backup to an ADIC
LTO3 drive, I got the following...  Don't know if it is related or if this is a
different problem, but maybe it will help:

Oct  6 00:58:31 virtual1b kernel: ------------[ cut here ]------------
Oct  6 00:58:31 virtual1b kernel: kernel BUG at mm/filemap.c:553!
Oct  6 00:58:31 virtual1b kernel: invalid opcode: 0000 [#1]
Oct  6 00:58:31 virtual1b kernel: SMP 
Oct  6 00:58:31 virtual1b kernel: last sysfs file: /fs/gfs2/fpcl01:vg00lv00/id
Oct  6 00:58:31 virtual1b kernel: Modules linked in: md5 sctp autofs4 ipv6
lock_dlm gfs2 dlm configfs sunrpc video sbs i2c_ec button battery asus_acpi ac
parport_pc lp parport st sg lpfc ide_c
d aic7xxx e100 cdrom scsi_transport_spi r8169 scsi_transport_fc floppy cpqphp
mii pcspkr i2c_piix4 i2c_core serio_raw dm_snapshot dm_zero dm_mirror dm_mod
cciss sd_mod scsi_mod ext3 jbd ehci_h
cd ohci_hcd uhci_hcd
Oct  6 00:58:31 virtual1b kernel: CPU:    1
Oct  6 00:58:31 virtual1b kernel: EIP:    0060:[<c0453202>]    Not tainted VLI
Oct  6 00:58:31 virtual1b kernel: EFLAGS: 00010246   (2.6.18-1.2741.fc6 #1) 
Oct  6 00:58:31 virtual1b kernel: EIP is at unlock_page+0xe/0x27
Oct  6 00:58:31 virtual1b kernel: eax: 00000000   ebx: c12a28c0   ecx: e40f0000
  edx: e40f0de0
Oct  6 00:58:31 virtual1b kernel: esi: 00000000   edi: 00000001   ebp: e9ea90f4
  esp: e40f0ce0
Oct  6 00:58:31 virtual1b kernel: ds: 007b   es: 007b   ss: 0068
Oct  6 00:58:31 virtual1b kernel: Process save (pid: 3684, ti=e40f0000
task=eadf2450 task.ti=e40f0000)
Oct  6 00:58:31 virtual1b kernel: Stack: c12a28c0 f8bee5fb e40f0d38 e40f0de0
00000000 e715d000 c0685aa8 00000000 
Oct  6 00:58:31 virtual1b kernel:        00000044 c0683694 00000202 00000002
00000044 00000000 00000002 00000000 
Oct  6 00:58:32 virtual1b kernel:        000201d2 c0685aa8 00000001 00000001
0000000f 00000001 e40f0d38 e40f0d38 
Oct  6 00:58:32 virtual1b kernel: Call Trace:
Oct  6 00:58:32 virtual1b kernel:  [<f8bee5fb>] gfs2_readpages+0x18e/0x1b1 [gfs2]
Oct  6 00:58:32 virtual1b kernel:  [<c0458e93>]
__do_page_cache_readahead+0x124/0x1c8
Oct  6 00:58:32 virtual1b kernel:  [<c0458f83>]
blockable_page_cache_readahead+0x4c/0x9f
Oct  6 00:58:32 virtual1b kernel:  [<c045912e>] page_cache_readahead+0xbf/0x196
Oct  6 00:58:32 virtual1b kernel:  [<c04538fd>] do_generic_mapping_read+0x13d/0x49b
Oct  6 00:58:32 virtual1b kernel:  [<c0454535>] __generic_file_aio_read+0x18c/0x1d1
Oct  6 00:58:32 virtual1b kernel:  [<c0455604>] generic_file_read+0xa2/0xb8
Oct  6 00:58:32 virtual1b kernel:  [<c046f77c>] vfs_read+0xa6/0x157
Oct  6 00:58:32 virtual1b kernel:  [<c046fbeb>] sys_read+0x41/0x67
Oct  6 00:58:32 virtual1b kernel:  [<c0403f77>] syscall_call+0x7/0xb
Oct  6 00:58:32 virtual1b kernel: DWARF2 unwinder stuck at syscall_call+0x7/0xb
Oct  6 00:58:32 virtual1b kernel: 
Oct  6 00:58:32 virtual1b kernel: Leftover inexact backtrace:
Oct  6 00:58:32 virtual1b kernel: 
Oct  6 00:58:32 virtual1b kernel:  =======================
Oct  6 00:58:32 virtual1b kernel: Code: 08 0f 0b 5a 02 12 7c 63 c0 89 d8 e8 45
fa ff ff 89 da b9 08 00 00 00 5b e9 c7 37 fe ff 53 89 c3 f0 0f ba 30 00 19 c0 85
c0 75 08 <0f> 0b 29 02 12 7c 63 
c0 89 d8 e8 1b fa ff ff 89 da 31 c9 5b e9 
Oct  6 00:58:32 virtual1b kernel: EIP: [<c0453202>] unlock_page+0xe/0x27 SS:ESP
0068:e40f0ce0


Comment 25 Steve Whitehouse 2006-10-12 08:49:14 UTC
I've just pushed the fix from Russell Cattelan for the highmem bug to my gfs2
git tree. I'll push it to Linus later today and also get it into both FC and
RHEL as soon as I can.

Comment 26 Gary Lindstrom 2006-10-16 22:08:27 UTC
Well, either I am lucky, or the fix got into kernel-2.6.18-1.2784.fc6 as I no
longer get the himem problem.  However, I have a new problem, opened bz 211044
for that.

Comment 27 Steve Whitehouse 2006-11-24 11:46:42 UTC
The highmem bug discussion was transferred to another bug and has long since
been fixed. I'm reassigning this to Patrick because I believe that the cman
issue is probably also fixed, and if so he can mark it as resolved.

Patrick, if this is the case, then please add some suitable resolution to this
bug to take it off the list.

Gary, please let us know if you have any outstanding issues relating to this
bug, although it may be better to open a new one to prevent the information here
getting too confusing.

Comment 28 Christine Caulfield 2006-11-29 10:40:17 UTC
I'm pretty sure that the cman issues have been resolved by many updates to
openais & cman over the last few weeks of 'smoke' work.

Not sure if this is the right resolution code but it gets this bug off the lists!