Bug 213289

Summary: [DLM] gfs umount hung, message size too big
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: kernelAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED CURRENTRELEASE QA Contact: Dean Jansa <djansa>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.1CC: dzickus, lwang, rkenna, rpeterso, teigland
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 5.0.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-07-19 19:39:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Breakdown of midcomm's buffer
none
Possible fix
none
New buffer full none

Description Nate Straz 2006-10-31 17:25:03 UTC
Description of problem:

Part of gfs_fsck_stress creates a tree from each node, then umounts the file
system from each node.  When the umount ran on the second node, the umount hung
and the following message was printed on the console:

dlm: morph-cluster0: bad rcom nodeid 3 from 5
dlm: message size 8224 from 5 too big, buf len 3828
dlm: error reading from sctp socket: -7
dlm: morph-cluster0: bad rcom nodeid 3 from 5
dlm: message size 8224 from 5 too big, buf len 3924
dlm: error reading from sctp socket: -7


Version-Release number of selected component (if applicable):
kmod-gfs-0.1.10-1.2.6.18_1.2739.el5
gfs-utils-0.1.7-1.el5
kernel-2.6.18-1.2739.el5


How reproducible:
unknown

Steps to Reproduce:
1. foreach node mount a gfs file system
2. foreach node create a tree on the file system
3. foreach node umount the file system
  
Actual results:

Here are the logs on each node, starting with the tree creation command through
the umount hang.
==========================================================================
Oct 30 16:22:39 morph-01 qarshd[6051]: Running cmdline:
/usr/tests/sts/bin/placemaker -c '/usr/tests/sts/bin/genesis -i 5 -d 2 -n 200'
-d 3 -w 3 -p /mnt/morph-cluster0/morph-01 
Oct 30 16:23:21 morph-01 qarshd[6171]: Running cmdline: mount -t gfs,gfs2 
Oct 30 16:23:21 morph-01 qarshd[6174]: Running cmdline: umount
/mnt/morph-cluster0 > /dev/null 2>&1 
Oct 30 16:23:24 morph-01 qarshd[6189]: Running cmdline: mount 
==========================================================================
Oct 30 16:22:54 morph-02 qarshd[5656]: Running cmdline:
/usr/tests/sts/bin/placemaker -c '/usr/tests/sts/bin/genesis -i 5 -d 2 -n 200'
-d 3 -w 3 -p /mnt/morph-cluster0/morph-02 
Oct 30 16:23:30 morph-02 kernel: dlm: morph-cluster0: recover 9
Oct 30 16:23:30 morph-02 kernel: dlm: morph-cluster0: remove member 1
Oct 30 16:23:30 morph-02 kernel: dlm: morph-cluster0: total members 4 error 0
Oct 30 16:23:30 morph-02 kernel: dlm: morph-cluster0: dlm_recover_directory
Oct 30 16:23:30 morph-02 kernel: dlm: morph-cluster0: dlm_recover_directory
15678 entries
Oct 30 16:23:31 morph-02 kernel: dlm: morph-cluster0: dlm_purge_locks
Oct 30 16:23:31 morph-02 kernel: dlm: morph-cluster0: dlm_recover_masters
Oct 30 16:23:31 morph-02 kernel: dlm: morph-cluster0: dlm_recover_masters 2861
resources
Oct 30 16:23:31 morph-02 kernel: dlm: morph-cluster0: dlm_recover_locks
Oct 30 16:23:31 morph-02 kernel: dlm: morph-cluster0: dlm_recover_locks 1321 locks
Oct 30 16:23:32 morph-02 qarshd[5776]: Running cmdline: mount -t gfs,gfs2 
Oct 30 16:23:32 morph-02 qarshd[5778]: Running cmdline: umount
/mnt/morph-cluster0 > /dev/null 2>&1 
Oct 30 16:35:25 morph-02 kernel: dlm: morph-cluster0: bad rcom nodeid 3 from 5
Oct 30 16:35:25 morph-02 kernel: dlm: message size 8224 from 5 too big, buf len 3828
Oct 30 16:35:25 morph-02 kernel: dlm: error reading from sctp socket: -7
Oct 30 16:35:35 morph-02 kernel: dlm: morph-cluster0: bad rcom nodeid 3 from 5
Oct 30 16:35:35 morph-02 kernel: dlm: message size 8224 from 5 too big, buf len 3924
Oct 30 16:35:35 morph-02 kernel: dlm: error reading from sctp socket: -7
==========================================================================
Oct 30 16:22:49 morph-03 qarshd[5621]: Running cmdline:
/usr/tests/sts/bin/placemaker -c '/usr/tests/sts/bin/genesis -i 5 -d 2 -n 200'
-d 3 -w 3 -p /mnt/morph-cluster0/morph-03 
Oct 30 16:23:16 morph-03 kernel: dlm: morph-cluster0: recover 7
Oct 30 16:23:16 morph-03 kernel: dlm: morph-cluster0: remove member 1
Oct 30 16:23:17 morph-03 kernel: dlm: morph-cluster0: total members 4 error 0
Oct 30 16:23:17 morph-03 kernel: dlm: morph-cluster0: dlm_recover_directory
Oct 30 16:23:17 morph-03 kernel: dlm: morph-cluster0: dlm_recover_directory
16095 entries
Oct 30 16:23:17 morph-03 kernel: dlm: morph-cluster0: dlm_purge_locks
Oct 30 16:23:17 morph-03 kernel: dlm: morph-cluster0: dlm_recover_masters
Oct 30 16:23:17 morph-03 kernel: dlm: morph-cluster0: dlm_recover_masters 2861
resources
Oct 30 16:23:17 morph-03 kernel: dlm: morph-cluster0: dlm_recover_locks
Oct 30 16:23:17 morph-03 kernel: dlm: morph-cluster0: dlm_recover_locks 2680 locks
Oct 30 16:28:21 morph-03 kernel: dlm: Error sending to node 2 -32
Oct 30 16:28:21 morph-03 kernel: dlm: Initiating association with node 2
Oct 30 16:28:21 morph-03 kernel: dlm: Error sending to node 2 -32
Oct 30 16:28:21 morph-03 last message repeated 4 times
==========================================================================
Oct 30 16:23:04 morph-04 qarshd[5589]: Running cmdline:
/usr/tests/sts/bin/placemaker -c '/usr/tests/sts/bin/genesis -i 5 -d 2 -n 200'
-d 3 -w 3 -p /mnt/morph-cluster0/morph-04 
Oct 30 16:23:22 morph-04 kernel: dlm: morph-cluster0: recover 5
Oct 30 16:23:22 morph-04 kernel: dlm: morph-cluster0: remove member 1
Oct 30 16:23:22 morph-04 kernel: dlm: morph-cluster0: total members 4 error 0
Oct 30 16:23:22 morph-04 kernel: dlm: morph-cluster0: dlm_recover_directory
Oct 30 16:23:22 morph-04 kernel: dlm: morph-cluster0: dlm_recover_directory
15871 entries
Oct 30 16:23:22 morph-04 kernel: dlm: morph-cluster0: dlm_purge_locks
Oct 30 16:23:22 morph-04 kernel: dlm: morph-cluster0: dlm_recover_masters
Oct 30 16:23:22 morph-04 kernel: dlm: morph-cluster0: dlm_recover_masters 2861
resources
Oct 30 16:23:22 morph-04 kernel: dlm: morph-cluster0: dlm_recover_locks
Oct 30 16:23:22 morph-04 kernel: dlm: morph-cluster0: dlm_recover_locks 2021 locks
==========================================================================
Oct 30 16:23:03 morph-05 qarshd[5615]: Running cmdline:
/usr/tests/sts/bin/placemaker -c '/usr/tests/sts/bin/genesis -i 5 -d 2 -n 200'
-d 3 -w 3 -p /mnt/morph-cluster0/morph-05 
Oct 30 16:23:12 morph-05 kernel: dlm: morph-cluster0: recover 3
Oct 30 16:23:12 morph-05 kernel: dlm: morph-cluster0: remove member 1
Oct 30 16:23:12 morph-05 kernel: dlm: morph-cluster0: total members 4 error 0
Oct 30 16:23:12 morph-05 kernel: dlm: morph-cluster0: dlm_recover_directory
Oct 30 16:23:13 morph-05 kernel: dlm: morph-cluster0: dlm_recover_directory
15872 entries
Oct 30 16:23:13 morph-05 kernel: dlm: morph-cluster0: dlm_purge_locks
Oct 30 16:23:13 morph-05 kernel: dlm: morph-cluster0: dlm_recover_masters
Oct 30 16:23:13 morph-05 kernel: dlm: morph-cluster0: dlm_recover_masters 2861
resources
Oct 30 16:23:13 morph-05 kernel: dlm: morph-cluster0: dlm_recover_locks
Oct 30 16:23:13 morph-05 kernel: dlm: morph-cluster0: dlm_recover_locks 2690 locks
Oct 30 16:28:17 morph-05 kernel: dlm: Error sending to node 2 -32
Oct 30 16:28:17 morph-05 kernel: dlm: Initiating association with node 2
Oct 30 16:28:17 morph-05 kernel: dlm: Error sending to node 2 -32
==========================================================================

Expected results:
The umount should complete on all nodes.

Additional info:

Comment 1 Kiersten (Kerri) Anderson 2006-10-31 21:36:12 UTC
Switching this one to dlm-kernel.  We can use this bugzilla for the possible
switch from sctp to tcp in the dlm module depending on the effort required.

Comment 2 Kiersten (Kerri) Anderson 2006-10-31 21:38:18 UTC
Setting beta2 blocker flags and devel ACK.  This one is related to the sctp
issues  and whether we continue to use sctp needs to be assessed prior to beta2
release.

Comment 3 RHEL Program Management 2006-10-31 21:46:45 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.

Comment 4 Kiersten (Kerri) Anderson 2006-11-02 18:18:15 UTC
Going to use this bugzilla for the switch from sctp to tcp for the dlm stack. 
Marking it as POSTED and changing the component to kernel.  Patch has been
posted to rhkernel-list with the title 
[RHEL5U2 BZ#211777 & others] dlm: add tcp communications

Link is 
http://post-office.corp.redhat.com/archives/rhkernel-list/2006-November/msg00084.html


Comment 5 Don Zickus 2006-11-06 03:28:34 UTC
in kernel-2.6.18-1.2744.el5

Comment 6 Linda Wang 2006-11-06 21:53:18 UTC
*** Bug 213965 has been marked as a duplicate of this bug. ***

Comment 7 Christine Caulfield 2007-01-23 15:27:33 UTC
Reopening this bug as I've seen it in testing on TCP.

Comment 8 David Teigland 2007-01-23 20:13:09 UTC
This was a beta-blocker for RHEL50 back in the day, and reopening it has
put it back at the top of the stop-everything-until-this-is-fixed list, so
I gather folks are suddenly panicking left and right over this bug, oops.
Not to worry, the "blocker issue" has not returned, and we probably should
have opened a new bug to track this "development issue".

I think Patrick saw this during his development testing, it may or may not
be the same issue that QE reported.  It's probably not RHEL5 code that
was being tested, but development code.  QE is _not_ seeing this on RHEL5.

This is not a data corruptor.  If you hit this problem, a dlm message goes
missing and some node sits and waits forever for it, i.e. whatever is
using the dlm (usually gfs) will sit and wait forever, i.e. hang.


Comment 9 Benjamin Kahn 2007-01-23 20:14:37 UTC
Updating flag data.

Comment 10 Christine Caulfield 2007-01-24 09:57:51 UTC
OK, sorry to alarm everyone. 

I now think this only affects the new lowcomms code in the head of the git tree
and I'm working on a fix. Yesterday it looked as though it was a more general
problem but that doesn't seem to be the case.

Comment 11 Christine Caulfield 2007-01-24 11:29:06 UTC
Patch posted to cluster-devel

Comment 14 Robert Peterson 2007-03-08 23:03:15 UTC
I did a git clone to pick up Steve Whitehouse's latest nmw git tree,
kernel, then I verified its lowcomms-tcp.c contained the patches posted
in cluster-devel near 24 January as:
[Cluster-devel] [PATCH] dlm: fix lowcomms receiving
and
[Cluster-devel] [PATCH] dlm: Make sock_sem into a mutex

Then I did a make clean ; make modules; make modules_install; make install
of the new kernel on all nodes.  I rebooted all nodes and made sure they
were running the new kernel.  Then I did a ./configure to point my
cluster code source at that kernel, did a make clean; make; make install
of gfs.  Then I loaded everything and ran gfs_fsck_stress from try.
It panicked right after the same messages.  On the console of trin-11:

dlm: bobs_gfs: remove fr 2 none
dlm: message size 8224 from 2 too big, buf len 4320
dlm: bobs_gfs: remove fr 2 none
dlm: message size 8224 from 2 too big, buf len 8080
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000008
 printing eip:
c010230f
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: gfs lock_dlm gfs2 dlm configfs qla2xxx
CPU:    0
EIP:    0060:[<c010230f>]    Not tainted VLI
EFLAGS: 00010086   (2.6.21-rc3 #4)
EIP is at __switch_to+0x10e/0x1a7
eax: 00000000   ebx: c13fa000   ecx: c13fa000   edx: 00000000
esi: c052b5f0   edi: d44a6a70   ebp: c052b440   esp: c0591f44
ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
Process dlm_recv/0 (pid: 6925, ti=c0590000 task=d44a6a70 task.ti=d1fa4000)
Stack: d44a6c20 c13f1f00 00000000 00000000 c13f4c60 c052b440 d1fa5f8c c040a755 
       c0591fcc 00000046 c13f6668 00000020 00000046 00000010 c0105094 c010ef77 
       00000001 c052b440 2f565cf0 00000129 00017427 c052b550 c13f4c60 00000000 
Call Trace:
 [<c040a755>] __sched_text_start+0x715/0x7c4
 [<c0105094>] do_IRQ+0x7e/0x92
 [<c010ef77>] smp_apic_timer_interrupt+0x6e/0x7a
 [<c040007b>] rpcauth_unbindcred+0x42/0x52
 [<c010171a>] cpu_idle+0x70/0x72
 [<c0597cc6>] start_kernel+0x409/0x411
 [<c059749b>] unknown_bootoption+0x0/0x202
 =======================
Code: 4e 08 8b 58 02 8b 46 0c 89 4b 38 89 43 3c 8b 44 24 08 03 14 85 80 d3 58 c0
8b 46 14 8b 4a 02 8b 56 10 89 41 44 89 51 40 8b 45 04 <f7> 40 08 00 00 06 00 75
0d 8b 47 04 8b 40 08 a9 00 00 04 00 74 
EIP: [<c010230f>] __switch_to+0x10e/0x1a7 SS:ESP 0068:c0591f44

Systems are: trin-09 trin-10 trin-11.lab.msp.redhat.com

To recreate the failure:
1. Reboot all nodes
2. On all nodes, do service cman start
3. On all nodes, do service clvmd start
4. On all nodes, do modprobe gfs
5. On all nodes, do mount -tgfs /dev/bob_vg/lvol0 /mnt/gfs1
6. From another terminal do:
   ssh to system "try", then:
   cd /local/bob/sts-rhel5/sts-root
   gfs/bin/gfs_fsck_stress -l $PWD -r /usr/tests/sts/ -f
/local/bob/sts-rhel5/sts-root/var/share/resource_files/trin.xml

I'm concerned that the patch that was posted may not have fixed the
problem, at least in the upstream kernel.  Either that, or we have
another problem.


Comment 15 Robert Peterson 2007-03-12 20:21:29 UTC
Created attachment 149859 [details]
Breakdown of midcomm's buffer

I created a patch to midcomms that dumps the dlm buffer when the error
is detected.  I took the resulting output and broke it down into its
constituant dlm messages, and this attachment is what I came up with.
It contains my notes, the most useful of which are at the very bottom.

It looks as though the starting offset in the buffer was at the end
of the page: offset 0xfd0.  I marked this in the buffer as
"Probable starting point".  The only conclusion I can reach looking 
at the breakdown is that it seems as if the buffer wrap likely 
introduced 8 extra bytes.  Perhaps a pointer rounding error.  I need
to take  a careful look at this patch again.

All messages in the buffer were m_type 0x0000000c == DLM_MSG_REMOVE.
All messages in the buffer had a length of 0x0070.
Which means this isn't as complex a scenario as bz 214585.

Comment 17 Christine Caulfield 2007-03-13 10:55:05 UTC
Two questions whose answers would illuminate this further...
1. Is this corruption happening on receive or send?
2. What was passed into midcomms from lowcomms ? ie what was the buffer received
 by recvmsg and what are the circular buffer parameters set to?

It looks like a new message has been just plonked into the middle of an existing
one, but without knowing what the buffer start & offsets are its hard to
determine which order the messages arrived (a sequence number would be handy here!)

If the last message in that list is the one just received then there is no
reason why it shouldn't be where it is (the stuff before and after it could be
just junk left in the circular buffer) so it /could/ be just the length that is
wrong as causing midcomms to read past a valid message into open air.

My point is that the data in that buffer, on its own, is not indicative of a
problem - though obviously there is one.

Oh, I and I'm not convinced by the 8-bytes theory either, sorry. That buffer (if
it's all current data) is more screwed up than 8 bytes. 0x30 or 0x40 would be
nearer the mark; and anyway I've seen the buffer wrap round correctly so many
times that I don't think the algorithm is intrinsically wrong.

Given that this seems to need high load (and all those REMOVE messages /have/ to
be a clue!) to produce I suspect that most of the data in the buffer is valid
(or was very recently) so it would really be handy to know where it actually
starts and ends.

Comment 18 Christine Caulfield 2007-03-13 12:00:23 UTC
Created attachment 149917 [details]
Possible fix

I spotted a potential for corruption in the send path. If there are enough
messages to create a second writequeue entry and the first one fails to send
completely (due to send buffer full), lowcomms will happily move onto the
second writequeue buffer rather than waiting and trying the first again. This
could cause the data from the second wq to be intermingled with the first.

Comment 19 Robert Peterson 2007-03-13 15:02:59 UTC
Created attachment 149929 [details]
New buffer full

The patch didn't fix the problem, but it did seem to run longer.  
This attachment gives more information at the bottom.

Comment 20 Christine Caulfield 2007-03-13 15:23:50 UTC
Those are scary numbers!

Comment 21 Robert Peterson 2007-03-13 17:09:23 UTC
The problem seems to have been fixed by a line that was in Patrick's
source tree, but not in Steve Whitehouse's tree.  It's in lowcomms-tcp.c
function receive_from_sock.

	iov[1].iov_len = 0;

If this value isn't zeroed, that causes len to be unpredictably bigger,
and it seems to have caused the corruption.
So that line of code needs to get into Steve Whitehouse's git tree,
and someone (Patrick?) should verify it is correct in the RHEL5 kernel.


Comment 22 Robert Peterson 2007-03-13 17:47:16 UTC
The patch added in comment #18 doesn't seem to be necessary.
I reran the test without that one (just the iov_len = 0 patch) and it 
still seems to work properly.


Comment 23 Robert Peterson 2007-03-13 17:48:45 UTC
Incidentally, Patrick told me the fix is indeed in the RHEL5 kernel,
so that's good news.


Comment 25 David Teigland 2007-07-19 19:25:41 UTC
It appears to me that there was never a specific bug or patch associated with
this bz.  It seems this bz was eventually used to represent "problems with SCTP
in the dlm" and as the basis for the 5.0 patch that added the TCP option back into
the dlm.  This all happened prior to 5.0 GA, so this bz should have been closed
a long time ago.


Comment 26 Don Zickus 2007-07-19 19:39:01 UTC
closing based on previous comment