Bug 213289
Summary: | [DLM] gfs umount hung, message size too big | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> | ||||||||
Component: | kernel | Assignee: | Christine Caulfield <ccaulfie> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Dean Jansa <djansa> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 5.1 | CC: | dzickus, lwang, rkenna, rpeterso, teigland | ||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | 5.0.0 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2007-07-19 19:39:01 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Nate Straz
2006-10-31 17:25:03 UTC
Switching this one to dlm-kernel. We can use this bugzilla for the possible switch from sctp to tcp in the dlm module depending on the effort required. Setting beta2 blocker flags and devel ACK. This one is related to the sctp issues and whether we continue to use sctp needs to be assessed prior to beta2 release. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering. This request is not yet committed for inclusion in release. Going to use this bugzilla for the switch from sctp to tcp for the dlm stack. Marking it as POSTED and changing the component to kernel. Patch has been posted to rhkernel-list with the title [RHEL5U2 BZ#211777 & others] dlm: add tcp communications Link is http://post-office.corp.redhat.com/archives/rhkernel-list/2006-November/msg00084.html in kernel-2.6.18-1.2744.el5 *** Bug 213965 has been marked as a duplicate of this bug. *** Reopening this bug as I've seen it in testing on TCP. This was a beta-blocker for RHEL50 back in the day, and reopening it has put it back at the top of the stop-everything-until-this-is-fixed list, so I gather folks are suddenly panicking left and right over this bug, oops. Not to worry, the "blocker issue" has not returned, and we probably should have opened a new bug to track this "development issue". I think Patrick saw this during his development testing, it may or may not be the same issue that QE reported. It's probably not RHEL5 code that was being tested, but development code. QE is _not_ seeing this on RHEL5. This is not a data corruptor. If you hit this problem, a dlm message goes missing and some node sits and waits forever for it, i.e. whatever is using the dlm (usually gfs) will sit and wait forever, i.e. hang. Updating flag data. OK, sorry to alarm everyone. I now think this only affects the new lowcomms code in the head of the git tree and I'm working on a fix. Yesterday it looked as though it was a more general problem but that doesn't seem to be the case. Patch posted to cluster-devel I did a git clone to pick up Steve Whitehouse's latest nmw git tree, kernel, then I verified its lowcomms-tcp.c contained the patches posted in cluster-devel near 24 January as: [Cluster-devel] [PATCH] dlm: fix lowcomms receiving and [Cluster-devel] [PATCH] dlm: Make sock_sem into a mutex Then I did a make clean ; make modules; make modules_install; make install of the new kernel on all nodes. I rebooted all nodes and made sure they were running the new kernel. Then I did a ./configure to point my cluster code source at that kernel, did a make clean; make; make install of gfs. Then I loaded everything and ran gfs_fsck_stress from try. It panicked right after the same messages. On the console of trin-11: dlm: bobs_gfs: remove fr 2 none dlm: message size 8224 from 2 too big, buf len 4320 dlm: bobs_gfs: remove fr 2 none dlm: message size 8224 from 2 too big, buf len 8080 BUG: unable to handle kernel NULL pointer dereference at virtual address 00000008 printing eip: c010230f *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: gfs lock_dlm gfs2 dlm configfs qla2xxx CPU: 0 EIP: 0060:[<c010230f>] Not tainted VLI EFLAGS: 00010086 (2.6.21-rc3 #4) EIP is at __switch_to+0x10e/0x1a7 eax: 00000000 ebx: c13fa000 ecx: c13fa000 edx: 00000000 esi: c052b5f0 edi: d44a6a70 ebp: c052b440 esp: c0591f44 ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 Process dlm_recv/0 (pid: 6925, ti=c0590000 task=d44a6a70 task.ti=d1fa4000) Stack: d44a6c20 c13f1f00 00000000 00000000 c13f4c60 c052b440 d1fa5f8c c040a755 c0591fcc 00000046 c13f6668 00000020 00000046 00000010 c0105094 c010ef77 00000001 c052b440 2f565cf0 00000129 00017427 c052b550 c13f4c60 00000000 Call Trace: [<c040a755>] __sched_text_start+0x715/0x7c4 [<c0105094>] do_IRQ+0x7e/0x92 [<c010ef77>] smp_apic_timer_interrupt+0x6e/0x7a [<c040007b>] rpcauth_unbindcred+0x42/0x52 [<c010171a>] cpu_idle+0x70/0x72 [<c0597cc6>] start_kernel+0x409/0x411 [<c059749b>] unknown_bootoption+0x0/0x202 ======================= Code: 4e 08 8b 58 02 8b 46 0c 89 4b 38 89 43 3c 8b 44 24 08 03 14 85 80 d3 58 c0 8b 46 14 8b 4a 02 8b 56 10 89 41 44 89 51 40 8b 45 04 <f7> 40 08 00 00 06 00 75 0d 8b 47 04 8b 40 08 a9 00 00 04 00 74 EIP: [<c010230f>] __switch_to+0x10e/0x1a7 SS:ESP 0068:c0591f44 Systems are: trin-09 trin-10 trin-11.lab.msp.redhat.com To recreate the failure: 1. Reboot all nodes 2. On all nodes, do service cman start 3. On all nodes, do service clvmd start 4. On all nodes, do modprobe gfs 5. On all nodes, do mount -tgfs /dev/bob_vg/lvol0 /mnt/gfs1 6. From another terminal do: ssh to system "try", then: cd /local/bob/sts-rhel5/sts-root gfs/bin/gfs_fsck_stress -l $PWD -r /usr/tests/sts/ -f /local/bob/sts-rhel5/sts-root/var/share/resource_files/trin.xml I'm concerned that the patch that was posted may not have fixed the problem, at least in the upstream kernel. Either that, or we have another problem. Created attachment 149859 [details] Breakdown of midcomm's buffer I created a patch to midcomms that dumps the dlm buffer when the error is detected. I took the resulting output and broke it down into its constituant dlm messages, and this attachment is what I came up with. It contains my notes, the most useful of which are at the very bottom. It looks as though the starting offset in the buffer was at the end of the page: offset 0xfd0. I marked this in the buffer as "Probable starting point". The only conclusion I can reach looking at the breakdown is that it seems as if the buffer wrap likely introduced 8 extra bytes. Perhaps a pointer rounding error. I need to take a careful look at this patch again. All messages in the buffer were m_type 0x0000000c == DLM_MSG_REMOVE. All messages in the buffer had a length of 0x0070. Which means this isn't as complex a scenario as bz 214585. Two questions whose answers would illuminate this further... 1. Is this corruption happening on receive or send? 2. What was passed into midcomms from lowcomms ? ie what was the buffer received by recvmsg and what are the circular buffer parameters set to? It looks like a new message has been just plonked into the middle of an existing one, but without knowing what the buffer start & offsets are its hard to determine which order the messages arrived (a sequence number would be handy here!) If the last message in that list is the one just received then there is no reason why it shouldn't be where it is (the stuff before and after it could be just junk left in the circular buffer) so it /could/ be just the length that is wrong as causing midcomms to read past a valid message into open air. My point is that the data in that buffer, on its own, is not indicative of a problem - though obviously there is one. Oh, I and I'm not convinced by the 8-bytes theory either, sorry. That buffer (if it's all current data) is more screwed up than 8 bytes. 0x30 or 0x40 would be nearer the mark; and anyway I've seen the buffer wrap round correctly so many times that I don't think the algorithm is intrinsically wrong. Given that this seems to need high load (and all those REMOVE messages /have/ to be a clue!) to produce I suspect that most of the data in the buffer is valid (or was very recently) so it would really be handy to know where it actually starts and ends. Created attachment 149917 [details]
Possible fix
I spotted a potential for corruption in the send path. If there are enough
messages to create a second writequeue entry and the first one fails to send
completely (due to send buffer full), lowcomms will happily move onto the
second writequeue buffer rather than waiting and trying the first again. This
could cause the data from the second wq to be intermingled with the first.
Created attachment 149929 [details]
New buffer full
The patch didn't fix the problem, but it did seem to run longer.
This attachment gives more information at the bottom.
Those are scary numbers! The problem seems to have been fixed by a line that was in Patrick's source tree, but not in Steve Whitehouse's tree. It's in lowcomms-tcp.c function receive_from_sock. iov[1].iov_len = 0; If this value isn't zeroed, that causes len to be unpredictably bigger, and it seems to have caused the corruption. So that line of code needs to get into Steve Whitehouse's git tree, and someone (Patrick?) should verify it is correct in the RHEL5 kernel. The patch added in comment #18 doesn't seem to be necessary. I reran the test without that one (just the iov_len = 0 patch) and it still seems to work properly. Incidentally, Patrick told me the fix is indeed in the RHEL5 kernel, so that's good news. It appears to me that there was never a specific bug or patch associated with this bz. It seems this bz was eventually used to represent "problems with SCTP in the dlm" and as the basis for the 5.0 patch that added the TCP option back into the dlm. This all happened prior to 5.0 GA, so this bz should have been closed a long time ago. closing based on previous comment |