Bug 821060

Summary: dlm: make dlm_recv single threaded
Product: Red Hat Enterprise Linux 6 Reporter: David Teigland <teigland>
Component: kernelAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Nate Straz <nstraz>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.3CC: nstraz, syeghiay
Target Milestone: rc   
Target Release: 6.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-2.6.32-304.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 06:12:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
patch none

Description David Teigland 2012-05-11 17:56:36 UTC
Description of problem:

Nate hit this problem while doing clvm tests on buzz 01-05.

[root@buzz-01 ~]# dlm_tool lockdebug clvmd
Resource len 64  "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
012f0001 CR      Master:   3 03f60001


[root@buzz-02 ~]# dlm_tool lockdebug clvmd
Resource len 64  "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
02cc0002 CR      Master:   3 01d00001

[root@buzz-03 ~]# dlm_tool lockdebug clvmd
Resource len 64  "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Master
Granted
03f60001 CR      Remote:   1 012f0001
01d00001 CR      Remote:   2 02cc0002
00940001 CR      Remote:   5 038f0001
01850001 CR      Remote:   4 034f0001
01330001 CR

[root@buzz-04 ~]# dlm_tool lockdebug clvmd
Resource len 64  "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
034f0001 CR      Master:   3 01850001

[root@buzz-05 ~]# dlm_tool lockdebug clvmd
Resource len  9  "P_#global"
Master
Granted
01cb0001 PR

Resource len 64  "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
038f0001 CR      Master:   3 00940001

Resource len 18  "V_stripe_8_4096_16"
Local 1          flags 00000000 first_lkid 1360003 root 0 recover 0 locks 0

Expecting reply
nodeid  1 msg request lkid 01360003 resource "V_stripe_8_4096_16 D;"

dmesg from buzz-05
dlm: clvmd: total members 5 error 0
dlm: clvmd: dlm_recover_directory
dlm: clvmd: dlm_recover_directory 0 entries
dlm: clvmd: recover 5 done: 147 ms
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply from 1 no lkb 1af0001

It appears that "1af0001" from nodeid 1 should have been "1360003".
On nodeid 1, receive_request() was probably returning another -53
error using setup_stub_lkb().  Because the stub_lkb is shared
for the lockspace among all threads, I suspect the lkid value
is being clobbered by multiple threads using the stub_lkb (a similar
problem with the stub_ms was fixed upstream.)  Ultimately, the
stub_lkb should either have a lock protecting it, or be eliminated.
But, none of this would happen unless there were multiple dlm_recv
threads using it (either concurrently, or maybe due to one sleeping in lowcomms_get_buffer.)

At some point in the distant past, dlm_recv was incorrectly set as
a multithreaded workqueue, but the dlm has never really been made
to work correctly with mutiple recv threads (the stub issues above
being one of the reasons.)  For some reason, these multiple threads
apparently never ran concurrently, so they never posed a problem,
so we never fixed dlm_recv to be single threaded as it should be.

Eventually upstream, the wq threads began operating concurrently,
these problems started appearing, so we fixed it to be single
threaded.  Making a similar fix in rhel would probably resolve
the problem above.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 David Teigland 2012-05-11 18:03:52 UTC
A confirmation that the analysis above was correct.
I found that 1af0001 was the lkid that 1 was sending back to 4,
also using stub_lkb:

dmesg from buzz-04
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53

Comment 3 RHEL Program Management 2012-05-15 04:03:56 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 4 David Teigland 2012-07-31 18:32:33 UTC
Created attachment 601563 [details]
patch

A scratch build with this patch:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4702135

Comment 5 RHEL Program Management 2012-07-31 18:50:21 UTC
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 6 David Teigland 2012-07-31 20:58:19 UTC
This bug was never repeatable AFAIK, so we'll just have to test broadly for no regressions.

Comment 8 Jarod Wilson 2012-08-31 18:44:14 UTC
Patch(es) available on kernel-2.6.32-304.el6

Comment 11 Nate Straz 2013-01-21 20:28:48 UTC
Verified this patch is included in kernel-2.6.32-350.el6 and found no regressions during testing.

Comment 13 errata-xmlrpc 2013-02-21 06:12:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0496.html