RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 821060 - dlm: make dlm_recv single threaded
Summary: dlm: make dlm_recv single threaded
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 6.4
Assignee: David Teigland
QA Contact: Nate Straz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-05-11 17:56 UTC by David Teigland
Modified: 2013-02-21 06:12 UTC (History)
2 users (show)

Fixed In Version: kernel-2.6.32-304.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-02-21 06:12:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch (936 bytes, text/plain)
2012-07-31 18:32 UTC, David Teigland
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2013:0496 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 6 kernel update 2013-02-20 21:40:54 UTC

Description David Teigland 2012-05-11 17:56:36 UTC
Description of problem:

Nate hit this problem while doing clvm tests on buzz 01-05.

[root@buzz-01 ~]# dlm_tool lockdebug clvmd
Resource len 64  "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
012f0001 CR      Master:   3 03f60001


[root@buzz-02 ~]# dlm_tool lockdebug clvmd
Resource len 64  "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
02cc0002 CR      Master:   3 01d00001

[root@buzz-03 ~]# dlm_tool lockdebug clvmd
Resource len 64  "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Master
Granted
03f60001 CR      Remote:   1 012f0001
01d00001 CR      Remote:   2 02cc0002
00940001 CR      Remote:   5 038f0001
01850001 CR      Remote:   4 034f0001
01330001 CR

[root@buzz-04 ~]# dlm_tool lockdebug clvmd
Resource len 64  "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
034f0001 CR      Master:   3 01850001

[root@buzz-05 ~]# dlm_tool lockdebug clvmd
Resource len  9  "P_#global"
Master
Granted
01cb0001 PR

Resource len 64  "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
038f0001 CR      Master:   3 00940001

Resource len 18  "V_stripe_8_4096_16"
Local 1          flags 00000000 first_lkid 1360003 root 0 recover 0 locks 0

Expecting reply
nodeid  1 msg request lkid 01360003 resource "V_stripe_8_4096_16 D;"

dmesg from buzz-05
dlm: clvmd: total members 5 error 0
dlm: clvmd: dlm_recover_directory
dlm: clvmd: dlm_recover_directory 0 entries
dlm: clvmd: recover 5 done: 147 ms
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply from 1 no lkb 1af0001

It appears that "1af0001" from nodeid 1 should have been "1360003".
On nodeid 1, receive_request() was probably returning another -53
error using setup_stub_lkb().  Because the stub_lkb is shared
for the lockspace among all threads, I suspect the lkid value
is being clobbered by multiple threads using the stub_lkb (a similar
problem with the stub_ms was fixed upstream.)  Ultimately, the
stub_lkb should either have a lock protecting it, or be eliminated.
But, none of this would happen unless there were multiple dlm_recv
threads using it (either concurrently, or maybe due to one sleeping in lowcomms_get_buffer.)

At some point in the distant past, dlm_recv was incorrectly set as
a multithreaded workqueue, but the dlm has never really been made
to work correctly with mutiple recv threads (the stub issues above
being one of the reasons.)  For some reason, these multiple threads
apparently never ran concurrently, so they never posed a problem,
so we never fixed dlm_recv to be single threaded as it should be.

Eventually upstream, the wq threads began operating concurrently,
these problems started appearing, so we fixed it to be single
threaded.  Making a similar fix in rhel would probably resolve
the problem above.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 David Teigland 2012-05-11 18:03:52 UTC
A confirmation that the analysis above was correct.
I found that 1af0001 was the lkid that 1 was sending back to 4,
also using stub_lkb:

dmesg from buzz-04
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53

Comment 3 RHEL Program Management 2012-05-15 04:03:56 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 4 David Teigland 2012-07-31 18:32:33 UTC
Created attachment 601563 [details]
patch

A scratch build with this patch:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4702135

Comment 5 RHEL Program Management 2012-07-31 18:50:21 UTC
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 6 David Teigland 2012-07-31 20:58:19 UTC
This bug was never repeatable AFAIK, so we'll just have to test broadly for no regressions.

Comment 8 Jarod Wilson 2012-08-31 18:44:14 UTC
Patch(es) available on kernel-2.6.32-304.el6

Comment 11 Nate Straz 2013-01-21 20:28:48 UTC
Verified this patch is included in kernel-2.6.32-350.el6 and found no regressions during testing.

Comment 13 errata-xmlrpc 2013-02-21 06:12:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0496.html


Note You need to log in before you can comment on or make changes to this bug.