Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
Nate hit this problem while doing clvm tests on buzz 01-05.
[root@buzz-01 ~]# dlm_tool lockdebug clvmd
Resource len 64 "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
012f0001 CR Master: 3 03f60001
[root@buzz-02 ~]# dlm_tool lockdebug clvmd
Resource len 64 "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
02cc0002 CR Master: 3 01d00001
[root@buzz-03 ~]# dlm_tool lockdebug clvmd
Resource len 64 "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Master
Granted
03f60001 CR Remote: 1 012f0001
01d00001 CR Remote: 2 02cc0002
00940001 CR Remote: 5 038f0001
01850001 CR Remote: 4 034f0001
01330001 CR
[root@buzz-04 ~]# dlm_tool lockdebug clvmd
Resource len 64 "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
034f0001 CR Master: 3 01850001
[root@buzz-05 ~]# dlm_tool lockdebug clvmd
Resource len 9 "P_#global"
Master
Granted
01cb0001 PR
Resource len 64 "3jjUnzbhMGE8eVRtuIt2uenthHP6MUIH6PgFiLBO0Z7JP7yuzRRx2VLKvOe8w3rj"
Local 3
Granted
038f0001 CR Master: 3 00940001
Resource len 18 "V_stripe_8_4096_16"
Local 1 flags 00000000 first_lkid 1360003 root 0 recover 0 locks 0
Expecting reply
nodeid 1 msg request lkid 01360003 resource "V_stripe_8_4096_16 D;"
dmesg from buzz-05
dlm: clvmd: total members 5 error 0
dlm: clvmd: dlm_recover_directory
dlm: clvmd: dlm_recover_directory 0 entries
dlm: clvmd: recover 5 done: 147 ms
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1360003 1 master diff 1 -53
dlm: clvmd: receive_request_reply from 1 no lkb 1af0001
It appears that "1af0001" from nodeid 1 should have been "1360003".
On nodeid 1, receive_request() was probably returning another -53
error using setup_stub_lkb(). Because the stub_lkb is shared
for the lockspace among all threads, I suspect the lkid value
is being clobbered by multiple threads using the stub_lkb (a similar
problem with the stub_ms was fixed upstream.) Ultimately, the
stub_lkb should either have a lock protecting it, or be eliminated.
But, none of this would happen unless there were multiple dlm_recv
threads using it (either concurrently, or maybe due to one sleeping in lowcomms_get_buffer.)
At some point in the distant past, dlm_recv was incorrectly set as
a multithreaded workqueue, but the dlm has never really been made
to work correctly with mutiple recv threads (the stub issues above
being one of the reasons.) For some reason, these multiple threads
apparently never ran concurrently, so they never posed a problem,
so we never fixed dlm_recv to be single threaded as it should be.
Eventually upstream, the wq threads began operating concurrently,
these problems started appearing, so we fixed it to be single
threaded. Making a similar fix in rhel would probably resolve
the problem above.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
A confirmation that the analysis above was correct.
I found that 1af0001 was the lkid that 1 was sending back to 4,
also using stub_lkb:
dmesg from buzz-04
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53
dlm: clvmd: receive_request_reply 1af0001 1 master diff 1 -53
Comment 3RHEL Program Management
2012-05-15 04:03:56 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 5RHEL Program Management
2012-07-31 18:50:21 UTC
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release. Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
http://rhn.redhat.com/errata/RHSA-2013-0496.html