572695 – clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539

Bug 572695 - clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539

Summary: clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	magma-plugins
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	619476 (view as bug list)
Depends On:
Blocks:	572792
TreeView+	depends on / blocked

Reported:	2010-03-11 20:49 UTC by Lon Hohberger
Modified:	2018-11-14 17:09 UTC (History)
CC List:	10 users (show)
Fixed In Version:	magma-plugins-1.0.15-2
Clone Of:	294491
Environment:
Last Closed:	2011-02-16 16:11:40 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0268	0	normal	SHIPPED_LIVE	magma-plugins bug fix update	2011-02-16 16:11:35 UTC

Description Lon Hohberger 2010-03-11 20:49:44 UTC

+++ This bug was initially created as a clone of Bug #294491 +++

Description of problem:

unexpected reboot

Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster
lock: Unknown error 65539
Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster
lock: Unknown error 65539
Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #50: Unable to obtain cluster
lock: Unknown error 65539
Sep 16 23:10:30 tedse-pro1 clurgmgrd[24622]: <crit> Watchdog: Daemon died,
rebooting...
Sep 16 23:10:31 tedse-pro1 kernel: md: stopping all md devices.
Sep 16 23:10:31 tedse-pro1 kernel: md: md0 switched to read-only mode.
Sep 16 23:14:36 tedse-pro1 syslogd 1.4.1: restart.
Sep 16 23:14:36 tedse-pro1 syslog: syslogd startup succeeded
Sep 16 23:14:36 tedse-pro1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 16 23:14:36 tedse-pro1 kernel: Linux version 2.6.9-42.0.3.ELsmp
(brewbuilder.redhat.com) (gcc version 3.4.6 20060404 (Red Hat
3.4.6-3)) #1 SMP Mon Sep 25 17:28:02 EDT 2006
Sep 16 23:14:36 tedse-pro1 kernel: BIOS-provided physical RAM map:


--- Additional comment from lhh on 2010-02-16 15:21:56 EST ---

I believe I figured out how we get error #65539, but it turns out that it occurs only when other bad stuff happens.

sm.c:745 - sm_lock begin
sm.c:811 - call _dlm_lock
sm.c:568 - _dlm_lock begin
sm.c:574 - call libdlm.c: dlm_ls_lock
libdlm.c:538 - dlm_ls_lock begin
libdlm.c:595 - lksb->sb_status = EINPROG (= 65539)
libdlm.c:604 - call dlm_write (write) - --fails--
libdlm.c:608 - dlm_ls_lock return -1
sm.c:574 - ret = -1
sm.c:579 - _dlm_lock return -1
sm.c:811 - ret = -1
sm.c:824 - checking errno, which is set to whatever write() returned
sm.c:840 - errno wasn't EAGAIN
sm.c:842 - ret = lksb->sb_status (= EINPROG = 65539)
sm.c:844 - errno = ret (= 65539)
sm.c:845 - sm_lock return -1

At this point, rgmanager spits out the warning.

Unfortunately, we don't know what bad value write() returned at this point - but it should not matter.

The call to dlm_ls_lock should check for:
 - EAGAIN (handled), EINTR -> retry
 - EBADF, EFAULT, EFBIG, EINVAL, EIO, ENOSPC, EPIPE -> fatal

My guess as to what happened is that rgmanager received a signal at the exact moment of the write call, causing EINTR to be returned to magma-plugins.  EINTR was being overwritten with the value in the manner described above, causing both an incorrect warning in the system logs and improper behavior.

--- Additional comment from lhh on 2010-02-16 16:41:04 EST ---

Created an attachment (id=394649)
magma-plugins: Handle other return values from dlm_lock

If writing to the dlm lockspace file descriptor failed
due to delivery of a signal, libdlm and magma-plugins were
passing EINTR back up back up to the caller instead of
retrying the lock or unlock request.

Additionally, when this occurred, we were overwriting errno
with the value of lksb->sb_status, which was always EINPROG
(65539) if write(2) returned any error conditions except
EAGAIN.

--- Additional comment from lhh on 2010-02-16 16:53:42 EST ---

Created an attachment (id=394651)
magma-plugins: Handle other errors from dlm_lock

Update to previous patch.  Just better comments.

--- Additional comment from lhh on 2010-02-16 17:41:55 EST ---

Test srpm:

http://people.redhat.com/lhh/magma-plugins-1.0.15-1.1.src.rpm

You can rebuild this on your machine(s) by running:

  rpmbuild --rebuild magma-plugins-1.0.15-1.1.src.rpm

Note that you must have gcc, magma-devel, cman-devel, dlm-devel, gulm-devel, cman-kernheaders and possibly other packages installed in order to build.

Comment 2 Lon Hohberger 2010-03-11 22:31:06 UTC

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=69f42e4e36aadfba659a7843303b99bf9064b9f1

Comment 6 Lon Hohberger 2010-10-22 14:32:44 UTC

*** Bug 619476 has been marked as a duplicate of this bug. ***

Comment 8 Florian Nadge 2011-01-03 12:14:27 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, Magma_sm.so handled read errors from the Distributed Lock Manager (DLM) incorrectly and passed them up to callers as EINPROG, which caused errors with the rgmanager and other applications. With this update, the magma-plugins handle handle these errors correctly, and the issue is resolved.

Comment 9 errata-xmlrpc 2011-02-16 16:11:40 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0268.html

Note You need to log in before you can comment on or make changes to this bug.