+++ This bug was initially created as a clone of Bug #294491 +++ Description of problem: unexpected reboot Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539 Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539 Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #50: Unable to obtain cluster lock: Unknown error 65539 Sep 16 23:10:30 tedse-pro1 clurgmgrd[24622]: <crit> Watchdog: Daemon died, rebooting... Sep 16 23:10:31 tedse-pro1 kernel: md: stopping all md devices. Sep 16 23:10:31 tedse-pro1 kernel: md: md0 switched to read-only mode. Sep 16 23:14:36 tedse-pro1 syslogd 1.4.1: restart. Sep 16 23:14:36 tedse-pro1 syslog: syslogd startup succeeded Sep 16 23:14:36 tedse-pro1 kernel: klogd 1.4.1, log source = /proc/kmsg started. Sep 16 23:14:36 tedse-pro1 kernel: Linux version 2.6.9-42.0.3.ELsmp (brewbuilder.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:28:02 EDT 2006 Sep 16 23:14:36 tedse-pro1 kernel: BIOS-provided physical RAM map: --- Additional comment from lhh on 2010-02-16 15:21:56 EST --- I believe I figured out how we get error #65539, but it turns out that it occurs only when other bad stuff happens. sm.c:745 - sm_lock begin sm.c:811 - call _dlm_lock sm.c:568 - _dlm_lock begin sm.c:574 - call libdlm.c: dlm_ls_lock libdlm.c:538 - dlm_ls_lock begin libdlm.c:595 - lksb->sb_status = EINPROG (= 65539) libdlm.c:604 - call dlm_write (write) - --fails-- libdlm.c:608 - dlm_ls_lock return -1 sm.c:574 - ret = -1 sm.c:579 - _dlm_lock return -1 sm.c:811 - ret = -1 sm.c:824 - checking errno, which is set to whatever write() returned sm.c:840 - errno wasn't EAGAIN sm.c:842 - ret = lksb->sb_status (= EINPROG = 65539) sm.c:844 - errno = ret (= 65539) sm.c:845 - sm_lock return -1 At this point, rgmanager spits out the warning. Unfortunately, we don't know what bad value write() returned at this point - but it should not matter. The call to dlm_ls_lock should check for: - EAGAIN (handled), EINTR -> retry - EBADF, EFAULT, EFBIG, EINVAL, EIO, ENOSPC, EPIPE -> fatal My guess as to what happened is that rgmanager received a signal at the exact moment of the write call, causing EINTR to be returned to magma-plugins. EINTR was being overwritten with the value in the manner described above, causing both an incorrect warning in the system logs and improper behavior. --- Additional comment from lhh on 2010-02-16 16:41:04 EST --- Created an attachment (id=394649) magma-plugins: Handle other return values from dlm_lock If writing to the dlm lockspace file descriptor failed due to delivery of a signal, libdlm and magma-plugins were passing EINTR back up back up to the caller instead of retrying the lock or unlock request. Additionally, when this occurred, we were overwriting errno with the value of lksb->sb_status, which was always EINPROG (65539) if write(2) returned any error conditions except EAGAIN. --- Additional comment from lhh on 2010-02-16 16:53:42 EST --- Created an attachment (id=394651) magma-plugins: Handle other errors from dlm_lock Update to previous patch. Just better comments. --- Additional comment from lhh on 2010-02-16 17:41:55 EST --- Test srpm: http://people.redhat.com/lhh/magma-plugins-1.0.15-1.1.src.rpm You can rebuild this on your machine(s) by running: rpmbuild --rebuild magma-plugins-1.0.15-1.1.src.rpm Note that you must have gcc, magma-devel, cman-devel, dlm-devel, gulm-devel, cman-kernheaders and possibly other packages installed in order to build.
http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=69f42e4e36aadfba659a7843303b99bf9064b9f1
*** Bug 619476 has been marked as a duplicate of this bug. ***
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, Magma_sm.so handled read errors from the Distributed Lock Manager (DLM) incorrectly and passed them up to callers as EINPROG, which caused errors with the rgmanager and other applications. With this update, the magma-plugins handle handle these errors correctly, and the issue is resolved.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0268.html