Bug 294491 - clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539
Summary: clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: magma-plugins
Version: 4
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
: 428062 (view as bug list)
Depends On:
Blocks: 555907
TreeView+ depends on / blocked
 
Reported: 2007-09-18 10:55 UTC by Tomasz Jaszowski
Modified: 2018-10-19 23:50 UTC (History)
6 users (show)

Fixed In Version: RHBA-2008-0793
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 572695 (view as bug list)
Environment:
Last Closed: 2010-03-11 21:30:47 UTC
Embargoed:


Attachments (Terms of Use)
Patch to retry on EINPROG from DLM (1.23 KB, patch)
2007-11-14 19:10 UTC, Lon Hohberger
no flags Details | Diff
Fix; regression-tested (1.83 KB, patch)
2008-01-09 18:54 UTC, Lon Hohberger
no flags Details | Diff
patched magma-plugins srpm (37.42 KB, application/octet-stream)
2008-01-09 22:51 UTC, Lon Hohberger
no flags Details
magma-plugins: Handle other return values from dlm_lock (3.03 KB, patch)
2010-02-16 21:41 UTC, Lon Hohberger
no flags Details | Diff
magma-plugins: Handle other errors from dlm_lock (3.33 KB, patch)
2010-02-16 21:53 UTC, Lon Hohberger
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0793 0 normal SHIPPED_LIVE magma-plugins bug fix update 2008-07-25 19:09:17 UTC
Red Hat Product Errata RHBA-2009:1049 0 normal SHIPPED_LIVE magma-plugins bug-fix update 2009-05-18 21:14:21 UTC

Description Tomasz Jaszowski 2007-09-18 10:55:56 UTC
Description of problem:

unexpected reboot

Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster
lock: Unknown error 65539
Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster
lock: Unknown error 65539
Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #50: Unable to obtain cluster
lock: Unknown error 65539
Sep 16 23:10:30 tedse-pro1 clurgmgrd[24622]: <crit> Watchdog: Daemon died,
rebooting...
Sep 16 23:10:31 tedse-pro1 kernel: md: stopping all md devices.
Sep 16 23:10:31 tedse-pro1 kernel: md: md0 switched to read-only mode.
Sep 16 23:14:36 tedse-pro1 syslogd 1.4.1: restart.
Sep 16 23:14:36 tedse-pro1 syslog: syslogd startup succeeded
Sep 16 23:14:36 tedse-pro1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 16 23:14:36 tedse-pro1 kernel: Linux version 2.6.9-42.0.3.ELsmp
(brewbuilder.redhat.com) (gcc version 3.4.6 20060404 (Red Hat
3.4.6-3)) #1 SMP Mon Sep 25 17:28:02 EDT 2006
Sep 16 23:14:36 tedse-pro1 kernel: BIOS-provided physical RAM map:



Version-Release number of selected component (if applicable):
:~> rpm -qa | grep rgmanager
rgmanager-1.9.54-3.228823test


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:
problem and reboot

Expected results:
explanation of problem 

Additional info:

Comment 1 Lon Hohberger 2007-09-20 14:20:29 UTC
Do you have kernel logs from the incident?

Does this happen with 4.5 packages (esp. dlm, magma-plugins, magma, rgmanager), too?

65539 is what 'errno' was set to; errno is used to report system or library call
errors.  clu_lock() receiving 65539 is unlikely to indicate a problem in rgmanager.


Comment 2 Tomasz Jaszowski 2007-09-21 09:08:00 UTC
(In reply to comment #1)
> Do you have kernel logs from the incident?

only logs from messages:

Sep 16 23:05:50 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster
lock: Unknown error 65539
Sep 16 23:05:50 tedse-pro1 clurgmgrd[24623]: <notice> Stopping service SERVICE1.1
Sep 16 23:05:50 tedse-pro1 clurgmgrd: [24623]: <info> Executing SERVICE1.1.init stop
Sep 16 23:05:50 tedse-pro1 su(pam_unix)[5354]: session opened for user
SERVICE1_USER by (uid=0)
Sep 16 23:06:04 tedse-pro1 su(pam_unix)[5354]: session closed for user SERVICE1_USER
Sep 16 23:06:06 tedse-pro1 clurgmgrd: [24623]: <info> unmounting SERVICE1.1
Sep 16 23:06:07 tedse-pro1 clurgmgrd: [24623]: <info> unmounting SERVICE1_FS
Sep 16 23:06:07 tedse-pro1 clurgmgrd[24623]: <notice> Service SERVICE1.1 is
recovering
Sep 16 23:06:07 tedse-pro1 clurgmgrd[24623]: <notice> Recovering failed service
SERVICE1.1
Sep 16 23:06:07 tedse-pro1 clurgmgrd: [24623]: <info> mounting FS_DEVICE on
SERVICE1.1
Sep 16 23:06:07 tedse-pro1 kernel: kjournald starting.  Commit interval 5 seconds
Sep 16 23:06:07 tedse-pro1 kernel: EXT3-fs warning: maximal mount count reached,
running e2fsck is recommended
Sep 16 23:06:07 tedse-pro1 kernel: EXT3 FS on sdf1, internal journal
Sep 16 23:06:07 tedse-pro1 kernel: EXT3-fs: mounted filesystem with ordered data
mode.
Sep 16 23:06:07 tedse-pro1 clurgmgrd: [24623]: <info> mounting /dev/sdj1 on
SERVICE1_FS
Sep 16 23:06:07 tedse-pro1 kernel: kjournald starting.  Commit interval 5 seconds
Sep 16 23:06:07 tedse-pro1 kernel: EXT3-fs warning: maximal mount count reached,
running e2fsck is recommended
Sep 16 23:06:07 tedse-pro1 kernel: EXT3 FS on sdj1, internal journal
Sep 16 23:06:07 tedse-pro1 kernel: EXT3-fs: mounted filesystem with ordered data
mode.
Sep 16 23:06:07 tedse-pro1 clurgmgrd: [24623]: <info> Executing SERVICE1.1.init
start
Sep 16 23:06:07 tedse-pro1 su(pam_unix)[6368]: session opened for user
SERVICE1_USER by (uid=0)
Sep 16 23:06:07 tedse-pro1 su(pam_unix)[6368]: session closed for user SERVICE1_USER
Sep 16 23:06:08 tedse-pro1 clurgmgrd[24623]: <notice> Service SERVICE1.1 started
[...]
[nothing unusal]
[...]
Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster
lock: Unknown error 65539
Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster
lock: Unknown error 65539
Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #50: Unable to obtain cluster
lock: Unknown error 65539
Sep 16 23:10:30 tedse-pro1 clurgmgrd[24622]: <crit> Watchdog: Daemon died,
rebooting...
Sep 16 23:10:31 tedse-pro1 kernel: md: stopping all md devices.
Sep 16 23:10:31 tedse-pro1 kernel: md: md0 switched to read-only mode.
Sep 16 23:14:36 tedse-pro1 syslogd 1.4.1: restart.
Sep 16 23:14:36 tedse-pro1 syslog: syslogd startup succeeded
Sep 16 23:14:36 tedse-pro1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 16 23:14:36 tedse-pro1 kernel: Linux version 2.6.9-42.0.3.ELsmp
(brewbuilder.redhat.com) (gcc version 3.4.6 20060404 (Red Hat
3.4.6-3)) #1 SMP Mon Sep 25 17:28:02 EDT 2006


> Does this happen with 4.5 packages (esp. dlm, magma-plugins, magma,
rgmanager), too?

no, we are using RH4u4 with updated rgmanager package. It happened first time on
only one cluster. We are using few identical 2node clusters as backup and test
env, and we are using it since few months without any changes.

> 65539 is what 'errno' was set to; errno is used to report system or library call
> errors.  clu_lock() receiving 65539 is unlikely to indicate a problem in
rgmanager.
> 



Comment 3 Lon Hohberger 2007-09-24 20:36:45 UTC
So, it recovered from an error during a locking call, and the side effect was
that the service was bounced?

Comment 4 Tomasz Jaszowski 2007-09-25 05:00:30 UTC
for me it looked like:
- locking call error
- one of services installed on this machine failed (maybe due to this locking error)
- service recovered (it's configured to recover on the same node)
- for few minutes all services status were ok, 
- multiple locking call errors
- watchdog rebooted node

It's ok for me if one service will fail and then recover, but if node is
rebooted it's a big problem. Reboot usually blocks GFS for some time, and
sometimes (due to service dependencies) causes other service restarts.



Comment 5 Krzysztof Kopec 2007-10-19 14:01:42 UTC
we have similar situation again, but on different server.
The only different is that servers wasn't rebooted, only services were failed
and restored. There are logs from one of the services in fact we have such
situation for many of them running on this server.

Oct 19 12:53:46 tefse-pro1 clurgmgrd[2582]: <err> #48: Unable to obtain cluster
lock: Unknown error 65539
Oct 19 12:53:46 tefse-pro1 clurgmgrd[2582]: <notice> Stopping service smb1
Oct 19 12:53:51 tefse-pro1 clurgmgrd: [2582]: <notice> Forcefully unmounting
/opt/pro/smb1
Oct 19 12:53:51 tefse-pro1 clurgmgrd: [2582]: <warning> killing process 5640
(smb smbd /opt/pro/smb1)
Oct 19 12:53:51 tefse-pro1 clurgmgrd: [2582]: <warning> killing process 5650
(smb smbd /opt/pro/smb1)
Oct 19 12:54:02 tefse-pro1 clurgmgrd[2582]: <notice> Service smb1 is recovering
Oct 19 12:54:02 tefse-pro1 clurgmgrd[2582]: <notice> Recovering failed service
smb1
Oct 19 12:54:02 tefse-pro1 kernel: EXT3-fs warning: maximal mount count
reached, running e2fsck is recommended
Oct 19 12:54:02 tefse-pro1 kernel: GFS: Trying to join cluster "lock_dlm",
"tefcl-pro:prodatasmbdata"
Oct 19 12:54:05 tefse-pro1 kernel: GFS: fsid=tefcl-pro:prodatasmbdata.0: Joined
cluster. Now mounting FS...
Oct 19 12:54:05 tefse-pro1 kernel: GFS: fsid=tefcl-pro:prodatasmbdata.0: jid=0:
Trying to acquire journal lock...
Oct 19 12:54:05 tefse-pro1 kernel: GFS: fsid=tefcl-pro:prodatasmbdata.0: jid=0:
Looking at journal...
Oct 19 12:54:05 tefse-pro1 kernel: GFS: fsid=tefcl-pro:prodatasmbdata.0: jid=0:
Done
Oct 19 12:54:05 tefse-pro1 clurgmgrd[2582]: <notice> Service smb1 started

Comment 6 Lon Hohberger 2007-10-22 15:33:07 UTC
Ok, I think I know what's happening:

* libdlm is returning EINPROG / -1 (EINPROG = libdlm-defined error condition,
not related to the actual errno = 0x10003 = 65539).
* libdlm normally returns EAGAIN / 0 in that case because EAGAIN isn't an error.
* EINPROG is normally set when we start a lock operation but it's waiting for
completion

Ok, so, looking at libdlm code, we flip the status in do_dlm_dispatch - but only
after a successful read.  We're getting a read failure at line ~410 in libdlm.c,
and the errno isn't EAGAIN.  This causes the dlm_lksb to get passed back to the
caller unmodified but with dlm_dispatch returning -1.  This gets passed up the
stack to rgmanager, causing problems.

What we need to do is make the sm plugin retry when this condition occurs.  This
should also solve the subsequent failures.


Comment 7 Lon Hohberger 2007-10-22 15:45:44 UTC
Looking at the DLM kernel code, there are really only a couple of ways a read
would return -1:

* Size is wrong -> we're using a library to do the read/writes; this isn't possible
 
* Non-blocking I/O is specified and there's nothing to return -> but we used
select() which indicated data was available...

* A signal is pending prior to the result coming back -> this is what I think is
happening

* some other things happened which we don't care about (lvb / extended data /
etc.) -> we're not using them 



Comment 8 Lon Hohberger 2007-11-14 19:10:08 UTC
Created attachment 258571 [details]
Patch to retry on EINPROG from DLM

Comment 9 RHEL Program Management 2007-11-30 19:05:18 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Lon Hohberger 2008-01-09 18:54:07 UTC
Created attachment 291185 [details]
Fix; regression-tested

Comment 11 Lon Hohberger 2008-01-09 22:51:13 UTC
Created attachment 291213 [details]
patched magma-plugins srpm

Comment 12 Lon Hohberger 2008-01-10 20:40:33 UTC
*** Bug 428062 has been marked as a duplicate of this bug. ***

Comment 13 Lon Hohberger 2008-01-28 19:29:11 UTC
Patch in CVS.

Comment 16 errata-xmlrpc 2008-07-25 19:09:20 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0793.html


Comment 22 Lon Hohberger 2008-09-17 19:27:53 UTC
Updated rgmanager SRPM based on comments from Eduardo Damato:

http://people.redhat.com/lhh/rgmanager-1.9.80-1.3test.src.rpm

Comment 23 Issue Tracker 2008-12-29 12:26:27 UTC
Hi Lon,

FYI customer tried the patch exhaustively and wasn't able of reproducing
the problem anymore. I see BZ is now MODIFIED, and it now just becomes a
question of when this can be released on RHN.

I am verifying with customer about urgencies etc.

thanks again!

cheers
eduardo.


This event sent from IssueTracker by edamato 
 issue 218275

Comment 27 errata-xmlrpc 2009-05-18 21:14:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1049.html

Comment 30 Krzysztof Kopec 2010-02-16 13:10:32 UTC
I'd like to reopen this bug, because we had again problem with our cluster:

NODE1:
Feb 16 13:41:52 tefse-pro1 clurgmgrd[11303]: <err> #48: Unable to obtain cluster lock: Unknown error 65539
Feb 16 13:41:53 tefse-pro1 clurgmgrd[11302]: <crit> Watchdog: Daemon died, rebooting...

NODE2:
Feb 16 13:42:18 tefse-pro2 kernel: CMAN: removing node tefse-pro1 from the cluster : Missed too many heartbeats
Feb 16 13:42:18 tefse-pro2 fenced[11469]: tefse-pro1 not a cluster member after 0 sec post_fail_delay
Feb 16 13:42:18 tefse-pro2 fenced[11469]: fencing node "tefse-pro1"

On both nodes we are using RH4u4 with the following version:

tefse-pro2 ~ $ rpm -qa | egrep "magma|css|manager|GFS|cman|dlm|cluster"
cman-kernel-smp-2.6.9-55.13
cman-devel-1.0.24-1
dlm-1.0.7-1
system-config-cluster-1.0.54-2.0
GFS-6.1.18-1
cmanic-7.6.0-5.rhel4
magma-1.0.8-1
cman-kernel-2.6.9-55.13
cman-1.0.24-1
dlm-kernel-smp-2.6.9-54.11
dlm-kernheaders-2.6.9-54.11
dlm-devel-1.0.7-1
magma-plugins-1.0.15-1
GFS-kernel-2.6.9-80.9
GFS-kernel-smp-2.6.9-80.9
rgmanager-1.9.87-1
magma-devel-1.0.8-1
cman-kernheaders-2.6.9-55.13
dlm-kernel-2.6.9-54.11
GFS-kernheaders-2.6.9-80.9

As you can see we had applied errata from previous comment
(magma-plugins-1.0.15-1.src.rpm)

Comment 31 Tomasz Jaszowski 2010-02-16 13:11:53 UTC
(In reply to comment #30)
> I'd like to reopen this bug, because we had again problem with our cluster:
[...]
> As you can see we had applied errata from previous comment
> (magma-plugins-1.0.15-1.src.rpm)    

due to above comment - REOPEN

Comment 32 Perry Myers 2010-02-16 16:45:34 UTC
(In reply to comment #31)
> (In reply to comment #30)
> > I'd like to reopen this bug, because we had again problem with our cluster:
> [...]
> > As you can see we had applied errata from previous comment
> > (magma-plugins-1.0.15-1.src.rpm)    
> 
> due to above comment - REOPEN    

Hi, thanks for the additional info.  We'll certainly investigate.

Do you have a ticket open already with Red Hat support?

Thanks

Comment 33 Lon Hohberger 2010-02-16 19:53:19 UTC
1) how long was the cluster up ?
2) did rgmanager drop a core file ?

Comment 34 Lon Hohberger 2010-02-16 20:21:56 UTC
I believe I figured out how we get error #65539, but it turns out that it occurs only when other bad stuff happens.

sm.c:745 - sm_lock begin
sm.c:811 - call _dlm_lock
sm.c:568 - _dlm_lock begin
sm.c:574 - call libdlm.c: dlm_ls_lock
libdlm.c:538 - dlm_ls_lock begin
libdlm.c:595 - lksb->sb_status = EINPROG (= 65539)
libdlm.c:604 - call dlm_write (write) - --fails--
libdlm.c:608 - dlm_ls_lock return -1
sm.c:574 - ret = -1
sm.c:579 - _dlm_lock return -1
sm.c:811 - ret = -1
sm.c:824 - checking errno, which is set to whatever write() returned
sm.c:840 - errno wasn't EAGAIN
sm.c:842 - ret = lksb->sb_status (= EINPROG = 65539)
sm.c:844 - errno = ret (= 65539)
sm.c:845 - sm_lock return -1

At this point, rgmanager spits out the warning.

Unfortunately, we don't know what bad value write() returned at this point - but it should not matter.

The call to dlm_ls_lock should check for:
 - EAGAIN (handled), EINTR -> retry
 - EBADF, EFAULT, EFBIG, EINVAL, EIO, ENOSPC, EPIPE -> fatal

My guess as to what happened is that rgmanager received a signal at the exact moment of the write call, causing EINTR to be returned to magma-plugins.  EINTR was being overwritten with the value in the manner described above, causing both an incorrect warning in the system logs and improper behavior.

Comment 35 Lon Hohberger 2010-02-16 21:41:04 UTC
Created attachment 394649 [details]
magma-plugins: Handle other return values from dlm_lock

If writing to the dlm lockspace file descriptor failed
due to delivery of a signal, libdlm and magma-plugins were
passing EINTR back up back up to the caller instead of
retrying the lock or unlock request.

Additionally, when this occurred, we were overwriting errno
with the value of lksb->sb_status, which was always EINPROG
(65539) if write(2) returned any error conditions except
EAGAIN.

Comment 36 Lon Hohberger 2010-02-16 21:53:42 UTC
Created attachment 394651 [details]
magma-plugins: Handle other errors from dlm_lock

Update to previous patch.  Just better comments.

Comment 37 Lon Hohberger 2010-02-16 22:41:55 UTC
Test srpm:

http://people.redhat.com/lhh/magma-plugins-1.0.15-1.1.src.rpm

You can rebuild this on your machine(s) by running:

  rpmbuild --rebuild magma-plugins-1.0.15-1.1.src.rpm

Note that you must have gcc, magma-devel, cman-devel, dlm-devel, gulm-devel, cman-kernheaders and possibly other packages installed in order to build.

Comment 39 Lon Hohberger 2010-03-11 21:30:47 UTC
I have opened a new bugzilla in order to track the second case in comment #34 separately.

The new bugzilla is here:

https://bugzilla.redhat.com/show_bug.cgi?id=572695

Please direct future attention to the above bugzilla.


Note You need to log in before you can comment on or make changes to this bug.