Description of problem: unexpected reboot Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539 Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539 Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #50: Unable to obtain cluster lock: Unknown error 65539 Sep 16 23:10:30 tedse-pro1 clurgmgrd[24622]: <crit> Watchdog: Daemon died, rebooting... Sep 16 23:10:31 tedse-pro1 kernel: md: stopping all md devices. Sep 16 23:10:31 tedse-pro1 kernel: md: md0 switched to read-only mode. Sep 16 23:14:36 tedse-pro1 syslogd 1.4.1: restart. Sep 16 23:14:36 tedse-pro1 syslog: syslogd startup succeeded Sep 16 23:14:36 tedse-pro1 kernel: klogd 1.4.1, log source = /proc/kmsg started. Sep 16 23:14:36 tedse-pro1 kernel: Linux version 2.6.9-42.0.3.ELsmp (brewbuilder.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:28:02 EDT 2006 Sep 16 23:14:36 tedse-pro1 kernel: BIOS-provided physical RAM map: Version-Release number of selected component (if applicable): :~> rpm -qa | grep rgmanager rgmanager-1.9.54-3.228823test How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: problem and reboot Expected results: explanation of problem Additional info:
Do you have kernel logs from the incident? Does this happen with 4.5 packages (esp. dlm, magma-plugins, magma, rgmanager), too? 65539 is what 'errno' was set to; errno is used to report system or library call errors. clu_lock() receiving 65539 is unlikely to indicate a problem in rgmanager.
(In reply to comment #1) > Do you have kernel logs from the incident? only logs from messages: Sep 16 23:05:50 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539 Sep 16 23:05:50 tedse-pro1 clurgmgrd[24623]: <notice> Stopping service SERVICE1.1 Sep 16 23:05:50 tedse-pro1 clurgmgrd: [24623]: <info> Executing SERVICE1.1.init stop Sep 16 23:05:50 tedse-pro1 su(pam_unix)[5354]: session opened for user SERVICE1_USER by (uid=0) Sep 16 23:06:04 tedse-pro1 su(pam_unix)[5354]: session closed for user SERVICE1_USER Sep 16 23:06:06 tedse-pro1 clurgmgrd: [24623]: <info> unmounting SERVICE1.1 Sep 16 23:06:07 tedse-pro1 clurgmgrd: [24623]: <info> unmounting SERVICE1_FS Sep 16 23:06:07 tedse-pro1 clurgmgrd[24623]: <notice> Service SERVICE1.1 is recovering Sep 16 23:06:07 tedse-pro1 clurgmgrd[24623]: <notice> Recovering failed service SERVICE1.1 Sep 16 23:06:07 tedse-pro1 clurgmgrd: [24623]: <info> mounting FS_DEVICE on SERVICE1.1 Sep 16 23:06:07 tedse-pro1 kernel: kjournald starting. Commit interval 5 seconds Sep 16 23:06:07 tedse-pro1 kernel: EXT3-fs warning: maximal mount count reached, running e2fsck is recommended Sep 16 23:06:07 tedse-pro1 kernel: EXT3 FS on sdf1, internal journal Sep 16 23:06:07 tedse-pro1 kernel: EXT3-fs: mounted filesystem with ordered data mode. Sep 16 23:06:07 tedse-pro1 clurgmgrd: [24623]: <info> mounting /dev/sdj1 on SERVICE1_FS Sep 16 23:06:07 tedse-pro1 kernel: kjournald starting. Commit interval 5 seconds Sep 16 23:06:07 tedse-pro1 kernel: EXT3-fs warning: maximal mount count reached, running e2fsck is recommended Sep 16 23:06:07 tedse-pro1 kernel: EXT3 FS on sdj1, internal journal Sep 16 23:06:07 tedse-pro1 kernel: EXT3-fs: mounted filesystem with ordered data mode. Sep 16 23:06:07 tedse-pro1 clurgmgrd: [24623]: <info> Executing SERVICE1.1.init start Sep 16 23:06:07 tedse-pro1 su(pam_unix)[6368]: session opened for user SERVICE1_USER by (uid=0) Sep 16 23:06:07 tedse-pro1 su(pam_unix)[6368]: session closed for user SERVICE1_USER Sep 16 23:06:08 tedse-pro1 clurgmgrd[24623]: <notice> Service SERVICE1.1 started [...] [nothing unusal] [...] Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539 Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #48: Unable to obtain cluster lock: Unknown error 65539 Sep 16 23:10:30 tedse-pro1 clurgmgrd[24623]: <err> #50: Unable to obtain cluster lock: Unknown error 65539 Sep 16 23:10:30 tedse-pro1 clurgmgrd[24622]: <crit> Watchdog: Daemon died, rebooting... Sep 16 23:10:31 tedse-pro1 kernel: md: stopping all md devices. Sep 16 23:10:31 tedse-pro1 kernel: md: md0 switched to read-only mode. Sep 16 23:14:36 tedse-pro1 syslogd 1.4.1: restart. Sep 16 23:14:36 tedse-pro1 syslog: syslogd startup succeeded Sep 16 23:14:36 tedse-pro1 kernel: klogd 1.4.1, log source = /proc/kmsg started. Sep 16 23:14:36 tedse-pro1 kernel: Linux version 2.6.9-42.0.3.ELsmp (brewbuilder.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Mon Sep 25 17:28:02 EDT 2006 > Does this happen with 4.5 packages (esp. dlm, magma-plugins, magma, rgmanager), too? no, we are using RH4u4 with updated rgmanager package. It happened first time on only one cluster. We are using few identical 2node clusters as backup and test env, and we are using it since few months without any changes. > 65539 is what 'errno' was set to; errno is used to report system or library call > errors. clu_lock() receiving 65539 is unlikely to indicate a problem in rgmanager. >
So, it recovered from an error during a locking call, and the side effect was that the service was bounced?
for me it looked like: - locking call error - one of services installed on this machine failed (maybe due to this locking error) - service recovered (it's configured to recover on the same node) - for few minutes all services status were ok, - multiple locking call errors - watchdog rebooted node It's ok for me if one service will fail and then recover, but if node is rebooted it's a big problem. Reboot usually blocks GFS for some time, and sometimes (due to service dependencies) causes other service restarts.
we have similar situation again, but on different server. The only different is that servers wasn't rebooted, only services were failed and restored. There are logs from one of the services in fact we have such situation for many of them running on this server. Oct 19 12:53:46 tefse-pro1 clurgmgrd[2582]: <err> #48: Unable to obtain cluster lock: Unknown error 65539 Oct 19 12:53:46 tefse-pro1 clurgmgrd[2582]: <notice> Stopping service smb1 Oct 19 12:53:51 tefse-pro1 clurgmgrd: [2582]: <notice> Forcefully unmounting /opt/pro/smb1 Oct 19 12:53:51 tefse-pro1 clurgmgrd: [2582]: <warning> killing process 5640 (smb smbd /opt/pro/smb1) Oct 19 12:53:51 tefse-pro1 clurgmgrd: [2582]: <warning> killing process 5650 (smb smbd /opt/pro/smb1) Oct 19 12:54:02 tefse-pro1 clurgmgrd[2582]: <notice> Service smb1 is recovering Oct 19 12:54:02 tefse-pro1 clurgmgrd[2582]: <notice> Recovering failed service smb1 Oct 19 12:54:02 tefse-pro1 kernel: EXT3-fs warning: maximal mount count reached, running e2fsck is recommended Oct 19 12:54:02 tefse-pro1 kernel: GFS: Trying to join cluster "lock_dlm", "tefcl-pro:prodatasmbdata" Oct 19 12:54:05 tefse-pro1 kernel: GFS: fsid=tefcl-pro:prodatasmbdata.0: Joined cluster. Now mounting FS... Oct 19 12:54:05 tefse-pro1 kernel: GFS: fsid=tefcl-pro:prodatasmbdata.0: jid=0: Trying to acquire journal lock... Oct 19 12:54:05 tefse-pro1 kernel: GFS: fsid=tefcl-pro:prodatasmbdata.0: jid=0: Looking at journal... Oct 19 12:54:05 tefse-pro1 kernel: GFS: fsid=tefcl-pro:prodatasmbdata.0: jid=0: Done Oct 19 12:54:05 tefse-pro1 clurgmgrd[2582]: <notice> Service smb1 started
Ok, I think I know what's happening: * libdlm is returning EINPROG / -1 (EINPROG = libdlm-defined error condition, not related to the actual errno = 0x10003 = 65539). * libdlm normally returns EAGAIN / 0 in that case because EAGAIN isn't an error. * EINPROG is normally set when we start a lock operation but it's waiting for completion Ok, so, looking at libdlm code, we flip the status in do_dlm_dispatch - but only after a successful read. We're getting a read failure at line ~410 in libdlm.c, and the errno isn't EAGAIN. This causes the dlm_lksb to get passed back to the caller unmodified but with dlm_dispatch returning -1. This gets passed up the stack to rgmanager, causing problems. What we need to do is make the sm plugin retry when this condition occurs. This should also solve the subsequent failures.
Looking at the DLM kernel code, there are really only a couple of ways a read would return -1: * Size is wrong -> we're using a library to do the read/writes; this isn't possible * Non-blocking I/O is specified and there's nothing to return -> but we used select() which indicated data was available... * A signal is pending prior to the result coming back -> this is what I think is happening * some other things happened which we don't care about (lvb / extended data / etc.) -> we're not using them
Created attachment 258571 [details] Patch to retry on EINPROG from DLM
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 291185 [details] Fix; regression-tested
Created attachment 291213 [details] patched magma-plugins srpm
*** Bug 428062 has been marked as a duplicate of this bug. ***
Patch in CVS.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0793.html
Updated fixes in git: http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=84ad017e8091cda3e538ab439ec4938e640fe22d Related to doing lots of relocations quickly: http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=12a7b076984a5e92a4c2f4fc6d589f75e6d7b646 (bug #461954) and (for performance): http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=651858b37a27ee735423103e65bc39121b5ece0a
Test source RPMs here: http://people.redhat.com/lhh/magma-plugins-1.0.14-1.1test.src.rpm http://people.redhat.com/lhh/rgmanager-1.9.80-1.1test.src.rpm
http://git.fedorahosted.org/git/?p=cluster.git;a=blobdiff;f=magma-plugins/sm/sm.c;h=666bb4fd78adf20c0f66f9431cc4c7b6b8bef972;hp=6867b81a0acc3e0df135d57ca1bb881afdf4364d;hb=f59bac1cbb6a1b3812dbdf0637fdeee635aa1a8e;hpb=0179f8e9edc98ace2e59cb432476e2d9dd89cfde I pushed the wrong patch; here's the corrected one. Update magma-plugins source RPM: http://people.redhat.com/lhh/magma-plugins-1.0.14-1.2test.src.rpm
Updated rgmanager SRPM based on comments from Eduardo Damato: http://people.redhat.com/lhh/rgmanager-1.9.80-1.3test.src.rpm
Hi Lon, FYI customer tried the patch exhaustively and wasn't able of reproducing the problem anymore. I see BZ is now MODIFIED, and it now just becomes a question of when this can be released on RHN. I am verifying with customer about urgencies etc. thanks again! cheers eduardo. This event sent from IssueTracker by edamato issue 218275
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1049.html
I'd like to reopen this bug, because we had again problem with our cluster: NODE1: Feb 16 13:41:52 tefse-pro1 clurgmgrd[11303]: <err> #48: Unable to obtain cluster lock: Unknown error 65539 Feb 16 13:41:53 tefse-pro1 clurgmgrd[11302]: <crit> Watchdog: Daemon died, rebooting... NODE2: Feb 16 13:42:18 tefse-pro2 kernel: CMAN: removing node tefse-pro1 from the cluster : Missed too many heartbeats Feb 16 13:42:18 tefse-pro2 fenced[11469]: tefse-pro1 not a cluster member after 0 sec post_fail_delay Feb 16 13:42:18 tefse-pro2 fenced[11469]: fencing node "tefse-pro1" On both nodes we are using RH4u4 with the following version: tefse-pro2 ~ $ rpm -qa | egrep "magma|css|manager|GFS|cman|dlm|cluster" cman-kernel-smp-2.6.9-55.13 cman-devel-1.0.24-1 dlm-1.0.7-1 system-config-cluster-1.0.54-2.0 GFS-6.1.18-1 cmanic-7.6.0-5.rhel4 magma-1.0.8-1 cman-kernel-2.6.9-55.13 cman-1.0.24-1 dlm-kernel-smp-2.6.9-54.11 dlm-kernheaders-2.6.9-54.11 dlm-devel-1.0.7-1 magma-plugins-1.0.15-1 GFS-kernel-2.6.9-80.9 GFS-kernel-smp-2.6.9-80.9 rgmanager-1.9.87-1 magma-devel-1.0.8-1 cman-kernheaders-2.6.9-55.13 dlm-kernel-2.6.9-54.11 GFS-kernheaders-2.6.9-80.9 As you can see we had applied errata from previous comment (magma-plugins-1.0.15-1.src.rpm)
(In reply to comment #30) > I'd like to reopen this bug, because we had again problem with our cluster: [...] > As you can see we had applied errata from previous comment > (magma-plugins-1.0.15-1.src.rpm) due to above comment - REOPEN
(In reply to comment #31) > (In reply to comment #30) > > I'd like to reopen this bug, because we had again problem with our cluster: > [...] > > As you can see we had applied errata from previous comment > > (magma-plugins-1.0.15-1.src.rpm) > > due to above comment - REOPEN Hi, thanks for the additional info. We'll certainly investigate. Do you have a ticket open already with Red Hat support? Thanks
1) how long was the cluster up ? 2) did rgmanager drop a core file ?
I believe I figured out how we get error #65539, but it turns out that it occurs only when other bad stuff happens. sm.c:745 - sm_lock begin sm.c:811 - call _dlm_lock sm.c:568 - _dlm_lock begin sm.c:574 - call libdlm.c: dlm_ls_lock libdlm.c:538 - dlm_ls_lock begin libdlm.c:595 - lksb->sb_status = EINPROG (= 65539) libdlm.c:604 - call dlm_write (write) - --fails-- libdlm.c:608 - dlm_ls_lock return -1 sm.c:574 - ret = -1 sm.c:579 - _dlm_lock return -1 sm.c:811 - ret = -1 sm.c:824 - checking errno, which is set to whatever write() returned sm.c:840 - errno wasn't EAGAIN sm.c:842 - ret = lksb->sb_status (= EINPROG = 65539) sm.c:844 - errno = ret (= 65539) sm.c:845 - sm_lock return -1 At this point, rgmanager spits out the warning. Unfortunately, we don't know what bad value write() returned at this point - but it should not matter. The call to dlm_ls_lock should check for: - EAGAIN (handled), EINTR -> retry - EBADF, EFAULT, EFBIG, EINVAL, EIO, ENOSPC, EPIPE -> fatal My guess as to what happened is that rgmanager received a signal at the exact moment of the write call, causing EINTR to be returned to magma-plugins. EINTR was being overwritten with the value in the manner described above, causing both an incorrect warning in the system logs and improper behavior.
Created attachment 394649 [details] magma-plugins: Handle other return values from dlm_lock If writing to the dlm lockspace file descriptor failed due to delivery of a signal, libdlm and magma-plugins were passing EINTR back up back up to the caller instead of retrying the lock or unlock request. Additionally, when this occurred, we were overwriting errno with the value of lksb->sb_status, which was always EINPROG (65539) if write(2) returned any error conditions except EAGAIN.
Created attachment 394651 [details] magma-plugins: Handle other errors from dlm_lock Update to previous patch. Just better comments.
Test srpm: http://people.redhat.com/lhh/magma-plugins-1.0.15-1.1.src.rpm You can rebuild this on your machine(s) by running: rpmbuild --rebuild magma-plugins-1.0.15-1.1.src.rpm Note that you must have gcc, magma-devel, cman-devel, dlm-devel, gulm-devel, cman-kernheaders and possibly other packages installed in order to build.
I have opened a new bugzilla in order to track the second case in comment #34 separately. The new bugzilla is here: https://bugzilla.redhat.com/show_bug.cgi?id=572695 Please direct future attention to the above bugzilla.