Bug 183507
Summary: | gulm LT can start (and fail) before gulm core starts | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | gulm | Assignee: | Chris Feist <cfeist> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | cluster-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2006-0553 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-08-10 21:18:00 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 180185 |
Description
Corey Marthaler
2006-03-01 16:42:20 UTC
This is looking like a Gulm issue: [root@morph-01 ~]# tail -f /var/log/messages Mar 1 10:55:03 morph-01 lock_gulmd_LTPX[7574]: Cannot connect morph-01 ::ffff:10.15.89.61 (Connection refused) Mar 1 10:55:03 morph-01 lock_gulmd_core[7570]: Got heartbeat from morph-04 at 1141232103407222 (last:10001059 max:10638248 avg:10001058) Mar 1 10:55:04 morph-01 lock_gulmd_LTPX[7574]: Cannot connect morph-01 ::ffff:10.15.89.61 (Connection refused) Mar 1 10:55:08 morph-01 last message repeated 4 times Mar 1 10:55:08 morph-01 lock_gulmd_core[7570]: Got heartbeat from morph-01 at 1141232108530404 (last:10001478 max:10002480 avg:10001149) Mar 1 10:55:09 morph-01 lock_gulmd_LTPX[7574]: Cannot connect morph-01 ::ffff:10.15.89.61 (Connection refused) Mar 1 10:55:10 morph-01 lock_gulmd_LTPX[7574]: Cannot connect morph-01 ::ffff:10.15.89.61 (Connection refused) Mar 1 10:55:11 morph-01 lock_gulmd_core[7570]: Got heartbeat from morph-03 at 1141232111218135 (last:10000429 max:12802905 avg:10000516) Mar 1 10:55:11 morph-01 lock_gulmd_LTPX[7574]: Cannot connect morph-01 ::ffff:10.15.89.61 (Connection refused) Mar 1 10:55:11 morph-01 lock_gulmd_core[7570]: Got heartbeat from morph-02 at 1141232111474380 (last:10000431 max:12548367 avg:10000550) Mar 1 10:55:12 morph-01 lock_gulmd_LTPX[7574]: Cannot connect morph-01 ::ffff:10.15.89.61 (Connection refused) Mar 1 10:55:13 morph-01 lock_gulmd_LTPX[7574]: Cannot connect morph-01 ::ffff:10.15.89.61 (Connection refused) Mar 1 10:55:13 morph-01 lock_gulmd_core[7570]: Got heartbeat from morph-04 at 1141232113408278 (last:10001056 max:10638248 avg:10001058) Mar 1 10:55:14 morph-01 lock_gulmd_LTPX[7574]: Cannot connect morph-01 ::ffff:10.15.89.61 (Connection refused) [root@morph-02 ~]# tail -f /var/log/messages Mar 1 10:52:31 morph-02 lock_gulmd_core[25985]: Sending heartbeat to Core Master at 1141231951272364, last was 1141231941271885 Mar 1 10:52:32 morph-02 lock_gulmd_LTPX[25993]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:52:41 morph-02 last message repeated 9 times Mar 1 10:52:41 morph-02 lock_gulmd_core[25985]: Sending heartbeat to Core Master at 1141231961272844, last was 1141231951272364 Mar 1 10:52:42 morph-02 lock_gulmd_LTPX[25993]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:52:51 morph-02 last message repeated 9 times Mar 1 10:52:51 morph-02 lock_gulmd_core[25985]: Sending heartbeat to Core Master at 1141231971273324, last was 1141231961272844 Mar 1 10:52:52 morph-02 lock_gulmd_LTPX[25993]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:53:01 morph-02 last message repeated 9 times Mar 1 10:53:01 morph-02 lock_gulmd_core[25985]: Sending heartbeat to Core Master at 1141231981273803, last was 1141231971273324 Mar 1 10:53:02 morph-02 lock_gulmd_LTPX[25993]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) [root@morph-03 ~]# tail -f /var/log/messages Mar 1 10:57:22 morph-03 lock_gulmd_LT000[27793]: Trying to log into Master morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 Mar 1 10:57:22 morph-03 lock_gulmd_LT000[27793]: Cannot connect to morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:57:22 morph-03 lock_gulmd_LTPX[27797]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:57:23 morph-03 lock_gulmd_LT000[27793]: Trying to log into Master morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 Mar 1 10:57:23 morph-03 lock_gulmd_LT000[27793]: Cannot connect to morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:57:23 morph-03 lock_gulmd_LTPX[27797]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:57:24 morph-03 lock_gulmd_LT000[27793]: Trying to log into Master morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 Mar 1 10:57:24 morph-03 lock_gulmd_LT000[27793]: Cannot connect to morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:57:24 morph-03 lock_gulmd_core[27789]: Sending heartbeat to Core Master at 1141232244370484, last was 1141232234370004 Mar 1 10:57:24 morph-03 lock_gulmd_LTPX[27797]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:57:25 morph-03 lock_gulmd_LT000[27793]: Trying to log into Master morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 Mar 1 10:57:25 morph-03 lock_gulmd_LT000[27793]: Cannot connect to morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:57:25 morph-03 lock_gulmd_LTPX[27797]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) [root@morph-04 ~]# tail -f /var/log/messages Mar 1 10:54:39 morph-04 lock_gulmd_LT000[25615]: Cannot connect to morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:54:39 morph-04 lock_gulmd_LTPX[25619]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:54:40 morph-04 lock_gulmd_LT000[25615]: Trying to log into Master morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 Mar 1 10:54:40 morph-04 lock_gulmd_LT000[25615]: Cannot connect to morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:54:40 morph-04 lock_gulmd_LTPX[25619]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:54:41 morph-04 lock_gulmd_LT000[25615]: Trying to log into Master morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 Mar 1 10:54:41 morph-04 lock_gulmd_LT000[25615]: Cannot connect to morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:54:41 morph-04 lock_gulmd_LTPX[25619]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:54:42 morph-04 lock_gulmd_LT000[25615]: Trying to log into Master morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 Mar 1 10:54:42 morph-04 lock_gulmd_LT000[25615]: Cannot connect to morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) Mar 1 10:54:42 morph-04 lock_gulmd_LTPX[25619]: Cannot connect morph-01.lab.msp.redhat.com ::ffff:10.15.89.61 (Connection refused) What is going on here? gulm_tool getstats reports that the cluster is fine yet I see all these connetcion refusal messages. I don't think this is another FQDN issuse cause everyone is useing the short name. [root@morph-01 ~]# uname -ar Linux morph-01 2.6.9-34.EL #1 Fri Feb 24 16:44:51 EST 2006 i686 i686 i386 GNU/Linux [root@morph-01 ~]# cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster name="morph-cluster" config_version="1"> <gulm> <lockserver name="morph-01"/> <lockserver name="morph-03"/> <lockserver name="morph-04"/> </gulm> <clusternodes> <clusternode name="morph-01"> <fence> <method name="single"> <device name="apc" switch="1" port="1"/> </method> </fence> </clusternode> <clusternode name="morph-02"> <fence> <method name="single"> <device name="apc" switch="1" port="2"/> </method> </fence> </clusternode> <clusternode name="morph-03"> <fence> <method name="single"> <device name="apc" switch="1" port="3"/> </method> </fence> </clusternode> <clusternode name="morph-04"> <fence> <method name="single"> <device name="apc" switch="1" port="4"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="apc" agent="fence_apc" ipaddr="morph-apc" login="apc" passwd="apc"/> </fencedevices> </cluster> Looks like gulm LT is getting started (and failing) before gulm core starts. Mar 1 10:14:35 morph-01 lock_gulmd_main[5806]: Forked lock_gulmd_core. Mar 1 10:14:36 morph-01 lock_gulmd_main[5806]: Forked lock_gulmd_LT. Mar 1 10:14:36 morph-01 lock_gulmd_LT[7569]: Starting lock_gulmd_LT 1.0.6. (built Feb 20 2006 13:34:52) Copyright (C) 2004 Red Ha t, Inc. All rights reserved. Mar 1 10:14:36 morph-01 lock_gulmd_LT[7569]: I am running in Fail-over mode. Mar 1 10:14:36 morph-01 lock_gulmd_LT[7569]: I am (morph-01) with ip (::ffff:10.15.89.61) Mar 1 10:14:36 morph-01 lock_gulmd_LT[7569]: This is cluster morph-cluster Mar 1 10:14:36 morph-01 lock_gulmd_LT000[7569]: Locktable 0 started. Mar 1 10:14:36 morph-01 lock_gulmd_LT000[7569]: ERROR [src/lock_io.c:531] Failed to connect to core. 111:Connection refused Mar 1 10:14:36 morph-01 lock_gulmd_core[7570]: Starting lock_gulmd_core 1.0.6. (built Feb 20 2006 13:34:52) Copyright (C) 2004 Re d Hat, Inc. All rights reserved. Mar 1 10:14:36 morph-01 lock_gulmd_core[7570]: I am running in Fail-over mode. Mar 1 10:14:36 morph-01 lock_gulmd_core[7570]: I am (morph-01) with ip (::ffff:10.15.89.61) Mar 1 10:14:36 morph-01 lock_gulmd_core[7570]: This is cluster morph-cluster Mar 1 10:14:36 morph-01 lock_gulmd_core[7570]: In state: Pending Mar 1 10:14:37 morph-01 lock_gulmd_core[7570]: New Service "Magma::5503" connected. idx:1 fd:6 Mar 1 10:14:37 morph-01 lock_gulmd_core[7570]: EOF on xdr (Magma::5503 ::1 idx:1 fd:6) Mar 1 10:14:37 morph-01 lock_gulmd_core[7570]: Closing connection idx:1, fd:6 to Magma::5503 Mar 1 10:14:37 morph-01 lock_gulmd_main[5806]: Forked lock_gulmd_LTPX. Mar 1 10:14:37 morph-01 lock_gulmd_LTPX[7574]: Starting lock_gulmd_LTPX 1.0.6. (built Feb 20 2006 13:34:52) Copyright (C) 2004 Re d Hat, Inc. All rights reserved. restarting the gulm deamon on the master with problems fixed this issue. Changing ownership to cfeist as this is a gulm issue. This appears to be an issue with the gulm processes not starting up at the correct times. Modified gulm_lt & gulm_ltpx to retry if they are unable to connect the first time. Fix should be in the next gulm build. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0553.html |