Bug 178081

Summary: lock_gulmd fails to start after node is rebooted
Product: [Retired] Red Hat Cluster Suite Reporter: Nate Straz <nstraz>
Component: gulmAssignee: Chris Feist <cfeist>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-01-17 19:32:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nate Straz 2006-01-17 18:56:23 UTC
Description of problem:

While doing recovery testing, some nodes are not able to start lock_gulmd
because they are having communication problems.

Version-Release number of selected component (if applicable):
gulm-1.0.4-0
RHEL4 U2 security errata

How reproducible:
Seems easy now

Steps to Reproduce:
1. Reboot nodes in a gulm cluster

Actual results:

Here is the scenario that ran:

[revolver]      Gulm Status
[revolver]      ===========
[revolver]      tank-01: [revolver] Master
[revolver]      tank-02: [revolver] Client
[revolver]      tank-03: [revolver] Slave
[revolver]      tank-04: [revolver] Client
[revolver]      tank-05: [revolver] Slave
[revolver] Those facing the revolver=tank-04 tank-02

Log from master (tank-01)
Jan 17 11:29:35 tank-01 lock_gulmd_core[7089]: tank-04 missed a heartbeat (time:
1137518975515971 mb:1) 
Jan 17 11:29:43 tank-01 lock_gulmd_core[7089]: tank-02 missed a heartbeat (time:
1137518983016768 mb:1) 
Jan 17 11:29:50 tank-01 lock_gulmd_core[7089]: tank-04 missed a heartbeat (time:
1137518990517639 mb:2) 
Jan 17 11:29:58 tank-01 lock_gulmd_core[7089]: tank-02 missed a heartbeat (time:
1137518998019583 mb:2) 
Jan 17 11:30:05 tank-01 lock_gulmd_core[7089]: tank-04 missed a heartbeat (time:
1137519005520346 mb:3) 
Jan 17 11:30:05 tank-01 lock_gulmd_core[7089]: Client (tank-04) expired 
Jan 17 11:30:05 tank-01 lock_gulmd_core[7974]: Gonna exec fence_node -O tank-04 
Jan 17 11:30:05 tank-01 lock_gulmd_core[7089]: Forked [7974] fence_node -O tank-
04 with a 0 pause. 
Jan 17 11:30:09 tank-01 fence_node[7974]: Fence of "tank-04" was successful 
Jan 17 11:30:13 tank-01 lock_gulmd_core[7089]: tank-02 missed a heartbeat (time:
1137519013023240 mb:3) 
Jan 17 11:30:13 tank-01 lock_gulmd_core[7089]: Client (tank-02) expired 
Jan 17 11:30:13 tank-01 lock_gulmd_core[7980]: Gonna exec fence_node -O tank-02 
Jan 17 11:30:13 tank-01 lock_gulmd_core[7089]: Forked [7980] fence_node -O tank-
02 with a 0 pause. 
Jan 17 11:30:17 tank-01 fence_node[7980]: Fence of "tank-02" was successful 
Jan 17 11:31:39 tank-01 lock_gulmd_core[7089]:  (tank-04 ::ffff:10.15.84.94) Can
not login if you are expired. 
Jan 17 11:31:42 tank-01 lock_gulmd_core[7089]:  (tank-04 ::ffff:10.15.84.94) Can
not login if you are expired. 
Jan 17 11:32:17 tank-01 lock_gulmd_core[7089]:  (tank-02 ::ffff:10.15.84.92) Can
not login if you are expired. 
Jan 17 11:32:20 tank-01 lock_gulmd_core[7089]:  (tank-02 ::ffff:10.15.84.92) Can
not login if you are expired. 

Log from tank-02:
Jan 17 11:32:09 tank-02 lock_gulmd_main[2349]: Forked lock_gulmd_core. 
Jan 17 11:32:09 tank-02 lock_gulmd_core[2361]: Starting lock_gulmd_core 1.0.4.
(built Aug  1 2005 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved. 
Jan 17 11:32:09 tank-02 lock_gulmd_core[2361]: I am running in Fail-over mode. 
Jan 17 11:32:09 tank-02 lock_gulmd_core[2361]: I am (tank-02) with ip
(::ffff:10.15.84.92) 
Jan 17 11:32:09 tank-02 lock_gulmd_core[2361]: This is cluster tank-cluster 
Jan 17 11:32:09 tank-02 lock_gulmd_core[2361]: ERROR [src/core_io.c:1058] Got
error from reply: (tank-01.lab.msp.redhat.com ::ffff:10.15.84.91) 1008:Bad State
Change 
Jan 17 11:32:10 tank-02 lock_gulmd_core[2361]: EOF on xdr (Magma::2296 ::1 idx:1
fd:6) 
Jan 17 11:32:10 tank-02 lock_gulmd_main[2349]: Forked lock_gulmd_LT. 
Jan 17 11:32:10 tank-02 lock_gulmd_LT[2365]: Starting lock_gulmd_LT 1.0.4.
(built Aug  1 2005 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved. 
Jan 17 11:32:10 tank-02 lock_gulmd_LT[2365]: I am running in Fail-over mode. 
Jan 17 11:32:10 tank-02 lock_gulmd_LT[2365]: I am (tank-02) with ip
(::ffff:10.15.84.92) 
Jan 17 11:32:10 tank-02 lock_gulmd_LT[2365]: This is cluster tank-cluster 
Jan 17 11:32:10 tank-02 lock_gulmd_LT000[2365]: Not serving locks from this node. 
Jan 17 11:32:11 tank-02 lock_gulmd_core[2361]: EOF on xdr (Magma::2296 ::1 idx:1
fd:6) 
Jan 17 11:32:11 tank-02 lock_gulmd_main[2349]: Forked lock_gulmd_LTPX. 
Jan 17 11:32:11 tank-02 lock_gulmd_LTPX[2370]: Starting lock_gulmd_LTPX 1.0.4.
(built Aug  1 2005 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved. 
Jan 17 11:32:11 tank-02 lock_gulmd_LTPX[2370]: I am running in Fail-over mode. 
Jan 17 11:32:11 tank-02 lock_gulmd_LTPX[2370]: I am (tank-02) with ip
(::ffff:10.15.84.92) 
Jan 17 11:32:11 tank-02 lock_gulmd_LTPX[2370]: This is cluster tank-cluster 
Jan 17 11:32:12 tank-02 ccsd[2295]: Connected to cluster infrastruture via: GuLM
Plugin v1.0.3 
Jan 17 11:32:12 tank-02 ccsd[2295]: Initial status:: Inquorate 
Jan 17 11:32:12 tank-02 lock_gulmd_core[2361]: ERROR [src/core_io.c:1058] Got
error from reply: (tank-01.lab.msp.redhat.com ::ffff:10.15.84.91) 1008:Bad State
Change 
Jan 17 11:32:12 tank-02 lock_gulmd_core[2361]: finished. 
Jan 17 11:32:12 tank-02 lock_gulmd_LTPX[2370]: finished. 

Log from tank-04

Jan 17 11:31:32 tank-04 lock_gulmd_main[2346]: Forked lock_gulmd_core. 
Jan 17 11:31:32 tank-04 lock_gulmd_core[2352]: Starting lock_gulmd_core 1.0.4.
(built Aug  1 2005 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved. 
Jan 17 11:31:32 tank-04 lock_gulmd_core[2352]: I am running in Fail-over mode. 
Jan 17 11:31:32 tank-04 lock_gulmd_core[2352]: I am (tank-04) with ip
(::ffff:10.15.84.94) 
Jan 17 11:31:32 tank-04 lock_gulmd_core[2352]: This is cluster tank-cluster 
Jan 17 11:31:32 tank-04 lock_gulmd_core[2352]: ERROR [src/core_io.c:1058] Got
error from reply: (tank-01.lab.msp.redhat.com ::ffff:10.15.84.91) 1008:Bad State
Change 
Jan 17 11:31:32 tank-04 lock_gulmd_core[2352]: EOF on xdr (Magma::2292 ::1 idx:1
fd:6) 
Jan 17 11:31:33 tank-04 lock_gulmd_main[2346]: Forked lock_gulmd_LT. 
Jan 17 11:31:33 tank-04 lock_gulmd_LT[2356]: Starting lock_gulmd_LT 1.0.4.
(built Aug  1 2005 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved. 
Jan 17 11:31:33 tank-04 lock_gulmd_LT[2356]: I am running in Fail-over mode. 
Jan 17 11:31:33 tank-04 lock_gulmd_LT[2356]: I am (tank-04) with ip
(::ffff:10.15.84.94) 
Jan 17 11:31:33 tank-04 lock_gulmd_LT[2356]: This is cluster tank-cluster 
Jan 17 11:31:33 tank-04 lock_gulmd_LT000[2356]: Not serving locks from this node. 
Jan 17 11:31:33 tank-04 lock_gulmd_core[2352]: EOF on xdr (Magma::2292 ::1 idx:1
fd:6) 
Jan 17 11:31:34 tank-04 lock_gulmd_main[2346]: Forked lock_gulmd_LTPX. 
Jan 17 11:31:34 tank-04 lock_gulmd_LTPX[2361]: Starting lock_gulmd_LTPX 1.0.4.
(built Aug  1 2005 14:54:33) Copyright (C) 2004 Red Hat, Inc.  All rights reserved. 
Jan 17 11:31:34 tank-04 lock_gulmd_LTPX[2361]: I am running in Fail-over mode. 
Jan 17 11:31:34 tank-04 lock_gulmd_LTPX[2361]: I am (tank-04) with ip
(::ffff:10.15.84.94) 
Jan 17 11:31:34 tank-04 lock_gulmd_LTPX[2361]: This is cluster tank-cluster 
Jan 17 11:31:34 tank-04 ccsd[2291]: Connected to cluster infrastruture via: GuLM
Plugin v1.0.3 
Jan 17 11:31:34 tank-04 ccsd[2291]: Initial status:: Inquorate 
Jan 17 11:31:35 tank-04 lock_gulmd_core[2352]: ERROR [src/core_io.c:1058] Got
error from reply: (tank-01.lab.msp.redhat.com ::ffff:10.15.84.91) 1008:Bad State
Change 
Jan 17 11:31:35 tank-04 lock_gulmd_core[2352]: finished. 
Jan 17 11:31:35 tank-04 lock_gulmd_LTPX[2361]: finished. 
Jan 17 11:31:35 tank-04 ccsd[2291]: Cluster manager shutdown.  Attemping to
reconnect... 
Jan 17 11:31:36 tank-04 lock_gulmd: startup failed


Expected results:

lock_gulm should start up and join the cluster.

Additional info:

Comment 1 Chris Feist 2006-01-17 19:32:45 UTC

*** This bug has been marked as a duplicate of 171246 ***