Bug 239080

Summary: gulm and clustermon interaction causes gulm to fence cluster members while running IO load.
Product: [Retired] Red Hat Cluster Suite Reporter: Dean Jansa <djansa>
Component: gulmAssignee: Chris Feist <cfeist>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-01-04 21:19:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dean Jansa 2007-05-04 19:24:04 UTC
Description of problem:

While running IO (with data journaling turned on) I am seeing a lot of:

May  4 14:16:21 link-13 lock_gulmd_core[16220]: "Magma::18618" is logged out. fd:15 

messages.  It seems Gulm becomes confused and is losing messages.  This forces the fence of cluster 
members -- in turn other members attempt to rejoin and get:

May  4 08:56:42 link-13 lock_gulmd_core[7156]: ERROR [src/core_io.c:1066] Got error from reply: 
(link-15.lab.msp.redhat.com ::ffff:10.15.89.165) 1008:Bad State Change

I re-ran the tests after chkconfiging modcluster off -- I was not able to reproduce the issue.
If we turn modcluster back on I am able to reproduce this every time.


Version-Release number of selected component (if applicable):

gulm-1.0.10-0.ia64
gulm-debuginfo-1.0.10-0.ia64
gulm-devel-1.0.10-0.ia64
modcluster-0.9.1-6.ia64

Comment 1 Dean Jansa 2007-05-04 19:26:52 UTC
A more detailed portion of the log, showing a typcial scenario:

May  1 13:36:15 link-13 lock_gulmd_core[6829]: "Magma::9794" is logged out. fd:14 
May  1 13:36:19 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003d941, 
ip=0x400000000005d741
May  1 13:36:52 link-13 lock_gulmd_LT000[6836]: EOF on xdr (link-14 ::ffff:10.15.89.164 idx:4 fd:9)
May  1 13:37:10 link-13 lock_gulmd_core[6829]: link-14 missed a heartbeat (time:1178044630197028 
mb:1)
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003d941, 
ip=0x400000000005d300
May  1 13:37:10 link-13 lock_gulmd_core[6829]: link-15 missed a heartbeat (time:1178044630197028 
mb:1)
May  1 13:37:10 link-13 lock_gulmd_LT000[6836]: EOF on xdr (link-15 ::ffff:10.15.89.165 idx:6 fd:11)
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003d941, 
ip=0x400000000005dc10
May  1 13:37:10 link-13 lock_gulmd_core[6829]: link-16 missed a heartbeat (time:1178044630197028 
mb:1)
May  1 13:37:10 link-13 lock_gulmd_LT000[6836]: ERROR [src/lock_io.c:1685] Warning! When trying to 
send a 0x674c4300:gulm_lock_cb_state packet, we got a -32:32:Broken pipe
May  1 13:37:10 link-13 kernel: lock_gulmd(6839): unaligned access to 0x60000000009dbca1, 
ip=0x400000000005d741
May  1 13:37:10 link-13 lock_gulmd_core[6829]: link-13 missed a heartbeat (time:1178044630197028 
mb:1)
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, 
ip=0x400000000005d741
May  1 13:37:10 link-13 lock_gulmd_core[6829]: ERROR [src/core_io.c:2082] POLLHUP on idx:4 fd:9 
name:link-14
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, 
ip=0x400000000005d300
May  1 13:37:10 link-13 lock_gulmd_core[6829]: ERROR [src/core_io.c:2082] POLLHUP on idx:5 fd:10 
name:link-15
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, 
ip=0x400000000005dc10
May  1 13:37:10 link-13 lock_gulmd_core[6829]: Core lost slave quorum. Have 1, need 2. Switching to 
Arbitrating.
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, 
ip=0x400000000005d611
May  1 13:37:10 link-13 lock_gulmd_core[6829]: Could not send quorum update to slave link-14
May  1 13:37:10 link-13 kernel: lock_gulmd(6839): unaligned access to 0x60000000009dc551, 
ip=0x400000000005d741
May  1 13:37:10 link-13 lock_gulmd_core[6829]: ERROR [src/core_resources.c:302] Error sending core 
state information to child Magma::9796: Broken pipe

Comment 5 Corey Marthaler 2008-05-21 20:07:22 UTC
This appears to still be reproducable.

May 21 14:38:46 grant-03 lock_gulmd_main[3087]: Forked lock_gulmd_core.
May 21 14:38:46 grant-03 lock_gulmd_core[3089]: Starting lock_gulmd_core 1.0.10.
(built Mar 14 2007 16:40:42) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
May 21 14:38:46 grant-03 lock_gulmd_core[3089]: I am running in Fail-over mode.
May 21 14:38:46 grant-03 lock_gulmd_core[3089]: I am (grant-03) with ip
(::ffff:10.15.89.153)
May 21 14:38:46 grant-03 lock_gulmd_core[3089]: This is cluster GRANT
May 21 14:38:46 grant-03 lock_gulmd_core[3089]: EOF on xdr (Magma::3024 ::1
idx:2 fd:7)
May 21 14:38:47 grant-03 lock_gulmd_main[3087]: Forked lock_gulmd_LT.
May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: Starting lock_gulmd_LT 1.0.10.
(built Mar 14 2007 16:40:42) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: I am running in Fail-over mode.
May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: I am (grant-03) with ip
(::ffff:10.15.89.153)
May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: This is cluster GRANT
May 21 14:38:47 grant-03 lock_gulmd_core[3089]: EOF on xdr (Magma::3024 ::1
idx:3 fd:8)
May 21 14:38:48 grant-03 lock_gulmd_main[3087]: Forked lock_gulmd_LTPX.
May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: Starting lock_gulmd_LTPX 1.0.10.
(built Mar 14 2007 16:40:42) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: I am running in Fail-over mode.
May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: I am (grant-03) with ip
(::ffff:10.15.89.153)
May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: This is cluster GRANT
May 21 14:38:48 grant-03 ccsd[3023]: Connected to cluster infrastruture via:
GuLM Plugin v1.0.5
May 21 14:38:48 grant-03 ccsd[3023]: Initial status:: Inquorate
May 21 14:38:49 grant-03 lock_gulmd_core[3089]: ERROR [src/core_io.c:1066] Got
error from reply: (grant-02.lab.msp.redhat.com ::ffff:10.15.89.152) 1008:Bad
State Change
May 21 14:38:52 grant-03 lock_gulmd_core[3089]: ERROR [src/core_io.c:1066] Got
error from reply: (grant-02.lab.msp.redhat.com ::ffff:10.15.89.152) 1008:Bad
State Change
May 21 14:38:52 grant-03 lock_gulmd_LTPX[3096]: finished.
May 21 14:38:52 grant-03 lock_gulmd_core[3089]: finished.
May 21 14:38:52 grant-03 lock_gulmd_LT000[3092]: EOF on xdr (_ core _ ::1 idx:1
fd:6)
May 21 14:38:52 grant-03 lock_gulmd_LT000[3092]: In src/lock_io.c:419 (1.0.10)
death by: Lost connection to core, cannot continue.node reset required to
re-activate cluster operations.
May 21 14:38:52 grant-03 ccsd[3023]: Cluster manager shutdown.  Attemping to
reconnect...
May 21 14:38:53 grant-03 lock_gulmd: startup failed