239080 – gulm and clustermon interaction causes gulm to fence cluster members while running IO load.

Bug 239080 - gulm and clustermon interaction causes gulm to fence cluster members while running IO load.

Summary: gulm and clustermon interaction causes gulm to fence cluster members while ru...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gulm
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Chris Feist
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-05-04 19:24 UTC by Dean Jansa
Modified:	2016-04-26 14:06 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-01-04 21:19:45 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dean Jansa 2007-05-04 19:24:04 UTC

Description of problem:

While running IO (with data journaling turned on) I am seeing a lot of:

May  4 14:16:21 link-13 lock_gulmd_core[16220]: "Magma::18618" is logged out. fd:15 

messages.  It seems Gulm becomes confused and is losing messages.  This forces the fence of cluster 
members -- in turn other members attempt to rejoin and get:

May  4 08:56:42 link-13 lock_gulmd_core[7156]: ERROR [src/core_io.c:1066] Got error from reply: 
(link-15.lab.msp.redhat.com ::ffff:10.15.89.165) 1008:Bad State Change

I re-ran the tests after chkconfiging modcluster off -- I was not able to reproduce the issue.
If we turn modcluster back on I am able to reproduce this every time.


Version-Release number of selected component (if applicable):

gulm-1.0.10-0.ia64
gulm-debuginfo-1.0.10-0.ia64
gulm-devel-1.0.10-0.ia64
modcluster-0.9.1-6.ia64

Comment 1 Dean Jansa 2007-05-04 19:26:52 UTC

A more detailed portion of the log, showing a typcial scenario:

May  1 13:36:15 link-13 lock_gulmd_core[6829]: "Magma::9794" is logged out. fd:14 
May  1 13:36:19 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003d941, 
ip=0x400000000005d741
May  1 13:36:52 link-13 lock_gulmd_LT000[6836]: EOF on xdr (link-14 ::ffff:10.15.89.164 idx:4 fd:9)
May  1 13:37:10 link-13 lock_gulmd_core[6829]: link-14 missed a heartbeat (time:1178044630197028 
mb:1)
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003d941, 
ip=0x400000000005d300
May  1 13:37:10 link-13 lock_gulmd_core[6829]: link-15 missed a heartbeat (time:1178044630197028 
mb:1)
May  1 13:37:10 link-13 lock_gulmd_LT000[6836]: EOF on xdr (link-15 ::ffff:10.15.89.165 idx:6 fd:11)
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003d941, 
ip=0x400000000005dc10
May  1 13:37:10 link-13 lock_gulmd_core[6829]: link-16 missed a heartbeat (time:1178044630197028 
mb:1)
May  1 13:37:10 link-13 lock_gulmd_LT000[6836]: ERROR [src/lock_io.c:1685] Warning! When trying to 
send a 0x674c4300:gulm_lock_cb_state packet, we got a -32:32:Broken pipe
May  1 13:37:10 link-13 kernel: lock_gulmd(6839): unaligned access to 0x60000000009dbca1, 
ip=0x400000000005d741
May  1 13:37:10 link-13 lock_gulmd_core[6829]: link-13 missed a heartbeat (time:1178044630197028 
mb:1)
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, 
ip=0x400000000005d741
May  1 13:37:10 link-13 lock_gulmd_core[6829]: ERROR [src/core_io.c:2082] POLLHUP on idx:4 fd:9 
name:link-14
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, 
ip=0x400000000005d300
May  1 13:37:10 link-13 lock_gulmd_core[6829]: ERROR [src/core_io.c:2082] POLLHUP on idx:5 fd:10 
name:link-15
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, 
ip=0x400000000005dc10
May  1 13:37:10 link-13 lock_gulmd_core[6829]: Core lost slave quorum. Have 1, need 2. Switching to 
Arbitrating.
May  1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, 
ip=0x400000000005d611
May  1 13:37:10 link-13 lock_gulmd_core[6829]: Could not send quorum update to slave link-14
May  1 13:37:10 link-13 kernel: lock_gulmd(6839): unaligned access to 0x60000000009dc551, 
ip=0x400000000005d741
May  1 13:37:10 link-13 lock_gulmd_core[6829]: ERROR [src/core_resources.c:302] Error sending core 
state information to child Magma::9796: Broken pipe

Comment 5 Corey Marthaler 2008-05-21 20:07:22 UTC

This appears to still be reproducable.

May 21 14:38:46 grant-03 lock_gulmd_main[3087]: Forked lock_gulmd_core.
May 21 14:38:46 grant-03 lock_gulmd_core[3089]: Starting lock_gulmd_core 1.0.10.
(built Mar 14 2007 16:40:42) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
May 21 14:38:46 grant-03 lock_gulmd_core[3089]: I am running in Fail-over mode.
May 21 14:38:46 grant-03 lock_gulmd_core[3089]: I am (grant-03) with ip
(::ffff:10.15.89.153)
May 21 14:38:46 grant-03 lock_gulmd_core[3089]: This is cluster GRANT
May 21 14:38:46 grant-03 lock_gulmd_core[3089]: EOF on xdr (Magma::3024 ::1
idx:2 fd:7)
May 21 14:38:47 grant-03 lock_gulmd_main[3087]: Forked lock_gulmd_LT.
May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: Starting lock_gulmd_LT 1.0.10.
(built Mar 14 2007 16:40:42) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: I am running in Fail-over mode.
May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: I am (grant-03) with ip
(::ffff:10.15.89.153)
May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: This is cluster GRANT
May 21 14:38:47 grant-03 lock_gulmd_core[3089]: EOF on xdr (Magma::3024 ::1
idx:3 fd:8)
May 21 14:38:48 grant-03 lock_gulmd_main[3087]: Forked lock_gulmd_LTPX.
May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: Starting lock_gulmd_LTPX 1.0.10.
(built Mar 14 2007 16:40:42) Copyright (C) 2004 Red Hat, Inc.  All rights reserved.
May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: I am running in Fail-over mode.
May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: I am (grant-03) with ip
(::ffff:10.15.89.153)
May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: This is cluster GRANT
May 21 14:38:48 grant-03 ccsd[3023]: Connected to cluster infrastruture via:
GuLM Plugin v1.0.5
May 21 14:38:48 grant-03 ccsd[3023]: Initial status:: Inquorate
May 21 14:38:49 grant-03 lock_gulmd_core[3089]: ERROR [src/core_io.c:1066] Got
error from reply: (grant-02.lab.msp.redhat.com ::ffff:10.15.89.152) 1008:Bad
State Change
May 21 14:38:52 grant-03 lock_gulmd_core[3089]: ERROR [src/core_io.c:1066] Got
error from reply: (grant-02.lab.msp.redhat.com ::ffff:10.15.89.152) 1008:Bad
State Change
May 21 14:38:52 grant-03 lock_gulmd_LTPX[3096]: finished.
May 21 14:38:52 grant-03 lock_gulmd_core[3089]: finished.
May 21 14:38:52 grant-03 lock_gulmd_LT000[3092]: EOF on xdr (_ core _ ::1 idx:1
fd:6)
May 21 14:38:52 grant-03 lock_gulmd_LT000[3092]: In src/lock_io.c:419 (1.0.10)
death by: Lost connection to core, cannot continue.node reset required to
re-activate cluster operations.
May 21 14:38:52 grant-03 ccsd[3023]: Cluster manager shutdown.  Attemping to
reconnect...
May 21 14:38:53 grant-03 lock_gulmd: startup failed

Note You need to log in before you can comment on or make changes to this bug.