Description of problem: While running IO (with data journaling turned on) I am seeing a lot of: May 4 14:16:21 link-13 lock_gulmd_core[16220]: "Magma::18618" is logged out. fd:15 messages. It seems Gulm becomes confused and is losing messages. This forces the fence of cluster members -- in turn other members attempt to rejoin and get: May 4 08:56:42 link-13 lock_gulmd_core[7156]: ERROR [src/core_io.c:1066] Got error from reply: (link-15.lab.msp.redhat.com ::ffff:10.15.89.165) 1008:Bad State Change I re-ran the tests after chkconfiging modcluster off -- I was not able to reproduce the issue. If we turn modcluster back on I am able to reproduce this every time. Version-Release number of selected component (if applicable): gulm-1.0.10-0.ia64 gulm-debuginfo-1.0.10-0.ia64 gulm-devel-1.0.10-0.ia64 modcluster-0.9.1-6.ia64
A more detailed portion of the log, showing a typcial scenario: May 1 13:36:15 link-13 lock_gulmd_core[6829]: "Magma::9794" is logged out. fd:14 May 1 13:36:19 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003d941, ip=0x400000000005d741 May 1 13:36:52 link-13 lock_gulmd_LT000[6836]: EOF on xdr (link-14 ::ffff:10.15.89.164 idx:4 fd:9) May 1 13:37:10 link-13 lock_gulmd_core[6829]: link-14 missed a heartbeat (time:1178044630197028 mb:1) May 1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003d941, ip=0x400000000005d300 May 1 13:37:10 link-13 lock_gulmd_core[6829]: link-15 missed a heartbeat (time:1178044630197028 mb:1) May 1 13:37:10 link-13 lock_gulmd_LT000[6836]: EOF on xdr (link-15 ::ffff:10.15.89.165 idx:6 fd:11) May 1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003d941, ip=0x400000000005dc10 May 1 13:37:10 link-13 lock_gulmd_core[6829]: link-16 missed a heartbeat (time:1178044630197028 mb:1) May 1 13:37:10 link-13 lock_gulmd_LT000[6836]: ERROR [src/lock_io.c:1685] Warning! When trying to send a 0x674c4300:gulm_lock_cb_state packet, we got a -32:32:Broken pipe May 1 13:37:10 link-13 kernel: lock_gulmd(6839): unaligned access to 0x60000000009dbca1, ip=0x400000000005d741 May 1 13:37:10 link-13 lock_gulmd_core[6829]: link-13 missed a heartbeat (time:1178044630197028 mb:1) May 1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, ip=0x400000000005d741 May 1 13:37:10 link-13 lock_gulmd_core[6829]: ERROR [src/core_io.c:2082] POLLHUP on idx:4 fd:9 name:link-14 May 1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, ip=0x400000000005d300 May 1 13:37:10 link-13 lock_gulmd_core[6829]: ERROR [src/core_io.c:2082] POLLHUP on idx:5 fd:10 name:link-15 May 1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, ip=0x400000000005dc10 May 1 13:37:10 link-13 lock_gulmd_core[6829]: Core lost slave quorum. Have 1, need 2. Switching to Arbitrating. May 1 13:37:10 link-13 kernel: lock_gulmd(6836): unaligned access to 0x600000000003e521, ip=0x400000000005d611 May 1 13:37:10 link-13 lock_gulmd_core[6829]: Could not send quorum update to slave link-14 May 1 13:37:10 link-13 kernel: lock_gulmd(6839): unaligned access to 0x60000000009dc551, ip=0x400000000005d741 May 1 13:37:10 link-13 lock_gulmd_core[6829]: ERROR [src/core_resources.c:302] Error sending core state information to child Magma::9796: Broken pipe
This appears to still be reproducable. May 21 14:38:46 grant-03 lock_gulmd_main[3087]: Forked lock_gulmd_core. May 21 14:38:46 grant-03 lock_gulmd_core[3089]: Starting lock_gulmd_core 1.0.10. (built Mar 14 2007 16:40:42) Copyright (C) 2004 Red Hat, Inc. All rights reserved. May 21 14:38:46 grant-03 lock_gulmd_core[3089]: I am running in Fail-over mode. May 21 14:38:46 grant-03 lock_gulmd_core[3089]: I am (grant-03) with ip (::ffff:10.15.89.153) May 21 14:38:46 grant-03 lock_gulmd_core[3089]: This is cluster GRANT May 21 14:38:46 grant-03 lock_gulmd_core[3089]: EOF on xdr (Magma::3024 ::1 idx:2 fd:7) May 21 14:38:47 grant-03 lock_gulmd_main[3087]: Forked lock_gulmd_LT. May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: Starting lock_gulmd_LT 1.0.10. (built Mar 14 2007 16:40:42) Copyright (C) 2004 Red Hat, Inc. All rights reserved. May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: I am running in Fail-over mode. May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: I am (grant-03) with ip (::ffff:10.15.89.153) May 21 14:38:47 grant-03 lock_gulmd_LT[3092]: This is cluster GRANT May 21 14:38:47 grant-03 lock_gulmd_core[3089]: EOF on xdr (Magma::3024 ::1 idx:3 fd:8) May 21 14:38:48 grant-03 lock_gulmd_main[3087]: Forked lock_gulmd_LTPX. May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: Starting lock_gulmd_LTPX 1.0.10. (built Mar 14 2007 16:40:42) Copyright (C) 2004 Red Hat, Inc. All rights reserved. May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: I am running in Fail-over mode. May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: I am (grant-03) with ip (::ffff:10.15.89.153) May 21 14:38:48 grant-03 lock_gulmd_LTPX[3096]: This is cluster GRANT May 21 14:38:48 grant-03 ccsd[3023]: Connected to cluster infrastruture via: GuLM Plugin v1.0.5 May 21 14:38:48 grant-03 ccsd[3023]: Initial status:: Inquorate May 21 14:38:49 grant-03 lock_gulmd_core[3089]: ERROR [src/core_io.c:1066] Got error from reply: (grant-02.lab.msp.redhat.com ::ffff:10.15.89.152) 1008:Bad State Change May 21 14:38:52 grant-03 lock_gulmd_core[3089]: ERROR [src/core_io.c:1066] Got error from reply: (grant-02.lab.msp.redhat.com ::ffff:10.15.89.152) 1008:Bad State Change May 21 14:38:52 grant-03 lock_gulmd_LTPX[3096]: finished. May 21 14:38:52 grant-03 lock_gulmd_core[3089]: finished. May 21 14:38:52 grant-03 lock_gulmd_LT000[3092]: EOF on xdr (_ core _ ::1 idx:1 fd:6) May 21 14:38:52 grant-03 lock_gulmd_LT000[3092]: In src/lock_io.c:419 (1.0.10) death by: Lost connection to core, cannot continue.node reset required to re-activate cluster operations. May 21 14:38:52 grant-03 ccsd[3023]: Cluster manager shutdown. Attemping to reconnect... May 21 14:38:53 grant-03 lock_gulmd: startup failed