Bug 142063 - lock_gulmd_core: POLLHUP ERROR causes nodes not to receive replys to from master
lock_gulmd_core: POLLHUP ERROR causes nodes not to receive replys to from master
Status: CLOSED WORKSFORME
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gulm (Show other bugs)
3
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Chris Feist
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-12-06 18:37 EST by Corey Marthaler
Modified: 2009-04-16 16:01 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-01-04 15:23:51 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2004-12-06 18:37:38 EST
Description of problem:
I saw this while running regression tests on:
GFS-6.0.2-17
GFS-modules-hugemem-6.0.2-17

kernel:
2.4.21-27.ELhugemem


5 node cluster (morph-01 - 05) morph-01 was the Gulm server.

The whole cluster is busy doing I/O when morph-01 gets theses errors:

Dec  6 17:15:54 morph-01 lock_gulmd_core[5487]: ERROR [core_io.c:2086]
POLLHUP on idx:1 fd:6 name:morph-05.lab.msp.redhat.com
Dec  6 17:15:54 morph-01 lock_gulmd_core[5487]: ERROR [core_io.c:2086]
POLLHUP on idx:2 fd:7 name:morph-02.lab.msp.redhat.com
Dec  6 17:15:54 morph-01 lock_gulmd_core[5487]: ERROR [core_io.c:2086]
POLLHUP on idx:3 fd:8 name:morph-03.lab.msp.redhat.com
Dec  6 17:15:54 morph-01 lock_gulmd_core[5487]: ERROR [core_io.c:2086]
POLLHUP on idx:4 fd:

Then the other 4 nodes in the cluster failed to receive heartbeat replys:

Dec  6 17:15:23 morph-04 lock_gulmd_core[3226]: Failed to receive a
timely heartbeat reply
from Master. (t:1102374923530825 mb:1)
Dec  6 17:15:38 morph-04 lock_gulmd_core[3226]: Failed to receive a
timely heartbeat reply
from Master. (t:1102374938530827 mb:2)
Dec  6 17:15:53 morph-04 lock_gulmd_core[3226]: Failed to receive a
timely heartbeat reply
from Master. (t:1102374953550825 mb:3)
Dec  6 17:15:53 morph-04 lock_gulmd_core[3226]: In core_io.c:425
(v6.0.0) death by: Lost co
nnection to SLM Master (morph-01.lab.msp.redhat.com), stopping. node
reset required to re-a
ctivate cluster operations.
Dec  6 17:15:53 morph-04 lock_gulmd_LTPX[3228]: EOF on xdr (_ core
_:0.0.0.0 idx:1 fd:5)
Dec  6 17:15:53 morph-04 lock_gulmd_LTPX[3228]: In ltpx_io.c:332
(v6.0.0) death by: Lost co
nnection to core, cannot continue. node reset required to re-activate
cluster operations.
Dec  6 17:15:53 morph-04 kernel: lock_gulm: ERROR Got an error in
gulm_res_recvd err: -71
Dec  6 17:15:53 morph-04 kernel: lock_gulm: ERROR gulm_LT_recver err -71
Dec  6 17:15:56 morph-04 kernel: lock_gulm: ERROR Got a -111 trying to
login to lock_gulmd.
  Is it running?
Dec  6 17:16:29 morph-04 last message repeated 11 times


morph-01 thinks they all miss their heartbeats and then fences all
four of them.

Dec  6 17:16:17 morph-01 lock_gulmd_core[5487]:
morph-04.lab.msp.redhat.com missed a heartbeat (time:1102374977064030
mb:1)
Dec  6 17:16:24 morph-01 lock_gulmd_core[5487]:
morph-05.lab.msp.redhat.com missed a heartbeat (time:1102374984564045
mb:2)
Dec  6 17:16:32 morph-01 lock_gulmd_core[5487]:
morph-02.lab.msp.redhat.com missed a heartbeat (time:1102374992084030
mb:2)
Dec  6 17:16:32 morph-01 lock_gulmd_core[5487]:
morph-03.lab.msp.redhat.com missed a heartbeat (time:1102374992084030
mb:2)
Dec  6 17:16:32 morph-01 lock_gulmd_core[5487]:
morph-04.lab.msp.redhat.com missed a heartbeat (time:1102374992084030
mb:2)
Dec  6 17:16:39 morph-01 lock_gulmd_core[5487]:
morph-05.lab.msp.redhat.com missed a heartbeat (time:1102374999584043
mb:3)
Dec  6 17:16:39 morph-01 lock_gulmd_core[5487]: Client
(morph-05.lab.msp.redhat.com) expired
Dec  6 17:16:39 morph-01 lock_gulmd_core[5487]: Could not send
membership update "Expired" about morph-05.lab.msp.redhat.com to slave
morph-03.lab.msp.redhat.com
Dec  6 17:16:39 morph-01 lock_gulmd_core[5487]: Could not send
membership update "Expired" about morph-05.lab.msp.redhat.com to slave
morph-04.lab.msp.redhat.com
Dec  6 17:16:39 morph-01 lock_gulmd_core[5487]: Could not send
membership update "Expired" about morph-05.lab.msp.redhat.com to slave
morph-02.lab.msp.redhat.com
Dec  6 17:16:39 morph-01 lock_gulmd_core[7354]: Gonna exec fence_node
morph-05.lab.msp.redhat.com
Dec  6 17:16:39 morph-01 lock_gulmd_core[5487]: Forked [7354]
fence_node morph-05.lab.msp.redhat.com with a 0 pause.


How reproducible:
Didn't try
Comment 1 Corey Marthaler 2004-12-07 10:45:49 EST
Reproduced this right away again after restarting regression testing
I/O     on the morph cluster.  
Comment 2 michael conrad tadpol tilstra 2004-12-07 11:25:06 EST
well. assuming the timestamps are all in sync, it looks like everyone else
(well, morph-04. but similar on other nodes? (messages and times))  Missed
heartbeats to the SLM Master.  Then they dissconnected, which generates the
POLLHUP on the SLM Master.   (gulm_core on morph-04 prints the final message
before dieing, and the next second the POLLHUP is seen on morph-01)

So, this seems to be a question of why the heartbeat replies from the Master are
being missed.  (which could imply that the heartbeats are being missed from a
slow running or buffer filled gulm_core on morph-01)
Comment 3 michael conrad tadpol tilstra 2004-12-07 14:23:12 EST
what kind of io are the nodes doing when this happens?
Comment 4 Corey Marthaler 2004-12-09 14:36:39 EST
The io is a random mix of iogen/doio, genesis, and accordion cmdlines
run 2 wide per node, from 6.0/vedder/lib/regression.herd.m4 in the 6.0
sistina-test tree. 
Comment 5 michael conrad tadpol tilstra 2004-12-13 09:53:02 EST
I cannot get this to appear on my test nodes.  Can you try it again with the
'heartbeat' verb flag added?
Comment 6 Corey Marthaler 2005-01-10 19:10:46 EST
Will test this case again the next time I'm running at GFS 6.0.
Comment 7 Kiersten (Kerri) Anderson 2006-01-04 10:35:39 EST
Corey, has this one occurred again?
Comment 8 Corey Marthaler 2006-01-04 15:23:51 EST
Hasn't been seen in over a year, will reopen if reproduced again.

Note You need to log in before you can comment on or make changes to this bug.