Red Hat Bugzilla – Bug 142063
lock_gulmd_core: POLLHUP ERROR causes nodes not to receive replys to from master
Last modified: 2009-04-16 16:01:59 EDT
Description of problem: I saw this while running regression tests on: GFS-6.0.2-17 GFS-modules-hugemem-6.0.2-17 kernel: 2.4.21-27.ELhugemem 5 node cluster (morph-01 - 05) morph-01 was the Gulm server. The whole cluster is busy doing I/O when morph-01 gets theses errors: Dec 6 17:15:54 morph-01 lock_gulmd_core[5487]: ERROR [core_io.c:2086] POLLHUP on idx:1 fd:6 name:morph-05.lab.msp.redhat.com Dec 6 17:15:54 morph-01 lock_gulmd_core[5487]: ERROR [core_io.c:2086] POLLHUP on idx:2 fd:7 name:morph-02.lab.msp.redhat.com Dec 6 17:15:54 morph-01 lock_gulmd_core[5487]: ERROR [core_io.c:2086] POLLHUP on idx:3 fd:8 name:morph-03.lab.msp.redhat.com Dec 6 17:15:54 morph-01 lock_gulmd_core[5487]: ERROR [core_io.c:2086] POLLHUP on idx:4 fd: Then the other 4 nodes in the cluster failed to receive heartbeat replys: Dec 6 17:15:23 morph-04 lock_gulmd_core[3226]: Failed to receive a timely heartbeat reply from Master. (t:1102374923530825 mb:1) Dec 6 17:15:38 morph-04 lock_gulmd_core[3226]: Failed to receive a timely heartbeat reply from Master. (t:1102374938530827 mb:2) Dec 6 17:15:53 morph-04 lock_gulmd_core[3226]: Failed to receive a timely heartbeat reply from Master. (t:1102374953550825 mb:3) Dec 6 17:15:53 morph-04 lock_gulmd_core[3226]: In core_io.c:425 (v6.0.0) death by: Lost co nnection to SLM Master (morph-01.lab.msp.redhat.com), stopping. node reset required to re-a ctivate cluster operations. Dec 6 17:15:53 morph-04 lock_gulmd_LTPX[3228]: EOF on xdr (_ core _:0.0.0.0 idx:1 fd:5) Dec 6 17:15:53 morph-04 lock_gulmd_LTPX[3228]: In ltpx_io.c:332 (v6.0.0) death by: Lost co nnection to core, cannot continue. node reset required to re-activate cluster operations. Dec 6 17:15:53 morph-04 kernel: lock_gulm: ERROR Got an error in gulm_res_recvd err: -71 Dec 6 17:15:53 morph-04 kernel: lock_gulm: ERROR gulm_LT_recver err -71 Dec 6 17:15:56 morph-04 kernel: lock_gulm: ERROR Got a -111 trying to login to lock_gulmd. Is it running? Dec 6 17:16:29 morph-04 last message repeated 11 times morph-01 thinks they all miss their heartbeats and then fences all four of them. Dec 6 17:16:17 morph-01 lock_gulmd_core[5487]: morph-04.lab.msp.redhat.com missed a heartbeat (time:1102374977064030 mb:1) Dec 6 17:16:24 morph-01 lock_gulmd_core[5487]: morph-05.lab.msp.redhat.com missed a heartbeat (time:1102374984564045 mb:2) Dec 6 17:16:32 morph-01 lock_gulmd_core[5487]: morph-02.lab.msp.redhat.com missed a heartbeat (time:1102374992084030 mb:2) Dec 6 17:16:32 morph-01 lock_gulmd_core[5487]: morph-03.lab.msp.redhat.com missed a heartbeat (time:1102374992084030 mb:2) Dec 6 17:16:32 morph-01 lock_gulmd_core[5487]: morph-04.lab.msp.redhat.com missed a heartbeat (time:1102374992084030 mb:2) Dec 6 17:16:39 morph-01 lock_gulmd_core[5487]: morph-05.lab.msp.redhat.com missed a heartbeat (time:1102374999584043 mb:3) Dec 6 17:16:39 morph-01 lock_gulmd_core[5487]: Client (morph-05.lab.msp.redhat.com) expired Dec 6 17:16:39 morph-01 lock_gulmd_core[5487]: Could not send membership update "Expired" about morph-05.lab.msp.redhat.com to slave morph-03.lab.msp.redhat.com Dec 6 17:16:39 morph-01 lock_gulmd_core[5487]: Could not send membership update "Expired" about morph-05.lab.msp.redhat.com to slave morph-04.lab.msp.redhat.com Dec 6 17:16:39 morph-01 lock_gulmd_core[5487]: Could not send membership update "Expired" about morph-05.lab.msp.redhat.com to slave morph-02.lab.msp.redhat.com Dec 6 17:16:39 morph-01 lock_gulmd_core[7354]: Gonna exec fence_node morph-05.lab.msp.redhat.com Dec 6 17:16:39 morph-01 lock_gulmd_core[5487]: Forked [7354] fence_node morph-05.lab.msp.redhat.com with a 0 pause. How reproducible: Didn't try
Reproduced this right away again after restarting regression testing I/O on the morph cluster.
well. assuming the timestamps are all in sync, it looks like everyone else (well, morph-04. but similar on other nodes? (messages and times)) Missed heartbeats to the SLM Master. Then they dissconnected, which generates the POLLHUP on the SLM Master. (gulm_core on morph-04 prints the final message before dieing, and the next second the POLLHUP is seen on morph-01) So, this seems to be a question of why the heartbeat replies from the Master are being missed. (which could imply that the heartbeats are being missed from a slow running or buffer filled gulm_core on morph-01)
what kind of io are the nodes doing when this happens?
The io is a random mix of iogen/doio, genesis, and accordion cmdlines run 2 wide per node, from 6.0/vedder/lib/regression.herd.m4 in the 6.0 sistina-test tree.
I cannot get this to appear on my test nodes. Can you try it again with the 'heartbeat' verb flag added?
Will test this case again the next time I'm running at GFS 6.0.
Corey, has this one occurred again?
Hasn't been seen in over a year, will reopen if reproduced again.