Bug 144140 - Node heartbeat missed while doing IO, fenced.
Node heartbeat missed while doing IO, fenced.
Status: CLOSED NOTABUG
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: fence (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Corey Marthaler
Cluster QE
:
Depends On:
Blocks: 144795
  Show dependency treegraph
 
Reported: 2005-01-04 15:10 EST by Dean Jansa
Modified: 2009-04-16 16:03 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-03-14 18:54:07 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Dean Jansa 2005-01-04 15:10:30 EST
From Bugzilla Helper: 
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux) 
 
Description of problem: 
While attempting to reproduce 130665 one of my nodes (tank-02) in a 
six node cluster was fenced another node missed a HELLO. 
 
Jan  4 13:46:14 tank-04 kernel: CMAN: no HELLO from tank-02, 
removing from the cluster 
 
One gfs fs mounted on /mnt/gfs/gfs0 
 
Running <On all nodes (in /mnt/gfs/gfs0)>:  
 
iogen -o -m random -s write,writev,readv -t 1b -T1000b 10000b:tfile1  
| doio -avk 
 
 
After about 3 hours tank-02, which otherwise looked fine was fenced 
after tank-04 did not get the HELLO.   
 
Tank-02's /var/log/messages leading up to the fence: 
. 
. 
. 
Jan  4 11:14:58 tank-02 kernel: dlm: gfs0: recover event 26 done 
Jan  4 11:14:59 tank-02 kernel: dlm: gfs0: process held requests 
Jan  4 11:14:59 tank-02 kernel: dlm: gfs0: processed 0 requests 
Jan  4 11:14:59 tank-02 kernel: dlm: gfs0: resend marked requests 
Jan  4 11:14:59 tank-02 kernel: dlm: gfs0: resent 0 requests 
Jan  4 11:14:59 tank-02 kernel: dlm: gfs0: recover event 26 finished 
Jan  4 11:15:55 tank-02 sshd(pam_unix)[4021]: session opened for 
user root by (uid=0) 
Jan  4 13:50:17 tank-02 syslogd 1.4.1: restart. 
Jan  4 13:50:17 tank-02 syslog: syslogd startup succeeded 
 
So tank-02 was just going about the IO load without and log 
messages,  I didnt have any warning of the impending doom. 
 
 
tank-04's /var/log/messages leading up to fence: 
 
Jan  4 11:12:56 tank-04 kernel: dlm: gfs0: resend marked requests 
Jan  4 11:12:56 tank-04 kernel: dlm: gfs0: resent 0 requests 
Jan  4 11:12:56 tank-04 kernel: dlm: gfs0: recover event 18 finished 
Jan  4 11:13:53 tank-04 sshd(pam_unix)[4021]: session opened for 
user root by (uid=0) 
Jan  4 13:46:14 tank-04 kernel: CMAN: no HELLO from tank-02, 
removing from the cluster 
Jan  4 13:46:16 tank-04 fenced[3738]: fencing deferred to 1 
 
 
 
 
Version-Release number of selected component (if applicable): 
 
 
How reproducible: 
Sometimes 
 
Steps to Reproduce: 
1. iogen -o -m random -s write,writev,readv -t 1b -T1000b 
10000b:tfile1  | doio -avk 
 
2. wait. 
 
 
Actual Results:  Node fenced while doing IO, I have seen this more 
often with heavier loads than above. 
 
Expected Results:  IO runs without node being fenced. 
 
Additional info: This was hit with 2.6.9-1.906_ELsmp and RPMS built 
on Dec 20th.
Comment 1 Adam "mantis" Manthei 2005-01-05 14:16:13 EST
If nodes are being fenced correctly when heartbeats are being missed
then this is not a fencing bug.  The issue will probably need to be
assigned to either cman or dlm.  

What are you using for fencing?  APC fencing?  If so, you might be
running into bug #143448.  If you are not using APC fencing, were you
able to access tank-02 before needing to reboot it?  This is probably
related to some of the other issues we've seen users complaining about
on the list WRT to dlm (as well as some of our own inhouse anomolies)
Comment 2 Dean Jansa 2005-01-05 14:27:17 EST
Right, not a fence bug...  CMAN or gang I would guess. 
 
Comment 3 Derek Anderson 2005-01-05 14:35:54 EST
I see this too, and don't believe it's a fencing bug.  It's real  
easy to reproduce this in my 3 node cluster by just running 'while  
:; do ls -lR <kernel source>; done' simumtaneously on all nodes.   
The first will get fenced after a missed HELLO and the second will  
be removed but can't be fenced since there is only one node left and  
it has lost quorum.  
  
I can run 'cman_tool expected -e 1' on the remaining node so that it  
regains quorum.  It then properly fences the second node which was  
left hanging.  
  
Tried this same test with gulm locking and it works fine.  
  
Question: I notice there is a hello_timer (5) and a deadnode_timeout  
(21) in the cman config.  So you have to miss 4 hello_timers before  
the deadnode_timeout is reached?  It would be nice to have each miss  
logged so you can tell if you are missing heartbeats when load  
starts adding up.  
Comment 4 Adam "mantis" Manthei 2005-01-05 15:22:38 EST
Derek's suggestion in comment #3 regarding logging missed HELLOs is
dead on.  These sort of things should get logged by default. 
Unfortunately, that probably means console printk() for the time being.

Derek -- how long does it take your test to cause fencing errors? 
would it be possible for you to disable fencing for now
(fence_agent="/bin/false") to verify that you are not running into bug
#143448?

Dean -- what about my questions for you in comment #1?

Comment 5 Dean Jansa 2005-01-05 15:29:58 EST
Adam, 
 
I'm using APC fencing. 
 
Comment 6 Adam "mantis" Manthei 2005-01-17 19:07:44 EST
I think that this is a duplicate of bug #143448.  Assigning to Dave
Teigland for now.
Comment 7 David Teigland 2005-01-17 23:11:18 EST

*** This bug has been marked as a duplicate of 143448 ***
Comment 8 Corey Marthaler 2005-02-14 12:06:17 EST
We still appear to be hitting this "slow node ends up missing
heartbeats and gets fenced" bug. I had mount_stress running on 5 nodes
this weekend and after about 8 hours and 1067 iterations, tank-02 got
fenced out of the blue by the master tank-01. The missed heartbeat
messages are the only thing which implys anything was wrong. According
to the test tank-02 was fine and had still been doing activity up
until it was shot.

.
.
.
Feb 12 01:06:57 tank-01 lock_gulmd_core[6307]:
tank-02.lab.msp.redhat.com missed a heartbeat (time:11081884178
28536 mb:1)
Feb 12 01:07:12 tank-01 lock_gulmd_core[6307]:
tank-02.lab.msp.redhat.com missed a heartbeat (time:11081884328
32255 mb:2)
Feb 12 01:07:27 tank-01 lock_gulmd_core[6307]:
tank-02.lab.msp.redhat.com missed a heartbeat (time:11081884478
37974 mb:3)
Feb 12 01:07:27 tank-01 lock_gulmd_core[6307]: Client
(tank-02.lab.msp.redhat.com) expired
Feb 12 01:07:27 tank-01 lock_gulmd_core[6073]: Gonna exec fence_node
tank-02.lab.msp.redhat.com
Feb 12 01:07:27 tank-01 lock_gulmd_core[6307]: Forked [6073]
fence_node tank-02.lab.msp.redhat.com with a 0 pause.
Feb 12 01:07:58 tank-01 fence_node[6073]: Fence of
"tank-02.lab.msp.redhat.com" was successful
Feb 12 01:07:58 tank-01 lock_gulmd_core[6307]: found match on pid
6073, marking node tank-02.lab.msp.redhat.co
m as logged out.
Feb 12 01:07:58 tank-01 kernel: lock_gulm: Checking for journals for
node "tank-02.lab.msp.redhat.com"
Comment 9 Corey Marthaler 2005-02-22 13:09:36 EST
Reassigning and adding to blocker list.
Comment 10 David Teigland 2005-02-28 21:51:07 EST
This bug is incoherent; every comment seems completely unrelated to
all the others.  AFAICT, the latest problem was related to gulm
heartbeats which makes me the last person who should touch it.
I recommend this bug be thrown out and a new clearly defined bug
be created if there's a real issue here.
Comment 11 Corey Marthaler 2005-03-14 18:54:07 EST
To be reopened with a better and more specific description once we hit
this again.

Note You need to log in before you can comment on or make changes to this bug.