Bug 490147

Summary: fencing node got killed by new qdisk master
Product: [Retired] Red Hat Cluster Suite Reporter: Nate Straz <nstraz>
Component: cmanAssignee: Lon Hohberger <lhh>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 4CC: cluster-maint, cmarthal, iannis, smayhew
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-07 19:07:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 500452    
Bug Blocks:    
Attachments:
Description Flags
syslog and cluster.conf
none
syslog and qdisk log from latest reproduction
none
iostat log showing weird behavior on tank-03
none
iostat log from tank-04 for same timeframe none

Description Nate Straz 2009-03-13 15:20:11 UTC
Created attachment 335102 [details]
syslog and cluster.conf

Description of problem:

While running revolver on a 6 node + qdisk cluster I hit a situation where a node was unexpectedly rebooted.

Scenario iteration 0.4 started at Fri Mar 13 03:37:40 CDT 2009
Sleeping 3 minute(s) to let the I/O get its lock count up...
Senario: DLM kill Quorum plus one

Those picked to face the revolver... tank-03 morph-04 morph-01 tank-04 

tank-01 started fencing nodes
morph-03 became qdisk master 18 seconds after deferring to tank-01 and starts
trying to kill tank-01.

morph-03::

Mar 13 03:41:38 morph-03 fenced[4212]: fencing deferred to tank-01
Mar 13 03:41:56 morph-03 qdiskd[4152]: <info> Assuming master role 
Mar 13 03:41:57 morph-03 qdiskd[4152]: <notice> Writing eviction notice for node 1 
Mar 13 03:41:57 morph-03 kernel: CMAN: removing node tank-01 from the cluster : Killed by another node
Mar 13 03:41:58 morph-03 qdiskd[4152]: <notice> Node 1 evicted 
Mar 13 03:41:58 morph-03 qarshd[5920]: Talking to peer 10.15.89.98:37398
Mar 13 03:41:58 morph-03 qarshd[5920]: Running cmdline: cat /proc/cluster/nodes 
Mar 13 03:42:03 morph-03 kernel: SM: 00000002 process_recovery_barrier status=-104
Mar 13 03:42:03 morph-03 fenced[4212]: tank-01 not a cluster member after 0 sec post_fail_delay
Mar 13 03:42:03 morph-03 fenced[4212]: tank-03 not a cluster member after 0 sec post_fail_delay
Mar 13 03:42:03 morph-03 fenced[4212]: morph-04 not a cluster member after 0 sec post_fail_delay
Mar 13 03:42:03 morph-03 fenced[4212]: tank-04 not a cluster member after 0 sec post_fail_delay
Mar 13 03:42:03 morph-03 fenced[4212]: morph-01 not a cluster member after 0 sec post_fail_delay
Mar 13 03:42:03 morph-03 fenced[4212]: fencing node "tank-01"
Mar 13 03:42:04 morph-03 fenced[4212]: agent "fence_apc" reports: Unable to connect/login to fencing device 
Mar 13 03:42:04 morph-03 fenced[4212]: fence "tank-01" failed


tank-01::

Mar 13 03:41:38 tank-01 kernel: CMAN: Initiating transition, generation 30
Mar 13 03:41:39 tank-01 kernel: CMAN: Completed transition, generation 30
Mar 13 03:41:39 tank-01 fenced[4339]: tank-03 not a cluster member after 0 sec post_fail_delay
Mar 13 03:41:39 tank-01 fenced[4339]: morph-01 not a cluster member after 0 sec post_fail_delay
Mar 13 03:41:39 tank-01 fenced[4339]: tank-04 not a cluster member after 0 sec post_fail_delay
Mar 13 03:41:39 tank-01 fenced[4339]: morph-04 not a cluster member after 0 sec post_fail_delay
Mar 13 03:41:39 tank-01 fenced[4339]: fencing node "tank-03"
Mar 13 03:41:45 tank-01 fenced[4339]: fence "tank-03" success
Mar 13 03:41:45 tank-01 fenced[4339]: fencing node "morph-01"
Mar 13 03:41:51 tank-01 fenced[4339]: fence "morph-01" success
Mar 13 03:41:51 tank-01 fenced[4339]: fencing node "tank-04"
Mar 13 03:41:57 tank-01 kernel: CMAN: Being told to leave the cluster by node 6
Mar 13 03:41:57 tank-01 kernel: CMAN: we are leaving the cluster. 
Mar 13 03:41:57 tank-01 kernel: WARNING: dlm_emergency_shutdown
Mar 13 03:41:57 tank-01 udevd[1794]: udev done!
Mar 13 03:41:58 tank-01 fenced[4339]: fence "tank-04" success
Mar 13 03:41:58 tank-01 fenced[4339]: fencing node "morph-04"
Mar 13 03:41:59 tank-01 qarshd[6493]: Talking to peer 10.15.89.98:37397
Mar 13 03:41:59 tank-01 qarshd[6493]: Running cmdline: cat /proc/cluster/nodes 
Mar 13 03:42:02 tank-01 kernel: md: stopping all md devices.
Mar 13 03:42:02 tank-01 kernel: md: md0 switched to read-only mode.
Mar 13 03:42:03 tank-01 kernel: Synchronizing SCSI cache for disk sdl: 
...
Mar 13 03:42:03 tank-01 kernel: Synchronizing SCSI cache for disk sdb: 
Mar 13 03:44:10 tank-01 syslogd 1.4.1: restart.


Version-Release number of selected component (if applicable):
cman-1.0.27-1.el4
fence-1.32.65-1.el4


How reproducible:
unknown

Comment 1 Lon Hohberger 2009-03-13 15:34:59 UTC
So, on tank-01, the qdiskd process hung, causing a master election even though tank-01 was still quite happily running (and fencing).

I originally assumed that the cause here was that cman_poll_quorum_dev() was hanging on the qdiskd master during the fencing operation, but did a couple of tests and found this wasn't the case; qdiskd was behaving normally during fencing requests on the fencing node whether it was the master or not.

My other (untested) idea is that this is caused by I/O hangs while writing to the disk on tank-01; a possible solution being to add the deadline scheduler to the kernel command line.

Comment 2 Nate Straz 2009-03-13 19:28:58 UTC
I was able to hit something similar with the deadline scheduler.  In this case only one node out of six was supposed to be rebooted, tank-01.  But tank-03 was rebooted too.  tank-03 was the qdisk master before it rebooted.

 Scenario iteration 0.2 started at Fri Mar 13 11:48:48 CDT 2009
 Sleeping 3 minute(s) to let the I/O get its lock count up...
 Senario: DLM kill one node

 Those picked to face the revolver... tank-01 

Mar 13 11:51:51 tank-01 qarshd[6032]: Running cmdline: reboot -fin


Mar 13 11:52:09 tank-03 kernel: CMAN: Initiating transition, generation 58
Mar 13 11:52:09 tank-03 kernel: CMAN: Started transition, generation 58
Mar 13 11:52:13 tank-03 qarshd[6714]: Talking to peer 10.15.89.98:38015
Mar 13 11:52:13 tank-03 qarshd[6714]: Running cmdline: cat /proc/cluster/nodes
Mar 13 11:52:15 tank-03 kernel: md: stopping all md devices.
Mar 13 11:52:15 tank-03 kernel: md: md0 switched to read-only mode.
Mar 13 11:52:15 tank-03 kernel: CMAN: removing node tank-01 from the cluster : No response to messages
Mar 13 11:52:15 tank-03 kernel: Synchronizing SCSI cache for disk sdl:
Mar 13 11:52:16 tank-03 kernel: Synchronizing SCSI cache for disk sdk:
Mar 13 11:54:29 tank-03 syslogd 1.4.1: restart.


Mar 13 11:52:08 tank-04 qdiskd[4154]: <info> Assuming master role
Mar 13 11:52:09 tank-04 qdiskd[4154]: <notice> Writing eviction notice for node 1
Mar 13 11:52:09 tank-04 qdiskd[4154]: <notice> Writing eviction notice for node 2
Mar 13 11:52:09 tank-04 kernel: CMAN: removing node tank-01 from the cluster : Killed by another node
Mar 13 11:52:10 tank-04 qdiskd[4154]: <notice> Node 1 evicted
Mar 13 11:52:10 tank-04 qdiskd[4154]: <notice> Node 2 evicted
Mar 13 11:52:13 tank-04 qarshd[6454]: Talking to peer 10.15.89.98:38016
Mar 13 11:52:13 tank-04 qarshd[6454]: Running cmdline: cat /proc/cluster/nodes
Mar 13 11:52:21 tank-04 kernel: CMAN: Initiating transition, generation 58
Mar 13 11:52:27 tank-04 kernel: CMAN: removing node tank-03 from the cluster : No response to messages
Mar 13 11:52:33 tank-04 kernel: CMAN: Initiating transition, generation 59
Mar 13 11:52:36 tank-04 kernel: CMAN: Completed transition, generation 59
Mar 13 11:52:36 tank-04 fenced[4214]: tank-01 not a cluster member after 0 sec post_fail_delay
Mar 13 11:52:36 tank-04 fenced[4214]: tank-03 not a cluster member after 0 sec post_fail_delay
Mar 13 11:52:36 tank-04 fenced[4214]: fencing node "tank-01"
Mar 13 11:52:41 tank-04 fenced[4214]: fence "tank-01" success
Mar 13 11:52:41 tank-04 fenced[4214]: fencing node "tank-03"
Mar 13 11:52:48 tank-04 fenced[4214]: fence "tank-03" success


Mar 13 11:52:08 morph-01 qdiskd[4272]: <info> Node 3 is the master
Mar 13 11:52:09 morph-01 kernel: CMAN: Initiating transition, generation 58
Mar 13 11:52:09 morph-01 kernel: CMAN: Started transition, generation 58
Mar 13 11:52:09 morph-01 qdiskd[4272]: <notice> Node 1 evicted
Mar 13 11:52:09 morph-01 qdiskd[4272]: <notice> Node 2 evicted
Mar 13 11:52:13 morph-01 qarshd[6383]: Talking to peer 10.15.89.98:38017
Mar 13 11:52:13 morph-01 qarshd[6383]: Running cmdline: cat /proc/cluster/nodes
Mar 13 11:52:15 morph-01 kernel: CMAN: removing node tank-01 from the cluster : No response to messages
Mar 13 11:52:21 morph-01 kernel: CMAN: Started transition, generation 58
Mar 13 11:52:27 morph-01 kernel: CMAN: removing node tank-03 from the cluster : No response to messages
Mar 13 11:52:33 morph-01 kernel: CMAN: Started transition, generation 58
Mar 13 11:52:33 morph-01 kernel: CMAN: Started transition, generation 58
Mar 13 11:52:34 morph-01 kernel: CMAN: Started transition, generation 59
Mar 13 11:52:36 morph-01 kernel: CMAN: Finished transition, generation 59
Mar 13 11:52:36 morph-01 fenced[4338]: fencing deferred to tank-04

Comment 3 Nate Straz 2009-03-13 20:52:16 UTC
Created attachment 335160 [details]
syslog and qdisk log from latest reproduction

Here are the logs from the same cluster hitting the bug again.  This time I did get node undead messages and morph-03 got rebooted a bunch of times before finally settling into quorum.

Comment 4 Lon Hohberger 2009-03-16 18:11:19 UTC
So 'undead' messages are likely a separate issue.

The fact that qdiskd is hanging during fencing operations is disconcerting; I haven't seen this behavior before.

Comment 5 Nate Straz 2009-04-24 18:38:30 UTC
I hit this again during RHCS 4.8 testing.

1. tank-01 was shot but revolver
2. tank-03 becomes qdisk master
3. tank-03 does into a loop calling tank-01 undead even though it is rebooting
4. tank-04 eventually evicts tank-04 for no reason

Comment 6 Lon Hohberger 2009-04-28 18:09:51 UTC
There are two things we should do :

(a) have your cluster send quorumd log messages to another file so it can be analyzed separately -- e.g.

   <quorumd log_level="7" log_facility="local5" ... >

...then in syslog.conf:
 
   local5.* /var/log/qdiskd.log

(b) I can fix the undead node thing once and for all, but I'm not clear as to why it occurs right now.

Comment 7 Lon Hohberger 2009-05-05 13:27:55 UTC
So on tank/morph last night, Nate saw errant behavior.   Looking at specifically tank-01 and its transition, here's what I found:

17:07:04 tank-01 in good shape and qdisk master
17:10:41 tank-03 sees tank-01 miss a qdisk heartbeat
17:10:54 tank-03 assumes master role after tank-01 continues to miss heartbeats
17:10:56 tank-01 eviction notice written
17:10:57 CMAN on tank-01 is -alive- and sees morph-03 die
17:10:59 (theory) qdiskd on tank-01 reads an eviction notice from tank-01 and reboots

So, from 17:10:39 to 17:10:59, qdiskd I/O is blocked.

... 20 seconds is a round number ...

Comment 8 Lon Hohberger 2009-05-05 13:29:12 UTC
17:10:41 tank-03 sees tank-01 miss a *second* qdisk heartbeat

the first miss was at 10:40, meaning the least good heartbeat was @ 10:39

Comment 9 Nate Straz 2009-05-05 13:55:35 UTC
Here are the iostats from tank-01 for the quorum disk

Time        Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
05:10:00 PM sda1         0.00   0.00 16.03  1.00   16.03    1.00     8.02     0.50     1.00     0.00    0.14   0.14   0.24
05:10:10 PM sda1         0.00   0.00 15.98  1.00   15.98    1.00     7.99     0.50     1.00     0.00    0.12   0.12   0.21
05:10:20 PM sda1         0.00   0.00 16.00  1.00   16.00    1.00     8.00     0.50     1.00     0.00    0.19   0.19   0.33
05:10:30 PM sda1         0.00   0.00 15.98  1.00   15.98    1.00     7.99     0.50     1.00     0.00    0.12   0.12   0.20
05:10:40 PM sda1         0.00   0.00 14.41  0.90   14.51    0.90     7.26     0.45     1.01     0.07    0.23   4.82   7.39
05:10:50 PM sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     1.00    0.00   0.00 100.02

Comment 10 Lon Hohberger 2009-05-05 15:42:29 UTC
05:10:30 = everything good
05:10:40 = looks like we missed something.  Note %util jump.
05:10:50 = last update before reboot.  Missed EVERYTHING; note 100.02 %util

Comment 11 Lon Hohberger 2009-05-05 16:04:46 UTC
This really doesn't look like a bug in qdiskd.  I have a feeling we can tune around it on this particular array configuration by using larger timeouts for the cluster and qdiskd.

on RHEL4:

  <cman deadnode_timeout="49" .../>

or on RHEL5:

  <totem token="49000" ... />

In these cases:

  <quorumd interval="2" tko="12" ... >
    <heuristic ... />
    ...
  </quorumd>

This will halve the per-second I/O requirement by qdiskd (it wasn't a lot to begin with before, but whatever) and allow >20 seconds for a recovery.  If my guess is right and there's some 20 second time limit here that we trip over for some reason, this will work around it.  I think knowing the real cause here is useful.

Comment 12 Lon Hohberger 2009-05-05 18:08:39 UTC
Created attachment 342512 [details]
iostat log showing weird behavior on tank-03

iostat log.  See how all of the devices are hanging on disk I/O on this node, not just the quorum disk device.

This is very, very strange to me.

Comment 13 Lon Hohberger 2009-05-05 18:28:44 UTC
The 100% utilization on some LUNs w/ zero I/Os completing across ALL LUNs on a particular array for ~20 seconds bothers me quite a bit.

This problem does not universally block access to the entire array for all hosts; other hosts can read/write normally.  So, either the array firmware is blocking a single host or (more likely) something in the kernel on the host in question is to blame for holding things up.

Comment 14 Nate Straz 2009-05-05 18:40:35 UTC
Comment on attachment 342512 [details]
iostat log showing weird behavior on tank-03

updating description to show which host the log came from.

Comment 15 Nate Straz 2009-05-05 18:47:32 UTC
Created attachment 342517 [details]
iostat log from tank-04 for same timeframe

Logs showing that another node was able to continue working on the same LUNs which tank-03 had trouble with.

Comment 16 Nate Straz 2009-05-05 21:01:15 UTC
After implementing the changes from comment #11 I was able to get through more iterations of revolver.  I checked through the logs to see if the I/O stalls were still happening and I did find them.  I also found the following log messages coming from qdiskd.  Apologies if bugzilla wraps this.

May  5 13:20:19 tank-04 qdiskd[11954]: <warning> qdisk cycle took more than 2 seconds to complete (20.030000)
May  5 13:20:20 tank-01 qdiskd[4280]: <warning> qdisk cycle took more than 2 seconds to complete (20.050000)
May  5 14:24:29 morph-01 qdiskd[4257]: <warning> qdisk cycle took more than 2 seconds to complete (18.710000)
May  5 14:24:29 morph-03 qdiskd[4161]: <warning> qdisk cycle took more than 2 seconds to complete (19.000000)
May  5 14:24:29 morph-04 qdiskd[4159]: <warning> qdisk cycle took more than 2 seconds to complete (20.330000)
May  5 14:24:48 tank-03 qdiskd[12485]: <warning> qdisk cycle took more than 2 seconds to complete (40.090000)
May  5 14:24:48 tank-04 qdiskd[12120]: <warning> qdisk cycle took more than 2 seconds to complete (40.050000)
May  5 14:36:09 tank-04 qdiskd[12120]: <warning> qdisk cycle took more than 2 seconds to complete (18.780000)
May  5 14:36:10 morph-03 qdiskd[4161]: <warning> qdisk cycle took more than 2 seconds to complete (18.950000)
May  5 14:36:10 tank-01 qdiskd[4283]: <warning> qdisk cycle took more than 2 seconds to complete (20.190000)
May  5 14:43:14 tank-04 qdiskd[12120]: <warning> qdisk cycle took more than 2 seconds to complete (19.280000)
May  5 14:43:15 morph-03 qdiskd[4161]: <warning> qdisk cycle took more than 2 seconds to complete (19.480000)
May  5 15:03:52 morph-01 qdiskd[4284]: <warning> qdisk cycle took more than 2 seconds to complete (19.590000)
May  5 15:10:35 morph-03 qdiskd[4159]: <warning> qdisk cycle took more than 2 seconds to complete (19.940000)
May  5 15:17:46 tank-04 qdiskd[4162]: <warning> qdisk cycle took more than 2 seconds to complete (20.170000)
May  5 15:17:48 tank-03 qdiskd[4162]: <warning> qdisk cycle took more than 2 seconds to complete (19.210000)

Comment 17 Nate Straz 2009-05-05 23:32:15 UTC
I was able to reproduce this with kernel-2.6.9-78.EL and associated GFS modules.

Comment 18 Lon Hohberger 2009-05-12 18:05:01 UTC
*** Bug 500408 has been marked as a duplicate of this bug. ***

Comment 20 Lon Hohberger 2010-10-07 19:08:32 UTC
qdiskd is a real-time system and does not function very well in environments
with non-deterministic I/O timings.

As long as there is an upper-bound for the amount of time it takes for a qdiskd
cycle to complete, it should be possible to tune qdiskd to sustain this sort of
temporal interruption, but it will require increasing the CMAN deadnode timeout
a corresponding amount (ex: in the above case, qdiskd should be configured for
at least 25 second detection times and CMAN's quorum_dev_poll and
deadnode_timer should be set close to a minute).

This will affect availability as a whole, since it will increase the amount of
time to respond to real failures (ex: kernel panic).

In cases where the upper bound of I/O latency to the quorum disk is unknown or
non-deterministic, it is a better idea to run in a cluster configuration which
does not use qdiskd.