Created attachment 335102 [details] syslog and cluster.conf Description of problem: While running revolver on a 6 node + qdisk cluster I hit a situation where a node was unexpectedly rebooted. Scenario iteration 0.4 started at Fri Mar 13 03:37:40 CDT 2009 Sleeping 3 minute(s) to let the I/O get its lock count up... Senario: DLM kill Quorum plus one Those picked to face the revolver... tank-03 morph-04 morph-01 tank-04 tank-01 started fencing nodes morph-03 became qdisk master 18 seconds after deferring to tank-01 and starts trying to kill tank-01. morph-03:: Mar 13 03:41:38 morph-03 fenced[4212]: fencing deferred to tank-01 Mar 13 03:41:56 morph-03 qdiskd[4152]: <info> Assuming master role Mar 13 03:41:57 morph-03 qdiskd[4152]: <notice> Writing eviction notice for node 1 Mar 13 03:41:57 morph-03 kernel: CMAN: removing node tank-01 from the cluster : Killed by another node Mar 13 03:41:58 morph-03 qdiskd[4152]: <notice> Node 1 evicted Mar 13 03:41:58 morph-03 qarshd[5920]: Talking to peer 10.15.89.98:37398 Mar 13 03:41:58 morph-03 qarshd[5920]: Running cmdline: cat /proc/cluster/nodes Mar 13 03:42:03 morph-03 kernel: SM: 00000002 process_recovery_barrier status=-104 Mar 13 03:42:03 morph-03 fenced[4212]: tank-01 not a cluster member after 0 sec post_fail_delay Mar 13 03:42:03 morph-03 fenced[4212]: tank-03 not a cluster member after 0 sec post_fail_delay Mar 13 03:42:03 morph-03 fenced[4212]: morph-04 not a cluster member after 0 sec post_fail_delay Mar 13 03:42:03 morph-03 fenced[4212]: tank-04 not a cluster member after 0 sec post_fail_delay Mar 13 03:42:03 morph-03 fenced[4212]: morph-01 not a cluster member after 0 sec post_fail_delay Mar 13 03:42:03 morph-03 fenced[4212]: fencing node "tank-01" Mar 13 03:42:04 morph-03 fenced[4212]: agent "fence_apc" reports: Unable to connect/login to fencing device Mar 13 03:42:04 morph-03 fenced[4212]: fence "tank-01" failed tank-01:: Mar 13 03:41:38 tank-01 kernel: CMAN: Initiating transition, generation 30 Mar 13 03:41:39 tank-01 kernel: CMAN: Completed transition, generation 30 Mar 13 03:41:39 tank-01 fenced[4339]: tank-03 not a cluster member after 0 sec post_fail_delay Mar 13 03:41:39 tank-01 fenced[4339]: morph-01 not a cluster member after 0 sec post_fail_delay Mar 13 03:41:39 tank-01 fenced[4339]: tank-04 not a cluster member after 0 sec post_fail_delay Mar 13 03:41:39 tank-01 fenced[4339]: morph-04 not a cluster member after 0 sec post_fail_delay Mar 13 03:41:39 tank-01 fenced[4339]: fencing node "tank-03" Mar 13 03:41:45 tank-01 fenced[4339]: fence "tank-03" success Mar 13 03:41:45 tank-01 fenced[4339]: fencing node "morph-01" Mar 13 03:41:51 tank-01 fenced[4339]: fence "morph-01" success Mar 13 03:41:51 tank-01 fenced[4339]: fencing node "tank-04" Mar 13 03:41:57 tank-01 kernel: CMAN: Being told to leave the cluster by node 6 Mar 13 03:41:57 tank-01 kernel: CMAN: we are leaving the cluster. Mar 13 03:41:57 tank-01 kernel: WARNING: dlm_emergency_shutdown Mar 13 03:41:57 tank-01 udevd[1794]: udev done! Mar 13 03:41:58 tank-01 fenced[4339]: fence "tank-04" success Mar 13 03:41:58 tank-01 fenced[4339]: fencing node "morph-04" Mar 13 03:41:59 tank-01 qarshd[6493]: Talking to peer 10.15.89.98:37397 Mar 13 03:41:59 tank-01 qarshd[6493]: Running cmdline: cat /proc/cluster/nodes Mar 13 03:42:02 tank-01 kernel: md: stopping all md devices. Mar 13 03:42:02 tank-01 kernel: md: md0 switched to read-only mode. Mar 13 03:42:03 tank-01 kernel: Synchronizing SCSI cache for disk sdl: ... Mar 13 03:42:03 tank-01 kernel: Synchronizing SCSI cache for disk sdb: Mar 13 03:44:10 tank-01 syslogd 1.4.1: restart. Version-Release number of selected component (if applicable): cman-1.0.27-1.el4 fence-1.32.65-1.el4 How reproducible: unknown
So, on tank-01, the qdiskd process hung, causing a master election even though tank-01 was still quite happily running (and fencing). I originally assumed that the cause here was that cman_poll_quorum_dev() was hanging on the qdiskd master during the fencing operation, but did a couple of tests and found this wasn't the case; qdiskd was behaving normally during fencing requests on the fencing node whether it was the master or not. My other (untested) idea is that this is caused by I/O hangs while writing to the disk on tank-01; a possible solution being to add the deadline scheduler to the kernel command line.
I was able to hit something similar with the deadline scheduler. In this case only one node out of six was supposed to be rebooted, tank-01. But tank-03 was rebooted too. tank-03 was the qdisk master before it rebooted. Scenario iteration 0.2 started at Fri Mar 13 11:48:48 CDT 2009 Sleeping 3 minute(s) to let the I/O get its lock count up... Senario: DLM kill one node Those picked to face the revolver... tank-01 Mar 13 11:51:51 tank-01 qarshd[6032]: Running cmdline: reboot -fin Mar 13 11:52:09 tank-03 kernel: CMAN: Initiating transition, generation 58 Mar 13 11:52:09 tank-03 kernel: CMAN: Started transition, generation 58 Mar 13 11:52:13 tank-03 qarshd[6714]: Talking to peer 10.15.89.98:38015 Mar 13 11:52:13 tank-03 qarshd[6714]: Running cmdline: cat /proc/cluster/nodes Mar 13 11:52:15 tank-03 kernel: md: stopping all md devices. Mar 13 11:52:15 tank-03 kernel: md: md0 switched to read-only mode. Mar 13 11:52:15 tank-03 kernel: CMAN: removing node tank-01 from the cluster : No response to messages Mar 13 11:52:15 tank-03 kernel: Synchronizing SCSI cache for disk sdl: Mar 13 11:52:16 tank-03 kernel: Synchronizing SCSI cache for disk sdk: Mar 13 11:54:29 tank-03 syslogd 1.4.1: restart. Mar 13 11:52:08 tank-04 qdiskd[4154]: <info> Assuming master role Mar 13 11:52:09 tank-04 qdiskd[4154]: <notice> Writing eviction notice for node 1 Mar 13 11:52:09 tank-04 qdiskd[4154]: <notice> Writing eviction notice for node 2 Mar 13 11:52:09 tank-04 kernel: CMAN: removing node tank-01 from the cluster : Killed by another node Mar 13 11:52:10 tank-04 qdiskd[4154]: <notice> Node 1 evicted Mar 13 11:52:10 tank-04 qdiskd[4154]: <notice> Node 2 evicted Mar 13 11:52:13 tank-04 qarshd[6454]: Talking to peer 10.15.89.98:38016 Mar 13 11:52:13 tank-04 qarshd[6454]: Running cmdline: cat /proc/cluster/nodes Mar 13 11:52:21 tank-04 kernel: CMAN: Initiating transition, generation 58 Mar 13 11:52:27 tank-04 kernel: CMAN: removing node tank-03 from the cluster : No response to messages Mar 13 11:52:33 tank-04 kernel: CMAN: Initiating transition, generation 59 Mar 13 11:52:36 tank-04 kernel: CMAN: Completed transition, generation 59 Mar 13 11:52:36 tank-04 fenced[4214]: tank-01 not a cluster member after 0 sec post_fail_delay Mar 13 11:52:36 tank-04 fenced[4214]: tank-03 not a cluster member after 0 sec post_fail_delay Mar 13 11:52:36 tank-04 fenced[4214]: fencing node "tank-01" Mar 13 11:52:41 tank-04 fenced[4214]: fence "tank-01" success Mar 13 11:52:41 tank-04 fenced[4214]: fencing node "tank-03" Mar 13 11:52:48 tank-04 fenced[4214]: fence "tank-03" success Mar 13 11:52:08 morph-01 qdiskd[4272]: <info> Node 3 is the master Mar 13 11:52:09 morph-01 kernel: CMAN: Initiating transition, generation 58 Mar 13 11:52:09 morph-01 kernel: CMAN: Started transition, generation 58 Mar 13 11:52:09 morph-01 qdiskd[4272]: <notice> Node 1 evicted Mar 13 11:52:09 morph-01 qdiskd[4272]: <notice> Node 2 evicted Mar 13 11:52:13 morph-01 qarshd[6383]: Talking to peer 10.15.89.98:38017 Mar 13 11:52:13 morph-01 qarshd[6383]: Running cmdline: cat /proc/cluster/nodes Mar 13 11:52:15 morph-01 kernel: CMAN: removing node tank-01 from the cluster : No response to messages Mar 13 11:52:21 morph-01 kernel: CMAN: Started transition, generation 58 Mar 13 11:52:27 morph-01 kernel: CMAN: removing node tank-03 from the cluster : No response to messages Mar 13 11:52:33 morph-01 kernel: CMAN: Started transition, generation 58 Mar 13 11:52:33 morph-01 kernel: CMAN: Started transition, generation 58 Mar 13 11:52:34 morph-01 kernel: CMAN: Started transition, generation 59 Mar 13 11:52:36 morph-01 kernel: CMAN: Finished transition, generation 59 Mar 13 11:52:36 morph-01 fenced[4338]: fencing deferred to tank-04
Created attachment 335160 [details] syslog and qdisk log from latest reproduction Here are the logs from the same cluster hitting the bug again. This time I did get node undead messages and morph-03 got rebooted a bunch of times before finally settling into quorum.
So 'undead' messages are likely a separate issue. The fact that qdiskd is hanging during fencing operations is disconcerting; I haven't seen this behavior before.
I hit this again during RHCS 4.8 testing. 1. tank-01 was shot but revolver 2. tank-03 becomes qdisk master 3. tank-03 does into a loop calling tank-01 undead even though it is rebooting 4. tank-04 eventually evicts tank-04 for no reason
There are two things we should do : (a) have your cluster send quorumd log messages to another file so it can be analyzed separately -- e.g. <quorumd log_level="7" log_facility="local5" ... > ...then in syslog.conf: local5.* /var/log/qdiskd.log (b) I can fix the undead node thing once and for all, but I'm not clear as to why it occurs right now.
So on tank/morph last night, Nate saw errant behavior. Looking at specifically tank-01 and its transition, here's what I found: 17:07:04 tank-01 in good shape and qdisk master 17:10:41 tank-03 sees tank-01 miss a qdisk heartbeat 17:10:54 tank-03 assumes master role after tank-01 continues to miss heartbeats 17:10:56 tank-01 eviction notice written 17:10:57 CMAN on tank-01 is -alive- and sees morph-03 die 17:10:59 (theory) qdiskd on tank-01 reads an eviction notice from tank-01 and reboots So, from 17:10:39 to 17:10:59, qdiskd I/O is blocked. ... 20 seconds is a round number ...
17:10:41 tank-03 sees tank-01 miss a *second* qdisk heartbeat the first miss was at 10:40, meaning the least good heartbeat was @ 10:39
Here are the iostats from tank-01 for the quorum disk Time Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 05:10:00 PM sda1 0.00 0.00 16.03 1.00 16.03 1.00 8.02 0.50 1.00 0.00 0.14 0.14 0.24 05:10:10 PM sda1 0.00 0.00 15.98 1.00 15.98 1.00 7.99 0.50 1.00 0.00 0.12 0.12 0.21 05:10:20 PM sda1 0.00 0.00 16.00 1.00 16.00 1.00 8.00 0.50 1.00 0.00 0.19 0.19 0.33 05:10:30 PM sda1 0.00 0.00 15.98 1.00 15.98 1.00 7.99 0.50 1.00 0.00 0.12 0.12 0.20 05:10:40 PM sda1 0.00 0.00 14.41 0.90 14.51 0.90 7.26 0.45 1.01 0.07 0.23 4.82 7.39 05:10:50 PM sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 100.02
05:10:30 = everything good 05:10:40 = looks like we missed something. Note %util jump. 05:10:50 = last update before reboot. Missed EVERYTHING; note 100.02 %util
This really doesn't look like a bug in qdiskd. I have a feeling we can tune around it on this particular array configuration by using larger timeouts for the cluster and qdiskd. on RHEL4: <cman deadnode_timeout="49" .../> or on RHEL5: <totem token="49000" ... /> In these cases: <quorumd interval="2" tko="12" ... > <heuristic ... /> ... </quorumd> This will halve the per-second I/O requirement by qdiskd (it wasn't a lot to begin with before, but whatever) and allow >20 seconds for a recovery. If my guess is right and there's some 20 second time limit here that we trip over for some reason, this will work around it. I think knowing the real cause here is useful.
Created attachment 342512 [details] iostat log showing weird behavior on tank-03 iostat log. See how all of the devices are hanging on disk I/O on this node, not just the quorum disk device. This is very, very strange to me.
The 100% utilization on some LUNs w/ zero I/Os completing across ALL LUNs on a particular array for ~20 seconds bothers me quite a bit. This problem does not universally block access to the entire array for all hosts; other hosts can read/write normally. So, either the array firmware is blocking a single host or (more likely) something in the kernel on the host in question is to blame for holding things up.
Comment on attachment 342512 [details] iostat log showing weird behavior on tank-03 updating description to show which host the log came from.
Created attachment 342517 [details] iostat log from tank-04 for same timeframe Logs showing that another node was able to continue working on the same LUNs which tank-03 had trouble with.
After implementing the changes from comment #11 I was able to get through more iterations of revolver. I checked through the logs to see if the I/O stalls were still happening and I did find them. I also found the following log messages coming from qdiskd. Apologies if bugzilla wraps this. May 5 13:20:19 tank-04 qdiskd[11954]: <warning> qdisk cycle took more than 2 seconds to complete (20.030000) May 5 13:20:20 tank-01 qdiskd[4280]: <warning> qdisk cycle took more than 2 seconds to complete (20.050000) May 5 14:24:29 morph-01 qdiskd[4257]: <warning> qdisk cycle took more than 2 seconds to complete (18.710000) May 5 14:24:29 morph-03 qdiskd[4161]: <warning> qdisk cycle took more than 2 seconds to complete (19.000000) May 5 14:24:29 morph-04 qdiskd[4159]: <warning> qdisk cycle took more than 2 seconds to complete (20.330000) May 5 14:24:48 tank-03 qdiskd[12485]: <warning> qdisk cycle took more than 2 seconds to complete (40.090000) May 5 14:24:48 tank-04 qdiskd[12120]: <warning> qdisk cycle took more than 2 seconds to complete (40.050000) May 5 14:36:09 tank-04 qdiskd[12120]: <warning> qdisk cycle took more than 2 seconds to complete (18.780000) May 5 14:36:10 morph-03 qdiskd[4161]: <warning> qdisk cycle took more than 2 seconds to complete (18.950000) May 5 14:36:10 tank-01 qdiskd[4283]: <warning> qdisk cycle took more than 2 seconds to complete (20.190000) May 5 14:43:14 tank-04 qdiskd[12120]: <warning> qdisk cycle took more than 2 seconds to complete (19.280000) May 5 14:43:15 morph-03 qdiskd[4161]: <warning> qdisk cycle took more than 2 seconds to complete (19.480000) May 5 15:03:52 morph-01 qdiskd[4284]: <warning> qdisk cycle took more than 2 seconds to complete (19.590000) May 5 15:10:35 morph-03 qdiskd[4159]: <warning> qdisk cycle took more than 2 seconds to complete (19.940000) May 5 15:17:46 tank-04 qdiskd[4162]: <warning> qdisk cycle took more than 2 seconds to complete (20.170000) May 5 15:17:48 tank-03 qdiskd[4162]: <warning> qdisk cycle took more than 2 seconds to complete (19.210000)
I was able to reproduce this with kernel-2.6.9-78.EL and associated GFS modules.
*** Bug 500408 has been marked as a duplicate of this bug. ***
qdiskd is a real-time system and does not function very well in environments with non-deterministic I/O timings. As long as there is an upper-bound for the amount of time it takes for a qdiskd cycle to complete, it should be possible to tune qdiskd to sustain this sort of temporal interruption, but it will require increasing the CMAN deadnode timeout a corresponding amount (ex: in the above case, qdiskd should be configured for at least 25 second detection times and CMAN's quorum_dev_poll and deadnode_timer should be set close to a minute). This will affect availability as a whole, since it will increase the amount of time to respond to real failures (ex: kernel panic). In cases where the upper bound of I/O latency to the quorum disk is unknown or non-deterministic, it is a better idea to run in a cluster configuration which does not use qdiskd.