Bug 602731

Summary: qdisk reboots cluster nodes even if it is configured not to do that
Product: Red Hat Enterprise Linux 5 Reporter: Marc Milgram <mmilgram>
Component: cmanAssignee: Lon Hohberger <lhh>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: high    
Version: 5.5CC: clasohm, cluster-maint, cww, djansa, edamato, lhh, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-04-05 18:34:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Proposed qdisk patch
none
Patch noted in comment #16 none

Description Marc Milgram 2010-06-10 15:18:44 UTC
Created attachment 422948 [details]
Proposed qdisk patch

Description of problem:
qdiskd should not reboot a node if the qdisk goes offline when configured with reboot="0" and io_timeout="0".

Version-Release number of selected component (if applicable):
cman-2.0.115-34.el5.x86_64.rpm

How reproducible:
100%

Steps to Reproduce:
1. Configure 2 node cluster with shared quorum disk with with reboot="0" and io_timeout="0".
   In /etc/cluster/cluster.conf, something like:
<quorumd allow_kill="0" device="/dev/mpath/ds0003_01a4" interval="3" io_timeout="0" max_error_cycles="5" min_score="1" reboot="0" tko="18" votes="1"/>
2. Start cluster
3. Stop access to quorum disk (so instantly neither cluster node can access the quorum disk).
  
Actual results:
Cluster nodes reboot.

Expected results:
Cluster nodes keep running

Additional info:

Comment 2 Lon Hohberger 2010-08-24 18:07:46 UTC
Looking at that patch again, it doesn't look right; bitwise comparisons are a bit tricky.

The patch ensures that the host never reboots, even if reboot is set.

This is because 0x3 & 0x1 & 0x2 is always 0... none of the bits between all three values are common.

If you want to compare two bit flags and do a conditional based on it, you need to do it like this:

  if ((flags & (flag1|flag2)) == (flag1|flag2))

ex:

+		    (ctx->qc_flags & (RF_IOTIMEOUT|RF_REBOOT)) == 
+                                    (RF_IOTIMEOUT|RF_REBOOT) ) {

Comment 3 Issue Tracker 2010-08-25 07:26:01 UTC
Event posted on 08:26:01, 25th Aug, 2010 BST by ndevos

Lon, thanks for checking!

That is pretty obvious, I don't know how I (and obviously others) missed
that...

Cheers,
Niels


This event sent from IssueTracker by ndevos 
 issue 935163

Comment 8 Lon Hohberger 2010-09-27 17:16:24 UTC
As it turns out, the patch doesn't really do what we'd want:

1) allow_kill="0" is incompatible with io_timeout="1" by design

2) the part that doesn't "work" today is simply the fact that writing the eviction notice to disk is still performed when allow_kill="0"

Comment 9 Lon Hohberger 2010-10-29 17:27:12 UTC
At this point, I'm at a loss; I configured an iSCSI target and forcefully ripped it out from under the two clients; everything continued operating correctly:

Oct 29 13:14:21 rhel5-1 kernel:  connection1:0: detected conn error (1011)
Oct 29 13:14:22 rhel5-1 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3)
Oct 29 13:14:25 rhel5-1 iscsid: connect to 192.168.122.1:3260 failed (Connection refused) 
Oct 29 13:14:26 rhel5-1 qdiskd[19668]: <warning> qdiskd: read (system call) has hung for 5 seconds 
Oct 29 13:14:26 rhel5-1 qdiskd[19668]: <warning> In 5 more seconds, we will be evicted 
Oct 29 13:14:29 rhel5-1 iscsid: connect to 192.168.122.1:3260 failed (Connection refused) 
Oct 29 13:14:31 rhel5-1 openais[1916]: [CMAN ] lost contact with quorum device 
Oct 29 13:14:33 rhel5-1 iscsid: connect to 192.168.122.1:3260 failed (Connection refused) 
Oct 29 13:15:06 rhel5-1 last message repeated 9 times

============

Oct 29 13:14:22 rhel5-2 kernel:  connection1:0: detected conn error (1011)
Oct 29 13:14:22 rhel5-2 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3)
Oct 29 13:14:24 rhel5-2 iscsid: connect to 192.168.122.1:3260 failed (Connection refused) 
Oct 29 13:14:27 rhel5-2 qdiskd[31979]: <warning> qdiskd: read (system call) has hung for 5 seconds 
Oct 29 13:14:27 rhel5-2 qdiskd[31979]: <warning> In 5 more seconds, we will be evicted 
Oct 29 13:14:28 rhel5-2 iscsid: connect to 192.168.122.1:3260 failed (Connection refused

After several minutes, the hosts are still up.  They are spewing I/O errors to the logs, as expected.

Restarting the iSCSI target caused the quorum disk daemon to immediately resume normal operations with no interruption in service; neither node rebooted nor was there any fencing.

My configuration was a 2-node cluster.  The cman/totem/quorumd tags (which are all relevant when discussing qdiskd) look like this:

        <cman expected_votes="3"/>
        <totem token="21000"/>
        <quorumd master_wins="1" label="rhel5" votes="1"/>

Either my test isn't the same as what you tried or it's working as expected.  

It's supposed to work this way, if I recall correctly.

Comment 10 Lon Hohberger 2010-10-29 20:59:50 UTC
I also forcefully killed the connection to the iSCSI target by doing the following:

   echo 1 > /sys/block/sda/device/delete

In this case, I had set:

   <quorumd master_wins="1" label="rhel5" votes="1" allow_kill="0" max_error_cycles="3"/>

Here's what happened:

Oct 29 16:53:20 rhel5-1 qdiskd[22945]: <warning> Error reading node ID block 15 
Oct 29 16:53:20 rhel5-1 kernel: scsi 0:0:0:1: rejecting I/O to dead device
Oct 29 16:53:20 rhel5-1 qdiskd[22945]: <warning> Error reading node ID block 16 
Oct 29 16:53:20 rhel5-1 kernel: scsi 0:0:0:1: rejecting I/O to dead device
Oct 29 16:53:20 rhel5-1 qdiskd[22945]: <err> Error writing to quorum disk 
Oct 29 16:53:20 rhel5-1 kernel: scsi 0:0:0:1: rejecting I/O to dead device
Oct 29 16:53:20 rhel5-1 qdiskd[22945]: <alert> Too many I/O errors; giving up. 
Oct 29 16:53:20 rhel5-1 kernel: scsi 0:0:0:1: rejecting I/O to dead device
Oct 29 16:53:20 rhel5-1 qdiskd[22945]: <warning> Error writing to quorum disk during logout 
Oct 29 16:53:20 rhel5-1 kernel: scsi 0:0:0:1: rejecting I/O to dead device
Oct 29 16:53:20 rhel5-1 last message repeated 10 times
Oct 29 16:53:30 rhel5-1 openais[1916]: [CMAN ] lost contact with quorum device 

===

Oct 29 16:53:33 rhel5-2 qdiskd[5005]: <info> Assuming master role
Oct 29 16:53:34 rhel5-2 qdiskd[5005]: <notice> Writing eviction notice for node 1
Oct 29 16:53:35 rhel5-2 qdiskd[5005]: <notice> Node 1 evicted

No other recovery action was taken; the node neither failed nor rebooted; it was not removed from the cluster by CMAN.

Comment 11 Lon Hohberger 2010-10-29 21:22:12 UTC
The only case I can think of which doesn't "work" is if you restore a node to operation prior to max_error_cycles triggering qdiskd to exit.

It wake up, read an eviction notice off of disk, and reboot.

I have a preliminary patch to make this "work", but it needs work; it doesn't correctly detect live-hangs if the node goes out to lunch while qdiskd is sleeping.

Comment 12 Lon Hohberger 2010-11-10 15:06:05 UTC
This will require some work upstream before it can be correctly done, as it turns out.

Comment 14 Lon Hohberger 2011-01-25 15:11:50 UTC
*** Bug 600395 has been marked as a duplicate of this bug. ***

Comment 16 Lon Hohberger 2011-02-07 21:24:55 UTC
So, the upstream work referred here is a proposal to remove multipath from the picture by placing the onus on writing directly to each path of a quorum disk instead of using multipath to sort it out.

The idea is that this will reduce the path failure detection time.  This requires kernel work.

As I recall, the idea was basically to have a 'last-failure, first-success' notification added to AIO.  We write the same data through each path.  Since quorum disks tend to not care about the exact content (just that something changed), it's okay if multiple writes succeed, or even if subsequent writes are out of order.

On the first success, we continue running.  Or, if all I/Os fail, then we note the error (and possibly take action).

This upstream work doesn't "address" the bugzilla directly; rather, it attempts to reduce the chance that multipath environments will have "slow I/O" (potentially during a trespass or path failover), causing qdiskd to "misbehave".

Comment 17 Lon Hohberger 2011-02-07 21:47:58 UTC
I think there is some misunderstanding.

I noted on 10/29 that I have a patch; it resolves the "issue".

That patch does prevent disk evictions, but unfortunately, it is at best is good for debugging or testing purposes - it would effectively render qdiskd useless for actual "tiebreaker" purposes.

Qdiskd needs *both* timely access to shared storage and the ability to take actions when things fail; "never reboot and do nothing no matter what" renders qdiskd ineffective, if not inoperable, as a quorum determinant.

Once qdiskd cannot take actions at all (including disk-based eviction notices), its behavior becomes unpredictable with respect to both fence-races and fence-loops; two things it was designed to handle.

Comment 18 Lon Hohberger 2011-02-07 21:52:18 UTC
Here's the upstream proposal noted in comment #16:

https://www.redhat.com/archives/dm-devel/2010-November/msg00050.html

Comment 19 Lon Hohberger 2011-02-07 21:56:06 UTC
Created attachment 477504 [details]
Patch noted in comment #16

This patch eliminates disk-based evictions if allow_kill is explicitly set to 0, but applying this patch makes qdiskd non-deterministic as a side effect.

Comment 20 Lon Hohberger 2011-02-07 21:57:16 UTC
Note that the patch in comment #19 is against the STABLE31 branch, not RHEL5; it will not cleanly apply to a RHEL5 cman package.

Comment 21 Lon Hohberger 2011-02-07 21:58:19 UTC
(In reply to comment #19)
> Patch noted in comment #16
> 

Oops, comment #17 - sorry for the confusion.

Comment 25 Lon Hohberger 2011-04-05 18:30:54 UTC
The failure scenarios in the patch proposed in comment #19 are not well-understood; it's best to either:

1) configure timings that will support the I/O interruptions (in the worst-case scenarios) in the event that qdiskd needs to be a quorum determinant

2) use the methods listed in bug 690321

3) roll one's own tiebreaker like this example which is a "ping" tiebreaker with no disk usage at all.  Here's one example that I wrote a few years ago:

   https://github.com/lhh/qnet

Comment 27 RHEL Program Management 2011-04-05 18:34:55 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.