Bug 602731
Summary: | qdisk reboots cluster nodes even if it is configured not to do that | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Marc Milgram <mmilgram> | ||||||
Component: | cman | Assignee: | Lon Hohberger <lhh> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.5 | CC: | clasohm, cluster-maint, cww, djansa, edamato, lhh, tao | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2011-04-05 18:34:55 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Looking at that patch again, it doesn't look right; bitwise comparisons are a bit tricky. The patch ensures that the host never reboots, even if reboot is set. This is because 0x3 & 0x1 & 0x2 is always 0... none of the bits between all three values are common. If you want to compare two bit flags and do a conditional based on it, you need to do it like this: if ((flags & (flag1|flag2)) == (flag1|flag2)) ex: + (ctx->qc_flags & (RF_IOTIMEOUT|RF_REBOOT)) == + (RF_IOTIMEOUT|RF_REBOOT) ) { Event posted on 08:26:01, 25th Aug, 2010 BST by ndevos Lon, thanks for checking! That is pretty obvious, I don't know how I (and obviously others) missed that... Cheers, Niels This event sent from IssueTracker by ndevos issue 935163 As it turns out, the patch doesn't really do what we'd want: 1) allow_kill="0" is incompatible with io_timeout="1" by design 2) the part that doesn't "work" today is simply the fact that writing the eviction notice to disk is still performed when allow_kill="0" At this point, I'm at a loss; I configured an iSCSI target and forcefully ripped it out from under the two clients; everything continued operating correctly: Oct 29 13:14:21 rhel5-1 kernel: connection1:0: detected conn error (1011) Oct 29 13:14:22 rhel5-1 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Oct 29 13:14:25 rhel5-1 iscsid: connect to 192.168.122.1:3260 failed (Connection refused) Oct 29 13:14:26 rhel5-1 qdiskd[19668]: <warning> qdiskd: read (system call) has hung for 5 seconds Oct 29 13:14:26 rhel5-1 qdiskd[19668]: <warning> In 5 more seconds, we will be evicted Oct 29 13:14:29 rhel5-1 iscsid: connect to 192.168.122.1:3260 failed (Connection refused) Oct 29 13:14:31 rhel5-1 openais[1916]: [CMAN ] lost contact with quorum device Oct 29 13:14:33 rhel5-1 iscsid: connect to 192.168.122.1:3260 failed (Connection refused) Oct 29 13:15:06 rhel5-1 last message repeated 9 times ============ Oct 29 13:14:22 rhel5-2 kernel: connection1:0: detected conn error (1011) Oct 29 13:14:22 rhel5-2 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Oct 29 13:14:24 rhel5-2 iscsid: connect to 192.168.122.1:3260 failed (Connection refused) Oct 29 13:14:27 rhel5-2 qdiskd[31979]: <warning> qdiskd: read (system call) has hung for 5 seconds Oct 29 13:14:27 rhel5-2 qdiskd[31979]: <warning> In 5 more seconds, we will be evicted Oct 29 13:14:28 rhel5-2 iscsid: connect to 192.168.122.1:3260 failed (Connection refused After several minutes, the hosts are still up. They are spewing I/O errors to the logs, as expected. Restarting the iSCSI target caused the quorum disk daemon to immediately resume normal operations with no interruption in service; neither node rebooted nor was there any fencing. My configuration was a 2-node cluster. The cman/totem/quorumd tags (which are all relevant when discussing qdiskd) look like this: <cman expected_votes="3"/> <totem token="21000"/> <quorumd master_wins="1" label="rhel5" votes="1"/> Either my test isn't the same as what you tried or it's working as expected. It's supposed to work this way, if I recall correctly. I also forcefully killed the connection to the iSCSI target by doing the following: echo 1 > /sys/block/sda/device/delete In this case, I had set: <quorumd master_wins="1" label="rhel5" votes="1" allow_kill="0" max_error_cycles="3"/> Here's what happened: Oct 29 16:53:20 rhel5-1 qdiskd[22945]: <warning> Error reading node ID block 15 Oct 29 16:53:20 rhel5-1 kernel: scsi 0:0:0:1: rejecting I/O to dead device Oct 29 16:53:20 rhel5-1 qdiskd[22945]: <warning> Error reading node ID block 16 Oct 29 16:53:20 rhel5-1 kernel: scsi 0:0:0:1: rejecting I/O to dead device Oct 29 16:53:20 rhel5-1 qdiskd[22945]: <err> Error writing to quorum disk Oct 29 16:53:20 rhel5-1 kernel: scsi 0:0:0:1: rejecting I/O to dead device Oct 29 16:53:20 rhel5-1 qdiskd[22945]: <alert> Too many I/O errors; giving up. Oct 29 16:53:20 rhel5-1 kernel: scsi 0:0:0:1: rejecting I/O to dead device Oct 29 16:53:20 rhel5-1 qdiskd[22945]: <warning> Error writing to quorum disk during logout Oct 29 16:53:20 rhel5-1 kernel: scsi 0:0:0:1: rejecting I/O to dead device Oct 29 16:53:20 rhel5-1 last message repeated 10 times Oct 29 16:53:30 rhel5-1 openais[1916]: [CMAN ] lost contact with quorum device === Oct 29 16:53:33 rhel5-2 qdiskd[5005]: <info> Assuming master role Oct 29 16:53:34 rhel5-2 qdiskd[5005]: <notice> Writing eviction notice for node 1 Oct 29 16:53:35 rhel5-2 qdiskd[5005]: <notice> Node 1 evicted No other recovery action was taken; the node neither failed nor rebooted; it was not removed from the cluster by CMAN. The only case I can think of which doesn't "work" is if you restore a node to operation prior to max_error_cycles triggering qdiskd to exit. It wake up, read an eviction notice off of disk, and reboot. I have a preliminary patch to make this "work", but it needs work; it doesn't correctly detect live-hangs if the node goes out to lunch while qdiskd is sleeping. This will require some work upstream before it can be correctly done, as it turns out. *** Bug 600395 has been marked as a duplicate of this bug. *** So, the upstream work referred here is a proposal to remove multipath from the picture by placing the onus on writing directly to each path of a quorum disk instead of using multipath to sort it out. The idea is that this will reduce the path failure detection time. This requires kernel work. As I recall, the idea was basically to have a 'last-failure, first-success' notification added to AIO. We write the same data through each path. Since quorum disks tend to not care about the exact content (just that something changed), it's okay if multiple writes succeed, or even if subsequent writes are out of order. On the first success, we continue running. Or, if all I/Os fail, then we note the error (and possibly take action). This upstream work doesn't "address" the bugzilla directly; rather, it attempts to reduce the chance that multipath environments will have "slow I/O" (potentially during a trespass or path failover), causing qdiskd to "misbehave". I think there is some misunderstanding. I noted on 10/29 that I have a patch; it resolves the "issue". That patch does prevent disk evictions, but unfortunately, it is at best is good for debugging or testing purposes - it would effectively render qdiskd useless for actual "tiebreaker" purposes. Qdiskd needs *both* timely access to shared storage and the ability to take actions when things fail; "never reboot and do nothing no matter what" renders qdiskd ineffective, if not inoperable, as a quorum determinant. Once qdiskd cannot take actions at all (including disk-based eviction notices), its behavior becomes unpredictable with respect to both fence-races and fence-loops; two things it was designed to handle. Here's the upstream proposal noted in comment #16: https://www.redhat.com/archives/dm-devel/2010-November/msg00050.html Created attachment 477504 [details] Patch noted in comment #16 This patch eliminates disk-based evictions if allow_kill is explicitly set to 0, but applying this patch makes qdiskd non-deterministic as a side effect. Note that the patch in comment #19 is against the STABLE31 branch, not RHEL5; it will not cleanly apply to a RHEL5 cman package. (In reply to comment #19) > Patch noted in comment #16 > Oops, comment #17 - sorry for the confusion. The failure scenarios in the patch proposed in comment #19 are not well-understood; it's best to either: 1) configure timings that will support the I/O interruptions (in the worst-case scenarios) in the event that qdiskd needs to be a quorum determinant 2) use the methods listed in bug 690321 3) roll one's own tiebreaker like this example which is a "ping" tiebreaker with no disk usage at all. Here's one example that I wrote a few years ago: https://github.com/lhh/qnet Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. |
Created attachment 422948 [details] Proposed qdisk patch Description of problem: qdiskd should not reboot a node if the qdisk goes offline when configured with reboot="0" and io_timeout="0". Version-Release number of selected component (if applicable): cman-2.0.115-34.el5.x86_64.rpm How reproducible: 100% Steps to Reproduce: 1. Configure 2 node cluster with shared quorum disk with with reboot="0" and io_timeout="0". In /etc/cluster/cluster.conf, something like: <quorumd allow_kill="0" device="/dev/mpath/ds0003_01a4" interval="3" io_timeout="0" max_error_cycles="5" min_score="1" reboot="0" tko="18" votes="1"/> 2. Start cluster 3. Stop access to quorum disk (so instantly neither cluster node can access the quorum disk). Actual results: Cluster nodes reboot. Expected results: Cluster nodes keep running Additional info: