Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 510611

Summary:

qdisk does not autoboot/self-fence system if write errors take longer than interval*tko

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Eduardo Damato <edamato>

Component:

cman

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

high

Docs Contact:

Priority:

high

Version:

CC:

cfeist, cluster-maint, djansa, fnadge, iannis, jwest, tao

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

cman-1.0.28-1.el4

Doc Type:

Bug Fix

Doc Text:

Previously, the disk-based quorum daemon (qdisk) could under circumstances become suspended on input/output to shared storage. With this update, qdisk correctly self-fences if write errors take longer than the technical knockout (tko) interval times.

Story Points:

---

Clone Of:

Clones:

511113 (view as bug list)

Environment:

Last Closed:

2011-02-16 16:21:44 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

511113

Attachments:

Description	Flags
patch implementing io_timeout for qdisk	none

Description Eduardo Damato 2009-07-09 21:13:16 UTC

Description of problem:

Qdiskd on RHEL4 does not check if IO has failed for interval*tko seconds, relying only on cman kill to evict a node. If the quorum disk is unavailable for all nodes there is no way of acting, and when the disk comes back (let's say due to mpath flapping or unavailability) one of the nodes will be fenced due to cman kill.

How reproducible:

every time.

Steps to Reproduce:

1- Have a cluster of 2 nodes on rhel4 with quorum disk having one vote per machine and 1 vote for quorum disk.

2- unplug the FC cables for both machines, notice that FC link goes down.

3- wait for qdisk interval*tko to elapse.

4- notice that no action takes place and no machines are fenced or rebooted.

5- plug back the FC cable of one of the machines.

6- The machine whose cables were plugged back will send a 'CMAN kill' to the other as soon as it notices that it has missed disk heartbeats for interval*tko.

Actual results:

When both nodes have no FC anymore, no fencing or reboots occur.

Expected results:

Be able to reboot/fence nodes when IO fails for interval*tko seconds.

Additional info:

*** sanitized information ***

Example configuration:

        <quorumd interval="5" label="mylabel" min_score="1" tko="10" votes="1" paranoid="1"> 
                <heuristic interval="2" program="ping -c1 -t1 10.0.0.1" score="1"/> 
        </quorumd> 
 
disabled HBA's on both nodes. After multipathd failed the quorum device, qdiskd detected io failures, but did not cause a reboot or fence:
  
Jun 26 11:02:43 node2 qdiskd[4633]: <warning> Error reading node ID block 16 
Jun 26 11:02:43 node2 kernel: SCSI error : <1 0 0 3> return code = 0x10000 
Jun 26 11:02:43 node2 qdiskd[4633]: <err> Error writing to quorum disk 
Jun 26 11:02:43 node2 kernel: end_request: I/O error, dev sdc, sector 80 
Jun 26 11:02:43 node2 kernel: SCSI error : <1 0 0 3> return code = 0x10000 
Jun 26 11:02:43 node2 kernel: end_request: I/O error, dev sdc, sector 88 
 
After reenabling one of the cards we see: 
 
Jun 26 11:02:44 node2 kernel: qla2400 0000:05:00.0: LOOP UP detected (4 Gbps). 
Jun 26 11:02:48 node2 multipathd: 8:32: tur checker reports path is up 
Jun 26 11:02:48 node2 multipathd: 8:32: reinstated 
Jun 26 11:02:48 node2 multipathd: quorum: queue_if_no_path enabled 
Jun 26 11:02:48 node2 multipathd: quorum: Recovered to normal mode 
Jun 26 11:02:48 node2 multipathd: quorum: remaining active paths: 1 
 
leading to eviction of the other node: 
 
Jun 26 11:04:00 node2 kernel: CMAN: Quorum device /dev/sdc timed out 
Jun 26 11:04:02 node2 qdiskd[4633]: <info> Assuming master role 
Jun 26 11:04:07 node2 qdiskd[4633]: <notice> Writing eviction notice for node 1 
Jun 26 11:04:07 node2 kernel: CMAN: removing node node1.qcluster from the cluster : Killed by another node 
Jun 26 11:04:07 node2 fenced[4713]: node1.qcluster not a cluster member after 0 sec post_fail_delay 
Jun 26 11:04:07 node2 fenced[4713]: fencing node "node1.qcluster" 
Jun 26 11:04:12 node2 qdiskd[4633]: <notice> Node 1 evicted 
Jun 26 11:04:17 node2 fenced[4713]: fence "node1.qcluster" success

Comment 2 Eduardo Damato 2009-07-09 21:17:33 UTC

Created attachment 351177 [details]
patch implementing io_timeout for qdisk

The patch above should introduce the io_timeout parameter onto qdiskd, which allows the node to autoboot if time since last successful write operation is longer than interval*tko.

Eduardo.

Comment 6 Lon Hohberger 2010-10-21 22:14:02 UTC

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=ac21ded8bf8ae02aa70b578aaf7e0dd9e715124d
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=d174afd633be8e67e3d0c70bef5e2e98d6a6dc18

Comment 8 Florian Nadge 2011-01-03 14:31:24 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the disk-based quorum daemon (qdisk) could under circumstances become suspended on input/output to shared storage. With this update, qdisk correctly self-fences if write errors take longer than the technical knockout (tko) interval times.

Comment 10 errata-xmlrpc 2011-02-16 16:21:44 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0271.html