Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 511113

Summary:

qdisk does not autoboot/self-fence system if write errors take longer than interval*tko

Product:

Red Hat Enterprise Linux 5

Reporter:

Eduardo Damato <edamato>

Component:

cman

Assignee:

Christine Caulfield <ccaulfie>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

5.4

CC:

cluster-maint, cward, edamato, jkortus, tao

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

cman-2.0.115-15.el5.src.rpm

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

510611

Environment:

Last Closed:

2010-03-30 08:40:30 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

510611

Bug Blocks:

Attachments:

Description	Flags
patch1 - implements io_timeout - adapted from RHEL4 patch	none
patch2 - implements io_timeout turning max_error_cycles off	none
patch3 - io_timeout implements independent read and write timeout counters for read and write operations	none
Reformatted patch for patch #1	none
Reformatted patch for patch #2	none
Reformatted patch for patch #3	none
Ancillary fix from Fabio in STABLE3	none

Description Eduardo Damato 2009-07-13 18:16:35 UTC

+++ This bug was initially created as a clone of Bug #510611 +++

Description of problem:

Qdiskd on RHEL5 does not check if IO has failed for interval*tko seconds, relying only on cman kill to evict a node. If the quorum disk is unavailable for all nodes there is no way of acting, and when the disk comes back (let's say due to mpath flapping or unavailability) one of the nodes will be fenced due to cman kill.

On RHEL5 there is the max_error_cycles tunable that causes the node to disconnect itself from qdisk gracefully after having a number of cycles with errors greater than max_error_cycles, but this is not the same as interval*tko, which can deliver a timeout as expected.

How reproducible:

every time.

Steps to Reproduce:

1- Have a cluster of 2 nodes on rhel5 with quorum disk having one vote per machine and 1 vote for quorum disk.

2- unplug the FC cables for both machines, notice that FC link goes down.

3- wait for qdisk interval*tko to elapse.

4- notice that no action takes place and no machines are fenced or rebooted.

5- plug back the FC cable of one of the machines.

6- The machine whose cables were plugged back will send a 'CMAN kill' to the other as soon as it notices that it has missed disk heartbeats for interval*tko.

Actual results:

When both nodes have no FC anymore, no fencing or reboots occur.

Expected results:

Be able to reboot/fence nodes when IO fails for interval*tko seconds.

Additional info:

*** sanitized information ***
*** this data comes from RHEL4 via clone of BZ ***

Example configuration:

        <quorumd interval="5" label="mylabel" min_score="1" tko="10" votes="1" paranoid="1"> 
                <heuristic interval="2" program="ping -c1 -t1 10.0.0.1" score="1"/> 
        </quorumd> 
 
disabled HBA's on both nodes. After multipathd failed the quorum device, qdiskd detected io failures, but did not cause a reboot or fence:
  
Jun 26 11:02:43 node2 qdiskd[4633]: <warning> Error reading node ID block 16 
Jun 26 11:02:43 node2 kernel: SCSI error : <1 0 0 3> return code = 0x10000 
Jun 26 11:02:43 node2 qdiskd[4633]: <err> Error writing to quorum disk 
Jun 26 11:02:43 node2 kernel: end_request: I/O error, dev sdc, sector 80 
Jun 26 11:02:43 node2 kernel: SCSI error : <1 0 0 3> return code = 0x10000 
Jun 26 11:02:43 node2 kernel: end_request: I/O error, dev sdc, sector 88 
 
After reenabling one of the cards we see: 
 
Jun 26 11:02:44 node2 kernel: qla2400 0000:05:00.0: LOOP UP detected (4 Gbps). 
Jun 26 11:02:48 node2 multipathd: 8:32: tur checker reports path is up 
Jun 26 11:02:48 node2 multipathd: 8:32: reinstated 
Jun 26 11:02:48 node2 multipathd: quorum: queue_if_no_path enabled 
Jun 26 11:02:48 node2 multipathd: quorum: Recovered to normal mode 
Jun 26 11:02:48 node2 multipathd: quorum: remaining active paths: 1 
 
leading to eviction of the other node: 
 
Jun 26 11:04:00 node2 kernel: CMAN: Quorum device /dev/sdc timed out 
Jun 26 11:04:02 node2 qdiskd[4633]: <info> Assuming master role 
Jun 26 11:04:07 node2 qdiskd[4633]: <notice> Writing eviction notice for node 1 
Jun 26 11:04:07 node2 kernel: CMAN: removing node node1.qcluster from the cluster : Killed by another node 
Jun 26 11:04:07 node2 fenced[4713]: node1.qcluster not a cluster member after 0 sec post_fail_delay 
Jun 26 11:04:07 node2 fenced[4713]: fencing node "node1.qcluster" 
Jun 26 11:04:12 node2 qdiskd[4633]: <notice> Node 1 evicted 
Jun 26 11:04:17 node2 fenced[4713]: fence "node1.qcluster" success

Comment 1 Eduardo Damato 2009-07-13 18:18:26 UTC

Created attachment 351506 [details]
patch1 - implements io_timeout - adapted from RHEL4 patch

Comment 2 Eduardo Damato 2009-07-13 18:19:15 UTC

Created attachment 351507 [details]
patch2 - implements io_timeout turning max_error_cycles off

Comment 3 Eduardo Damato 2009-07-13 18:21:43 UTC

Created attachment 351508 [details]
patch3 - io_timeout implements independent read and write timeout counters for read and write operations

This patch would fix the situation where patch2 turns of max_error_cycles, disabling the detection of read() errors, and only rebooting on write() errors. 

This patch creates a timer for last successful read and reboots the system if last successful read was more than interval*tko ago.

Comment 4 Lon Hohberger 2009-09-29 14:09:44 UTC

Master branch:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=6e91a44cfb2d6baa1a639a2f6e6023bf82ab3cb7

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=51049be41e3c3f198f7b39173bddb2d31786bc5b

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=a9ef89ce68381955d35288eb329d249e37b31618

Comment 7 Lon Hohberger 2009-10-30 20:39:39 UTC

Created attachment 366860 [details]
Reformatted patch for patch #1

Comment 8 Lon Hohberger 2009-10-30 20:40:08 UTC

Created attachment 366861 [details]
Reformatted patch for patch #2

Comment 9 Lon Hohberger 2009-10-30 20:40:33 UTC

Created attachment 366862 [details]
Reformatted patch for patch #3

Comment 10 Lon Hohberger 2009-10-30 20:41:07 UTC

Created attachment 366864 [details]
Ancillary fix from Fabio in STABLE3

Comment 11 Lon Hohberger 2009-10-30 20:43:47 UTC

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=fe9a89972834d0459c312bede9e4a32df52e445a

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=8742ae97a69c8cc282faf39d8c1e7bfda441e5b2

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=fe46f6b6e9ed9a40c37fa60966fafc1cf07e36d2

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=4ad333b7009ada0af3c1a2a5ad8f9815fb67b582

Comment 12 Lon Hohberger 2009-10-30 21:29:12 UTC

Reassigning to default component owner for build.

Comment 15 Chris Ward 2010-02-11 10:12:20 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 18 errata-xmlrpc 2010-03-30 08:40:30 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0266.html

Comment 20 Red Hat Bugzilla 2023-09-14 01:17:12 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days