984783 – rhcs failover because of reservation conflict

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 984783 - rhcs failover because of reservation conflict

Summary: rhcs failover because of reservation conflict

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	fence-agents
Sub Component:
Version:	6.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Marek Grac
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-16 02:47 UTC by kevin_cai
Modified:	2013-09-12 12:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-09-12 12:27:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cluster.conf (2.58 KB, text/plain) 2013-07-17 01:31 UTC, kevin_cai	no flags	Details
node 1:/var/log/messages (174.77 KB, text/plain) 2013-07-17 01:32 UTC, kevin_cai	no flags	Details
node 2:/var/log/messages (16.04 KB, text/plain) 2013-07-17 01:32 UTC, kevin_cai	no flags	Details
View All

Description kevin_cai 2013-07-16 02:47:55 UTC

Jul 15 15:41:14 test kernel: igb: eth7 NIC Link is Down
Jul 15 15:41:14 test kernel: bonding: bond3: link status definitely down for interface eth7, disabling it
Jul 15 15:41:14 test kernel: igb: eth6 NIC Link is Down
Jul 15 15:41:14 test kernel: bonding: bond3: link status definitely down for interface eth6, disabling it
Jul 15 15:41:14 test kernel: bonding: bond3: now running without any active interface !
Jul 15 15:41:22 test corosync[2888]:   [TOTEM ] A processor failed, forming new configuration.
Jul 15 15:41:24 test corosync[2888]:   [QUORUM] Members[1]: 1
Jul 15 15:41:24 test corosync[2888]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 15 15:41:24 test kernel: dlm: closing connection to node 2
Jul 15 15:41:24 test rgmanager[4001]: State change: mysql_node2 DOWN
Jul 15 15:41:24 test corosync[2888]:   [CPG   ] chosen downlist: sender r(0) ip(12.12.200.6) ; members(old:2 left:1)
Jul 15 15:41:24 test corosync[2888]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 15 15:41:24 test fenced[2970]: fencing node mysql_node2
Jul 15 15:41:27 test fenced[2970]: fence mysql_node2 success
Jul 15 15:41:28 test rgmanager[4001]: Starting stopped service service:mysql_jrzxt
Jul 15 15:41:28 test rgmanager[2980]: [fs] mounting /dev/dm-7 on /mysqldata
Jul 15 15:41:28 test rgmanager[3006]: [fs] mount -t ext4  /dev/dm-7 /mysqldata
Jul 15 15:41:28 test kernel: EXT4-fs (dm-7): warning: maximal mount count reached, running e2fsck is recommended
Jul 15 15:41:28 test kernel: EXT4-fs (dm-7): recovery complete
Jul 15 15:41:28 test kernel: EXT4-fs (dm-7): mounted filesystem with ordered data mode. Opts: 
Jul 15 15:41:28 test rgmanager[3114]: [ip] Adding IPv4 address 172.16.200.40/25 to bond0
Jul 15 15:41:28 test Keepalived_healthcheckers: Netlink reflector reports IP 172.16.200.40 added
Jul 15 15:41:28 test Keepalived_vrrp: Netlink reflector reports IP 172.16.200.40 added
Jul 15 15:41:29 test rgmanager[3176]: [script] Executing /etc/init.d/mysqld start
Jul 15 15:41:30 test ntpd[3372]: Listening on interface #12 bond0, 172.16.200.40#123 Enabled
Jul 15 15:41:34 test rgmanager[3853]: [ip] Adding IPv4 address 10.11.200.40/25 to bond2
Jul 15 15:41:34 test Keepalived_healthcheckers: Netlink reflector reports IP 10.11.200.40 added
Jul 15 15:41:34 test Keepalived_vrrp: Netlink reflector reports IP 10.11.200.40 added
Jul 15 15:41:35 test rgmanager[4001]: Service service:mysql_jrzxt started
Jul 15 15:41:36 test ntpd[3372]: Listening on interface #13 bond2, 10.11.200.40#123 Enabled
Jul 15 15:41:40 test kernel: igb: eth7 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jul 15 15:41:40 test kernel: bond3: link status definitely up for interface eth7, 1000 Mbps full duplex.
Jul 15 15:41:40 test kernel: bonding: bond3: making interface eth7 the new active one.
Jul 15 15:41:40 test kernel: bonding: bond3: first active interface up!
Jul 15 15:41:40 test kernel: igb: eth6 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jul 15 15:41:40 test kernel: bond3: link status definitely up for interface eth6, 1000 Mbps full duplex.
Jul 15 15:42:10 test rgmanager[4312]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:42:40 test rgmanager[5001]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:43:10 test rgmanager[5541]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:43:40 test rgmanager[6387]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:44:10 test rgmanager[6922]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:44:40 test rgmanager[7688]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:44:51 test kernel: igb: eth6 NIC Link is Down
Jul 15 15:44:51 test kernel: bonding: bond3: link status definitely down for interface eth6, disabling it
Jul 15 15:44:51 test kernel: igb: eth7 NIC Link is Down
Jul 15 15:44:51 test kernel: bonding: bond3: link status definitely down for interface eth7, disabling it
Jul 15 15:44:51 test kernel: bonding: bond3: now running without any active interface !
Jul 15 15:45:06 test kernel: igb: eth7 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jul 15 15:45:06 test kernel: bond3: link status definitely up for interface eth7, 1000 Mbps full duplex.
Jul 15 15:45:06 test kernel: bonding: bond3: making interface eth7 the new active one.
Jul 15 15:45:06 test kernel: bonding: bond3: first active interface up!
Jul 15 15:45:06 test kernel: igb: eth6 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jul 15 15:45:06 test kernel: bond3: link status definitely up for interface eth6, 1000 Mbps full duplex.
Jul 15 15:45:10 test rgmanager[8230]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:45:40 test rgmanager[9002]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:46:10 test rgmanager[9549]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:46:36 test kernel: sd 7:0:0:0: reservation conflict
Jul 15 15:46:36 test kernel: sd 7:0:0:0: [sdb] Unhandled error code
Jul 15 15:46:36 test kernel: sd 7:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jul 15 15:46:36 test kernel: sd 7:0:0:0: [sdb] CDB: Write(10): 2a 00 25 84 cb 28 00 00 08 00
Jul 15 15:46:36 test kernel: device-mapper: multipath: Failing path 8:16.
Jul 15 15:46:36 test multipathd: 8:16: mark as failed
Jul 15 15:46:36 test multipathd: mysql: remaining active paths: 3
Jul 15 15:46:36 test kernel: sd 10:0:0:0: reservation conflict
Jul 15 15:46:36 test kernel: sd 10:0:0:0: [sde] Unhandled error code
Jul 15 15:46:36 test kernel: sd 10:0:0:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jul 15 15:46:36 test kernel: sd 10:0:0:0: [sde] CDB: Write(10): 2a 00 25 84 cb 28 00 00 08 00
Jul 15 15:46:36 test kernel: device-mapper: multipath: Failing path 8:64.
Jul 15 15:46:36 test kernel: sd 8:0:0:0: reservation conflict
Jul 15 15:46:36 test kernel: sd 8:0:0:0: [sdc] Unhandled error code
Jul 15 15:46:36 test kernel: sd 8:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jul 15 15:46:36 test kernel: sd 8:0:0:0: [sdc] CDB: Write(10): 2a 00 25 84 cb 28 00 00 08 00
Jul 15 15:46:36 test kernel: device-mapper: multipath: Failing path 8:32.
Jul 15 15:46:36 test kernel: sd 9:0:0:0: reservation conflict
Jul 15 15:46:36 test kernel: sd 9:0:0:0: [sdd] Unhandled error code
Jul 15 15:46:36 test kernel: sd 9:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jul 15 15:46:36 test kernel: sd 9:0:0:0: [sdd] CDB: Write(10): 2a 00 25 84 cb 28 00 00 08 00
Jul 15 15:46:36 test kernel: device-mapper: multipath: Failing path 8:48.
Jul 15 15:46:36 test kernel: Buffer I/O error on device dm-7, logical block 78682213
Jul 15 15:46:36 test kernel: lost page write due to I/O error on dm-7
Jul 15 15:46:36 test kernel: Aborting journal on device dm-7-8.
Jul 15 15:46:36 test kernel: Buffer I/O error on device dm-7, logical block 65044480
Jul 15 15:46:36 test kernel: lost page write due to I/O error on dm-7
Jul 15 15:46:36 test kernel: JBD2: I/O error detected when updating journal superblock for dm-7-8.
Jul 15 15:46:36 test kernel: journal commit I/O error
Jul 15 15:46:37 test kernel: EXT4-fs error (device dm-7): ext4_journal_start_sb: Detected aborted journal
Jul 15 15:46:37 test kernel: EXT4-fs (dm-7): Remounting filesystem read-only
Jul 15 15:46:37 test multipathd: 8:64: mark as failed
Jul 15 15:46:37 test multipathd: mysql: remaining active paths: 2
Jul 15 15:46:37 test multipathd: 8:32: mark as failed
Jul 15 15:46:37 test multipathd: mysql: remaining active paths: 1
Jul 15 15:46:37 test multipathd: 8:48: mark as failed
Jul 15 15:46:37 test multipathd: mysql: remaining active paths: 0
Jul 15 15:46:39 test multipathd: mysql: sdb - directio checker reports path is up
Jul 15 15:46:39 test multipathd: 8:16: reinstated
Jul 15 15:46:39 test multipathd: mysql: remaining active paths: 1
Jul 15 15:46:40 test rgmanager[10341]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:46:40 test rgmanager[10386]: [script] script:mysqld: status of /etc/init.d/mysqld failed (returned 2)
Jul 15 15:46:40 test rgmanager[4001]: status on script "mysqld" returned 1 (generic error)

Version-Release number of selected component (if applicable):

# rpm -qa|grep multipath
device-mapper-multipath-0.4.9-46.el6.x86_64
device-mapper-multipath-libs-0.4.9-46.el6.x86_64
# 
# rpm -qa|grep iscsi
iscsi-initiator-utils-6.2.0.872-34.el6.x86_64
# rpm -qa|grep cman
cman-3.0.12.1-23.el6.x86_64
# rpm -qa|grep rgmanager
rgmanager-3.0.12.1-5.el6.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Ryan McCabe 2013-07-16 15:40:38 UTC

We need at least your cluster.conf and relevant cluster logs from all nodes at the time of the failure to troubleshoot this.

Comment 3 kevin_cai 2013-07-17 01:31:41 UTC

Created attachment 774580 [details]
cluster.conf

cluster.conf

Comment 4 kevin_cai 2013-07-17 01:32:21 UTC

Created attachment 774583 [details]
node 1:/var/log/messages

node 1:/var/log/messages

Comment 5 kevin_cai 2013-07-17 01:32:59 UTC

Created attachment 774584 [details]
node 2:/var/log/messages

node 2:/var/log/messages

Comment 6 kevin_cai 2013-07-17 01:34:22 UTC

I have send the cluster.conf and /var/log/messages in "Jul 15 15:4".

#zcat corosync.log-20130716.gz
Jul 15 15:41:22 corosync [TOTEM ] A processor failed, forming new configuration.
Jul 15 15:41:24 corosync [QUORUM] Members[1]: 1
Jul 15 15:41:24 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 15 15:41:24 corosync [CPG   ] chosen downlist: sender r(0) ip(12.12.200.6) ; members(old:2 left:1)
Jul 15 15:41:24 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Jul 15 23:16:03 corosync [SERV  ] Unloading all Corosync service engines.
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync configuration service
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync cluster config database access v1.01
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync profile loading service
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: openais checkpoint service B.01.01
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync CMAN membership service 2.90
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Jul 15 23:16:03 corosync [MAIN  ] Corosync Cluster Engine exiting with status 0 at main.c:1864.
Jul 15 23:16:10 corosync [MAIN  ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
Jul 15 23:16:10 corosync [MAIN  ] Corosync built-in features: nss dbus rdma snmp
Jul 15 23:16:10 corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Jul 15 23:16:10 corosync [MAIN  ] Successfully parsed cman config
Jul 15 23:16:10 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
Jul 15 23:16:10 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jul 15 23:16:10 corosync [TOTEM ] The network interface [12.12.200.6] is now up.
Jul 15 23:16:10 corosync [QUORUM] Using quorum provider quorum_cman
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jul 15 23:16:10 corosync [CMAN  ] CMAN 3.0.12.1 (built Sep 30 2011 03:17:43) started
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync configuration service
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync profile loading service
Jul 15 23:16:10 corosync [QUORUM] Using quorum provider quorum_cman
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jul 15 23:16:10 corosync [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Jul 15 23:16:10 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 15 23:16:10 corosync [CMAN  ] quorum regained, resuming activity

Comment 7 Fabio Massimo Di Nitto 2013-09-12 12:27:46 UTC

Jul 15 15:46:35 NFJD-PSC-SGM-SV4 fenced[2928]: fencing node mysql_node1

is the key here that leads to the reservation conflicts on node2.


Jul 15 15:46:10 NFJD-PSC-SGM-SV3 rgmanager[9549]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:46:36 NFJD-PSC-SGM-SV3 kernel: sd 7:0:0:0: reservation conflict
Jul 15 15:46:36 NFJD-PSC-SGM-SV3 kernel: sd 7:0:0:0: [sdb] Unhandled error code

you need to verify why one node fenced the other.

Note You need to log in before you can comment on or make changes to this bug.