Bug 984783 - rhcs failover because of reservation conflict
rhcs failover because of reservation conflict
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: fence-agents (Show other bugs)
6.2
x86_64 Linux
unspecified Severity high
: rc
: ---
Assigned To: Marek Grac
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-15 22:47 EDT by kevin_cai
Modified: 2013-09-12 08:27 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-09-12 08:27:46 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
cluster.conf (2.58 KB, text/plain)
2013-07-16 21:31 EDT, kevin_cai
no flags Details
node 1:/var/log/messages (174.77 KB, text/plain)
2013-07-16 21:32 EDT, kevin_cai
no flags Details
node 2:/var/log/messages (16.04 KB, text/plain)
2013-07-16 21:32 EDT, kevin_cai
no flags Details

  None (edit)
Description kevin_cai 2013-07-15 22:47:55 EDT
Jul 15 15:41:14 test kernel: igb: eth7 NIC Link is Down
Jul 15 15:41:14 test kernel: bonding: bond3: link status definitely down for interface eth7, disabling it
Jul 15 15:41:14 test kernel: igb: eth6 NIC Link is Down
Jul 15 15:41:14 test kernel: bonding: bond3: link status definitely down for interface eth6, disabling it
Jul 15 15:41:14 test kernel: bonding: bond3: now running without any active interface !
Jul 15 15:41:22 test corosync[2888]:   [TOTEM ] A processor failed, forming new configuration.
Jul 15 15:41:24 test corosync[2888]:   [QUORUM] Members[1]: 1
Jul 15 15:41:24 test corosync[2888]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 15 15:41:24 test kernel: dlm: closing connection to node 2
Jul 15 15:41:24 test rgmanager[4001]: State change: mysql_node2 DOWN
Jul 15 15:41:24 test corosync[2888]:   [CPG   ] chosen downlist: sender r(0) ip(12.12.200.6) ; members(old:2 left:1)
Jul 15 15:41:24 test corosync[2888]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 15 15:41:24 test fenced[2970]: fencing node mysql_node2
Jul 15 15:41:27 test fenced[2970]: fence mysql_node2 success
Jul 15 15:41:28 test rgmanager[4001]: Starting stopped service service:mysql_jrzxt
Jul 15 15:41:28 test rgmanager[2980]: [fs] mounting /dev/dm-7 on /mysqldata
Jul 15 15:41:28 test rgmanager[3006]: [fs] mount -t ext4  /dev/dm-7 /mysqldata
Jul 15 15:41:28 test kernel: EXT4-fs (dm-7): warning: maximal mount count reached, running e2fsck is recommended
Jul 15 15:41:28 test kernel: EXT4-fs (dm-7): recovery complete
Jul 15 15:41:28 test kernel: EXT4-fs (dm-7): mounted filesystem with ordered data mode. Opts: 
Jul 15 15:41:28 test rgmanager[3114]: [ip] Adding IPv4 address 172.16.200.40/25 to bond0
Jul 15 15:41:28 test Keepalived_healthcheckers: Netlink reflector reports IP 172.16.200.40 added
Jul 15 15:41:28 test Keepalived_vrrp: Netlink reflector reports IP 172.16.200.40 added
Jul 15 15:41:29 test rgmanager[3176]: [script] Executing /etc/init.d/mysqld start
Jul 15 15:41:30 test ntpd[3372]: Listening on interface #12 bond0, 172.16.200.40#123 Enabled
Jul 15 15:41:34 test rgmanager[3853]: [ip] Adding IPv4 address 10.11.200.40/25 to bond2
Jul 15 15:41:34 test Keepalived_healthcheckers: Netlink reflector reports IP 10.11.200.40 added
Jul 15 15:41:34 test Keepalived_vrrp: Netlink reflector reports IP 10.11.200.40 added
Jul 15 15:41:35 test rgmanager[4001]: Service service:mysql_jrzxt started
Jul 15 15:41:36 test ntpd[3372]: Listening on interface #13 bond2, 10.11.200.40#123 Enabled
Jul 15 15:41:40 test kernel: igb: eth7 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jul 15 15:41:40 test kernel: bond3: link status definitely up for interface eth7, 1000 Mbps full duplex.
Jul 15 15:41:40 test kernel: bonding: bond3: making interface eth7 the new active one.
Jul 15 15:41:40 test kernel: bonding: bond3: first active interface up!
Jul 15 15:41:40 test kernel: igb: eth6 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jul 15 15:41:40 test kernel: bond3: link status definitely up for interface eth6, 1000 Mbps full duplex.
Jul 15 15:42:10 test rgmanager[4312]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:42:40 test rgmanager[5001]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:43:10 test rgmanager[5541]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:43:40 test rgmanager[6387]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:44:10 test rgmanager[6922]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:44:40 test rgmanager[7688]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:44:51 test kernel: igb: eth6 NIC Link is Down
Jul 15 15:44:51 test kernel: bonding: bond3: link status definitely down for interface eth6, disabling it
Jul 15 15:44:51 test kernel: igb: eth7 NIC Link is Down
Jul 15 15:44:51 test kernel: bonding: bond3: link status definitely down for interface eth7, disabling it
Jul 15 15:44:51 test kernel: bonding: bond3: now running without any active interface !
Jul 15 15:45:06 test kernel: igb: eth7 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jul 15 15:45:06 test kernel: bond3: link status definitely up for interface eth7, 1000 Mbps full duplex.
Jul 15 15:45:06 test kernel: bonding: bond3: making interface eth7 the new active one.
Jul 15 15:45:06 test kernel: bonding: bond3: first active interface up!
Jul 15 15:45:06 test kernel: igb: eth6 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jul 15 15:45:06 test kernel: bond3: link status definitely up for interface eth6, 1000 Mbps full duplex.
Jul 15 15:45:10 test rgmanager[8230]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:45:40 test rgmanager[9002]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:46:10 test rgmanager[9549]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:46:36 test kernel: sd 7:0:0:0: reservation conflict
Jul 15 15:46:36 test kernel: sd 7:0:0:0: [sdb] Unhandled error code
Jul 15 15:46:36 test kernel: sd 7:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jul 15 15:46:36 test kernel: sd 7:0:0:0: [sdb] CDB: Write(10): 2a 00 25 84 cb 28 00 00 08 00
Jul 15 15:46:36 test kernel: device-mapper: multipath: Failing path 8:16.
Jul 15 15:46:36 test multipathd: 8:16: mark as failed
Jul 15 15:46:36 test multipathd: mysql: remaining active paths: 3
Jul 15 15:46:36 test kernel: sd 10:0:0:0: reservation conflict
Jul 15 15:46:36 test kernel: sd 10:0:0:0: [sde] Unhandled error code
Jul 15 15:46:36 test kernel: sd 10:0:0:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jul 15 15:46:36 test kernel: sd 10:0:0:0: [sde] CDB: Write(10): 2a 00 25 84 cb 28 00 00 08 00
Jul 15 15:46:36 test kernel: device-mapper: multipath: Failing path 8:64.
Jul 15 15:46:36 test kernel: sd 8:0:0:0: reservation conflict
Jul 15 15:46:36 test kernel: sd 8:0:0:0: [sdc] Unhandled error code
Jul 15 15:46:36 test kernel: sd 8:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jul 15 15:46:36 test kernel: sd 8:0:0:0: [sdc] CDB: Write(10): 2a 00 25 84 cb 28 00 00 08 00
Jul 15 15:46:36 test kernel: device-mapper: multipath: Failing path 8:32.
Jul 15 15:46:36 test kernel: sd 9:0:0:0: reservation conflict
Jul 15 15:46:36 test kernel: sd 9:0:0:0: [sdd] Unhandled error code
Jul 15 15:46:36 test kernel: sd 9:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jul 15 15:46:36 test kernel: sd 9:0:0:0: [sdd] CDB: Write(10): 2a 00 25 84 cb 28 00 00 08 00
Jul 15 15:46:36 test kernel: device-mapper: multipath: Failing path 8:48.
Jul 15 15:46:36 test kernel: Buffer I/O error on device dm-7, logical block 78682213
Jul 15 15:46:36 test kernel: lost page write due to I/O error on dm-7
Jul 15 15:46:36 test kernel: Aborting journal on device dm-7-8.
Jul 15 15:46:36 test kernel: Buffer I/O error on device dm-7, logical block 65044480
Jul 15 15:46:36 test kernel: lost page write due to I/O error on dm-7
Jul 15 15:46:36 test kernel: JBD2: I/O error detected when updating journal superblock for dm-7-8.
Jul 15 15:46:36 test kernel: journal commit I/O error
Jul 15 15:46:37 test kernel: EXT4-fs error (device dm-7): ext4_journal_start_sb: Detected aborted journal
Jul 15 15:46:37 test kernel: EXT4-fs (dm-7): Remounting filesystem read-only
Jul 15 15:46:37 test multipathd: 8:64: mark as failed
Jul 15 15:46:37 test multipathd: mysql: remaining active paths: 2
Jul 15 15:46:37 test multipathd: 8:32: mark as failed
Jul 15 15:46:37 test multipathd: mysql: remaining active paths: 1
Jul 15 15:46:37 test multipathd: 8:48: mark as failed
Jul 15 15:46:37 test multipathd: mysql: remaining active paths: 0
Jul 15 15:46:39 test multipathd: mysql: sdb - directio checker reports path is up
Jul 15 15:46:39 test multipathd: 8:16: reinstated
Jul 15 15:46:39 test multipathd: mysql: remaining active paths: 1
Jul 15 15:46:40 test rgmanager[10341]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:46:40 test rgmanager[10386]: [script] script:mysqld: status of /etc/init.d/mysqld failed (returned 2)
Jul 15 15:46:40 test rgmanager[4001]: status on script "mysqld" returned 1 (generic error)

Version-Release number of selected component (if applicable):

# rpm -qa|grep multipath
device-mapper-multipath-0.4.9-46.el6.x86_64
device-mapper-multipath-libs-0.4.9-46.el6.x86_64
# 
# rpm -qa|grep iscsi
iscsi-initiator-utils-6.2.0.872-34.el6.x86_64
# rpm -qa|grep cman
cman-3.0.12.1-23.el6.x86_64
# rpm -qa|grep rgmanager
rgmanager-3.0.12.1-5.el6.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Comment 2 Ryan McCabe 2013-07-16 11:40:38 EDT
We need at least your cluster.conf and relevant cluster logs from all nodes at the time of the failure to troubleshoot this.
Comment 3 kevin_cai 2013-07-16 21:31:41 EDT
Created attachment 774580 [details]
cluster.conf

cluster.conf
Comment 4 kevin_cai 2013-07-16 21:32:21 EDT
Created attachment 774583 [details]
node 1:/var/log/messages

node 1:/var/log/messages
Comment 5 kevin_cai 2013-07-16 21:32:59 EDT
Created attachment 774584 [details]
node 2:/var/log/messages

node 2:/var/log/messages
Comment 6 kevin_cai 2013-07-16 21:34:22 EDT
I have send the cluster.conf and /var/log/messages in "Jul 15 15:4".

#zcat corosync.log-20130716.gz
Jul 15 15:41:22 corosync [TOTEM ] A processor failed, forming new configuration.
Jul 15 15:41:24 corosync [QUORUM] Members[1]: 1
Jul 15 15:41:24 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 15 15:41:24 corosync [CPG   ] chosen downlist: sender r(0) ip(12.12.200.6) ; members(old:2 left:1)
Jul 15 15:41:24 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Jul 15 23:16:03 corosync [SERV  ] Unloading all Corosync service engines.
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync configuration service
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync cluster config database access v1.01
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync profile loading service
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: openais checkpoint service B.01.01
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync CMAN membership service 2.90
Jul 15 23:16:03 corosync [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Jul 15 23:16:03 corosync [MAIN  ] Corosync Cluster Engine exiting with status 0 at main.c:1864.
Jul 15 23:16:10 corosync [MAIN  ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
Jul 15 23:16:10 corosync [MAIN  ] Corosync built-in features: nss dbus rdma snmp
Jul 15 23:16:10 corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Jul 15 23:16:10 corosync [MAIN  ] Successfully parsed cman config
Jul 15 23:16:10 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
Jul 15 23:16:10 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jul 15 23:16:10 corosync [TOTEM ] The network interface [12.12.200.6] is now up.
Jul 15 23:16:10 corosync [QUORUM] Using quorum provider quorum_cman
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jul 15 23:16:10 corosync [CMAN  ] CMAN 3.0.12.1 (built Sep 30 2011 03:17:43) started
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync configuration service
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync profile loading service
Jul 15 23:16:10 corosync [QUORUM] Using quorum provider quorum_cman
Jul 15 23:16:10 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jul 15 23:16:10 corosync [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Jul 15 23:16:10 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jul 15 23:16:10 corosync [CMAN  ] quorum regained, resuming activity
Comment 7 Fabio Massimo Di Nitto 2013-09-12 08:27:46 EDT
Jul 15 15:46:35 NFJD-PSC-SGM-SV4 fenced[2928]: fencing node mysql_node1

is the key here that leads to the reservation conflicts on node2.


Jul 15 15:46:10 NFJD-PSC-SGM-SV3 rgmanager[9549]: [script] Executing /etc/init.d/mysqld status
Jul 15 15:46:36 NFJD-PSC-SGM-SV3 kernel: sd 7:0:0:0: reservation conflict
Jul 15 15:46:36 NFJD-PSC-SGM-SV3 kernel: sd 7:0:0:0: [sdb] Unhandled error code

you need to verify why one node fenced the other.

Note You need to log in before you can comment on or make changes to this bug.