Bug 585210 - Split brain when I kill aisexec (qdisk, fence_scsi)
Split brain when I kill aisexec (qdisk, fence_scsi)
Status: CLOSED DUPLICATE of bug 639961
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.5
i386 Linux
low Severity medium
: rc
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-04-23 08:48 EDT by macbogucki
Modified: 2011-01-25 10:10 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 589131 (view as bug list)
Environment:
Last Closed: 2011-01-25 10:10:12 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
cluster.conf (2.15 KB, text/plain)
2010-04-23 08:49 EDT, macbogucki
no flags Details

  None (edit)
Description macbogucki 2010-04-23 08:48:52 EDT
Description of problem:

I have two node cluster with qdisk and rgmanager. When I kill aisexec on the node1 (where the service BROKER is running) I get split brain situation. The service BROKER is runing on both nodes. When I upgraded rgmanager to the version from RH 5.5 BETA (rgmanager-2.0.52-3.el5) the split brain doesn't occures because of the IP is on the node1 (rhbz#526647).
I think that rgmanager on node1 should handle this situation and stop BROKER service when the aisexec is down.
The problem is because I use fence_scsi, and I would be the same with any SAN fencing fe. fence_brocade.

---node1---
Mar  5 10:48:52 node1 clurgmgrd: [10813]: <info> Executing /opt/webmeth/71_prodBroker/Broker/aw_broker71 status
Mar  5 10:49:14 node1 fenced[10361]: cluster is down, exiting
Mar  5 10:49:14 node1 gfs_controld[10373]: cluster is down, exiting
Mar  5 10:49:14 node1 dlm_controld[10367]: cluster is down, exiting
Mar  5 10:49:14 node1 kernel: dlm: closing connection to node 2
Mar  5 10:49:14 node1 kernel: dlm: closing connection to node 1
Mar  5 10:49:19 node1 qdiskd[10340]: <err> cman_dispatch: Host is down
Mar  5 10:49:19 node1 qdiskd[10340]: <err> Halting qdisk operations
Mar  5 10:49:25 node1 kernel: dlm: connect from non cluster node
Mar  5 10:49:42 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 30 seconds.
Mar  5 10:50:13 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 60 seconds.
Mar  5 10:50:43 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 90 seconds.
Mar  5 10:51:13 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 120 seconds.
Mar  5 10:51:43 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 150 seconds.
Mar  5 10:52:13 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 180 seconds.
Mar  5 10:52:43 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 210 seconds.
---node1---

---node2---
Mar  5 10:50:47 node1 clurgmgrd[20822]: <info> Waiting for node #1 to be fenced
Mar  5 10:51:11 node1 fenced[8540]: node1 not a cluster member after 30 sec post_fail_delay
Mar  5 10:51:11 node1 fenced[8540]: fencing node "node1"
Mar  5 10:51:11 node1 fenced[8540]: fence "node1" success
Mar  5 10:51:13 node1 clurgmgrd[20822]: <info> Node #1 fenced; continuing
Mar  5 10:51:13 node1 clurgmgrd[20822]: <notice> Taking over service service:BROKER from down member node1
Mar  5 10:51:13 node1 clurgmgrd: [20822]: <info> mounting /dev/mapper/storage0-broker on /opt/webmeth/71_prodBroker/Broker/data
Mar  5 10:51:13 node1 kernel: kjournald starting.  Commit interval 5 seconds
Mar  5 10:51:13 node1 kernel: EXT3 FS on dm-7, internal journal
Mar  5 10:51:13 node1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Mar  5 10:51:13 node1 clurgmgrd: [20822]: <info> Adding IPv4 address 192.168.33.18/24 to bond0
Mar  5 10:51:13 node1 clurgmgrd: [20822]: <err> IPv4 address collision 192.168.33.18
Mar  5 10:51:13 node1 clurgmgrd[20822]: <notice> start on ip "192.168.33.18/24" returned 1 (generic error)
Mar  5 10:51:13 node1 clurgmgrd[20822]: <warning> #68: Failed to start service:BROKER; return value: 1
Mar  5 10:51:13 node1 clurgmgrd[20822]: <notice> Stopping service service:BROKER
Mar  5 10:51:13 node1 clurgmgrd: [20822]: <info> Executing /opt/webmeth/71_prodBroker/Broker/aw_broker71 stop
Mar  5 10:51:13 node1 clurgmgrd: [20822]: <info> unmounting /opt/webmeth/71_prodBroker/Broker/data
Mar  5 10:51:13 node1 clurgmgrd[20822]: <notice> Service service:BROKER is recovering
Mar  5 10:51:13 node1 clurgmgrd[20822]: <warning> #71: Relocating failed service service:BROKER
Mar  5 10:51:13 node1 clurgmgrd[20822]: <notice> Service service:BROKER is stopped
---node2---

I have tested this with cman from RH 5.5 (cman-2.0.115-29.el5) and cman for RH 5.4 BETA (cman-2.0.115-1.el5_4.9).


Version-Release number of selected component (if applicable):

cman-2.0.115-29.el5
cman-2.0.115-1.el5_4.9
rgmanager-2.0.52-3.el5
Comment 1 macbogucki 2010-04-23 08:49:27 EDT
Created attachment 408598 [details]
cluster.conf
Comment 2 Lon Hohberger 2010-04-29 10:47:05 EDT
What state was rgmanager in on the fenced host?
Comment 3 Lon Hohberger 2011-01-25 10:10:12 EST

*** This bug has been marked as a duplicate of bug 639961 ***

Note You need to log in before you can comment on or make changes to this bug.