Bug 1244097

Summary: corosync+pacemaker: Virtual IPs have gone down.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Saurabh <saujain>
Component: nfs-ganeshaAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED WORKSFORME QA Contact: storage-qa-internal <storage-qa-internal>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: jthottan, kkeithle, mzywusko, ndevos, nlevinki, rcyriac, sankarshan, skoduri, smohan
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-20 11:52:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
corosync log from rhs-client21
none
pacemaker log from rhs-client21
none
pcsd log from rhs-client21
none
messages log from rhs-client21 none

Description Saurabh 2015-07-17 06:20:18 UTC
Created attachment 1052991 [details]
corosync log from rhs-client21

Description of problem:
For providing the capabilities of HA for nfs-ganesha service we are using the pacemaker corosync based clustering mechanism.
This clustering requires Virtual IPs and once cluster is up and running Virtual-IPs are displayed as part of the "ip a" command output, but in the present setup the Virtual IPs itself are gone down and they are displayed as part of the output of "ip a" command.

example from a working cluster, 
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:2a:4a:46:20:05 brd ff:ff:ff:ff:ff:ff
    inet 10.70.46.8/22 brd 10.70.47.255 scope global eth0
    inet 10.70.44.92/32 brd 10.70.47.255 scope global eth0
here "10.70.46.8" is the physical IP
and "10.70.44.92" is the virtual IP

present non-working cluster,
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:25:90:93:62:0a brd ff:ff:ff:ff:ff:ff
    inet 10.70.36.45/23 brd 10.70.37.255 scope global eth2
here we only have the physical IP i.e "10.70.36.45"
no virtual IP mentioned

In the present setup of 4 physical nodes HA setup for nfs-ganesha was initiated on three nodes and the problem of Virtual IP going down is seen on all three nodes. Services such as nfs-ganesha, pcsd, pacemaker and corosync are running on three nodes. 


the pcs status,
Cluster name: clamper
Last updated: Fri Jul 17 11:44:19 2015
Last change: Thu Jul 16 20:08:12 2015
Stack: cman
Current DC: rhs-client21.lab.eng.blr.redhat.com - partition WITHOUT quorum
Version: 1.1.11-97629de
3 Nodes configured
12 Resources configured


Online: [ rhs-client21.lab.eng.blr.redhat.com ]
OFFLINE: [ rhs-client23.lab.eng.blr.redhat.com rhs-client36.lab.eng.blr.redhat.com ]

Full list of resources:

 Clone Set: nfs-mon-clone [nfs-mon]
     Stopped: [ rhs-client21.lab.eng.blr.redhat.com rhs-client23.lab.eng.blr.redhat.com rhs-client36.lab.eng.blr.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Stopped: [ rhs-client21.lab.eng.blr.redhat.com rhs-client23.lab.eng.blr.redhat.com rhs-client36.lab.eng.blr.redhat.com ]
 rhs-client21.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped 
 rhs-client21.lab.eng.blr.redhat.com-trigger_ip-1	(ocf::heartbeat:Dummy):	Stopped 
 rhs-client23.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped 
 rhs-client23.lab.eng.blr.redhat.com-trigger_ip-1	(ocf::heartbeat:Dummy):	Stopped 
 rhs-client36.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped 
 rhs-client36.lab.eng.blr.redhat.com-trigger_ip-1	(ocf::heartbeat:Dummy):	Stopped 

Same status on all nodes.

Version-Release number of selected component (if applicable):
glusterfs-3.7.1-10.el6rhs.x86_64
nfs-ganesha-2.2.0-5.el6rhs.x86_64
pacemaker-1.1.12-8.el6.x86_64
corosync-1.4.7-1.el6.x86_64
pcs-0.9.139-9.el6.x86_64

How reproducible:
happen to be seen first time

Steps to Reproduce:
1. create a volume of 6x2 type, start it
2. configure nfs-ganesha, export the volume created in step 1
3. mount the root on a client with vers=4
4. enable quota on the volume
5. set a limit of 1TB on the root of the volume
6. start I/O on the mount-point and after cd'ing the volume in consideration

Actual results:
mount-point hung as Virtual IP on all nodes is down.
nfs-ganesha service is running but HA capabilities lost

Expected results:
Virtual IPs need to be up and running, as for HA capabilities the mounts are done based on the Virtual IPs. 

Additional info:

Comment 2 Saurabh 2015-07-17 06:22:14 UTC
Created attachment 1052992 [details]
pacemaker log from rhs-client21

Comment 3 Saurabh 2015-07-17 06:23:09 UTC
Created attachment 1052993 [details]
pcsd log from rhs-client21

Comment 4 Saurabh 2015-07-17 06:24:09 UTC
Created attachment 1052994 [details]
messages log from rhs-client21

Comment 6 Kaleb KEITHLEY 2015-07-17 12:58:16 UTC
virtIPs were "lost" because there's no quorum, with two of the three nodes offline

What is the reason rhs-client21 and rhs-client36 are offline?

Comment 7 Kaleb KEITHLEY 2015-07-17 16:44:42 UTC
I did not see anything wrong with the config.

I tore down the HA (with ganesha-ha.sh --teardown), cleaned out /etc/cluster/cluster.conf* and /var/lib/pacemaker/cib/* on all three nodes, rebooted all three nodes, changed the HA_CLUSTER to "xclamperx" and manually brought up the HA; then the virtIPs were there and all nodes were online. _No_ other changes were made to the config. I disabled and enabled (with gluster nfs-ganesha enable) and all nodes were online and virtIPs were there.

Then I disabled it one more time, changed the HA_CLUSTER back to "clamper" and enabled, and everything was still as expected.

I did notice that there were residual /etc/cluster/cluster.conf files on rhs-client23 (and perhaps also on rhs-client36) before I manually cleaned them.

The cluster is up now and has its virtIPs. If you could pick up (almost) where you left off and run the quorum things you were doing before and see what happens?

Thanks,

Comment 10 Red Hat Bugzilla 2023-09-14 03:02:12 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days