Bug 1244097
| Summary: | corosync+pacemaker: Virtual IPs have gone down. | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Saurabh <saujain> | ||||||||||
| Component: | nfs-ganesha | Assignee: | Kaleb KEITHLEY <kkeithle> | ||||||||||
| Status: | CLOSED WORKSFORME | QA Contact: | storage-qa-internal <storage-qa-internal> | ||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | rhgs-3.1 | CC: | jthottan, kkeithle, mzywusko, ndevos, nlevinki, rcyriac, sankarshan, skoduri, smohan | ||||||||||
| Target Milestone: | --- | Keywords: | Triaged | ||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2016-06-20 11:52:18 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
Created attachment 1052992 [details]
pacemaker log from rhs-client21
Created attachment 1052993 [details]
pcsd log from rhs-client21
Created attachment 1052994 [details]
messages log from rhs-client21
virtIPs were "lost" because there's no quorum, with two of the three nodes offline What is the reason rhs-client21 and rhs-client36 are offline? I did not see anything wrong with the config. I tore down the HA (with ganesha-ha.sh --teardown), cleaned out /etc/cluster/cluster.conf* and /var/lib/pacemaker/cib/* on all three nodes, rebooted all three nodes, changed the HA_CLUSTER to "xclamperx" and manually brought up the HA; then the virtIPs were there and all nodes were online. _No_ other changes were made to the config. I disabled and enabled (with gluster nfs-ganesha enable) and all nodes were online and virtIPs were there. Then I disabled it one more time, changed the HA_CLUSTER back to "clamper" and enabled, and everything was still as expected. I did notice that there were residual /etc/cluster/cluster.conf files on rhs-client23 (and perhaps also on rhs-client36) before I manually cleaned them. The cluster is up now and has its virtIPs. If you could pick up (almost) where you left off and run the quorum things you were doing before and see what happens? Thanks, The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Created attachment 1052991 [details] corosync log from rhs-client21 Description of problem: For providing the capabilities of HA for nfs-ganesha service we are using the pacemaker corosync based clustering mechanism. This clustering requires Virtual IPs and once cluster is up and running Virtual-IPs are displayed as part of the "ip a" command output, but in the present setup the Virtual IPs itself are gone down and they are displayed as part of the output of "ip a" command. example from a working cluster, 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:2a:4a:46:20:05 brd ff:ff:ff:ff:ff:ff inet 10.70.46.8/22 brd 10.70.47.255 scope global eth0 inet 10.70.44.92/32 brd 10.70.47.255 scope global eth0 here "10.70.46.8" is the physical IP and "10.70.44.92" is the virtual IP present non-working cluster, 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 00:25:90:93:62:0a brd ff:ff:ff:ff:ff:ff inet 10.70.36.45/23 brd 10.70.37.255 scope global eth2 here we only have the physical IP i.e "10.70.36.45" no virtual IP mentioned In the present setup of 4 physical nodes HA setup for nfs-ganesha was initiated on three nodes and the problem of Virtual IP going down is seen on all three nodes. Services such as nfs-ganesha, pcsd, pacemaker and corosync are running on three nodes. the pcs status, Cluster name: clamper Last updated: Fri Jul 17 11:44:19 2015 Last change: Thu Jul 16 20:08:12 2015 Stack: cman Current DC: rhs-client21.lab.eng.blr.redhat.com - partition WITHOUT quorum Version: 1.1.11-97629de 3 Nodes configured 12 Resources configured Online: [ rhs-client21.lab.eng.blr.redhat.com ] OFFLINE: [ rhs-client23.lab.eng.blr.redhat.com rhs-client36.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs-mon-clone [nfs-mon] Stopped: [ rhs-client21.lab.eng.blr.redhat.com rhs-client23.lab.eng.blr.redhat.com rhs-client36.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Stopped: [ rhs-client21.lab.eng.blr.redhat.com rhs-client23.lab.eng.blr.redhat.com rhs-client36.lab.eng.blr.redhat.com ] rhs-client21.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped rhs-client21.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Stopped rhs-client23.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped rhs-client23.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Stopped rhs-client36.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped rhs-client36.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Stopped Same status on all nodes. Version-Release number of selected component (if applicable): glusterfs-3.7.1-10.el6rhs.x86_64 nfs-ganesha-2.2.0-5.el6rhs.x86_64 pacemaker-1.1.12-8.el6.x86_64 corosync-1.4.7-1.el6.x86_64 pcs-0.9.139-9.el6.x86_64 How reproducible: happen to be seen first time Steps to Reproduce: 1. create a volume of 6x2 type, start it 2. configure nfs-ganesha, export the volume created in step 1 3. mount the root on a client with vers=4 4. enable quota on the volume 5. set a limit of 1TB on the root of the volume 6. start I/O on the mount-point and after cd'ing the volume in consideration Actual results: mount-point hung as Virtual IP on all nodes is down. nfs-ganesha service is running but HA capabilities lost Expected results: Virtual IPs need to be up and running, as for HA capabilities the mounts are done based on the Virtual IPs. Additional info: