Bug 1636088

Summary: ocf:glusterfs:volume resource agent for pacemaker fails to stop gluster volume processes like glusterfsd
Product: [Community] GlusterFS Reporter: erik.dobak
Component: unclassifiedAssignee: bugs <bugs>
Status: CLOSED UPSTREAM QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: mainlineCC: bugs, ndevos, pasik
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-12 12:42:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description erik.dobak 2018-10-04 12:45:40 UTC
Description of problem:

I am using pacemaker to run glusterfs. After setting it up i tested it with 'crm node standby node01' but got a 'time out' from the volume agent:
crmd:    error: process_lrm_event:	Result of stop operation for p_volume_gluster on node02: Timed Out | call=559 key=p_volume_gluster_stop_0 timeout=20000ms

when checking the processes with ps -ef i still could see gluster processes running on the node.

Version-Release number of selected component (if applicable):

Name        : glusterfs-resource-agents
Arch        : noarch
Version     : 3.12.14
Release     : 1.el6
Size        : 13 k
Repo        : installed
From repo   : centos-gluster312


How reproducible:
configure gluster in pacemaker (2 nodes):

primitive glusterd ocf:glusterfs:glusterd \
	op monitor interval=10 timeout=120s \
	op start timeout=120s interval=0 \
	op stop timeout=120s interval=0

primitive p_volume_gluster ocf:glusterfs:volume \
	params volname=gv0 \
	op stop interval=0 trace_ra=1 \
	op monitor interval=0 timeout=120s \
	op start timeout=120s interval=0

clone cl_glusterd glusterd \
	meta interleave=true clone-max=2 clone-node-max=1 target-role=Started

clone cl_glustervol p_volume_gluster \
	meta interleave=true clone-max=2 clone-node-max=1

run the gluster in the cluster then put a node on standby.

Steps to Reproduce:
1. start gluster in pacemaker
2. put a node on standby: crm node standby node01
3. wait for the error messages

Actual results:
getting a time out error for the volume primitive. the processes are still running: /usr/sbin/glusterfsd

Expected results:
gluster should shutdown and no error should be in corosync.log

Additional info:
i did do debuging of the volume resource agent (/usr/lib/ocf/resource.d/glusterfs/volume) and could find 2 issues that prevented the agent to stop the processes.

1. SHORTHOSTNAME=`hostname -s`
In my system only the full hostname was used. i had to change this line to:
SHORTHOSTNAME=`hostname`

2. function volume_getdir() had wrong path hardcoded

volume_getdir() {
    local voldir
    voldir="/etc/glusterd/vols/${OCF_RESKEY_volname}"
    [ -d ${voldir} ] || return 1

    echo "${voldir}"
    return 0
}

i had to change /etc/glusterd into /var/lib/glusterd:

volume_getdir() {
    local voldir
    voldir="/var/lib/glusterd/vols/${OCF_RESKEY_volname}"
    [ -d ${voldir} ] || return 1

    echo "${voldir}"
    return 0
}

i am not sure if this is because of i am running centos 6. maybe the paths and hostnames differ on centos 7..

Comment 1 Shyamsundar 2018-10-23 14:53:58 UTC
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

Comment 2 Amar Tumballi 2019-07-15 05:22:50 UTC
Hi Erik,

We as a community have moved towards support of server packages on only CentOS7 since almost a year now. It would be great to see if upgrading the OS helps to resolve the issue. Also would be great if you update glusterfs to higher version.

Comment 3 erik.dobak 2019-07-15 05:40:10 UTC
Hi Amar,

i do not have RHEL or CENTO7 available. It would be great if you could attach here the current /usr/lib/ocf/resource.d/glusterfs/volume and i can check if the code did change or not.

Comment 4 Amar Tumballi 2019-07-15 05:47:57 UTC
Looks like it got fixed with https://review.gluster.org/#/c/glusterfs/+/19799/

Check latest code @ https://github.com/gluster/glusterfs/tree/master/extras/ocf

Comment 5 erik.dobak 2019-07-15 06:01:13 UTC
volume_getpid_dir() did change but SHORTHOSTNAME is still the same. one would have to test it on 7 maybe it works now fine.

Comment 6 Kaleb KEITHLEY 2019-11-22 17:44:36 UTC
common-ha is for the .../extras/ganesha/... stuff. named as such when we thought it would be rewritten using ctdb for Samba and Ganesha.

Comment 7 Worker Ant 2020-03-12 12:42:28 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/930, and will be tracked there from now on. Visit GitHub issues URL for further details