Bug 1568593

Summary: ocf:pacemaker:controld: configdir attribute doesn't allow specification of configfs mount point
Product: Red Hat Enterprise Linux 7 Reporter: Reid Wahl <nwahl>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact:
Priority: medium    
Version: 7.5CC: abeekhof, cluster-maint, mnovacek, phagara, sbradley
Target Milestone: rc   
Target Release: 7.6   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: pacemaker-1.1.18-12.el7 Doc Type: No Doc Update
Doc Text:
Likely zero users are interested in this capability.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 07:57:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Reid Wahl 2018-04-17 21:02:53 UTC
Description of problem: ocf:pacemaker:controld: configdir attribute doesn't allow specification of configfs mount point.

/sys/kernel/config is the default, but configdir parameterizes it.

https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/controld#L67:
 67 <parameter name="configdir" unique="1">
 68 <longdesc lang="en">
 69 The location where configfs is or should be mounted
 70 </longdesc>
 71 <shortdesc lang="en">Location of configfs</shortdesc>
 72 <content type="string" default="/sys/kernel/config" />
 73 </parameter>


The /sys/kernel/config location is hard-coded into at least three locations:
  - https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/controld#L183
  - https://github.com/wferi/dlm/blob/master/dlm_controld/action.c (throughout)
  - https://github.com/wferi/dlm/blob/master/init/dlm.init#L33

configdir was parameterized 10 years ago in the following:
  - https://github.com/ClusterLabs/pacemaker/commit/d7b606c077e38b32936306a98220f4d0575c4ed4

We can trace back the first hard-coding of /sys/kernel/config in the resource agent to the following commit 5 years ago. (The location and exact invocation has since changed.)
  - https://github.com/ClusterLabs/pacemaker/commit/f30a4d4eb08181f556271a796ef8a63f3c003690


It is not clear that there is a valid use case for the configdir attribute. If there is not, then rather than fixing the hard-coding issues in both the resource agent and dlm, perhaps we can deprecate the attribute.

If we opt to make the parameter usable instead of deprecating and removing it, then we will need to:
  - parameterize the location of the addr_list in the resource agent (easy)
  - pass the parameter into dlm_controld, maybe as "CLUSTER_DIR"


--------------------

Version-Release number of selected component (if applicable):
pacemaker-1.1.18-11.el7.x86_64
upstream master


--------------------

How reproducible: always


--------------------

Steps to Reproduce:

I tried a few different approaches to make the controld resource and the dlm_controld process use the specified mount point for the configdir. None worked.

With /sys/kernel/config still mounted:
    # # Show current mount
    # mount | grep configfs
    configfs on /sys/kernel/config type configfs (rw,relatime)

    # # Create custom mount point on all nodes
    # mkdir /testmnt

    # # Use custom configdir
    # pcs resource update dlm configdir=/testmnt

    # tail /var/log/messages
    ...
    Apr 17 22:22:39 fastvm-rhel-7-4-22 controld(dlm)[16687]: ERROR: /testmnt/dlm not available
    Apr 17 22:22:39 fastvm-rhel-7-4-22 crmd[1154]:  notice: Result of start operation for dlm on fastvm-rhel-7-4-22: 5 (not installed)
    Apr 17 22:22:39 fastvm-rhel-7-4-22 crmd[1154]:  notice: Result of stop operation for dlm on fastvm-rhel-7-4-22: 0 (ok)

    # # Add dlm subdirectory on all nodes and try again
    # mkdir /testmnt/dlm
    # pcs resource cleanup

    # # controld resource is now Started, but DLM is still using /sys/kernel/config
    # pcs resource show | grep -A1 dlm
     Clone Set: dlm-clone [dlm]
         Started: [ fastvm-rhel-7-4-22 fastvm-rhel-7-4-23 ]
    # mount | grep configfs
    configfs on /sys/kernel/config type configfs (rw,relatime)
    # ls /sys/kernel/config/dlm
    cluster
    # ls /testmnt/dlm
    #

    # # Prevent anything from using /sys/kernel/config
    # # Disable controld/clvm resources, unmount /sys/kernel/config on both nodes, update configdir, enable resources 
    # pcs resource disable clvmd
    # pcs resource disable dlm
    # umount /sys/kernel/config
    # pcs resource update dlm configdir=/testmnt
    # pcs resource enable dlm
    # pcs resource enable clvmd

    # # ocf:pacemaker:controld has mounted configfs on /testmnt (the configdir)
    # mount | grep config
    none on /testmnt type configfs (rw,relatime)

    # # However, DLM still depends on /sys/kernel/config being present
    # tail /var/log/messages
    ...
    Apr 17 22:39:10 fastvm-rhel-7-4-22 dlm_controld[3164]: 275 dlm_controld 4.0.7 started
    Apr 17 22:39:10 fastvm-rhel-7-4-22 dlm_controld[3164]: 275 No /sys/kernel/config/dlm, is the dlm loaded?
    Apr 17 22:39:11 fastvm-rhel-7-4-22 crmd[1142]:  notice: Result of start operation for dlm on fastvm-rhel-7-4-22: 7 (not running)
    Apr 17 22:39:11 fastvm-rhel-7-4-22 crmd[1142]:  notice: Result of stop operation for dlm on fastvm-rhel-7-4-22: 0 (ok)

    # # Reboot both nodes. configfs is once again mounted on /sys/kernel/config (by a kmod?), but now there is no dlm subdirectory and the controld resource fails to start.
    # ls /sys/kernel/config/
    # ls /testmnt/dlm
    # tail /var/log/messages
    ...
    Apr 17 22:43:22 fastvm-centos-7-4-22 dlm_controld[1388]: 40 dlm_controld 4.0.7 started
    Apr 17 22:43:22 fastvm-centos-7-4-22 dlm_controld[1388]: 40 Is dlm missing from kernel? No misc devices found.
    Apr 17 22:43:22 fastvm-centos-7-4-22 dlm_controld[1388]: 40 No /sys/kernel/config/dlm, is the dlm loaded?
    Apr 17 22:43:23 fastvm-centos-7-4-22 crmd[1148]:  notice: Result of start operation for dlm on fastvm-centos-7-4-22: 7 (not running)


--------------------

Actual results:

See "Steps to Reproduce."


--------------------

Expected results:

  - configfs mounted on ${OCF_RESKEY_configdir}
  - ocf:pacemaker:controld resource in Started state
  - dlm_controld process running
  - data in ${OCF_RESKEY_configdir}/dlm/cluster


--------------------

Additional info:

N/A

Comment 4 Reid Wahl 2018-04-17 21:07:49 UTC
Disregard CentOS messages. Copy-paste mistake from another trial.

Comment 7 Ken Gaillot 2018-04-18 14:55:33 UTC
I agree that it is not worth the effort to fully implement this, as demand is nonexistent.

My plan for RHEL 7.6 is to have ocf:pacemaker:controld ignore the parameter, always using /sys/kernel/config, and to indicate in the meta-data that it is deprecated and ignored. Then we can remove it in a later RHEL version.

QA: test is trivial, check the controld metadata to see whether it says the configdir parameter is deprecated, and try configuring a controld resource with a nonstandard configdir location (without changing the actual mountpoint) -- before the fix, the resource should fail, and after the fix, it should work (using the standard mountpoint).

Comment 9 Ken Gaillot 2018-04-21 03:46:48 UTC
fixed upstream by commit b65defeb

Comment 11 Patrik Hagara 2018-07-16 16:37:40 UTC
before:
=======

> [root@virt-257 ~]# rpm -q pacemaker                       
> pacemaker-1.1.18-11.el7.x86_64
> [root@virt-257 ~]# pcs resource describe ocf:pacemaker:controld        
> ocf:pacemaker:controld - DLM Agent for cluster file systems
> 
> This Resource Agent can control the dlm_controld services needed by cluster-aware file systems.
> It assumes that dlm_controld is in your default PATH.
> In most cases, it should be run as an anonymous clone.
> 
> Resource options:
>   args: Any additional options to start the dlm_controld service with
>   configdir: The location where configfs is or should be mounted
>   daemon: The daemon to start - supports gfs_controld(.pcmk) and dlm_controld(.pcmk)
>   allow_stonith_disabled: Allow DLM start-up even if STONITH/fencing is disabled in the cluster. Setting this option to true will cause cluster malfunction and hangs on fail-over for DLM clients that
>                           require fencing (such as GFS2, OCFS2, and cLVM2). This option is advanced use only.
> 
> Default operations:
>   start: interval=0s timeout=90
>   stop: interval=0s timeout=100
>   monitor: interval=10 start-delay=0 timeout=20
> [root@virt-257 ~]# pcs resource create controld ocf:pacemaker:controld allow_stonith_disabled=true configdir=/non/existent/path
> [root@virt-257 ~]# pcs status
>   <snip>
> Full list of resources:
> 
>  controld	(ocf::pacemaker:controld):	Stopped
> 
> Failed Actions:
> * controld_start_0 on virt-257.cluster-qe.lab.eng.brq.redhat.com 'not installed' (5): call=75, status=complete, exitreason='',
>     last-rc-change='Mon Jul 16 18:05:39 2018', queued=0ms, exec=102ms
> * controld_start_0 on virt-261.cluster-qe.lab.eng.brq.redhat.com 'not installed' (5): call=41, status=complete, exitreason='',
>     last-rc-change='Mon Jul 16 18:05:40 2018', queued=0ms, exec=100ms
>   <snip>
> [root@virt-257 ~]# pcs resource update controld configdir=/tmp
> [root@virt-257 ~]# pcs status
>   <snip>
> Full list of resources:
> 
>  controld	(ocf::pacemaker:controld):	Stopped
> 
> Failed Actions:
> * controld_start_0 on virt-257.cluster-qe.lab.eng.brq.redhat.com 'not installed' (5): call=112, status=complete, exitreason='',
>     last-rc-change='Mon Jul 16 18:19:43 2018', queued=1ms, exec=103ms
> * controld_start_0 on virt-261.cluster-qe.lab.eng.brq.redhat.com 'not installed' (5): call=66, status=complete, exitreason='',
>     last-rc-change='Mon Jul 16 18:19:43 2018', queued=0ms, exec=101ms
>   <snip>
> [root@virt-257 ~]# mkdir /tmp/dlm
> [root@virt-257 ~]# pcs resource cleanup
> [root@virt-257 ~]# pcs resource
>  controld	(ocf::pacemaker:controld):	Started virt-257.cluster-qe.lab.eng.brq.redhat.com
> [root@virt-257 ~]# umount /sys/kernel/config
> [root@virt-257 ~]# pcs resource restart controld
> Error: Error performing operation: Timer expired
> 
> Set 'controld' option: id=controld-meta_attributes-target-role set=controld-meta_attributes name=target-role=stopped
> Waiting for 1 resources to stop:
>  * controld
> Deleted 'controld' option: id=controld-meta_attributes-target-role name=target-role
> Waiting for 1 resources to start again:
>  * controld
> Could not complete restart of controld, 1 resources remaining
>  * controld
> [root@virt-257 ~]# pcs status
>   <snip>
> Full list of resources:
> 
>  controld	(ocf::pacemaker:controld):	Stopped
> 
> Failed Actions:
> * controld_start_0 on virt-257.cluster-qe.lab.eng.brq.redhat.com 'not running' (7): call=123, status=complete, exitreason='',
>     last-rc-change='Mon Jul 16 18:25:26 2018', queued=2ms, exec=1144ms
> * controld_start_0 on virt-261.cluster-qe.lab.eng.brq.redhat.com 'not installed' (5): call=73, status=complete, exitreason='',
>     last-rc-change='Mon Jul 16 18:25:27 2018', queued=0ms, exec=121ms
>   <snip>
> [root@virt-257 ~]# mkdir /custom/configfs
> [root@virt-257 ~]# mount -t configfs configfs /custom/configfs
> [root@virt-257 ~]# pcs resource update controld configdir=/custom/configfs
> [root@virt-257 ~]# pcs status
>   <snip>
> Full list of resources:
> 
>  controld	(ocf::pacemaker:controld):	Stopped
> 
> Failed Actions:
> * controld_start_0 on virt-257.cluster-qe.lab.eng.brq.redhat.com 'not running' (7): call=125, status=complete, exitreason='',
>     last-rc-change='Mon Jul 16 18:30:18 2018', queued=1ms, exec=1190ms
> * controld_start_0 on virt-261.cluster-qe.lab.eng.brq.redhat.com 'not installed' (5): call=75, status=complete, exitreason='',
>     last-rc-change='Mon Jul 16 18:30:19 2018', queued=0ms, exec=98ms
>   <snip>
> [root@virt-257 ~]# mount -t configfs configfs /sys/kernel/config
> [root@virt-257 ~]# pcs resource update controld configdir=
> [root@virt-257 ~]# pcs resource
>  controld	(ocf::pacemaker:controld):	Started virt-257.cluster-qe.lab.eng.brq.redhat.com

Controld agent returns with a "not installed" error when cluster attempts to start the resource with non-existing config directory. With an invalid, but existing config directory (/tmp where no "dlm" subdir existst), the same "not installed" error is returned. Creating a fake "dlm" directory in /tmp allows the initial checks of resource agent to pass, and it does start successfully -- but using the wrong configfs path (/sys/kernel/config instead of /tmp). This claim is easily verified by unmounting /sys/kernel/config and forcing resource restart -- resource fails to start with "not running". The same error ("not running") is returned when we create a real configfs mount in non-standard location. Everything returns back to normal once we mount configfs in its standard location and remove the configdir resource agent option.



after:
======

> [root@virt-257 ~]# rpm -q pacemaker
> pacemaker-1.1.19-3.el7.x86_64
> [root@virt-257 ~]# pcs resource describe ocf:pacemaker:controld
> ocf:pacemaker:controld - DLM Agent for cluster file systems
> 
> This Resource Agent can control the dlm_controld services needed by cluster-aware file systems.
> It assumes that dlm_controld is in your default PATH.
> In most cases, it should be run as an anonymous clone.
> 
> Resource options:
>   args: Any additional options to start the dlm_controld service with
>   configdir: This parameter is deprecated and ignored
>   daemon: The daemon to start - supports gfs_controld(.pcmk) and dlm_controld(.pcmk)
>   allow_stonith_disabled: Allow DLM start-up even if STONITH/fencing is disabled in the cluster. Setting this option to true will cause cluster malfunction and hangs on fail-over for DLM clients that
>                           require fencing (such as GFS2, OCFS2, and cLVM2). This option is advanced use only.
> 
> Default operations:
>   start: interval=0s timeout=90
>   stop: interval=0s timeout=100
>   monitor: interval=10 start-delay=0 timeout=20
> [root@virt-257 ~]# pcs resource create controld ocf:pacemaker:controld allow_stonith_disabled=true configdir=/non/existent/path
> [root@virt-257 ~]# pcs resource
>  controld	(ocf::pacemaker:controld):	Started virt-257.cluster-qe.lab.eng.brq.redhat.com

No warning shown when creating resource, no deprecation message present in cluster logs -- the deprecated configdir option is silently ignored. Controld resource successfully starts using the default configfs path.

Marking verified in 1.1.19-3.el7.

Comment 13 errata-xmlrpc 2018-10-30 07:57:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3055