Bug 1699046

Summary: TLS Everywhere: ceph-nfs ganesha service fails to start after tripleo deployment
Product: Red Hat OpenStack Reporter: Sadique Puthen <sputhenp>
Component: openstack-tripleo-heat-templatesAssignee: Goutham Pacha Ravi <gouthamr>
Status: CLOSED ERRATA QA Contact: Jason Grosso <jgrosso>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: gouthamr, jappleii, mburns, michele, pgrist, tbarron
Target Milestone: z7Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.3.1-39.el7ost Doc Type: Bug Fix
Doc Text:
Cause: NFS gateway (Ganesha) endpoint for the Shared File Systems service's (Manila) CephFS back end was being misconfigured when deploying with TLS Everywhere. Consequence: TLS everywhere deployment with Manila and the CephFS via NFS back end fails. Fix: The environment file pertaining to the NFS gateway endpoint has been fixed to use the Virtual IP address rather than the DNS name. Result: TLS everywhere deployment with Manila and the CephFS via NFS back end succeeds.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-10 13:05:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sadique Puthen 2019-04-11 15:52:33 UTC
Description of problem:

ceph-nfs pacemaker service fails to start after deployment. The error after tripleo deployment is:

pcs status
 ceph-nfs	(systemd:ceph-nfs@pacemaker):	Started controller-1

Failed Actions:
* ceph-nfs_monitor_60000 on controller-1 'not running' (7): call=359, status=complete, exitreason='',
    last-rc-change='Thu Apr 11 02:59:54 2019', queued=0ms, exec=0ms

# pcs resource show ceph-nfs
 Resource: ceph-nfs (class=systemd type=ceph-nfs@pacemaker)
  Operations: monitor interval=60 timeout=100 (ceph-nfs-monitor-interval-60)
              start interval=0s timeout=200s (ceph-nfs-start-interval-0s)
              stop interval=0s timeout=200s (ceph-nfs-stop-interval-0s)

# systemctl status ceph-nfs@pacemaker

  Process: 672006 ExecStart=/usr/bin/docker run --rm --net=host -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z -v /etc/ganesha:/etc/ganesha:z -v /var/run/ceph:/var/run/ceph:z --privileged -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket -v /etc/localtime:/etc/localtime:ro -e CLUSTER=ceph -e CEPH_DAEMON=NFS --name=ceph-nfs-pacemaker 172.16.0.1:8787/rhceph/rhceph-3-rhel7:3-23 (code=exited, status=255)

Apr 11 03:02:10 controller-1.redhat.local docker[672710]: Error response from daemon: No such container: ceph-nfs-pacemaker

If I run the docker command shown in systemctl manually, I get below error

2019-04-11 03:05:36  /entrypoint.sh: static: does not generate config
2019-04-11 03:05:37  /entrypoint.sh: SUCCESS
exec: PID 149: spawning /usr/bin/ganesha.nfsd  -F -L STDOUT 
exec: Waiting 149 to quit
11/04/2019 03:05:37 : epoch 5caeaf01 : controller-1.redhat.local : ganesha.nfsd-149[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version 2.7.1
11/04/2019 03:05:37 : epoch 5caeaf01 : controller-1.redhat.local : ganesha.nfsd-149[main] nfs_set_param_from_conf :NFS STARTUP :CRIT :Error while parsing core configuration
11/04/2019 03:05:37 : epoch 5caeaf01 : controller-1.redhat.local : ganesha.nfsd-149[main] main :NFS STARTUP :CRIT :Error setting parameters from configuration file.
11/04/2019 03:05:37 : epoch 5caeaf01 : controller-1.redhat.local : ganesha.nfsd-149[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:6): Expected an IP address, got a option name or number
11/04/2019 03:05:37 : epoch 5caeaf01 : controller-1.redhat.local : ganesha.nfsd-149[main] config_errs_to_log :CONFIG :CRIT :Config File (/etc/ganesha/ganesha.conf:39): 1 (invalid param value) errors found block NFS_Core_Param
11/04/2019 03:05:37 : epoch 5caeaf01 : controller-1.redhat.local : ganesha.nfsd-149[main] main :NFS STARTUP :FATAL :Fatal errors.  Server exiting...
teardown: managing teardown after SIGCHLD
teardown: Waiting PID 149 to terminate 
teardown: Process 149 is terminated
teardown: Bye Bye, container will die with return code -1
teardown: if you don't want me to die and have access to a shell to debug this situation, next time run me with '-e DEBUG=stayalive'

From /etc/ganesha/ganesha.conf

NFS_Core_Param
{
       Bind_Addr=overcloud.storagenfs.localdomain;
}

# grep storagenfs /etc/hosts
172.16.202.101  overcloud.storagenfs.localdomain

Should this be an ip?

pcs status:
 ip-172.16.202.101	(ocf::heartbeat:IPaddr2):	Started controller-1
Version-Release number of selected component (if applicable):

I changed Bind_Addr to the ip address and it can now be started by pacemaker and no errors.

So it shows fqdn is used as Bind_Addr in ganesha.conf instead of IP. This need to be fixed to use IP even when ssl everywhere is used.

Templates at https://gitlab.cee.redhat.com/sputhenp/openstack/tree/master/basic/templates

Deploy command at: https://gitlab.cee.redhat.com/sputhenp/openstack/blob/master/basic/templates/overcloud-deploy-tls.sh

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Goutham Pacha Ravi 2019-04-11 17:44:26 UTC
Thanks, confirming this is a bug, since the BindAddr must be a valid IPv4 or IPv6 address [1], and shouldn't be a hostname/fqdn as configured.

[1] https://github.com/nfs-ganesha/nfs-ganesha/blob/af26bf4/src/config_samples/config.txt#L43

Comment 16 Jason Grosso 2019-06-23 13:05:42 UTC
after deploying OSP 13z7 build     I see the following info about ceph-nfs-pacemaker

 ceph-nfs	(systemd:ceph-nfs@pacemaker):	Started controller-0


full output of command 2019-06-20.1



[heat-admin@controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.19-8.el7_6.5-c3c624ea3d) - partition with quorum
Last updated: Sun Jun 23 13:04:03 2019
Last change: Fri Jun 21 03:38:12 2019 by root via cibadmin on controller-0

12 nodes configured
40 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-0 redis-bundle-0@controller-1 redis-bundle-1@controller-2 redis-bundle-2@controller-0 ]

Full list of resources:

 ip-172.17.5.13	(ocf::heartbeat:IPaddr2):	Started controller-0
 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-0
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Master controller-1
   redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-2
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-0
 ip-192.168.24.101	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.102	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.1.101	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.3.101	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.4.101	(ocf::heartbeat:IPaddr2):	Started controller-0
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-1
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-2
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-0
 ceph-nfs	(systemd:ceph-nfs@pacemaker):	Started controller-0
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-1
 Docker container: openstack-manila-share [192.168.24.1:8787/rhosp13/openstack-manila-share:pcmklatest]
   openstack-manila-share-docker-0	(ocf::heartbeat:docker):	Started controller-0

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 18 errata-xmlrpc 2019-07-10 13:05:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1738