Bug 1784749

Summary:	[RHOS15] manila-share fails to start due to ceph-nfs stopped with IPv4 network protocol
Product:	Red Hat OpenStack	Reporter:	lkuchlan <lkuchlan>
Component:	openstack-manila	Assignee:	Tom Barron <tbarron>
Status:	CLOSED DUPLICATE	QA Contact:	vhariria
Severity:	high	Docs Contact:	mmurray
Priority:	high
Version:	15.0 (Stein)	CC:	ccopello, emacchi, gfidente, gouthamr, lkuchlan, mburns, tbarron, vhariria, vimartin
Target Milestone:	---	Keywords:	TestBlocker, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1784562	Environment:
Last Closed:	2019-12-18 15:55:45 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1575079, 1790756, 1848582

Description lkuchlan 2019-12-18 09:08:29 UTC

+++ This bug was initially created as a clone of Bug #1784562 +++

Description of problem:

Manila-share failed to start after deploying overcloud with ceph-nfs and ipv4 network protocol

[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 2.0.2-3.el8-744a30d655) - partition with quorum
Last updated: Tue Dec 17 17:21:03 2019
Last change: Fri Dec 13 15:27:27 2019 by root via cibadmin on controller-0

12 nodes configured
40 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-0 ]

Full list of resources:

 ip-172.17.5.98	(ocf::heartbeat:IPaddr2):	Started controller-1
 Container bundle set: galera-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-0
 Container bundle set: rabbitmq-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
 ip-192.168.24.54	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-10.0.0.138	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.38	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.3.86	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.4.76	(ocf::heartbeat:IPaddr2):	Started controller-2
 Container bundle set: haproxy-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-haproxy:pcmklatest]
   haproxy-bundle-podman-0	(ocf::heartbeat:podman):	Started controller-0
   haproxy-bundle-podman-1	(ocf::heartbeat:podman):	Started controller-1
   haproxy-bundle-podman-2	(ocf::heartbeat:podman):	Started controller-2
 Container bundle set: ovn-dbs-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Master controller-0
   ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave controller-1
   ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Slave controller-2
 ip-172.17.1.91	(ocf::heartbeat:IPaddr2):	Started controller-0
 ceph-nfs	(systemd:ceph-nfs@pacemaker):	Started controller-1
 Container bundle: openstack-cinder-volume [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-podman-0	(ocf::heartbeat:podman):	Started controller-1
 Container bundle: openstack-manila-share [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-manila-share:pcmklatest]
   openstack-manila-share-podman-0	(ocf::heartbeat:podman):	Started controller-1

Failed Resource Actions:
* ceph-nfs_start_0 on controller-0 'unknown error' (1): call=21559, status=complete, exitreason='',
    last-rc-change='Tue Dec 17 15:47:08 2019', queued=0ms, exec=12284ms
* ceph-nfs_monitor_60000 on controller-1 'not running' (7): call=470, status=complete, exitreason='',
    last-rc-change='Tue Dec 17 17:20:15 2019', queued=0ms, exec=0ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@controller-0 ~]# 

Version-Release number of selected component (if applicable):


How reproducible:
100% 

Steps to Reproduce:
1. Deploy overcloud with manila with cephfs-nfs back end using RHOS_TRUNK-16.0-RHEL-8-20191206.n.1

2. run sudo pcs status on controller nodes


Actual results:
ceph-nfs, ipv6 VIP interface, and manila-share fail 


Expected results:

Expected ceph-nfs and manila-share service work.


Additional info:

This BZ is cloned from a BZ for similar IPv6 deployment but differs from it in that the VIP used by Ceph-NFS is up.  The Ceph-NFS service (aka ganesha) is failing even though the VIP is up.

--- Additional comment from Tom Barron on 2019-12-17 18:45:39 UTC ---

Cephfs-nfs (ganesha) container is failing with an address in use issue with the dbus socket:

[root@controller-0 system]# /usr/bin/podman run --rm --net=host   -v /var/lib/ceph:/var/lib/ceph:z   -v /etc/ceph:/etc/ceph:z   -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z   -v /etc/ganesha:/etc/ganesha:z   -v /var/run/ceph:/var/run/ceph:z   -v /var/log/ceph:/var/log/ceph:z     --privileged   -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket   -v /etc/localtime:/etc/localtime:ro   -e CLUSTER=ceph   -e CEPH_DAEMON=NFS   -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest      --name=ceph-nfs-pacemaker   undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
2019-12-17 14:29:57  /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
dbus-daemon[85]: Failed to start message bus: Failed to bind socket "/run/dbus/system_bus_socket": Address already in use

This error appears to be triggered by a change in the the ceph container image [1] that runs a dedicated dbus daemon in the container.  ceph-ansible, howver, deploys ceph-nfs with the dbus socket bind mounted from thos host (as in the command above).  On the host the socket is indeed already in use.

[1] https://github.com/ceph/ceph-container/pull/1517

--- Additional comment from Tom Barron on 2019-12-17 19:00:08 UTC ---

OK, the ceph container image used for ceph-nfs (ganesha) and the ceph-ansible used to deploy the overcloud are not synchronized.  We have

ceph_image: rhceph-4.0-rhel8
ceph_namespace: docker-registry.upshift.redhat.com/ceph

which is running a dedicated dbus in the container, but

[root@undercloud-0 stack]# yum list installed ceph-ansible
Installed Packages
ceph-ansible.noarch                                                                    4.0.5-1.el8cp                                                                     @rhelosp-ceph-4-tools

which does not have https://github.com/ceph/ceph-ansible/pull/4760 from upstream -- a PR that modifies the template for the systemd file for ceph-nfs.

--- Additional comment from Tom Barron on 2019-12-17 19:14:59 UTC ---

IIUC the motivation for running a dedicated dbus inside the cephfs-nfs container is to enable running that container without '--privileged'.  That's good motivation, but the point of using dbus is to enable ganesha consumers to dynamically signal needed exports so that ganesha can implement these.  Manila-share service, running in a separate container, does this.

We don't currently have an alternative means of triggering the required exports updates from manila share, so I think that while synchronizing ceph-ansible with the container image change will enable ceph-nfs to run, it will still be broken in that manila won't be able to make use of it.

--- Additional comment from Tom Barron on 2019-12-17 20:07:27 UTC ---

Same issue with the most up-to-date OSP deployment:

(undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 
RHOS_TRUNK-16.0-RHEL-8-20191217.n.1(undercloud) [stack@undercloud-0 ~]$ sudo yum list installed ceph-ansible
Installed Packages
ceph-ansible.noarch                                                               4.0.6-1.el8cp                                                                @rhelosp-ceph-4.0-tools-pending
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo cat /etc/systemd/system/ceph-nfs@.service
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
[Unit]
Description=NFS-Ganesha file server
Documentation=http://github.com/nfs-ganesha/nfs-ganesha/wiki
After=network.target

[Service]
EnvironmentFile=-/etc/environment
ExecStartPre=-/usr/bin/podman rm ceph-nfs-%i
ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha
ExecStart=/usr/bin/podman run --rm --net=host \
  -v /var/lib/ceph:/var/lib/ceph:z \
  -v /etc/ceph:/etc/ceph:z \
  -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z \
  -v /etc/ganesha:/etc/ganesha:z \
  -v /var/run/ceph:/var/run/ceph:z \
  -v /var/log/ceph:/var/log/ceph:z \
    --privileged \
  -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket \
  -v /etc/localtime:/etc/localtime:ro \
  -e CLUSTER=ceph \
  -e CEPH_DAEMON=NFS \
  -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest \
   \
  --name=ceph-nfs-pacemaker \
  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
ExecStopPost=-/usr/bin/podman stop ceph-nfs-%i
Restart=always
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=15

[Install]
WantedBy=multi-user.target
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo podman ps -a | grep ceph
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
7893ed911009  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest                             /opt/ceph-contain...  42 minutes ago     Up 42 minutes ago                    ceph-mds-controller-0
6426df89582c  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest                             /opt/ceph-contain...  45 minutes ago     Up 45 minutes ago                    ceph-mgr-controller-0
f6f7dad234fa  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest                             /opt/ceph-contain...  About an hour ago  Up About an hour ago                 ceph-mon-controller-0
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo /usr/bin/podman run --rm --net=host   -v /var/lib/ceph:/var/lib/ceph:z   -v /etc/ceph:/etc/ceph:z   -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z   -v /etc/ganesha:/etc/ganesha:z   -v /var/run/ceph:/var/run/ceph:z   -v /var/log/ceph:/var/log/ceph:z     --privileged   -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket   -v /etc/localtime:/etc/localtime:ro   -e CLUSTER=ceph   -e CEPH_DAEMON=NFS   -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest      --name=ceph-nfs-pacemaker   undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
2019-12-17 20:05:19  /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
dbus-daemon[85]: Failed to start message bus: Failed to bind socket "/run/dbus/system_bus_socket": Address already in use
(undercloud) [stack@undercloud-0 ~]$

--- Additional comment from lkuchlan on 2019-12-18 06:40:55 UTC ---

Comment 1 Tzach Shefi 2019-12-18 09:24:04 UTC

Updating for Liron the versions for OSP15 are:

RHOS_TRUNK-15.0-RHEL-8-20191212.n.0
ceph-ansible-4.0.6-1.el8cp.noarch
ceph_image: rhceph-4.0-rhel8

Comment 3 Tom Barron 2019-12-18 13:22:08 UTC

Switching component to manila at least for now since there's nothing to do to THT to fix this (probably a ceph/container change is needed but we can track OSP side here).

Comment 5 Giulio Fidente 2019-12-18 15:55:45 UTC


*** This bug has been marked as a duplicate of bug 1784562 ***