+++ This bug was initially created as a clone of Bug #1784562 +++
Description of problem:
Manila-share failed to start after deploying overcloud with ceph-nfs and ipv4 network protocol
[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 2.0.2-3.el8-744a30d655) - partition with quorum
Last updated: Tue Dec 17 17:21:03 2019
Last change: Fri Dec 13 15:27:27 2019 by root via cibadmin on controller-0
12 nodes configured
40 resources configured
Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-0 ]
Full list of resources:
ip-172.17.5.98 (ocf::heartbeat:IPaddr2): Started controller-1
Container bundle set: galera-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:pcmklatest]
galera-bundle-0 (ocf::heartbeat:galera): Master controller-1
galera-bundle-1 (ocf::heartbeat:galera): Master controller-2
galera-bundle-2 (ocf::heartbeat:galera): Master controller-0
Container bundle set: rabbitmq-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-rabbitmq:pcmklatest]
rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-1
rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-2
rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-0
ip-192.168.24.54 (ocf::heartbeat:IPaddr2): Started controller-1
ip-10.0.0.138 (ocf::heartbeat:IPaddr2): Started controller-2
ip-172.17.1.38 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.3.86 (ocf::heartbeat:IPaddr2): Started controller-1
ip-172.17.4.76 (ocf::heartbeat:IPaddr2): Started controller-2
Container bundle set: haproxy-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-haproxy:pcmklatest]
haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0
haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-1
haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started controller-2
Container bundle set: ovn-dbs-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:pcmklatest]
ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0
ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1
ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2
ip-172.17.1.91 (ocf::heartbeat:IPaddr2): Started controller-0
ceph-nfs (systemd:ceph-nfs@pacemaker): Started controller-1
Container bundle: openstack-cinder-volume [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-volume:pcmklatest]
openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-1
Container bundle: openstack-manila-share [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-manila-share:pcmklatest]
openstack-manila-share-podman-0 (ocf::heartbeat:podman): Started controller-1
Failed Resource Actions:
* ceph-nfs_start_0 on controller-0 'unknown error' (1): call=21559, status=complete, exitreason='',
last-rc-change='Tue Dec 17 15:47:08 2019', queued=0ms, exec=12284ms
* ceph-nfs_monitor_60000 on controller-1 'not running' (7): call=470, status=complete, exitreason='',
last-rc-change='Tue Dec 17 17:20:15 2019', queued=0ms, exec=0ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@controller-0 ~]#
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Deploy overcloud with manila with cephfs-nfs back end using RHOS_TRUNK-16.0-RHEL-8-20191206.n.1
2. run sudo pcs status on controller nodes
Actual results:
ceph-nfs, ipv6 VIP interface, and manila-share fail
Expected results:
Expected ceph-nfs and manila-share service work.
Additional info:
This BZ is cloned from a BZ for similar IPv6 deployment but differs from it in that the VIP used by Ceph-NFS is up. The Ceph-NFS service (aka ganesha) is failing even though the VIP is up.
--- Additional comment from Tom Barron on 2019-12-17 18:45:39 UTC ---
Cephfs-nfs (ganesha) container is failing with an address in use issue with the dbus socket:
[root@controller-0 system]# /usr/bin/podman run --rm --net=host -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z -v /etc/ganesha:/etc/ganesha:z -v /var/run/ceph:/var/run/ceph:z -v /var/log/ceph:/var/log/ceph:z --privileged -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket -v /etc/localtime:/etc/localtime:ro -e CLUSTER=ceph -e CEPH_DAEMON=NFS -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest --name=ceph-nfs-pacemaker undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
2019-12-17 14:29:57 /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
dbus-daemon[85]: Failed to start message bus: Failed to bind socket "/run/dbus/system_bus_socket": Address already in use
This error appears to be triggered by a change in the the ceph container image [1] that runs a dedicated dbus daemon in the container. ceph-ansible, howver, deploys ceph-nfs with the dbus socket bind mounted from thos host (as in the command above). On the host the socket is indeed already in use.
[1] https://github.com/ceph/ceph-container/pull/1517
--- Additional comment from Tom Barron on 2019-12-17 19:00:08 UTC ---
OK, the ceph container image used for ceph-nfs (ganesha) and the ceph-ansible used to deploy the overcloud are not synchronized. We have
ceph_image: rhceph-4.0-rhel8
ceph_namespace: docker-registry.upshift.redhat.com/ceph
which is running a dedicated dbus in the container, but
[root@undercloud-0 stack]# yum list installed ceph-ansible
Installed Packages
ceph-ansible.noarch 4.0.5-1.el8cp @rhelosp-ceph-4-tools
which does not have https://github.com/ceph/ceph-ansible/pull/4760 from upstream -- a PR that modifies the template for the systemd file for ceph-nfs.
--- Additional comment from Tom Barron on 2019-12-17 19:14:59 UTC ---
IIUC the motivation for running a dedicated dbus inside the cephfs-nfs container is to enable running that container without '--privileged'. That's good motivation, but the point of using dbus is to enable ganesha consumers to dynamically signal needed exports so that ganesha can implement these. Manila-share service, running in a separate container, does this.
We don't currently have an alternative means of triggering the required exports updates from manila share, so I think that while synchronizing ceph-ansible with the container image change will enable ceph-nfs to run, it will still be broken in that manila won't be able to make use of it.
--- Additional comment from Tom Barron on 2019-12-17 20:07:27 UTC ---
Same issue with the most up-to-date OSP deployment:
(undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version
RHOS_TRUNK-16.0-RHEL-8-20191217.n.1(undercloud) [stack@undercloud-0 ~]$ sudo yum list installed ceph-ansible
Installed Packages
ceph-ansible.noarch 4.0.6-1.el8cp @rhelosp-ceph-4.0-tools-pending
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo cat /etc/systemd/system/ceph-nfs@.service
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
[Unit]
Description=NFS-Ganesha file server
Documentation=http://github.com/nfs-ganesha/nfs-ganesha/wiki
After=network.target
[Service]
EnvironmentFile=-/etc/environment
ExecStartPre=-/usr/bin/podman rm ceph-nfs-%i
ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha
ExecStart=/usr/bin/podman run --rm --net=host \
-v /var/lib/ceph:/var/lib/ceph:z \
-v /etc/ceph:/etc/ceph:z \
-v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z \
-v /etc/ganesha:/etc/ganesha:z \
-v /var/run/ceph:/var/run/ceph:z \
-v /var/log/ceph:/var/log/ceph:z \
--privileged \
-v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket \
-v /etc/localtime:/etc/localtime:ro \
-e CLUSTER=ceph \
-e CEPH_DAEMON=NFS \
-e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest \
\
--name=ceph-nfs-pacemaker \
undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
ExecStopPost=-/usr/bin/podman stop ceph-nfs-%i
Restart=always
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=15
[Install]
WantedBy=multi-user.target
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo podman ps -a | grep ceph
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
7893ed911009 undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 42 minutes ago Up 42 minutes ago ceph-mds-controller-0
6426df89582c undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 45 minutes ago Up 45 minutes ago ceph-mgr-controller-0
f6f7dad234fa undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... About an hour ago Up About an hour ago ceph-mon-controller-0
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo /usr/bin/podman run --rm --net=host -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z -v /etc/ganesha:/etc/ganesha:z -v /var/run/ceph:/var/run/ceph:z -v /var/log/ceph:/var/log/ceph:z --privileged -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket -v /etc/localtime:/etc/localtime:ro -e CLUSTER=ceph -e CEPH_DAEMON=NFS -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest --name=ceph-nfs-pacemaker undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
2019-12-17 20:05:19 /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
dbus-daemon[85]: Failed to start message bus: Failed to bind socket "/run/dbus/system_bus_socket": Address already in use
(undercloud) [stack@undercloud-0 ~]$
--- Additional comment from lkuchlan on 2019-12-18 06:40:55 UTC ---
Switching component to manila at least for now since there's nothing to do to THT to fix this (probably a ceph/container change is needed but we can track OSP side here).