Description of problem: Manila-share failed to start after deploying overcloud with ceph-nfs and ipv4 network protocol [root@controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-1 (version 2.0.2-3.el8-744a30d655) - partition with quorum Last updated: Tue Dec 17 17:21:03 2019 Last change: Fri Dec 13 15:27:27 2019 by root via cibadmin on controller-0 12 nodes configured 40 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-0 ] Full list of resources: ip-172.17.5.98 (ocf::heartbeat:IPaddr2): Started controller-1 Container bundle set: galera-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-1 galera-bundle-1 (ocf::heartbeat:galera): Master controller-2 galera-bundle-2 (ocf::heartbeat:galera): Master controller-0 Container bundle set: rabbitmq-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 ip-192.168.24.54 (ocf::heartbeat:IPaddr2): Started controller-1 ip-10.0.0.138 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.38 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.86 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.76 (ocf::heartbeat:IPaddr2): Started controller-2 Container bundle set: haproxy-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-haproxy:pcmklatest] haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0 haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-1 haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started controller-2 Container bundle set: ovn-dbs-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:pcmklatest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-0 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-1 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 ip-172.17.1.91 (ocf::heartbeat:IPaddr2): Started controller-0 ceph-nfs (systemd:ceph-nfs@pacemaker): Started controller-1 Container bundle: openstack-cinder-volume [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-volume:pcmklatest] openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-1 Container bundle: openstack-manila-share [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-manila-share:pcmklatest] openstack-manila-share-podman-0 (ocf::heartbeat:podman): Started controller-1 Failed Resource Actions: * ceph-nfs_start_0 on controller-0 'unknown error' (1): call=21559, status=complete, exitreason='', last-rc-change='Tue Dec 17 15:47:08 2019', queued=0ms, exec=12284ms * ceph-nfs_monitor_60000 on controller-1 'not running' (7): call=470, status=complete, exitreason='', last-rc-change='Tue Dec 17 17:20:15 2019', queued=0ms, exec=0ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@controller-0 ~]# Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with manila with cephfs-nfs back end using RHOS_TRUNK-16.0-RHEL-8-20191206.n.1 2. run sudo pcs status on controller nodes Actual results: ceph-nfs, ipv6 VIP interface, and manila-share fail Expected results: Expected ceph-nfs and manila-share service work. Additional info: This BZ is cloned from a BZ for similar IPv6 deployment but differs from it in that the VIP used by Ceph-NFS is up. The Ceph-NFS service (aka ganesha) is failing even though the VIP is up.
Cephfs-nfs (ganesha) container is failing with an address in use issue with the dbus socket: [root@controller-0 system]# /usr/bin/podman run --rm --net=host -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z -v /etc/ganesha:/etc/ganesha:z -v /var/run/ceph:/var/run/ceph:z -v /var/log/ceph:/var/log/ceph:z --privileged -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket -v /etc/localtime:/etc/localtime:ro -e CLUSTER=ceph -e CEPH_DAEMON=NFS -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest --name=ceph-nfs-pacemaker undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest 2019-12-17 14:29:57 /opt/ceph-container/bin/entrypoint.sh: static: does not generate config dbus-daemon[85]: Failed to start message bus: Failed to bind socket "/run/dbus/system_bus_socket": Address already in use This error appears to be triggered by a change in the the ceph container image [1] that runs a dedicated dbus daemon in the container. ceph-ansible, howver, deploys ceph-nfs with the dbus socket bind mounted from thos host (as in the command above). On the host the socket is indeed already in use. [1] https://github.com/ceph/ceph-container/pull/1517
OK, the ceph container image used for ceph-nfs (ganesha) and the ceph-ansible used to deploy the overcloud are not synchronized. We have ceph_image: rhceph-4.0-rhel8 ceph_namespace: docker-registry.upshift.redhat.com/ceph which is running a dedicated dbus in the container, but [root@undercloud-0 stack]# yum list installed ceph-ansible Installed Packages ceph-ansible.noarch 4.0.5-1.el8cp @rhelosp-ceph-4-tools which does not have https://github.com/ceph/ceph-ansible/pull/4760 from upstream -- a PR that modifies the template for the systemd file for ceph-nfs.
IIUC the motivation for running a dedicated dbus inside the cephfs-nfs container is to enable running that container without '--privileged'. That's good motivation, but the point of using dbus is to enable ganesha consumers to dynamically signal needed exports so that ganesha can implement these. Manila-share service, running in a separate container, does this. We don't currently have an alternative means of triggering the required exports updates from manila share, so I think that while synchronizing ceph-ansible with the container image change will enable ceph-nfs to run, it will still be broken in that manila won't be able to make use of it.
Same issue with the most up-to-date OSP deployment: (undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version RHOS_TRUNK-16.0-RHEL-8-20191217.n.1(undercloud) [stack@undercloud-0 ~]$ sudo yum list installed ceph-ansible Installed Packages ceph-ansible.noarch 4.0.6-1.el8cp @rhelosp-ceph-4.0-tools-pending (undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo cat /etc/systemd/system/ceph-nfs@.service Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts. [Unit] Description=NFS-Ganesha file server Documentation=http://github.com/nfs-ganesha/nfs-ganesha/wiki After=network.target [Service] EnvironmentFile=-/etc/environment ExecStartPre=-/usr/bin/podman rm ceph-nfs-%i ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha ExecStart=/usr/bin/podman run --rm --net=host \ -v /var/lib/ceph:/var/lib/ceph:z \ -v /etc/ceph:/etc/ceph:z \ -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z \ -v /etc/ganesha:/etc/ganesha:z \ -v /var/run/ceph:/var/run/ceph:z \ -v /var/log/ceph:/var/log/ceph:z \ --privileged \ -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket \ -v /etc/localtime:/etc/localtime:ro \ -e CLUSTER=ceph \ -e CEPH_DAEMON=NFS \ -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest \ \ --name=ceph-nfs-pacemaker \ undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest ExecStopPost=-/usr/bin/podman stop ceph-nfs-%i Restart=always RestartSec=10s TimeoutStartSec=120 TimeoutStopSec=15 [Install] WantedBy=multi-user.target (undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo podman ps -a | grep ceph Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts. 7893ed911009 undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 42 minutes ago Up 42 minutes ago ceph-mds-controller-0 6426df89582c undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 45 minutes ago Up 45 minutes ago ceph-mgr-controller-0 f6f7dad234fa undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... About an hour ago Up About an hour ago ceph-mon-controller-0 (undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo /usr/bin/podman run --rm --net=host -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z -v /etc/ganesha:/etc/ganesha:z -v /var/run/ceph:/var/run/ceph:z -v /var/log/ceph:/var/log/ceph:z --privileged -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket -v /etc/localtime:/etc/localtime:ro -e CLUSTER=ceph -e CEPH_DAEMON=NFS -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest --name=ceph-nfs-pacemaker undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts. 2019-12-17 20:05:19 /opt/ceph-container/bin/entrypoint.sh: static: does not generate config dbus-daemon[85]: Failed to start message bus: Failed to bind socket "/run/dbus/system_bus_socket": Address already in use (undercloud) [stack@undercloud-0 ~]$
*** Bug 1784488 has been marked as a duplicate of this bug. ***
Switching component to manila at least for now since there's nothing to do to THT to fix this (probably a ceph/container change is needed but we can track OSP side here).
*** Bug 1784749 has been marked as a duplicate of this bug. ***
Bug has been verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0313