Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use Jira Cloud for all bug tracking management.

Bug 1784562

Summary: [RHOS16] manila-share fails to start due to ceph-nfs stopped with IPv4 network protocol
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tom Barron <tbarron>
Component: ContainerAssignee: Dimitri Savineau <dsavinea>
Status: CLOSED ERRATA QA Contact: vhariria
Severity: high Docs Contact:
Priority: high    
Version: 4.0CC: bniver, ccopello, ceph-eng-bugs, dsavinea, ealcaniz, emacchi, gabrioux, gcharot, gfidente, gouthamr, kdreyer, lkuchlan, mburns, pgrist, tbarron, vhariria, vimartin
Target Milestone: rcKeywords: Regression, TestBlocker, Triaged
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhceph:ceph-4.0-rhel-8-containers-candidate-98784-20191223222411 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1783005
: 1784749 1793690 1801090 (view as bug list) Environment:
Last Closed: 2020-01-31 14:44:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1575079, 1642481, 1783005, 1790756, 1793690, 1801090, 1802065, 1848582    

Description Tom Barron 2019-12-17 17:28:25 UTC
Description of problem:

Manila-share failed to start after deploying overcloud with ceph-nfs and ipv4 network protocol

[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 2.0.2-3.el8-744a30d655) - partition with quorum
Last updated: Tue Dec 17 17:21:03 2019
Last change: Fri Dec 13 15:27:27 2019 by root via cibadmin on controller-0

12 nodes configured
40 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-0 ]

Full list of resources:

 ip-172.17.5.98	(ocf::heartbeat:IPaddr2):	Started controller-1
 Container bundle set: galera-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-0
 Container bundle set: rabbitmq-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
 ip-192.168.24.54	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-10.0.0.138	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.38	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.3.86	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.4.76	(ocf::heartbeat:IPaddr2):	Started controller-2
 Container bundle set: haproxy-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-haproxy:pcmklatest]
   haproxy-bundle-podman-0	(ocf::heartbeat:podman):	Started controller-0
   haproxy-bundle-podman-1	(ocf::heartbeat:podman):	Started controller-1
   haproxy-bundle-podman-2	(ocf::heartbeat:podman):	Started controller-2
 Container bundle set: ovn-dbs-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Master controller-0
   ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave controller-1
   ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Slave controller-2
 ip-172.17.1.91	(ocf::heartbeat:IPaddr2):	Started controller-0
 ceph-nfs	(systemd:ceph-nfs@pacemaker):	Started controller-1
 Container bundle: openstack-cinder-volume [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-podman-0	(ocf::heartbeat:podman):	Started controller-1
 Container bundle: openstack-manila-share [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-manila-share:pcmklatest]
   openstack-manila-share-podman-0	(ocf::heartbeat:podman):	Started controller-1

Failed Resource Actions:
* ceph-nfs_start_0 on controller-0 'unknown error' (1): call=21559, status=complete, exitreason='',
    last-rc-change='Tue Dec 17 15:47:08 2019', queued=0ms, exec=12284ms
* ceph-nfs_monitor_60000 on controller-1 'not running' (7): call=470, status=complete, exitreason='',
    last-rc-change='Tue Dec 17 17:20:15 2019', queued=0ms, exec=0ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@controller-0 ~]# 

Version-Release number of selected component (if applicable):


How reproducible:
100% 

Steps to Reproduce:
1. Deploy overcloud with manila with cephfs-nfs back end using RHOS_TRUNK-16.0-RHEL-8-20191206.n.1

2. run sudo pcs status on controller nodes


Actual results:
ceph-nfs, ipv6 VIP interface, and manila-share fail 


Expected results:

Expected ceph-nfs and manila-share service work.


Additional info:

This BZ is cloned from a BZ for similar IPv6 deployment but differs from it in that the VIP used by Ceph-NFS is up.  The Ceph-NFS service (aka ganesha) is failing even though the VIP is up.

Comment 1 Tom Barron 2019-12-17 18:45:39 UTC
Cephfs-nfs (ganesha) container is failing with an address in use issue with the dbus socket:

[root@controller-0 system]# /usr/bin/podman run --rm --net=host   -v /var/lib/ceph:/var/lib/ceph:z   -v /etc/ceph:/etc/ceph:z   -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z   -v /etc/ganesha:/etc/ganesha:z   -v /var/run/ceph:/var/run/ceph:z   -v /var/log/ceph:/var/log/ceph:z     --privileged   -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket   -v /etc/localtime:/etc/localtime:ro   -e CLUSTER=ceph   -e CEPH_DAEMON=NFS   -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest      --name=ceph-nfs-pacemaker   undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
2019-12-17 14:29:57  /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
dbus-daemon[85]: Failed to start message bus: Failed to bind socket "/run/dbus/system_bus_socket": Address already in use

This error appears to be triggered by a change in the the ceph container image [1] that runs a dedicated dbus daemon in the container.  ceph-ansible, howver, deploys ceph-nfs with the dbus socket bind mounted from thos host (as in the command above).  On the host the socket is indeed already in use.

[1] https://github.com/ceph/ceph-container/pull/1517

Comment 2 Tom Barron 2019-12-17 19:00:08 UTC
OK, the ceph container image used for ceph-nfs (ganesha) and the ceph-ansible used to deploy the overcloud are not synchronized.  We have

ceph_image: rhceph-4.0-rhel8
ceph_namespace: docker-registry.upshift.redhat.com/ceph

which is running a dedicated dbus in the container, but

[root@undercloud-0 stack]# yum list installed ceph-ansible
Installed Packages
ceph-ansible.noarch                                                                    4.0.5-1.el8cp                                                                     @rhelosp-ceph-4-tools

which does not have https://github.com/ceph/ceph-ansible/pull/4760 from upstream -- a PR that modifies the template for the systemd file for ceph-nfs.

Comment 3 Tom Barron 2019-12-17 19:14:59 UTC
IIUC the motivation for running a dedicated dbus inside the cephfs-nfs container is to enable running that container without '--privileged'.  That's good motivation, but the point of using dbus is to enable ganesha consumers to dynamically signal needed exports so that ganesha can implement these.  Manila-share service, running in a separate container, does this.

We don't currently have an alternative means of triggering the required exports updates from manila share, so I think that while synchronizing ceph-ansible with the container image change will enable ceph-nfs to run, it will still be broken in that manila won't be able to make use of it.

Comment 4 Tom Barron 2019-12-17 20:07:27 UTC
Same issue with the most up-to-date OSP deployment:

(undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 
RHOS_TRUNK-16.0-RHEL-8-20191217.n.1(undercloud) [stack@undercloud-0 ~]$ sudo yum list installed ceph-ansible
Installed Packages
ceph-ansible.noarch                                                               4.0.6-1.el8cp                                                                @rhelosp-ceph-4.0-tools-pending
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo cat /etc/systemd/system/ceph-nfs@.service
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
[Unit]
Description=NFS-Ganesha file server
Documentation=http://github.com/nfs-ganesha/nfs-ganesha/wiki
After=network.target

[Service]
EnvironmentFile=-/etc/environment
ExecStartPre=-/usr/bin/podman rm ceph-nfs-%i
ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha
ExecStart=/usr/bin/podman run --rm --net=host \
  -v /var/lib/ceph:/var/lib/ceph:z \
  -v /etc/ceph:/etc/ceph:z \
  -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z \
  -v /etc/ganesha:/etc/ganesha:z \
  -v /var/run/ceph:/var/run/ceph:z \
  -v /var/log/ceph:/var/log/ceph:z \
    --privileged \
  -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket \
  -v /etc/localtime:/etc/localtime:ro \
  -e CLUSTER=ceph \
  -e CEPH_DAEMON=NFS \
  -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest \
   \
  --name=ceph-nfs-pacemaker \
  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
ExecStopPost=-/usr/bin/podman stop ceph-nfs-%i
Restart=always
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=15

[Install]
WantedBy=multi-user.target
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo podman ps -a | grep ceph
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
7893ed911009  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest                             /opt/ceph-contain...  42 minutes ago     Up 42 minutes ago                    ceph-mds-controller-0
6426df89582c  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest                             /opt/ceph-contain...  45 minutes ago     Up 45 minutes ago                    ceph-mgr-controller-0
f6f7dad234fa  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest                             /opt/ceph-contain...  About an hour ago  Up About an hour ago                 ceph-mon-controller-0
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo /usr/bin/podman run --rm --net=host   -v /var/lib/ceph:/var/lib/ceph:z   -v /etc/ceph:/etc/ceph:z   -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z   -v /etc/ganesha:/etc/ganesha:z   -v /var/run/ceph:/var/run/ceph:z   -v /var/log/ceph:/var/log/ceph:z     --privileged   -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket   -v /etc/localtime:/etc/localtime:ro   -e CLUSTER=ceph   -e CEPH_DAEMON=NFS   -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest      --name=ceph-nfs-pacemaker   undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
2019-12-17 20:05:19  /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
dbus-daemon[85]: Failed to start message bus: Failed to bind socket "/run/dbus/system_bus_socket": Address already in use
(undercloud) [stack@undercloud-0 ~]$

Comment 5 lkuchlan 2019-12-18 06:40:55 UTC
*** Bug 1784488 has been marked as a duplicate of this bug. ***

Comment 6 Tom Barron 2019-12-18 13:19:03 UTC
Switching component to manila at least for now since there's nothing to do to THT to fix this (probably a ceph/container change is needed but we can track OSP side here).

Comment 7 Giulio Fidente 2019-12-18 15:55:45 UTC
*** Bug 1784749 has been marked as a duplicate of this bug. ***

Comment 22 vhariria 2020-01-28 15:18:54 UTC
Bug has been verified.

Comment 24 errata-xmlrpc 2020-01-31 14:44:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0313