1784562 – [RHOS16] manila-share fails to start due to ceph-nfs stopped with IPv4 network protocol

Bug 1784562 - [RHOS16] manila-share fails to start due to ceph-nfs stopped with IPv4 network protocol

Summary: [RHOS16] manila-share fails to start due to ceph-nfs stopped with IPv4 networ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Container
Sub Component:
Version:	4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	4.0
Assignee:	Dimitri Savineau
QA Contact:	vhariria
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1784488 1784749 (view as bug list)
Depends On:
Blocks:	1575079 1642481 1783005 1790756 1793690 1801090 1802065 1848582
TreeView+	depends on / blocked

Reported:	2019-12-17 17:28 UTC by Tom Barron
Modified:	2020-06-18 14:46 UTC (History)
CC List:	17 users (show)
Fixed In Version:	rhceph:ceph-4.0-rhel-8-containers-candidate-98784-20191223222411
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1783005
Clones:	1784749 1793690 1801090 (view as bug list)
Environment:
Last Closed:	2020-01-31 14:44:57 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-container pull 1541	0	None	closed	Revert "nfs: run a dedicated dbus daemon for nfs-ganesha"	2021-02-15 14:56:58 UTC
Red Hat Product Errata	RHBA-2020:0313	0	None	None	None	2020-01-31 14:45:14 UTC

Description Tom Barron 2019-12-17 17:28:25 UTC

Description of problem:

Manila-share failed to start after deploying overcloud with ceph-nfs and ipv4 network protocol

[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 2.0.2-3.el8-744a30d655) - partition with quorum
Last updated: Tue Dec 17 17:21:03 2019
Last change: Fri Dec 13 15:27:27 2019 by root via cibadmin on controller-0

12 nodes configured
40 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-2 rabbitmq-bundle-2@controller-0 ]

Full list of resources:

 ip-172.17.5.98	(ocf::heartbeat:IPaddr2):	Started controller-1
 Container bundle set: galera-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-0
 Container bundle set: rabbitmq-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
 ip-192.168.24.54	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-10.0.0.138	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.38	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.3.86	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.4.76	(ocf::heartbeat:IPaddr2):	Started controller-2
 Container bundle set: haproxy-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-haproxy:pcmklatest]
   haproxy-bundle-podman-0	(ocf::heartbeat:podman):	Started controller-0
   haproxy-bundle-podman-1	(ocf::heartbeat:podman):	Started controller-1
   haproxy-bundle-podman-2	(ocf::heartbeat:podman):	Started controller-2
 Container bundle set: ovn-dbs-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Master controller-0
   ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave controller-1
   ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Slave controller-2
 ip-172.17.1.91	(ocf::heartbeat:IPaddr2):	Started controller-0
 ceph-nfs	(systemd:ceph-nfs@pacemaker):	Started controller-1
 Container bundle: openstack-cinder-volume [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-podman-0	(ocf::heartbeat:podman):	Started controller-1
 Container bundle: openstack-manila-share [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-manila-share:pcmklatest]
   openstack-manila-share-podman-0	(ocf::heartbeat:podman):	Started controller-1

Failed Resource Actions:
* ceph-nfs_start_0 on controller-0 'unknown error' (1): call=21559, status=complete, exitreason='',
    last-rc-change='Tue Dec 17 15:47:08 2019', queued=0ms, exec=12284ms
* ceph-nfs_monitor_60000 on controller-1 'not running' (7): call=470, status=complete, exitreason='',
    last-rc-change='Tue Dec 17 17:20:15 2019', queued=0ms, exec=0ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@controller-0 ~]# 

Version-Release number of selected component (if applicable):


How reproducible:
100% 

Steps to Reproduce:
1. Deploy overcloud with manila with cephfs-nfs back end using RHOS_TRUNK-16.0-RHEL-8-20191206.n.1

2. run sudo pcs status on controller nodes


Actual results:
ceph-nfs, ipv6 VIP interface, and manila-share fail 


Expected results:

Expected ceph-nfs and manila-share service work.


Additional info:

This BZ is cloned from a BZ for similar IPv6 deployment but differs from it in that the VIP used by Ceph-NFS is up.  The Ceph-NFS service (aka ganesha) is failing even though the VIP is up.

Comment 1 Tom Barron 2019-12-17 18:45:39 UTC

Cephfs-nfs (ganesha) container is failing with an address in use issue with the dbus socket:

[root@controller-0 system]# /usr/bin/podman run --rm --net=host   -v /var/lib/ceph:/var/lib/ceph:z   -v /etc/ceph:/etc/ceph:z   -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z   -v /etc/ganesha:/etc/ganesha:z   -v /var/run/ceph:/var/run/ceph:z   -v /var/log/ceph:/var/log/ceph:z     --privileged   -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket   -v /etc/localtime:/etc/localtime:ro   -e CLUSTER=ceph   -e CEPH_DAEMON=NFS   -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest      --name=ceph-nfs-pacemaker   undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
2019-12-17 14:29:57  /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
dbus-daemon[85]: Failed to start message bus: Failed to bind socket "/run/dbus/system_bus_socket": Address already in use

This error appears to be triggered by a change in the the ceph container image [1] that runs a dedicated dbus daemon in the container.  ceph-ansible, howver, deploys ceph-nfs with the dbus socket bind mounted from thos host (as in the command above).  On the host the socket is indeed already in use.

[1] https://github.com/ceph/ceph-container/pull/1517

Comment 2 Tom Barron 2019-12-17 19:00:08 UTC

OK, the ceph container image used for ceph-nfs (ganesha) and the ceph-ansible used to deploy the overcloud are not synchronized.  We have

ceph_image: rhceph-4.0-rhel8
ceph_namespace: docker-registry.upshift.redhat.com/ceph

which is running a dedicated dbus in the container, but

[root@undercloud-0 stack]# yum list installed ceph-ansible
Installed Packages
ceph-ansible.noarch                                                                    4.0.5-1.el8cp                                                                     @rhelosp-ceph-4-tools

which does not have https://github.com/ceph/ceph-ansible/pull/4760 from upstream -- a PR that modifies the template for the systemd file for ceph-nfs.

Comment 3 Tom Barron 2019-12-17 19:14:59 UTC

IIUC the motivation for running a dedicated dbus inside the cephfs-nfs container is to enable running that container without '--privileged'.  That's good motivation, but the point of using dbus is to enable ganesha consumers to dynamically signal needed exports so that ganesha can implement these.  Manila-share service, running in a separate container, does this.

We don't currently have an alternative means of triggering the required exports updates from manila share, so I think that while synchronizing ceph-ansible with the container image change will enable ceph-nfs to run, it will still be broken in that manila won't be able to make use of it.

Comment 4 Tom Barron 2019-12-17 20:07:27 UTC

Same issue with the most up-to-date OSP deployment:

(undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 
RHOS_TRUNK-16.0-RHEL-8-20191217.n.1(undercloud) [stack@undercloud-0 ~]$ sudo yum list installed ceph-ansible
Installed Packages
ceph-ansible.noarch                                                               4.0.6-1.el8cp                                                                @rhelosp-ceph-4.0-tools-pending
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo cat /etc/systemd/system/ceph-nfs@.service
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
[Unit]
Description=NFS-Ganesha file server
Documentation=http://github.com/nfs-ganesha/nfs-ganesha/wiki
After=network.target

[Service]
EnvironmentFile=-/etc/environment
ExecStartPre=-/usr/bin/podman rm ceph-nfs-%i
ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha
ExecStart=/usr/bin/podman run --rm --net=host \
  -v /var/lib/ceph:/var/lib/ceph:z \
  -v /etc/ceph:/etc/ceph:z \
  -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z \
  -v /etc/ganesha:/etc/ganesha:z \
  -v /var/run/ceph:/var/run/ceph:z \
  -v /var/log/ceph:/var/log/ceph:z \
    --privileged \
  -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket \
  -v /etc/localtime:/etc/localtime:ro \
  -e CLUSTER=ceph \
  -e CEPH_DAEMON=NFS \
  -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest \
   \
  --name=ceph-nfs-pacemaker \
  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
ExecStopPost=-/usr/bin/podman stop ceph-nfs-%i
Restart=always
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=15

[Install]
WantedBy=multi-user.target
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo podman ps -a | grep ceph
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
7893ed911009  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest                             /opt/ceph-contain...  42 minutes ago     Up 42 minutes ago                    ceph-mds-controller-0
6426df89582c  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest                             /opt/ceph-contain...  45 minutes ago     Up 45 minutes ago                    ceph-mgr-controller-0
f6f7dad234fa  undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest                             /opt/ceph-contain...  About an hour ago  Up About an hour ago                 ceph-mon-controller-0
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.25 sudo /usr/bin/podman run --rm --net=host   -v /var/lib/ceph:/var/lib/ceph:z   -v /etc/ceph:/etc/ceph:z   -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z   -v /etc/ganesha:/etc/ganesha:z   -v /var/run/ceph:/var/run/ceph:z   -v /var/log/ceph:/var/log/ceph:z     --privileged   -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket   -v /etc/localtime:/etc/localtime:ro   -e CLUSTER=ceph   -e CEPH_DAEMON=NFS   -e CONTAINER_IMAGE=undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest      --name=ceph-nfs-pacemaker   undercloud-0.ctlplane.redhat.local:8787/ceph/rhceph-4.0-rhel8:latest
Warning: Permanently added '192.168.24.25' (ECDSA) to the list of known hosts.
2019-12-17 20:05:19  /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
dbus-daemon[85]: Failed to start message bus: Failed to bind socket "/run/dbus/system_bus_socket": Address already in use
(undercloud) [stack@undercloud-0 ~]$

Comment 5 lkuchlan 2019-12-18 06:40:55 UTC

*** Bug 1784488 has been marked as a duplicate of this bug. ***

Comment 6 Tom Barron 2019-12-18 13:19:03 UTC

Switching component to manila at least for now since there's nothing to do to THT to fix this (probably a ceph/container change is needed but we can track OSP side here).

Comment 7 Giulio Fidente 2019-12-18 15:55:45 UTC

*** Bug 1784749 has been marked as a duplicate of this bug. ***

Comment 22 vhariria 2020-01-28 15:18:54 UTC

Bug has been verified.

Comment 24 errata-xmlrpc 2020-01-31 14:44:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0313

Note You need to log in before you can comment on or make changes to this bug.