Description of problem: after osp17.0.0 -> osp17.0.1 upgrade manila ceph-nfs failed to start. Error: ceph-nfs start on leaf1-controller-0 returned 'error' because 'failed' "pcs resource cleanup" also did not fix the problem: * ip-172.20.9.214 (ocf:heartbeat:IPaddr2): Stopped * ceph-nfs (systemd:ceph-nfs@pacemaker): Stopped * Container bundle: openstack-manila-share [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-manila-share:pcmklatest]: * openstack-manila-share-podman-0 (ocf:heartbeat:podman): Stopped * ip-172.20.12.100 (ocf:heartbeat:IPaddr2): Started leaf1-controller-0 * Container bundle set: redis-bundle [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-redis:pcmklatest]: * redis-bundle-0 (ocf:heartbeat:redis): Unpromoted leaf1-controller-0 * redis-bundle-1 (ocf:heartbeat:redis): Promoted leaf1-controller-1 * redis-bundle-2 (ocf:heartbeat:redis): Unpromoted leaf1-controller-2 Failed Resource Actions: * ceph-nfs start on leaf1-controller-0 returned 'error' because 'failed' at Mon Mar 20 19:36:18 2023 after 3.352s * ceph-nfs start on leaf1-controller-1 returned 'error' because 'failed' at Mon Mar 20 19:36:25 2023 after 3.322s * ceph-nfs start on leaf1-controller-2 returned 'error' because 'failed' at Mon Mar 20 19:36:32 2023 after 3.332s pacemaker logs: Mar 20 19:36:32.528 leaf1-controller-0 pacemaker-based [334922] (cib_perform_op) info: + /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='ceph-nfs']/lrm_rsc_op[@id='ceph-nfs_last_0']: @transition-magic=0:1;148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d, @exit-reason=failed, @call-id=145, @rc-code=1, @op-status=0, @last-rc-change=1679355392, @exec-time=3332 Mar 20 19:36:32.528 leaf1-controller-0 pacemaker-based [334922] (cib_perform_op) info: ++ /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='ceph-nfs']: <lrm_rsc_op id="ceph-nfs_last_failure_0" operation_key="ceph-nfs_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.13.0" transition-key="148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d" transition-magic="0:1;148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d" exit-reason="failed" on_node="leaf1-controller-2" call-id="145" rc-code="1" op-status="0" interval="0" last-rc-change="1679355392" exec-time="3332" queue-time="0" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/> Mar 20 19:36:32.529 leaf1-controller-0 pacemaker-based [334922] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=leaf1-controller-2/crmd/277, version=0.209.41) Mar 20 19:36:32.530 leaf1-controller-0 pacemaker-attrd [334925] (attrd_peer_update) notice: Setting fail-count-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> INFINITY | from leaf1-controller-2 Mar 20 19:36:32.531 leaf1-controller-0 pacemaker-attrd [334925] (attrd_peer_update) notice: Setting last-failure-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> 1679355392 | from leaf1-controller-2 Mar 20 19:36:32.574 leaf1-controller-0 pacemaker-based [334922] (cib_perform_op) info: Diff: --- 0.209.41 2 I was able to workaround it by manually starting the service and invoking: "pcs resource cleanup" afterwards [root@leaf1-controller-0 manila]# systemctl start ceph-nfs [root@leaf1-controller-0 manila]# systemctl status ceph-nfs ● ceph-nfs - NFS-Ganesha file server Loaded: loaded (/etc/systemd/system/ceph-nfs@.service; disabled; vendor preset: disabled) Active: active (running) since Tue 2023-03-21 12:56:17 EDT; 6s ago Docs: http://github.com/nfs-ganesha/nfs-ganesha/wiki Process: 318032 ExecStartPre=/usr/bin/rm -f //run/ceph-nfs //run/ceph-nfs (code=exited, status=0/SUCCESS) Process: 318063 ExecStartPre=/usr/bin/podman rm --storage ceph-nfs-pacemaker (code=exited, status=1/FAILURE) Process: 318335 ExecStartPre=/usr/bin/mkdir -p /var/log/ceph /var/log/ganesha (code=exited, status=0/SUCCESS) Process: 318353 ExecStartPre=/usr/bin/podman rm ceph-nfs-pacemaker (code=exited, status=1/FAILURE) Process: 318754 ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha /var/log/ganesha (code=exited, status=0/SUCCESS) Process: 318762 ExecStart=/usr/bin/podman run --rm --net=host -d --log-driver journald --conmon-pidfile //run/ceph-nfs --cidfile //run/ceph-nfs -v /var/lib/ceph:/v> Main PID: 319193 (conmon) Tasks: 1 (limit: 816568) Memory: 992.0K CPU: 738ms CGroup: /system.slice/system-ceph\x2dnfs.slice/ceph-nfs └─319193 /usr/bin/conmon --api-version 1 -c a300370ff5d32a4a830199f8515460043007338329f69cd770fc32f17e106855 -u a300370ff5d32a4a830199f8515460043007338329f69cd770fc32f17e106855 -r /usr/bin/crun -b /> Mar 21 12:56:18 leaf1-controller-0 ceph-nfs-pacemaker[319193]: 21/03/2023 12:56:18 : epoch 6419e1b2 : leaf1-controller-0 : ganesha.nfsd-72[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started succe> Mar 21 12:56:18 leaf1-controller-0 ceph-nfs-pacemaker[319193]: 21/03/2023 12:56:18 : epoch 6419e1b2 : leaf1-controller-0 : ganesha.nfsd-72[main] nfs_Start_threads :THREAD :EVENT :admin thread Version-Release number of selected component (if applicable): osp17.0.0 -> osp17.0.1 How reproducible: during upgrade Steps to Reproduce: 1. deploy osp 17.0.0 with manila ceph-nfs enabled 2. upgrade to osp17.0.1 3. Actual results: ceph-nfs won't start Expected results: ceph-nfs starts Additional info: [root@leaf1-controller-1 manila]# sudo cat /etc/systemd/system/ceph-nfs@.service [Unit] Description=NFS-Ganesha file server Documentation=http://github.com/nfs-ganesha/nfs-ganesha/wiki After=network.target [Service] EnvironmentFile=-/etc/environment ExecStartPre=-/usr/bin/rm -f /%t/%n-pid /%t/%n-cid ExecStartPre=-/usr/bin/podman rm --storage ceph-nfs-%i ExecStartPre=-/usr/bin/mkdir -p /var/log/ceph /var/log/ganesha ExecStartPre=-/usr/bin/podman rm ceph-nfs-%i ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha /var/log/ganesha ExecStart=/usr/bin/podman run --rm --net=host \ -d --log-driver journald --conmon-pidfile /%t/%n-pid --cidfile /%t/%n-cid \ -v /var/lib/ceph:/var/lib/ceph:z \ -v /var/lib/tripleo-config/ceph:/etc/ceph:z \ -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z \ -v /etc/ganesha:/etc/ganesha:z \ -v /var/run/ceph:/var/run/ceph:z \ -v /var/log/ceph:/var/log/ceph:z \ -v /var/log/ganesha:/var/log/ganesha:z \ --privileged \ -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket \ -v /etc/localtime:/etc/localtime:ro \ -e CLUSTER=ceph \ -e CEPH_DAEMON=NFS \ -e CONTAINER_IMAGE=bgp-undercloud.ctlplane.openstack.lab:8787/rhceph/rhceph-5-rhel8:latest \ --name=ceph-nfs-pacemaker \ bgp-undercloud.ctlplane.openstack.lab:8787/rhceph/rhceph-5-rhel8:latest ExecStop=-/usr/bin/sh -c "/usr/bin/podman rm -f `cat /%t/%n-cid`" KillMode=none Restart=always RestartSec=10s TimeoutStartSec=120 TimeoutStopSec=15 Type=forking PIDFile=/%t/%n-pid [Install] WantedBy=multi-user.target
I worked with Goutham Pacha Ravi to further troubleshoot this issue and we noticed the reason for manila not starting was lack of /var/run/ceph directory on the controllers: Jun 05 13:04:58 leaf1-controller-2 systemd[1]: 4a49419f381c550ce6224e2ff9113f52d1137cd4ba280dedd286e445772bebba.service: Deactivated successfully. Jun 05 13:04:58 leaf1-controller-2 haproxy[7175]: 172.20.12.118:33744 [05/Jun/2023:13:04:58.817] neutron neutron/leaf1-controller-1.internalapi.openstack.lab 0/0/0/161/161 200 187 - - ---- 918/22/ 0/0/0 0/0 "GET /v2.0/networks/73c5fbf9-04f4-4d1e-8409-90c9a95e6ca9?fields=segments HTTP/1.1" Jun 05 13:04:58 leaf1-controller-2 podman[367072]: Error: no container with name or ID "ceph-nfs-pacemaker" found: no such container Jun 05 13:04:59 leaf1-controller-2 haproxy[7175]: 172.20.12.118:33744 [05/Jun/2023:13:04:58.981] neutron neutron/leaf1-controller-2.internalapi.openstack.lab 0/0/0/158/158 200 252 - - ---- 918/22/ 0/0/0 0/0 "GET /v2.0/networks/73c5fbf9-04f4-4d1e-8409-90c9a95e6ca9?fields=provider%3Aphysical_network&fields=provider%3Anetwork_type HTTP/1.1" Jun 05 13:04:59 leaf1-controller-2 podman[367154]: Error: statfs /var/run/ceph: no such file or directory Jun 05 13:04:59 leaf1-controller-2 systemd[1]: ceph-nfs: Control process exited, code=exited, status=125/n/a Jun 05 13:04:59 leaf1-controller-2 systemd[1]: ceph-nfs: Failed with result 'exit-code'. Jun 05 13:04:59 leaf1-controller-2 systemd[1]: Failed to start Cluster Controlled ceph-nfs@pacemaker. Jun 05 13:04:59 leaf1-controller-2 runuser[367227]: pam_unix(runuser:session): session opened for user rabbitmq(uid=42439) by (uid=0) Jun 05 13:05:00 leaf1-controller-2 podman[366953]: 2023-06-05 13:05:00.609997374 -0400 EDT m=+1.879562362 container exec_died af8ef465e63dd4d99b71a5bf3eb808687539bbd73e688cd0a0bc98db020ae5cf (imag e=bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-swift-object:17.0, name=swift_rsync, health_status=, execID=89b483ab2860a6a21555f5a61ee13acc1d600bf09c08f418ff9bb025bbc8f6b7) Jun 05 13:05:00 leaf1-controller-2 systemd[1]: af8ef465e63dd4d99b71a5bf3eb808687539bbd73e688cd0a0bc98db020ae5cf.service: Deactivated successfully. Jun 05 13:05:00 leaf1-controller-2 pacemaker-controld[3248]: notice: Result of start operation for ceph-nfs on leaf1-controller-2: error (failed) Jun 05 13:05:00 leaf1-controller-2 pacemaker-attrd[3246]: notice: Setting fail-count-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> INFINITY Jun 05 13:05:00 leaf1-controller-2 pacemaker-attrd[3246]: notice: Setting last-failure-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> 1685984700 Jun 05 13:05:00 leaf1-controller-2 runuser[367227]: pam_unix(runuser:session): session closed for user rabbitmq Jun 05 13:05:00 leaf1-controller-2 runuser[367357]: pam_unix(runuser:session): session opened for user rabbitmq(uid=42439) by (uid=0) Jun 05 13:05:01 leaf1-controller-2 pacemaker-controld[3248]: notice: Requesting local execution of stop operation for ceph-nfs on leaf1-controller-2 Jun 05 13:05:01 leaf1-controller-2 systemd[1]: Reloading. The workaround was to manually create there on each controller and restart the service with the pacemaker: [root@leaf1-controller-0 ~]# mkdir /var/run/ceph [root@leaf1-controller-0 ~]# pcs resource cleanup
Hello, This issue has occurred again on the same cluster, this time after rebooting controllers /var/run/ceph no longer exist: [root@leaf1-controller-0 ~]# ls /var/run/ceph ls: cannot access '/var/run/ceph': No such file or directory
(In reply to Chris Janiszewski from comment #19) > Hello, > > This issue has occurred again on the same cluster, this time after rebooting > controllers /var/run/ceph no longer exist: > [root@leaf1-controller-0 ~]# ls /var/run/ceph > ls: cannot access '/var/run/ceph': No such file or directory ack; systemd could handle the directory's deletion after a reboot. Thanks for reporting this.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 17.1.2 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:0209