Description of problem: after osp17.0.0 -> osp17.0.1 upgrade manila ceph-nfs failed to start. Error: ceph-nfs start on leaf1-controller-0 returned 'error' because 'failed' "pcs resource cleanup" also did not fix the problem: * ip-172.20.9.214 (ocf:heartbeat:IPaddr2): Stopped * ceph-nfs (systemd:ceph-nfs@pacemaker): Stopped * Container bundle: openstack-manila-share [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-manila-share:pcmklatest]: * openstack-manila-share-podman-0 (ocf:heartbeat:podman): Stopped * ip-172.20.12.100 (ocf:heartbeat:IPaddr2): Started leaf1-controller-0 * Container bundle set: redis-bundle [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-redis:pcmklatest]: * redis-bundle-0 (ocf:heartbeat:redis): Unpromoted leaf1-controller-0 * redis-bundle-1 (ocf:heartbeat:redis): Promoted leaf1-controller-1 * redis-bundle-2 (ocf:heartbeat:redis): Unpromoted leaf1-controller-2 Failed Resource Actions: * ceph-nfs start on leaf1-controller-0 returned 'error' because 'failed' at Mon Mar 20 19:36:18 2023 after 3.352s * ceph-nfs start on leaf1-controller-1 returned 'error' because 'failed' at Mon Mar 20 19:36:25 2023 after 3.322s * ceph-nfs start on leaf1-controller-2 returned 'error' because 'failed' at Mon Mar 20 19:36:32 2023 after 3.332s pacemaker logs: Mar 20 19:36:32.528 leaf1-controller-0 pacemaker-based [334922] (cib_perform_op) info: + /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='ceph-nfs']/lrm_rsc_op[@id='ceph-nfs_last_0']: @transition-magic=0:1;148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d, @exit-reason=failed, @call-id=145, @rc-code=1, @op-status=0, @last-rc-change=1679355392, @exec-time=3332 Mar 20 19:36:32.528 leaf1-controller-0 pacemaker-based [334922] (cib_perform_op) info: ++ /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='ceph-nfs']: <lrm_rsc_op id="ceph-nfs_last_failure_0" operation_key="ceph-nfs_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.13.0" transition-key="148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d" transition-magic="0:1;148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d" exit-reason="failed" on_node="leaf1-controller-2" call-id="145" rc-code="1" op-status="0" interval="0" last-rc-change="1679355392" exec-time="3332" queue-time="0" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/> Mar 20 19:36:32.529 leaf1-controller-0 pacemaker-based [334922] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=leaf1-controller-2/crmd/277, version=0.209.41) Mar 20 19:36:32.530 leaf1-controller-0 pacemaker-attrd [334925] (attrd_peer_update) notice: Setting fail-count-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> INFINITY | from leaf1-controller-2 Mar 20 19:36:32.531 leaf1-controller-0 pacemaker-attrd [334925] (attrd_peer_update) notice: Setting last-failure-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> 1679355392 | from leaf1-controller-2 Mar 20 19:36:32.574 leaf1-controller-0 pacemaker-based [334922] (cib_perform_op) info: Diff: --- 0.209.41 2 I was able to workaround it by manually starting the service and invoking: "pcs resource cleanup" afterwards [root@leaf1-controller-0 manila]# systemctl start ceph-nfs [root@leaf1-controller-0 manila]# systemctl status ceph-nfs ● ceph-nfs - NFS-Ganesha file server Loaded: loaded (/etc/systemd/system/ceph-nfs@.service; disabled; vendor preset: disabled) Active: active (running) since Tue 2023-03-21 12:56:17 EDT; 6s ago Docs: http://github.com/nfs-ganesha/nfs-ganesha/wiki Process: 318032 ExecStartPre=/usr/bin/rm -f //run/ceph-nfs //run/ceph-nfs (code=exited, status=0/SUCCESS) Process: 318063 ExecStartPre=/usr/bin/podman rm --storage ceph-nfs-pacemaker (code=exited, status=1/FAILURE) Process: 318335 ExecStartPre=/usr/bin/mkdir -p /var/log/ceph /var/log/ganesha (code=exited, status=0/SUCCESS) Process: 318353 ExecStartPre=/usr/bin/podman rm ceph-nfs-pacemaker (code=exited, status=1/FAILURE) Process: 318754 ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha /var/log/ganesha (code=exited, status=0/SUCCESS) Process: 318762 ExecStart=/usr/bin/podman run --rm --net=host -d --log-driver journald --conmon-pidfile //run/ceph-nfs --cidfile //run/ceph-nfs -v /var/lib/ceph:/v> Main PID: 319193 (conmon) Tasks: 1 (limit: 816568) Memory: 992.0K CPU: 738ms CGroup: /system.slice/system-ceph\x2dnfs.slice/ceph-nfs └─319193 /usr/bin/conmon --api-version 1 -c a300370ff5d32a4a830199f8515460043007338329f69cd770fc32f17e106855 -u a300370ff5d32a4a830199f8515460043007338329f69cd770fc32f17e106855 -r /usr/bin/crun -b /> Mar 21 12:56:18 leaf1-controller-0 ceph-nfs-pacemaker[319193]: 21/03/2023 12:56:18 : epoch 6419e1b2 : leaf1-controller-0 : ganesha.nfsd-72[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started succe> Mar 21 12:56:18 leaf1-controller-0 ceph-nfs-pacemaker[319193]: 21/03/2023 12:56:18 : epoch 6419e1b2 : leaf1-controller-0 : ganesha.nfsd-72[main] nfs_Start_threads :THREAD :EVENT :admin thread Version-Release number of selected component (if applicable): osp17.0.0 -> osp17.0.1 How reproducible: during upgrade Steps to Reproduce: 1. deploy osp 17.0.0 with manila ceph-nfs enabled 2. upgrade to osp17.0.1 3. Actual results: ceph-nfs won't start Expected results: ceph-nfs starts Additional info: [root@leaf1-controller-1 manila]# sudo cat /etc/systemd/system/ceph-nfs@.service [Unit] Description=NFS-Ganesha file server Documentation=http://github.com/nfs-ganesha/nfs-ganesha/wiki After=network.target [Service] EnvironmentFile=-/etc/environment ExecStartPre=-/usr/bin/rm -f /%t/%n-pid /%t/%n-cid ExecStartPre=-/usr/bin/podman rm --storage ceph-nfs-%i ExecStartPre=-/usr/bin/mkdir -p /var/log/ceph /var/log/ganesha ExecStartPre=-/usr/bin/podman rm ceph-nfs-%i ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha /var/log/ganesha ExecStart=/usr/bin/podman run --rm --net=host \ -d --log-driver journald --conmon-pidfile /%t/%n-pid --cidfile /%t/%n-cid \ -v /var/lib/ceph:/var/lib/ceph:z \ -v /var/lib/tripleo-config/ceph:/etc/ceph:z \ -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z \ -v /etc/ganesha:/etc/ganesha:z \ -v /var/run/ceph:/var/run/ceph:z \ -v /var/log/ceph:/var/log/ceph:z \ -v /var/log/ganesha:/var/log/ganesha:z \ --privileged \ -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket \ -v /etc/localtime:/etc/localtime:ro \ -e CLUSTER=ceph \ -e CEPH_DAEMON=NFS \ -e CONTAINER_IMAGE=bgp-undercloud.ctlplane.openstack.lab:8787/rhceph/rhceph-5-rhel8:latest \ --name=ceph-nfs-pacemaker \ bgp-undercloud.ctlplane.openstack.lab:8787/rhceph/rhceph-5-rhel8:latest ExecStop=-/usr/bin/sh -c "/usr/bin/podman rm -f `cat /%t/%n-cid`" KillMode=none Restart=always RestartSec=10s TimeoutStartSec=120 TimeoutStopSec=15 Type=forking PIDFile=/%t/%n-pid [Install] WantedBy=multi-user.target
I worked with Goutham Pacha Ravi to further troubleshoot this issue and we noticed the reason for manila not starting was lack of /var/run/ceph directory on the controllers: Jun 05 13:04:58 leaf1-controller-2 systemd[1]: 4a49419f381c550ce6224e2ff9113f52d1137cd4ba280dedd286e445772bebba.service: Deactivated successfully. Jun 05 13:04:58 leaf1-controller-2 haproxy[7175]: 172.20.12.118:33744 [05/Jun/2023:13:04:58.817] neutron neutron/leaf1-controller-1.internalapi.openstack.lab 0/0/0/161/161 200 187 - - ---- 918/22/ 0/0/0 0/0 "GET /v2.0/networks/73c5fbf9-04f4-4d1e-8409-90c9a95e6ca9?fields=segments HTTP/1.1" Jun 05 13:04:58 leaf1-controller-2 podman[367072]: Error: no container with name or ID "ceph-nfs-pacemaker" found: no such container Jun 05 13:04:59 leaf1-controller-2 haproxy[7175]: 172.20.12.118:33744 [05/Jun/2023:13:04:58.981] neutron neutron/leaf1-controller-2.internalapi.openstack.lab 0/0/0/158/158 200 252 - - ---- 918/22/ 0/0/0 0/0 "GET /v2.0/networks/73c5fbf9-04f4-4d1e-8409-90c9a95e6ca9?fields=provider%3Aphysical_network&fields=provider%3Anetwork_type HTTP/1.1" Jun 05 13:04:59 leaf1-controller-2 podman[367154]: Error: statfs /var/run/ceph: no such file or directory Jun 05 13:04:59 leaf1-controller-2 systemd[1]: ceph-nfs: Control process exited, code=exited, status=125/n/a Jun 05 13:04:59 leaf1-controller-2 systemd[1]: ceph-nfs: Failed with result 'exit-code'. Jun 05 13:04:59 leaf1-controller-2 systemd[1]: Failed to start Cluster Controlled ceph-nfs@pacemaker. Jun 05 13:04:59 leaf1-controller-2 runuser[367227]: pam_unix(runuser:session): session opened for user rabbitmq(uid=42439) by (uid=0) Jun 05 13:05:00 leaf1-controller-2 podman[366953]: 2023-06-05 13:05:00.609997374 -0400 EDT m=+1.879562362 container exec_died af8ef465e63dd4d99b71a5bf3eb808687539bbd73e688cd0a0bc98db020ae5cf (imag e=bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-swift-object:17.0, name=swift_rsync, health_status=, execID=89b483ab2860a6a21555f5a61ee13acc1d600bf09c08f418ff9bb025bbc8f6b7) Jun 05 13:05:00 leaf1-controller-2 systemd[1]: af8ef465e63dd4d99b71a5bf3eb808687539bbd73e688cd0a0bc98db020ae5cf.service: Deactivated successfully. Jun 05 13:05:00 leaf1-controller-2 pacemaker-controld[3248]: notice: Result of start operation for ceph-nfs on leaf1-controller-2: error (failed) Jun 05 13:05:00 leaf1-controller-2 pacemaker-attrd[3246]: notice: Setting fail-count-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> INFINITY Jun 05 13:05:00 leaf1-controller-2 pacemaker-attrd[3246]: notice: Setting last-failure-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> 1685984700 Jun 05 13:05:00 leaf1-controller-2 runuser[367227]: pam_unix(runuser:session): session closed for user rabbitmq Jun 05 13:05:00 leaf1-controller-2 runuser[367357]: pam_unix(runuser:session): session opened for user rabbitmq(uid=42439) by (uid=0) Jun 05 13:05:01 leaf1-controller-2 pacemaker-controld[3248]: notice: Requesting local execution of stop operation for ceph-nfs on leaf1-controller-2 Jun 05 13:05:01 leaf1-controller-2 systemd[1]: Reloading. The workaround was to manually create there on each controller and restart the service with the pacemaker: [root@leaf1-controller-0 ~]# mkdir /var/run/ceph [root@leaf1-controller-0 ~]# pcs resource cleanup
Hello, This issue has occurred again on the same cluster, this time after rebooting controllers /var/run/ceph no longer exist: [root@leaf1-controller-0 ~]# ls /var/run/ceph ls: cannot access '/var/run/ceph': No such file or directory
(In reply to Chris Janiszewski from comment #19) > Hello, > > This issue has occurred again on the same cluster, this time after rebooting > controllers /var/run/ceph no longer exist: > [root@leaf1-controller-0 ~]# ls /var/run/ceph > ls: cannot access '/var/run/ceph': No such file or directory ack; systemd could handle the directory's deletion after a reboot. Thanks for reporting this.