2180542 – After upgrade, manila ceph-nfs fails to start with error: ceph-nfs start on leaf1-controller-0 returned 'error' because 'failed'

Bug 2180542 - After upgrade, manila ceph-nfs fails to start with error: ceph-nfs start on leaf1-controller-0 returned 'error' because 'failed'

Summary: After upgrade, manila ceph-nfs fails to start with error: ceph-nfs start on l...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	tripleo-ansible
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	z2
Target Release:	17.1
Assignee:	Goutham Pacha Ravi
QA Contact:	Alfredo
Docs Contact:	Jenny-Anne Lynch
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-03-21 17:42 UTC by Chris Janiszewski
Modified:	2024-01-16 14:32 UTC (History)
CC List:	12 users (show)
Fixed In Version:	tripleo-ansible-3.3.1-17.1.20230828000817.0ab5533.el9ost, tripleo-ansible-3.3.1-17.1.20231101233745.4d015bf.el8ost
Doc Type:	Bug Fix
Doc Text:	This update fixes a bug that caused the `ceph-nfs` service to fail after a reboot of all Controller nodes. + The Pacemaker-controlled `ceph-nfs` resource requires a runtime directory to store some process data. + Before this update, the directory was created when you installed or upgraded RHOSP. However, a reboot of the Controller nodes removed the directory, and the `ceph-nfs` service did not recover when the Controller nodes were rebooted. If all Controller nodes were rebooted, the `ceph-nfs` service failed permanently. + With this update, the directory is created before spawning the `ceph-nfs` service, and the `cephfs-nfs` service continues through reboots.
Clone Of:
Environment:
Last Closed:	2024-01-16 14:32:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	887340	None	MERGED	Set RuntimeDirectory in ceph-nfs systemd unit	2023-08-17 07:07:25 UTC
Red Hat Issue Tracker	OSP-23264	None	None	None	2023-03-21 17:42:48 UTC
Red Hat Product Errata	RHBA-2024:0209	None	None	None	2024-01-16 14:32:10 UTC

Description Chris Janiszewski 2023-03-21 17:42:25 UTC

Description of problem:
after osp17.0.0 -> osp17.0.1 upgrade manila ceph-nfs failed to start. Error: ceph-nfs start on leaf1-controller-0 returned 'error' because 'failed'

"pcs resource cleanup" also did not fix the problem:

  * ip-172.20.9.214     (ocf:heartbeat:IPaddr2):         Stopped
  * ceph-nfs    (systemd:ceph-nfs@pacemaker):    Stopped
  * Container bundle: openstack-manila-share [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-manila-share:pcmklatest]:
    * openstack-manila-share-podman-0   (ocf:heartbeat:podman):  Stopped
  * ip-172.20.12.100    (ocf:heartbeat:IPaddr2):         Started leaf1-controller-0
  * Container bundle set: redis-bundle [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-redis:pcmklatest]:
    * redis-bundle-0    (ocf:heartbeat:redis):   Unpromoted leaf1-controller-0
    * redis-bundle-1    (ocf:heartbeat:redis):   Promoted leaf1-controller-1
    * redis-bundle-2    (ocf:heartbeat:redis):   Unpromoted leaf1-controller-2

Failed Resource Actions:
  * ceph-nfs start on leaf1-controller-0 returned 'error' because 'failed' at Mon Mar 20 19:36:18 2023 after 3.352s
  * ceph-nfs start on leaf1-controller-1 returned 'error' because 'failed' at Mon Mar 20 19:36:25 2023 after 3.322s
  * ceph-nfs start on leaf1-controller-2 returned 'error' because 'failed' at Mon Mar 20 19:36:32 2023 after 3.332s


pacemaker logs:
Mar 20 19:36:32.528 leaf1-controller-0 pacemaker-based     [334922] (cib_perform_op)    info: +  /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='ceph-nfs']/lrm_rsc_op[@id='ceph-nfs_last_0']:  @transition-magic=0:1;148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d, @exit-reason=failed, @call-id=145, @rc-code=1, @op-status=0, @last-rc-change=1679355392, @exec-time=3332
Mar 20 19:36:32.528 leaf1-controller-0 pacemaker-based     [334922] (cib_perform_op)    info: ++ /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='ceph-nfs']:  <lrm_rsc_op id="ceph-nfs_last_failure_0" operation_key="ceph-nfs_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.13.0" transition-key="148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d" transition-magic="0:1;148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d" exit-reason="failed" on_node="leaf1-controller-2" call-id="145" rc-code="1" op-status="0" interval="0" last-rc-change="1679355392" exec-time="3332" queue-time="0" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
Mar 20 19:36:32.529 leaf1-controller-0 pacemaker-based     [334922] (cib_process_request)       info: Completed cib_modify operation for section status: OK (rc=0, origin=leaf1-controller-2/crmd/277, version=0.209.41)
Mar 20 19:36:32.530 leaf1-controller-0 pacemaker-attrd     [334925] (attrd_peer_update)         notice: Setting fail-count-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> INFINITY | from leaf1-controller-2
Mar 20 19:36:32.531 leaf1-controller-0 pacemaker-attrd     [334925] (attrd_peer_update)         notice: Setting last-failure-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> 1679355392 | from leaf1-controller-2
Mar 20 19:36:32.574 leaf1-controller-0 pacemaker-based     [334922] (cib_perform_op)    info: Diff: --- 0.209.41 2


I was able to workaround it by manually starting the service and invoking: "pcs resource cleanup" afterwards

[root@leaf1-controller-0 manila]# systemctl start ceph-nfs                                                                                                                                        
[root@leaf1-controller-0 manila]# systemctl status ceph-nfs                                                                                                                                       
● ceph-nfs - NFS-Ganesha file server                                                                                                                                                              
     Loaded: loaded (/etc/systemd/system/ceph-nfs@.service; disabled; vendor preset: disabled)                                                                                                                      
     Active: active (running) since Tue 2023-03-21 12:56:17 EDT; 6s ago                                                                                                                                             
       Docs: http://github.com/nfs-ganesha/nfs-ganesha/wiki                                                                                                                                                         
    Process: 318032 ExecStartPre=/usr/bin/rm -f //run/ceph-nfs //run/ceph-nfs (code=exited, status=0/SUCCESS)                                                           
    Process: 318063 ExecStartPre=/usr/bin/podman rm --storage ceph-nfs-pacemaker (code=exited, status=1/FAILURE)                                                                                             
    Process: 318335 ExecStartPre=/usr/bin/mkdir -p /var/log/ceph /var/log/ganesha (code=exited, status=0/SUCCESS)                                                                                                   
    Process: 318353 ExecStartPre=/usr/bin/podman rm ceph-nfs-pacemaker (code=exited, status=1/FAILURE)                                                                                                              
    Process: 318754 ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha /var/log/ganesha (code=exited, status=0/SUCCESS)                                                                     
    Process: 318762 ExecStart=/usr/bin/podman run --rm --net=host -d --log-driver journald --conmon-pidfile //run/ceph-nfs --cidfile //run/ceph-nfs -v /var/lib/ceph:/v>
   Main PID: 319193 (conmon)                                                                                                                                                                                        
      Tasks: 1 (limit: 816568)                                                                                                                                                                                      
     Memory: 992.0K                                                                                                                                                                                                 
        CPU: 738ms                                                                                                                                                                                                  
     CGroup: /system.slice/system-ceph\x2dnfs.slice/ceph-nfs                                                                                                                                      
             └─319193 /usr/bin/conmon --api-version 1 -c a300370ff5d32a4a830199f8515460043007338329f69cd770fc32f17e106855 -u a300370ff5d32a4a830199f8515460043007338329f69cd770fc32f17e106855 -r /usr/bin/crun -b />
                                                                                                                                                                                                                    
Mar 21 12:56:18 leaf1-controller-0 ceph-nfs-pacemaker[319193]: 21/03/2023 12:56:18 : epoch 6419e1b2 : leaf1-controller-0 : ganesha.nfsd-72[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started succe>
Mar 21 12:56:18 leaf1-controller-0 ceph-nfs-pacemaker[319193]: 21/03/2023 12:56:18 : epoch 6419e1b2 : leaf1-controller-0 : ganesha.nfsd-72[main] nfs_Start_threads :THREAD :EVENT :admin thread




Version-Release number of selected component (if applicable):
osp17.0.0 -> osp17.0.1

How reproducible:
during upgrade

Steps to Reproduce:
1. deploy osp 17.0.0 with manila ceph-nfs enabled
2. upgrade to osp17.0.1
3.

Actual results:
ceph-nfs won't start


Expected results:
ceph-nfs starts 


Additional info:

[root@leaf1-controller-1 manila]# sudo cat /etc/systemd/system/ceph-nfs@.service
[Unit]
Description=NFS-Ganesha file server
Documentation=http://github.com/nfs-ganesha/nfs-ganesha/wiki
After=network.target

[Service]
EnvironmentFile=-/etc/environment
ExecStartPre=-/usr/bin/rm -f /%t/%n-pid /%t/%n-cid
ExecStartPre=-/usr/bin/podman rm --storage ceph-nfs-%i
ExecStartPre=-/usr/bin/mkdir -p /var/log/ceph /var/log/ganesha
ExecStartPre=-/usr/bin/podman rm ceph-nfs-%i
ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha /var/log/ganesha
ExecStart=/usr/bin/podman run --rm --net=host \
  -d --log-driver journald --conmon-pidfile /%t/%n-pid --cidfile /%t/%n-cid \
  -v /var/lib/ceph:/var/lib/ceph:z \
  -v /var/lib/tripleo-config/ceph:/etc/ceph:z \
  -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z \
  -v /etc/ganesha:/etc/ganesha:z \
  -v /var/run/ceph:/var/run/ceph:z \
  -v /var/log/ceph:/var/log/ceph:z \
  -v /var/log/ganesha:/var/log/ganesha:z \
    --privileged \
  -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket \
  -v /etc/localtime:/etc/localtime:ro \
  -e CLUSTER=ceph \
  -e CEPH_DAEMON=NFS \
  -e CONTAINER_IMAGE=bgp-undercloud.ctlplane.openstack.lab:8787/rhceph/rhceph-5-rhel8:latest \
  --name=ceph-nfs-pacemaker \
  bgp-undercloud.ctlplane.openstack.lab:8787/rhceph/rhceph-5-rhel8:latest
ExecStop=-/usr/bin/sh -c "/usr/bin/podman rm -f `cat /%t/%n-cid`"
KillMode=none
Restart=always
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=15
Type=forking
PIDFile=/%t/%n-pid

[Install]
WantedBy=multi-user.target

Comment 13 Chris Janiszewski 2023-06-05 17:20:23 UTC

I worked with Goutham Pacha Ravi to further troubleshoot this issue and we noticed the reason for manila not starting was lack of /var/run/ceph directory on the controllers:

Jun 05 13:04:58 leaf1-controller-2 systemd[1]: 4a49419f381c550ce6224e2ff9113f52d1137cd4ba280dedd286e445772bebba.service: Deactivated successfully.
Jun 05 13:04:58 leaf1-controller-2 haproxy[7175]: 172.20.12.118:33744 [05/Jun/2023:13:04:58.817] neutron neutron/leaf1-controller-1.internalapi.openstack.lab 0/0/0/161/161 200 187 - - ---- 918/22/
0/0/0 0/0 "GET /v2.0/networks/73c5fbf9-04f4-4d1e-8409-90c9a95e6ca9?fields=segments HTTP/1.1"
Jun 05 13:04:58 leaf1-controller-2 podman[367072]: Error: no container with name or ID "ceph-nfs-pacemaker" found: no such container
Jun 05 13:04:59 leaf1-controller-2 haproxy[7175]: 172.20.12.118:33744 [05/Jun/2023:13:04:58.981] neutron neutron/leaf1-controller-2.internalapi.openstack.lab 0/0/0/158/158 200 252 - - ---- 918/22/
0/0/0 0/0 "GET /v2.0/networks/73c5fbf9-04f4-4d1e-8409-90c9a95e6ca9?fields=provider%3Aphysical_network&fields=provider%3Anetwork_type HTTP/1.1"
Jun 05 13:04:59 leaf1-controller-2 podman[367154]: Error: statfs /var/run/ceph: no such file or directory
Jun 05 13:04:59 leaf1-controller-2 systemd[1]: ceph-nfs: Control process exited, code=exited, status=125/n/a
Jun 05 13:04:59 leaf1-controller-2 systemd[1]: ceph-nfs: Failed with result 'exit-code'.
Jun 05 13:04:59 leaf1-controller-2 systemd[1]: Failed to start Cluster Controlled ceph-nfs@pacemaker.
Jun 05 13:04:59 leaf1-controller-2 runuser[367227]: pam_unix(runuser:session): session opened for user rabbitmq(uid=42439) by (uid=0)
Jun 05 13:05:00 leaf1-controller-2 podman[366953]: 2023-06-05 13:05:00.609997374 -0400 EDT m=+1.879562362 container exec_died af8ef465e63dd4d99b71a5bf3eb808687539bbd73e688cd0a0bc98db020ae5cf (imag
e=bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-swift-object:17.0, name=swift_rsync, health_status=, execID=89b483ab2860a6a21555f5a61ee13acc1d600bf09c08f418ff9bb025bbc8f6b7)
Jun 05 13:05:00 leaf1-controller-2 systemd[1]: af8ef465e63dd4d99b71a5bf3eb808687539bbd73e688cd0a0bc98db020ae5cf.service: Deactivated successfully.
Jun 05 13:05:00 leaf1-controller-2 pacemaker-controld[3248]:  notice: Result of start operation for ceph-nfs on leaf1-controller-2: error (failed)
Jun 05 13:05:00 leaf1-controller-2 pacemaker-attrd[3246]:  notice: Setting fail-count-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> INFINITY
Jun 05 13:05:00 leaf1-controller-2 pacemaker-attrd[3246]:  notice: Setting last-failure-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> 1685984700
Jun 05 13:05:00 leaf1-controller-2 runuser[367227]: pam_unix(runuser:session): session closed for user rabbitmq
Jun 05 13:05:00 leaf1-controller-2 runuser[367357]: pam_unix(runuser:session): session opened for user rabbitmq(uid=42439) by (uid=0)
Jun 05 13:05:01 leaf1-controller-2 pacemaker-controld[3248]:  notice: Requesting local execution of stop operation for ceph-nfs on leaf1-controller-2
Jun 05 13:05:01 leaf1-controller-2 systemd[1]: Reloading.


The workaround was to manually create there on each controller and restart the service with the pacemaker:
[root@leaf1-controller-0 ~]# mkdir /var/run/ceph
[root@leaf1-controller-0 ~]# pcs resource cleanup

Comment 19 Chris Janiszewski 2023-06-29 15:34:31 UTC

Hello,

This issue has occurred again on the same cluster, this time after rebooting controllers /var/run/ceph no longer exist:
[root@leaf1-controller-0 ~]# ls /var/run/ceph
ls: cannot access '/var/run/ceph': No such file or directory

Comment 22 Goutham Pacha Ravi 2023-06-29 20:30:01 UTC

(In reply to Chris Janiszewski from comment #19)
> Hello,
> 
> This issue has occurred again on the same cluster, this time after rebooting
> controllers /var/run/ceph no longer exist:
> [root@leaf1-controller-0 ~]# ls /var/run/ceph
> ls: cannot access '/var/run/ceph': No such file or directory

ack; systemd could handle the directory's deletion after a reboot. Thanks for reporting this.

Comment 55 errata-xmlrpc 2024-01-16 14:32:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.1.2 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:0209

Note You need to log in before you can comment on or make changes to this bug.