Bug 2180542

Summary: After upgrade, manila ceph-nfs fails to start with error: ceph-nfs start on leaf1-controller-0 returned 'error' because 'failed'
Product: Red Hat OpenStack Reporter: Chris Janiszewski <cjanisze>
Component: tripleo-ansibleAssignee: Goutham Pacha Ravi <gouthamr>
Status: POST --- QA Contact: Joe H. Rahme <jhakimra>
Severity: medium Docs Contact: Jenny-Anne Lynch <jelynch>
Priority: low    
Version: 17.0 (Wallaby)CC: alfrgarc, ashrodri, fpantano, gouthamr, jamsmith, jelynch, kgilliga, lkuchlan, mkatari, vhariria
Target Milestone: z2Keywords: Triaged
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
The Pacemaker-controlled `ceph-nfs` resource requires a runtime directory to store some process data. The directory is created when you install or upgrade RHOSP. Currently, a reboot of the Controller nodes removes the directory, and the `ceph-nfs` service does not recover when the Controller nodes are rebooted. If all Controller nodes are rebooted, the `ceph-nfs` service fails permanently. + Workaround: If you reboot a Controller node, log into the Controller node and create a `/var/run/ceph` directory: + `$ mkdir -p /var/run/ceph` + Repeat this step on all Controller nodes that have been rebooted. If the `ceph-nfs-pacemaker` service has been marked as failed, after creating the directory, execute the following command from any of the Controller nodes: + `$ pcs resource cleanup`
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Janiszewski 2023-03-21 17:42:25 UTC
Description of problem:
after osp17.0.0 -> osp17.0.1 upgrade manila ceph-nfs failed to start. Error: ceph-nfs start on leaf1-controller-0 returned 'error' because 'failed'

"pcs resource cleanup" also did not fix the problem:

  * ip-172.20.9.214     (ocf:heartbeat:IPaddr2):         Stopped
  * ceph-nfs    (systemd:ceph-nfs@pacemaker):    Stopped
  * Container bundle: openstack-manila-share [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-manila-share:pcmklatest]:
    * openstack-manila-share-podman-0   (ocf:heartbeat:podman):  Stopped
  * ip-172.20.12.100    (ocf:heartbeat:IPaddr2):         Started leaf1-controller-0
  * Container bundle set: redis-bundle [bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-redis:pcmklatest]:
    * redis-bundle-0    (ocf:heartbeat:redis):   Unpromoted leaf1-controller-0
    * redis-bundle-1    (ocf:heartbeat:redis):   Promoted leaf1-controller-1
    * redis-bundle-2    (ocf:heartbeat:redis):   Unpromoted leaf1-controller-2

Failed Resource Actions:
  * ceph-nfs start on leaf1-controller-0 returned 'error' because 'failed' at Mon Mar 20 19:36:18 2023 after 3.352s
  * ceph-nfs start on leaf1-controller-1 returned 'error' because 'failed' at Mon Mar 20 19:36:25 2023 after 3.322s
  * ceph-nfs start on leaf1-controller-2 returned 'error' because 'failed' at Mon Mar 20 19:36:32 2023 after 3.332s


pacemaker logs:
Mar 20 19:36:32.528 leaf1-controller-0 pacemaker-based     [334922] (cib_perform_op)    info: +  /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='ceph-nfs']/lrm_rsc_op[@id='ceph-nfs_last_0']:  @transition-magic=0:1;148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d, @exit-reason=failed, @call-id=145, @rc-code=1, @op-status=0, @last-rc-change=1679355392, @exec-time=3332
Mar 20 19:36:32.528 leaf1-controller-0 pacemaker-based     [334922] (cib_perform_op)    info: ++ /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='ceph-nfs']:  <lrm_rsc_op id="ceph-nfs_last_failure_0" operation_key="ceph-nfs_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.13.0" transition-key="148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d" transition-magic="0:1;148:11:0:373d2006-4ae7-4c49-9bd0-50a85521cf5d" exit-reason="failed" on_node="leaf1-controller-2" call-id="145" rc-code="1" op-status="0" interval="0" last-rc-change="1679355392" exec-time="3332" queue-time="0" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
Mar 20 19:36:32.529 leaf1-controller-0 pacemaker-based     [334922] (cib_process_request)       info: Completed cib_modify operation for section status: OK (rc=0, origin=leaf1-controller-2/crmd/277, version=0.209.41)
Mar 20 19:36:32.530 leaf1-controller-0 pacemaker-attrd     [334925] (attrd_peer_update)         notice: Setting fail-count-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> INFINITY | from leaf1-controller-2
Mar 20 19:36:32.531 leaf1-controller-0 pacemaker-attrd     [334925] (attrd_peer_update)         notice: Setting last-failure-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> 1679355392 | from leaf1-controller-2
Mar 20 19:36:32.574 leaf1-controller-0 pacemaker-based     [334922] (cib_perform_op)    info: Diff: --- 0.209.41 2


I was able to workaround it by manually starting the service and invoking: "pcs resource cleanup" afterwards

[root@leaf1-controller-0 manila]# systemctl start ceph-nfs                                                                                                                                        
[root@leaf1-controller-0 manila]# systemctl status ceph-nfs                                                                                                                                       
● ceph-nfs - NFS-Ganesha file server                                                                                                                                                              
     Loaded: loaded (/etc/systemd/system/ceph-nfs@.service; disabled; vendor preset: disabled)                                                                                                                      
     Active: active (running) since Tue 2023-03-21 12:56:17 EDT; 6s ago                                                                                                                                             
       Docs: http://github.com/nfs-ganesha/nfs-ganesha/wiki                                                                                                                                                         
    Process: 318032 ExecStartPre=/usr/bin/rm -f //run/ceph-nfs //run/ceph-nfs (code=exited, status=0/SUCCESS)                                                           
    Process: 318063 ExecStartPre=/usr/bin/podman rm --storage ceph-nfs-pacemaker (code=exited, status=1/FAILURE)                                                                                             
    Process: 318335 ExecStartPre=/usr/bin/mkdir -p /var/log/ceph /var/log/ganesha (code=exited, status=0/SUCCESS)                                                                                                   
    Process: 318353 ExecStartPre=/usr/bin/podman rm ceph-nfs-pacemaker (code=exited, status=1/FAILURE)                                                                                                              
    Process: 318754 ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha /var/log/ganesha (code=exited, status=0/SUCCESS)                                                                     
    Process: 318762 ExecStart=/usr/bin/podman run --rm --net=host -d --log-driver journald --conmon-pidfile //run/ceph-nfs --cidfile //run/ceph-nfs -v /var/lib/ceph:/v>
   Main PID: 319193 (conmon)                                                                                                                                                                                        
      Tasks: 1 (limit: 816568)                                                                                                                                                                                      
     Memory: 992.0K                                                                                                                                                                                                 
        CPU: 738ms                                                                                                                                                                                                  
     CGroup: /system.slice/system-ceph\x2dnfs.slice/ceph-nfs                                                                                                                                      
             └─319193 /usr/bin/conmon --api-version 1 -c a300370ff5d32a4a830199f8515460043007338329f69cd770fc32f17e106855 -u a300370ff5d32a4a830199f8515460043007338329f69cd770fc32f17e106855 -r /usr/bin/crun -b />
                                                                                                                                                                                                                    
Mar 21 12:56:18 leaf1-controller-0 ceph-nfs-pacemaker[319193]: 21/03/2023 12:56:18 : epoch 6419e1b2 : leaf1-controller-0 : ganesha.nfsd-72[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started succe>
Mar 21 12:56:18 leaf1-controller-0 ceph-nfs-pacemaker[319193]: 21/03/2023 12:56:18 : epoch 6419e1b2 : leaf1-controller-0 : ganesha.nfsd-72[main] nfs_Start_threads :THREAD :EVENT :admin thread




Version-Release number of selected component (if applicable):
osp17.0.0 -> osp17.0.1

How reproducible:
during upgrade

Steps to Reproduce:
1. deploy osp 17.0.0 with manila ceph-nfs enabled
2. upgrade to osp17.0.1
3.

Actual results:
ceph-nfs won't start


Expected results:
ceph-nfs starts 


Additional info:

[root@leaf1-controller-1 manila]# sudo cat /etc/systemd/system/ceph-nfs@.service
[Unit]
Description=NFS-Ganesha file server
Documentation=http://github.com/nfs-ganesha/nfs-ganesha/wiki
After=network.target

[Service]
EnvironmentFile=-/etc/environment
ExecStartPre=-/usr/bin/rm -f /%t/%n-pid /%t/%n-cid
ExecStartPre=-/usr/bin/podman rm --storage ceph-nfs-%i
ExecStartPre=-/usr/bin/mkdir -p /var/log/ceph /var/log/ganesha
ExecStartPre=-/usr/bin/podman rm ceph-nfs-%i
ExecStartPre=/usr/bin/mkdir -p /etc/ceph /etc/ganesha /var/lib/nfs/ganesha /var/log/ganesha
ExecStart=/usr/bin/podman run --rm --net=host \
  -d --log-driver journald --conmon-pidfile /%t/%n-pid --cidfile /%t/%n-cid \
  -v /var/lib/ceph:/var/lib/ceph:z \
  -v /var/lib/tripleo-config/ceph:/etc/ceph:z \
  -v /var/lib/nfs/ganesha:/var/lib/nfs/ganesha:z \
  -v /etc/ganesha:/etc/ganesha:z \
  -v /var/run/ceph:/var/run/ceph:z \
  -v /var/log/ceph:/var/log/ceph:z \
  -v /var/log/ganesha:/var/log/ganesha:z \
    --privileged \
  -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket \
  -v /etc/localtime:/etc/localtime:ro \
  -e CLUSTER=ceph \
  -e CEPH_DAEMON=NFS \
  -e CONTAINER_IMAGE=bgp-undercloud.ctlplane.openstack.lab:8787/rhceph/rhceph-5-rhel8:latest \
  --name=ceph-nfs-pacemaker \
  bgp-undercloud.ctlplane.openstack.lab:8787/rhceph/rhceph-5-rhel8:latest
ExecStop=-/usr/bin/sh -c "/usr/bin/podman rm -f `cat /%t/%n-cid`"
KillMode=none
Restart=always
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=15
Type=forking
PIDFile=/%t/%n-pid

[Install]
WantedBy=multi-user.target

Comment 13 Chris Janiszewski 2023-06-05 17:20:23 UTC
I worked with Goutham Pacha Ravi to further troubleshoot this issue and we noticed the reason for manila not starting was lack of /var/run/ceph directory on the controllers:

Jun 05 13:04:58 leaf1-controller-2 systemd[1]: 4a49419f381c550ce6224e2ff9113f52d1137cd4ba280dedd286e445772bebba.service: Deactivated successfully.
Jun 05 13:04:58 leaf1-controller-2 haproxy[7175]: 172.20.12.118:33744 [05/Jun/2023:13:04:58.817] neutron neutron/leaf1-controller-1.internalapi.openstack.lab 0/0/0/161/161 200 187 - - ---- 918/22/
0/0/0 0/0 "GET /v2.0/networks/73c5fbf9-04f4-4d1e-8409-90c9a95e6ca9?fields=segments HTTP/1.1"
Jun 05 13:04:58 leaf1-controller-2 podman[367072]: Error: no container with name or ID "ceph-nfs-pacemaker" found: no such container
Jun 05 13:04:59 leaf1-controller-2 haproxy[7175]: 172.20.12.118:33744 [05/Jun/2023:13:04:58.981] neutron neutron/leaf1-controller-2.internalapi.openstack.lab 0/0/0/158/158 200 252 - - ---- 918/22/
0/0/0 0/0 "GET /v2.0/networks/73c5fbf9-04f4-4d1e-8409-90c9a95e6ca9?fields=provider%3Aphysical_network&fields=provider%3Anetwork_type HTTP/1.1"
Jun 05 13:04:59 leaf1-controller-2 podman[367154]: Error: statfs /var/run/ceph: no such file or directory
Jun 05 13:04:59 leaf1-controller-2 systemd[1]: ceph-nfs: Control process exited, code=exited, status=125/n/a
Jun 05 13:04:59 leaf1-controller-2 systemd[1]: ceph-nfs: Failed with result 'exit-code'.
Jun 05 13:04:59 leaf1-controller-2 systemd[1]: Failed to start Cluster Controlled ceph-nfs@pacemaker.
Jun 05 13:04:59 leaf1-controller-2 runuser[367227]: pam_unix(runuser:session): session opened for user rabbitmq(uid=42439) by (uid=0)
Jun 05 13:05:00 leaf1-controller-2 podman[366953]: 2023-06-05 13:05:00.609997374 -0400 EDT m=+1.879562362 container exec_died af8ef465e63dd4d99b71a5bf3eb808687539bbd73e688cd0a0bc98db020ae5cf (imag
e=bgp-undercloud.ctlplane.openstack.lab:8787/rhosp-rhel9/openstack-swift-object:17.0, name=swift_rsync, health_status=, execID=89b483ab2860a6a21555f5a61ee13acc1d600bf09c08f418ff9bb025bbc8f6b7)
Jun 05 13:05:00 leaf1-controller-2 systemd[1]: af8ef465e63dd4d99b71a5bf3eb808687539bbd73e688cd0a0bc98db020ae5cf.service: Deactivated successfully.
Jun 05 13:05:00 leaf1-controller-2 pacemaker-controld[3248]:  notice: Result of start operation for ceph-nfs on leaf1-controller-2: error (failed)
Jun 05 13:05:00 leaf1-controller-2 pacemaker-attrd[3246]:  notice: Setting fail-count-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> INFINITY
Jun 05 13:05:00 leaf1-controller-2 pacemaker-attrd[3246]:  notice: Setting last-failure-ceph-nfs#start_0[leaf1-controller-2]: (unset) -> 1685984700
Jun 05 13:05:00 leaf1-controller-2 runuser[367227]: pam_unix(runuser:session): session closed for user rabbitmq
Jun 05 13:05:00 leaf1-controller-2 runuser[367357]: pam_unix(runuser:session): session opened for user rabbitmq(uid=42439) by (uid=0)
Jun 05 13:05:01 leaf1-controller-2 pacemaker-controld[3248]:  notice: Requesting local execution of stop operation for ceph-nfs on leaf1-controller-2
Jun 05 13:05:01 leaf1-controller-2 systemd[1]: Reloading.


The workaround was to manually create there on each controller and restart the service with the pacemaker:
[root@leaf1-controller-0 ~]# mkdir /var/run/ceph
[root@leaf1-controller-0 ~]# pcs resource cleanup

Comment 19 Chris Janiszewski 2023-06-29 15:34:31 UTC
Hello,

This issue has occurred again on the same cluster, this time after rebooting controllers /var/run/ceph no longer exist:
[root@leaf1-controller-0 ~]# ls /var/run/ceph
ls: cannot access '/var/run/ceph': No such file or directory

Comment 22 Goutham Pacha Ravi 2023-06-29 20:30:01 UTC
(In reply to Chris Janiszewski from comment #19)
> Hello,
> 
> This issue has occurred again on the same cluster, this time after rebooting
> controllers /var/run/ceph no longer exist:
> [root@leaf1-controller-0 ~]# ls /var/run/ceph
> ls: cannot access '/var/run/ceph': No such file or directory

ack; systemd could handle the directory's deletion after a reboot. Thanks for reporting this.