Bug 2034554

Summary: ceph-nfs down in Pacemaker
Product: Red Hat OpenStack Reporter: Rohini Diwakar <rdiwakar>
Component: openstack-manilaAssignee: Goutham Pacha Ravi <gouthamr>
Status: CLOSED DEFERRED QA Contact: vhariria
Severity: urgent Docs Contact: ndeevy <ndeevy>
Priority: urgent    
Version: 16.1 (Train)CC: ashrodri, cardasil, cdasilva, ebarrera, gfidente, gouthamr, jveiraca, openstack-manila-bugs, pgrist, schhabdi, shtiwari, vhariria, vimartin
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-27 16:40:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rohini Diwakar 2021-12-21 10:03:18 UTC
Description of problem:

Ceph NFS is stopped in pacemaker
+++
  * ceph-nfs    (systemd:ceph-nfs@pacemaker):   Stopped
+++

ceph-nfs tried to start on all controllers 1 by 1 and it failed in all attempts
+++
Migration Summary:
  * Node: xxxxctr001 (1):
    * ceph-nfs: migration-threshold=1000000 fail-count=1000000 last-failure=Wed Nov 17 10:18:15 2021:
  * Node: xxxxctr002 (2):
    * ceph-nfs: migration-threshold=1000000 fail-count=1000000 last-failure=Wed Nov 17 10:18:21 2021:
  * Node: xxxxctr003 (3):
    * ceph-nfs: migration-threshold=1000000 fail-count=1000000 last-failure=Wed Nov 17 10:18:26 2021:
+++

Failed Resource Actions:
+++
  * ceph-nfs_start_0 on xxxxctr001 'error' (1): call=608, status='complete', exitreason='', last-rc-change='2021-11-17 10:18:13 +01:00', queued=0ms, exec=2412ms
  * ceph-nfs_start_0 on xxxxctr002 'error' (1): call=280, status='complete', exitreason='', last-rc-change='2021-11-17 10:18:18 +01:00', queued=0ms, exec=2472ms
  * ceph-nfs_start_0 on xxxxctr003 'error' (1): call=253, status='complete', exitreason='', last-rc-change='2021-11-17 10:18:24 +01:00', queued=0ms, exec=2399ms
+++

After checking the pacemaker logs, here too ceph-nfs failed to start
+++
Nov 17 10:18:21 xxxxctr002 pacemaker-controld[6429]: notice: Result of start operation for ceph-nfs on xxxxctr002: 1 (error)
Nov 17 10:18:21 xxxxctr002 pacemaker-attrd[6427]: notice: Setting fail-count-ceph-nfs#start_0[xxxxctr002]: (unset) -> INFINITY
Nov 17 10:18:21 xxxxctr002 pacemaker-attrd[6427]: notice: Setting last-failure-ceph-nfs#start_0[xxxxctr002]: (unset) -> 1637140701
+++

While restarting the resource we get the following error
+++
[root@xxxxctr002 ~]# pcs resource restart ceph-nfs
Error: Error performing operation: No such device or address ceph-nfs is not running anywhere and so cannot be restarted
+++

Version-Release number of selected component (if applicable):
RHOSP16.1


Actual results:
ceph-nfs is in stopped state

Expected results:
ceph-nfs should be in started state

Comment 7 Paul Grist 2021-12-27 15:07:56 UTC
Can someone add more context here as well?  Is this a new deployment or an existing one that is now failing?

Comment 9 Tom Barron 2021-12-27 17:55:51 UTC
(In reply to Paul Grist from comment #7)
> Can someone add more context here as well?  Is this a new deployment or an
> existing one that is now failing?

It appears to be an existing deployment that was working before: there are old manila-share logs (in /var/log/containers) from when that service was running fine, as late as November 13 and as early as August 11 (no logs exist before that).  Since the manila-share service requires the ceph-nfs service and that in turn requires the pacemaker managed IPaddr2 VIP for the StorageNFS network, we can infer that those were working then even though it's clear that they are not working now.

So we need to ask the customer what they changed and what they were trying to achieve by the change.

Currently the StorageNFS IPs on the three controllers do not match the allocation range for StorageNFS that was provided in the case.  This suggests an unsuccessful attempt to change the allocation?  Also, the address on the third controller is not from the same CIDR as that specified for the deployment (the CIDR to which the StorageNFS VIP and the StorageNFS address on the other two controllers belong).

Comment 25 Goutham Pacha Ravi 2022-01-12 20:37:21 UTC
Reiterating log to highlight the ceph-nfs failure:


```

Nov 16 06:59:53 xxxxctr002 podman[147075]: a7631b726e7d74c82d4d1d5814759926e6465c18add66043c5a01417dbf79c77
Nov 16 06:59:53 xxxxctr002 systemd[1]: Started Cluster Controlled ceph-nfs@pacemaker.
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 2021-11-16 06:59:53  /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
Nov 16 06:59:53 xxxxctr002 conmon[20061]: 2021-11-16 06:59:53.490 7f2b790d8700  1 ====== starting new request req=0x7f2c9c59b680 =====
Nov 16 06:59:53 xxxxctr002 conmon[20061]: 2021-11-16 06:59:53.490 7f2b790d8700  1 ====== req done req=0x7f2c9c59b680 op status=0 http_status=200 latency=0s ======
Nov 16 06:59:53 xxxxctr002 conmon[20061]: 2021-11-16 06:59:53.490 7f2b790d8700  1 beast: 0x7f2c9c59b680: 192.168.8.78 - - [2021-11-16 06:59:53.0.490459s] "GET /swift/healthcheck HTTP/1.0" 200 0 - - -
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 2021-11-16 06:59:53  /opt/ceph-container/bin/entrypoint.sh: SUCCESS
Nov 16 06:59:53 xxxxctr002 conmon[147402]: exec: PID 57: spawning /usr/bin/ganesha.nfsd  -F -L STDOUT
Nov 16 06:59:53 xxxxctr002 conmon[147402]: exec: Waiting 57 to quit
---------------------------------------
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 16/11/2021 06:59:53 : epoch 619348d9 : xxxxctr002 : ganesha.nfsd-57[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version /builddir/build/BUILD/nfs-ganesha-3.3/src, built at Dec  9 2020 02:43:33 on
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 16/11/2021 06:59:53 : epoch 619348d9 : xxxxctr002 : ganesha.nfsd-57[main] cu_rados_url_fetch :CONFIG :EVENT :cu_rados_url_fetch: Failed reading manila_data/ganesha-export-share-aa068dff-75b1-4e31-99df-c81a9ac7716a Unknown error -2


Nov 16 06:59:53 xxxxctr002 conmon[147402]: 16/11/2021 06:59:53 : epoch 619348d9 : xxxxctr002 : ganesha.nfsd-57[main] main :NFS STARTUP :CRIT :Error (token scan) while parsing (/etc/ganesha/ganesha.conf)  <=================


Nov 16 06:59:53 xxxxctr002 conmon[147402]: 16/11/2021 06:59:53 : epoch 619348d9 : xxxxctr002 : ganesha.nfsd-57[main] config_errs_to_log :CONFIG :CRIT :Config File (rados://manila_data/ganesha-export-index:1): new url (rados://manila_data/ganesha-export-share-aa068dff-75b1-4e31-99df-c81a9ac7716a) open error (Unknown error -2), ignored
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 16/11/2021 06:59:53 : epoch 619348d9 : xxxxctr002 : ganesha.nfsd-57[main] main :NFS STARTUP :FATAL :Fatal errors.  Server exiting...
---------------------------------------
Nov 16 06:59:53 xxxxctr002 conmon[147402]: teardown: managing teardown after SIGCHLD
Nov 16 06:59:53 xxxxctr002 conmon[147402]: teardown: Waiting PID 57 to terminate
Nov 16 06:59:53 xxxxctr002 conmon[147402]:
Nov 16 06:59:53 xxxxctr002 conmon[147402]: teardown: Process 57 is terminated
Nov 16 06:59:53 xxxxctr002 conmon[147402]: teardown: Bye Bye, container will die with return code 0
Nov 16 06:59:53 xxxxctr002 systemd[1]: libpod-a7631b726e7d74c82d4d1d5814759926e6465c18add66043c5a01417dbf79c77.scope: Consumed 348ms CPU time
nova_metadata/xxxxctr003.internalapi.int.corp 0/0/0/178/178 200 243 - - ---- 1129/1/0/1/0 0/0 "GET /2009-04-04/meta-data/local-ipv4 HTTP/1.1"
Nov 16 06:59:53 xxxxctr002 podman[147620]: 2021-11-16 06:59:53.712572988 +0100 CET m=+0.066926769 container died a7631b726e7d74c82d4d1d5814759926e6465c18add66043c5a01417dbf79c77 

```


The root cause here was that a rados export configuration object (rados://manila_data/ganesha-export-share-aa068dff-75b1-4e31-99df-c81a9ac7716a) was deleted, but the reference was still retained in the ganesha-export-index. It's possible there's a race condition in the object update routine in manila [1]; we're still verifying 

The workaround for this situation is to edit the ganesha-export-index and remove the offending line:

    ### Perform these steps with a client that has access (network connectivity, permissions) to the ceph cluster. 
    ### For example, by "exec"ing into the ceph-mon container from one of the controller nodes.
    ### to isolate the issue, verify that the URL from the logs is really deleted

    $ rados -p manila_data get ganesha-export-share-aa068dff-75b1-4e31-99df-c81a9ac7716a -
    error getting manila_data/ganesha-export-share-aa068dff-75b1-4e31-99df-c81a9ac7716a (2) No such file or directory


    ### copy the contents of the ganesha-export-index into a file
    $ rados -p manila_data get ganesha-export-index saved_exports.txt

    ### edit saved_exports.txt to remove offending line pertaining to the missing export

    ### put the contents back into ganesha-export-index
    $ rados -p manila_data put ganesha-export-index saved_exports.txt


Pacemaker should be in a continuous restart loop and this change must allow ceph-nfs to start up without errors. However, if pacemaker's retries are exhausted, you can clear the state from one of the controller nodes

   # pcs resource cleanup


[1] https://opendev.org/openstack/manila/src/commit/a8307109395e3535085190c40325cd46dd355c78/manila/share/drivers/ganesha/manager.py#L439-L457

Comment 33 Red Hat Bugzilla 2023-09-18 04:29:34 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days