Bug 2034554
| Summary: | ceph-nfs down in Pacemaker | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Rohini Diwakar <rdiwakar> |
| Component: | openstack-manila | Assignee: | Goutham Pacha Ravi <gouthamr> |
| Status: | CLOSED DEFERRED | QA Contact: | vhariria |
| Severity: | urgent | Docs Contact: | ndeevy <ndeevy> |
| Priority: | urgent | ||
| Version: | 16.1 (Train) | CC: | ashrodri, cardasil, cdasilva, ebarrera, gfidente, gouthamr, jveiraca, openstack-manila-bugs, pgrist, schhabdi, shtiwari, vhariria, vimartin |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-27 16:40:16 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Can someone add more context here as well? Is this a new deployment or an existing one that is now failing? (In reply to Paul Grist from comment #7) > Can someone add more context here as well? Is this a new deployment or an > existing one that is now failing? It appears to be an existing deployment that was working before: there are old manila-share logs (in /var/log/containers) from when that service was running fine, as late as November 13 and as early as August 11 (no logs exist before that). Since the manila-share service requires the ceph-nfs service and that in turn requires the pacemaker managed IPaddr2 VIP for the StorageNFS network, we can infer that those were working then even though it's clear that they are not working now. So we need to ask the customer what they changed and what they were trying to achieve by the change. Currently the StorageNFS IPs on the three controllers do not match the allocation range for StorageNFS that was provided in the case. This suggests an unsuccessful attempt to change the allocation? Also, the address on the third controller is not from the same CIDR as that specified for the deployment (the CIDR to which the StorageNFS VIP and the StorageNFS address on the other two controllers belong). Reiterating log to highlight the ceph-nfs failure:
```
Nov 16 06:59:53 xxxxctr002 podman[147075]: a7631b726e7d74c82d4d1d5814759926e6465c18add66043c5a01417dbf79c77
Nov 16 06:59:53 xxxxctr002 systemd[1]: Started Cluster Controlled ceph-nfs@pacemaker.
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 2021-11-16 06:59:53 /opt/ceph-container/bin/entrypoint.sh: static: does not generate config
Nov 16 06:59:53 xxxxctr002 conmon[20061]: 2021-11-16 06:59:53.490 7f2b790d8700 1 ====== starting new request req=0x7f2c9c59b680 =====
Nov 16 06:59:53 xxxxctr002 conmon[20061]: 2021-11-16 06:59:53.490 7f2b790d8700 1 ====== req done req=0x7f2c9c59b680 op status=0 http_status=200 latency=0s ======
Nov 16 06:59:53 xxxxctr002 conmon[20061]: 2021-11-16 06:59:53.490 7f2b790d8700 1 beast: 0x7f2c9c59b680: 192.168.8.78 - - [2021-11-16 06:59:53.0.490459s] "GET /swift/healthcheck HTTP/1.0" 200 0 - - -
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 2021-11-16 06:59:53 /opt/ceph-container/bin/entrypoint.sh: SUCCESS
Nov 16 06:59:53 xxxxctr002 conmon[147402]: exec: PID 57: spawning /usr/bin/ganesha.nfsd -F -L STDOUT
Nov 16 06:59:53 xxxxctr002 conmon[147402]: exec: Waiting 57 to quit
---------------------------------------
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 16/11/2021 06:59:53 : epoch 619348d9 : xxxxctr002 : ganesha.nfsd-57[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version /builddir/build/BUILD/nfs-ganesha-3.3/src, built at Dec 9 2020 02:43:33 on
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 16/11/2021 06:59:53 : epoch 619348d9 : xxxxctr002 : ganesha.nfsd-57[main] cu_rados_url_fetch :CONFIG :EVENT :cu_rados_url_fetch: Failed reading manila_data/ganesha-export-share-aa068dff-75b1-4e31-99df-c81a9ac7716a Unknown error -2
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 16/11/2021 06:59:53 : epoch 619348d9 : xxxxctr002 : ganesha.nfsd-57[main] main :NFS STARTUP :CRIT :Error (token scan) while parsing (/etc/ganesha/ganesha.conf) <=================
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 16/11/2021 06:59:53 : epoch 619348d9 : xxxxctr002 : ganesha.nfsd-57[main] config_errs_to_log :CONFIG :CRIT :Config File (rados://manila_data/ganesha-export-index:1): new url (rados://manila_data/ganesha-export-share-aa068dff-75b1-4e31-99df-c81a9ac7716a) open error (Unknown error -2), ignored
Nov 16 06:59:53 xxxxctr002 conmon[147402]: 16/11/2021 06:59:53 : epoch 619348d9 : xxxxctr002 : ganesha.nfsd-57[main] main :NFS STARTUP :FATAL :Fatal errors. Server exiting...
---------------------------------------
Nov 16 06:59:53 xxxxctr002 conmon[147402]: teardown: managing teardown after SIGCHLD
Nov 16 06:59:53 xxxxctr002 conmon[147402]: teardown: Waiting PID 57 to terminate
Nov 16 06:59:53 xxxxctr002 conmon[147402]:
Nov 16 06:59:53 xxxxctr002 conmon[147402]: teardown: Process 57 is terminated
Nov 16 06:59:53 xxxxctr002 conmon[147402]: teardown: Bye Bye, container will die with return code 0
Nov 16 06:59:53 xxxxctr002 systemd[1]: libpod-a7631b726e7d74c82d4d1d5814759926e6465c18add66043c5a01417dbf79c77.scope: Consumed 348ms CPU time
nova_metadata/xxxxctr003.internalapi.int.corp 0/0/0/178/178 200 243 - - ---- 1129/1/0/1/0 0/0 "GET /2009-04-04/meta-data/local-ipv4 HTTP/1.1"
Nov 16 06:59:53 xxxxctr002 podman[147620]: 2021-11-16 06:59:53.712572988 +0100 CET m=+0.066926769 container died a7631b726e7d74c82d4d1d5814759926e6465c18add66043c5a01417dbf79c77
```
The root cause here was that a rados export configuration object (rados://manila_data/ganesha-export-share-aa068dff-75b1-4e31-99df-c81a9ac7716a) was deleted, but the reference was still retained in the ganesha-export-index. It's possible there's a race condition in the object update routine in manila [1]; we're still verifying
The workaround for this situation is to edit the ganesha-export-index and remove the offending line:
### Perform these steps with a client that has access (network connectivity, permissions) to the ceph cluster.
### For example, by "exec"ing into the ceph-mon container from one of the controller nodes.
### to isolate the issue, verify that the URL from the logs is really deleted
$ rados -p manila_data get ganesha-export-share-aa068dff-75b1-4e31-99df-c81a9ac7716a -
error getting manila_data/ganesha-export-share-aa068dff-75b1-4e31-99df-c81a9ac7716a (2) No such file or directory
### copy the contents of the ganesha-export-index into a file
$ rados -p manila_data get ganesha-export-index saved_exports.txt
### edit saved_exports.txt to remove offending line pertaining to the missing export
### put the contents back into ganesha-export-index
$ rados -p manila_data put ganesha-export-index saved_exports.txt
Pacemaker should be in a continuous restart loop and this change must allow ceph-nfs to start up without errors. However, if pacemaker's retries are exhausted, you can clear the state from one of the controller nodes
# pcs resource cleanup
[1] https://opendev.org/openstack/manila/src/commit/a8307109395e3535085190c40325cd46dd355c78/manila/share/drivers/ganesha/manager.py#L439-L457
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |
Description of problem: Ceph NFS is stopped in pacemaker +++ * ceph-nfs (systemd:ceph-nfs@pacemaker): Stopped +++ ceph-nfs tried to start on all controllers 1 by 1 and it failed in all attempts +++ Migration Summary: * Node: xxxxctr001 (1): * ceph-nfs: migration-threshold=1000000 fail-count=1000000 last-failure=Wed Nov 17 10:18:15 2021: * Node: xxxxctr002 (2): * ceph-nfs: migration-threshold=1000000 fail-count=1000000 last-failure=Wed Nov 17 10:18:21 2021: * Node: xxxxctr003 (3): * ceph-nfs: migration-threshold=1000000 fail-count=1000000 last-failure=Wed Nov 17 10:18:26 2021: +++ Failed Resource Actions: +++ * ceph-nfs_start_0 on xxxxctr001 'error' (1): call=608, status='complete', exitreason='', last-rc-change='2021-11-17 10:18:13 +01:00', queued=0ms, exec=2412ms * ceph-nfs_start_0 on xxxxctr002 'error' (1): call=280, status='complete', exitreason='', last-rc-change='2021-11-17 10:18:18 +01:00', queued=0ms, exec=2472ms * ceph-nfs_start_0 on xxxxctr003 'error' (1): call=253, status='complete', exitreason='', last-rc-change='2021-11-17 10:18:24 +01:00', queued=0ms, exec=2399ms +++ After checking the pacemaker logs, here too ceph-nfs failed to start +++ Nov 17 10:18:21 xxxxctr002 pacemaker-controld[6429]: notice: Result of start operation for ceph-nfs on xxxxctr002: 1 (error) Nov 17 10:18:21 xxxxctr002 pacemaker-attrd[6427]: notice: Setting fail-count-ceph-nfs#start_0[xxxxctr002]: (unset) -> INFINITY Nov 17 10:18:21 xxxxctr002 pacemaker-attrd[6427]: notice: Setting last-failure-ceph-nfs#start_0[xxxxctr002]: (unset) -> 1637140701 +++ While restarting the resource we get the following error +++ [root@xxxxctr002 ~]# pcs resource restart ceph-nfs Error: Error performing operation: No such device or address ceph-nfs is not running anywhere and so cannot be restarted +++ Version-Release number of selected component (if applicable): RHOSP16.1 Actual results: ceph-nfs is in stopped state Expected results: ceph-nfs should be in started state