Bug 2405397
| Summary: | cephadm crashes and doesn't recover with ganesha-rados-grace tool failed: Failure: -126 | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Omid Yoosefi <omidyoosefi> | |
| Component: | Cephadm | Assignee: | Shweta Bhosale <shbhosal> | |
| Status: | CLOSED ERRATA | QA Contact: | Manisha Saini <msaini> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 8.1 | CC: | adking, akane, bkunal, cephqe-warriors, hacharya, shbhosal, spunadik, tserlin, vdas | |
| Target Milestone: | --- | |||
| Target Release: | 9.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | ceph-20.1.0-126 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2421434 (view as bug list) | Environment: | ||
| Last Closed: | 2026-01-29 07:02:22 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2421434 | |||
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 9.0 Security and Enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2026:1536 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days or the product is inactive and locked |
Description of problem: When provisioning and deprovisioning lots of NFS daemon with the concurrency changes, NFS gracetool hits an exception and crashes the cephadm module. After the fact, orchestrator will not work until mgr daemon is restarted or failed over. Version-Release number of selected component (if applicable): 19.2.1-245.0.hotfix.BYOK.el9cp How reproducible: 50% Steps to Reproduce: 1. Provision a lot of NFS daemons using 1 single spec apply 2. Delete all the daemons 3. Watch mgr logs or ceph -s for health_err Actual results: cephadm orchestrator stops working and needs a mgr restart/failover to continue. Expected results: cephadm orchestrator handles the error and retries without user intervention Additional info: ``` 2025-10-21T01:43:57.547+0000 7f77c7153640 0 [cephadm INFO cephadm.services.nfs] Fencing old nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.7.dal1-qz2-sr5-rk025-s28.dgynzp 2025-10-21T01:43:57.547+0000 7f77c7153640 0 log_channel(cephadm) log [INF] : Fencing old nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.7.dal1-qz2-sr5-rk025-s28.dgynzp 2025-10-21T01:43:57.547+0000 7f77c7153640 0 [cephadm INFO cephadm.services.nfs] Removing key for client.nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.7.dal1-qz2-sr5-rk025-s28.dgynzp 2025-10-21T01:43:57.547+0000 7f77c7153640 0 log_channel(cephadm) log [INF] : Removing key for client.nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.7.dal1-qz2-sr5-rk025-s28.dgynzp 2025-10-21T01:43:57.585+0000 7f77c7153640 0 [cephadm INFO cephadm.services.nfs] Fencing old nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.8.dal3-qz2-sr3-rk279-s28.qasdbj 2025-10-21T01:43:57.585+0000 7f77c7153640 0 log_channel(cephadm) log [INF] : Fencing old nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.8.dal3-qz2-sr3-rk279-s28.qasdbj 2025-10-21T01:43:57.585+0000 7f77c7153640 0 [cephadm INFO cephadm.services.nfs] Removing key for client.nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.8.dal3-qz2-sr3-rk279-s28.qasdbj 2025-10-21T01:43:57.585+0000 7f77c7153640 0 log_channel(cephadm) log [INF] : Removing key for client.nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.8.dal3-qz2-sr3-rk279-s28.qasdbj 2025-10-21T01:43:57.620+0000 7f77c7153640 0 [cephadm INFO cephadm.services.nfs] Fencing old nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.9.dal4-qz2-sr1-rk114-s48.isrjtf 2025-10-21T01:43:57.620+0000 7f77c7153640 0 log_channel(cephadm) log [INF] : Fencing old nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.9.dal4-qz2-sr1-rk114-s48.isrjtf 2025-10-21T01:43:57.620+0000 7f77c7153640 0 [cephadm INFO cephadm.services.nfs] Removing key for client.nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.9.dal4-qz2-sr1-rk114-s48.isrjtf 2025-10-21T01:43:57.620+0000 7f77c7153640 0 log_channel(cephadm) log [INF] : Removing key for client.nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.2.9.dal4-qz2-sr1-rk114-s48.isrjtf 2025-10-21T01:43:58.138+0000 7f77c7153640 0 [cephadm INFO cephadm.services.nfs] Fencing old nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.4.0.dal2-qz2-sr2-rk089-s28.maofjs 2025-10-21T01:43:58.138+0000 7f77c7153640 0 log_channel(cephadm) log [INF] : Fencing old nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.4.0.dal2-qz2-sr2-rk089-s28.maofjs 2025-10-21T01:43:58.138+0000 7f77c7153640 0 [cephadm INFO cephadm.services.nfs] Removing key for client.nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.4.0.dal2-qz2-sr2-rk089-s28.maofjs 2025-10-21T01:43:58.138+0000 7f77c7153640 0 log_channel(cephadm) log [INF] : Removing key for client.nfs.r134-4bd98973-2380-4530-aa1e-6c838295028b.4.0.dal2-qz2-sr2-rk089-s28.maofjs 2025-10-21T01:43:58.144+0000 7f77c7153640 0 [cephadm INFO root] Removing 4 from the ganesha grace table 2025-10-21T01:43:58.144+0000 7f77c7153640 0 log_channel(cephadm) log [INF] : Removing 4 from the ganesha grace table 2025-10-21T01:43:58.150+0000 7f77d582b640 0 log_channel(cluster) log [DBG] : pgmap v38: 6721 pgs: 6721 active+clean; 2.3 TiB data, 9.4 TiB used, 177 TiB / 186 TiB avail; 16 KiB/s rd, 0 B/s wr, 19 op/s 2025-10-21T01:43:58.268+0000 7f77c7153640 0 [cephadm WARNING root] ganesha-rados-grace tool failed: Failure: -126 2025-10-21T01:43:58.269+0000 7f77c7153640 0 log_channel(cephadm) log [WRN] : ganesha-rados-grace tool failed: Failure: -126 2025-10-21T01:43:58.305+0000 7f77c7153640 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'cephadm' while running on mgr.dal1-qz2-sr5-rk025-s38.vyhojz: grace tool failed: Failure: -126 2025-10-21T01:43:58.305+0000 7f77c7153640 -1 cephadm.serve: 2025-10-21T01:43:58.305+0000 7f77c7153640 -1 Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/module.py", line 865, in serve serve.serve() File "/usr/share/ceph/mgr/cephadm/serve.py", line 120, in serve if self._apply_all_services(): File "/usr/share/ceph/mgr/cephadm/serve.py", line 764, in _apply_all_services svc.fence_old_ranks(spec, ranking_map, len(daemons)) File "/usr/share/ceph/mgr/cephadm/services/nfs.py", line 315, in fence_old_ranks self.run_grace_tool(cast(NFSServiceSpec, spec), 'remove', nodeid) File "/usr/share/ceph/mgr/cephadm/services/nfs.py", line 874, in run_grace_tool raise RuntimeError(f'grace tool failed: {result.stderr.decode("utf-8")}') RuntimeError: grace tool failed: Failure: -126 ```