Bug 1984881
| Summary: | [RADOS] MGR daemon crashed saying - FAILED ceph_assert(pending_service_map.epoch > service_map.epoch) | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vasishta <vashastr> | |
| Component: | RADOS | Assignee: | Neha Ojha <nojha> | |
| Status: | CLOSED ERRATA | QA Contact: | skanta | |
| Severity: | medium | Docs Contact: | Masauso Lungu <mlungu> | |
| Priority: | unspecified | |||
| Version: | 5.0 | CC: | akupczyk, bhubbard, ceph-eng-bugs, glaw, mlungu, nojha, pdhange, pnataraj, rzarzyns, skanta, sseshasa, tserlin, vumrao | |
| Target Milestone: | --- | |||
| Target Release: | 6.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | ceph-17.2.3-11.el9cp | Doc Type: | Bug Fix | |
| Doc Text: |
.The Ceph Manager checks that deals with the initial service map is now relaxed
Previously, when upgrading a cluster, the Ceph Manager would receive several `service_map` versions from the previously active Ceph manager. This caused the manager daemon to crash, due to an incorrect check in the code, when the newly activated manager received a map with a higher version sent by the previously active manager.
With this fix, the check in the Ceph Manager that deals with the initial service map is relaxed to correctly check service maps and no assertion occurs during the Ceph Manager failover.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2095062 (view as bug list) | Environment: | ||
| Last Closed: | 2023-03-20 18:55:34 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2095062, 2126050 | |||
@Neha, Am able to see the same issue while upgrading from 5.0 to 5.1. Upgrade was successful however, ceph health reports 1 mgr and 2 mon daemon crash.
upgraded from 5.x 16.2.0-141.el8cp ceph version to 5.1 ceph version 16.2.6-20.el8cp
below ceph health status
[ceph: root@magna031 /]# ceph health
HEALTH_WARN 3 daemons have recently crashed
[ceph: root@magna031 /]# ceph health detail
HEALTH_WARN 3 daemons have recently crashed
[WRN] RECENT_CRASH: 3 daemons have recently crashed
mgr.magna006.vxieja crashed on host magna006 at 2021-11-09T16:58:47.494357Z
mon.magna031 crashed on host magna031 at 2021-11-09T17:28:14.312133Z
mon.magna032 crashed on host magna032 at 2021-11-09T17:28:14.289100Z
[ceph: root@magna031 /]#
collected crash info and pasted here -> http://pastebin.test.redhat.com/1007149
Will collect mgr and mon logs for the same.
@vikhyat, Setup do not have coredump configured. We need to configure and repro the issue to collect core dumps. I will try to repro the issue in the new cluster and collect core dump if reproduced. @Vikhyat, I have attached crash logs captured from mgr and mon nodes from the cluster where issue was seen, (In reply to Preethi from comment #21) > @Vikhyat, I have attached crash logs captured from mgr and mon nodes from > the cluster where issue was seen, @Preethi The BZ attachments from comment#15 to comment#20 does not contain a coredump of mgr daemon. Can you provide the coredump file for mgr crash located in /var/lib/systemd/coredump/ directory ? The file name starts with "core." e.g for a osd daemon $ ls -ltr /var/lib/systemd/coredump/ total 55676 -rw-r-----. 1 root root 57010176 Oct 7 22:19 core.ceph-osd.167.fdb8f3e50a094893b7840041c8af5164.7509.1607379543000000.lz4 Let us know if you cannot find the coredump in /var/lib/systemd/coredump/ directory on node where mgr daemon has crashed. Refer KCS solution https://access.redhat.com/solutions/3968111 , in pre rhcs-5.x release we had to edit ceph-mgr service file to add privilege in docker/podman run command to allow coredump generation in containerized environment (Refer section *Second* -- you donot need to trigger coredump generation, when mgr gets crashed then you will find coredump in /var/lib/systemd/coredump/ directory). @Prashant, We have upgraded the cluster to latest and there is no core dump in the system now. *** Bug 2095032 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:1360 |
Description of problem: In a cluster upgraded from RHCS 4.x to 5.x, MGR daemon had crashed when checked later. On the upgraded cluster, only NFS conf was modified and performed some IO Version-Release number of selected component (if applicable): "ceph_version": "16.2.0-102.el8cp" How reproducible: Once Steps to Reproduce: 1. Configure 4.x with NFS Ganesha on RGW, Perform IO 2. Upgrade cluster to 5.x 3. Modified NFS configuration, perform IO Actual results: "assert_line": 2928, "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/mgr/DaemonServer.cc: In function 'DaemonServer::got_service_map()::<lambda(const ServiceMap&)>' thread 7ffbac41b700 time 2021-07-18T17:14:00.116180+0000\n/builddir/build/BUILD/ceph-16.2.0/src/mgr/DaemonServer.cc: 2928: FAILED ceph_assert(pending_service_map.epoch > service_map.epoch)\n", Expected results: No Crashes Additional info: