Description of problem: In a cluster upgraded from RHCS 4.x to 5.x, MGR daemon had crashed when checked later. On the upgraded cluster, only NFS conf was modified and performed some IO Version-Release number of selected component (if applicable): "ceph_version": "16.2.0-102.el8cp" How reproducible: Once Steps to Reproduce: 1. Configure 4.x with NFS Ganesha on RGW, Perform IO 2. Upgrade cluster to 5.x 3. Modified NFS configuration, perform IO Actual results: "assert_line": 2928, "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/mgr/DaemonServer.cc: In function 'DaemonServer::got_service_map()::<lambda(const ServiceMap&)>' thread 7ffbac41b700 time 2021-07-18T17:14:00.116180+0000\n/builddir/build/BUILD/ceph-16.2.0/src/mgr/DaemonServer.cc: 2928: FAILED ceph_assert(pending_service_map.epoch > service_map.epoch)\n", Expected results: No Crashes Additional info:
@Neha, Am able to see the same issue while upgrading from 5.0 to 5.1. Upgrade was successful however, ceph health reports 1 mgr and 2 mon daemon crash. upgraded from 5.x 16.2.0-141.el8cp ceph version to 5.1 ceph version 16.2.6-20.el8cp below ceph health status [ceph: root@magna031 /]# ceph health HEALTH_WARN 3 daemons have recently crashed [ceph: root@magna031 /]# ceph health detail HEALTH_WARN 3 daemons have recently crashed [WRN] RECENT_CRASH: 3 daemons have recently crashed mgr.magna006.vxieja crashed on host magna006 at 2021-11-09T16:58:47.494357Z mon.magna031 crashed on host magna031 at 2021-11-09T17:28:14.312133Z mon.magna032 crashed on host magna032 at 2021-11-09T17:28:14.289100Z [ceph: root@magna031 /]# collected crash info and pasted here -> http://pastebin.test.redhat.com/1007149 Will collect mgr and mon logs for the same.
@vikhyat, Setup do not have coredump configured. We need to configure and repro the issue to collect core dumps. I will try to repro the issue in the new cluster and collect core dump if reproduced.
@Vikhyat, I have attached crash logs captured from mgr and mon nodes from the cluster where issue was seen,
(In reply to Preethi from comment #21) > @Vikhyat, I have attached crash logs captured from mgr and mon nodes from > the cluster where issue was seen, @Preethi The BZ attachments from comment#15 to comment#20 does not contain a coredump of mgr daemon. Can you provide the coredump file for mgr crash located in /var/lib/systemd/coredump/ directory ? The file name starts with "core." e.g for a osd daemon $ ls -ltr /var/lib/systemd/coredump/ total 55676 -rw-r-----. 1 root root 57010176 Oct 7 22:19 core.ceph-osd.167.fdb8f3e50a094893b7840041c8af5164.7509.1607379543000000.lz4 Let us know if you cannot find the coredump in /var/lib/systemd/coredump/ directory on node where mgr daemon has crashed. Refer KCS solution https://access.redhat.com/solutions/3968111 , in pre rhcs-5.x release we had to edit ceph-mgr service file to add privilege in docker/podman run command to allow coredump generation in containerized environment (Refer section *Second* -- you donot need to trigger coredump generation, when mgr gets crashed then you will find coredump in /var/lib/systemd/coredump/ directory).
@Prashant, We have upgraded the cluster to latest and there is no core dump in the system now.
*** Bug 2095032 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:1360