Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1984881

Summary: [RADOS] MGR daemon crashed saying - FAILED ceph_assert(pending_service_map.epoch > service_map.epoch)
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: RADOSAssignee: Neha Ojha <nojha>
Status: CLOSED ERRATA QA Contact: skanta
Severity: medium Docs Contact: Masauso Lungu <mlungu>
Priority: unspecified    
Version: 5.0CC: akupczyk, bhubbard, ceph-eng-bugs, glaw, mlungu, nojha, pdhange, pnataraj, rzarzyns, skanta, sseshasa, tserlin, vumrao
Target Milestone: ---   
Target Release: 6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-17.2.3-11.el9cp Doc Type: Bug Fix
Doc Text:
.The Ceph Manager checks that deals with the initial service map is now relaxed Previously, when upgrading a cluster, the Ceph Manager would receive several `service_map` versions from the previously active Ceph manager. This caused the manager daemon to crash, due to an incorrect check in the code, when the newly activated manager received a map with a higher version sent by the previously active manager. With this fix, the check in the Ceph Manager that deals with the initial service map is relaxed to correctly check service maps and no assertion occurs during the Ceph Manager failover.
Story Points: ---
Clone Of:
: 2095062 (view as bug list) Environment:
Last Closed: 2023-03-20 18:55:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2095062, 2126050    

Description Vasishta 2021-07-22 11:55:29 UTC
Description of problem:
In a cluster upgraded from RHCS 4.x to 5.x, MGR daemon had crashed when checked later.
On the upgraded cluster, only NFS conf was modified and performed some IO

Version-Release number of selected component (if applicable):
"ceph_version": "16.2.0-102.el8cp"

How reproducible:
Once

Steps to Reproduce:
1. Configure 4.x with NFS Ganesha on RGW, Perform IO
2. Upgrade cluster to 5.x
3. Modified NFS configuration, perform IO

Actual results:
"assert_line": 2928,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/mgr/DaemonServer.cc: In function 'DaemonServer::got_service_map()::<lambda(const ServiceMap&)>' thread 7ffbac41b700 time 2021-07-18T17:14:00.116180+0000\n/builddir/build/BUILD/ceph-16.2.0/src/mgr/DaemonServer.cc: 2928: FAILED ceph_assert(pending_service_map.epoch > service_map.epoch)\n",


Expected results:
No Crashes

Additional info:

Comment 9 Preethi 2021-11-10 06:35:41 UTC
@Neha, Am able to see the same issue while upgrading from 5.0 to 5.1. Upgrade was successful however, ceph health reports 1 mgr and 2 mon daemon crash.

upgraded from 5.x  16.2.0-141.el8cp ceph version  to 5.1 ceph version 16.2.6-20.el8cp


below ceph health status
[ceph: root@magna031 /]# ceph health
HEALTH_WARN 3 daemons have recently crashed
[ceph: root@magna031 /]# ceph health detail
HEALTH_WARN 3 daemons have recently crashed
[WRN] RECENT_CRASH: 3 daemons have recently crashed
    mgr.magna006.vxieja crashed on host magna006 at 2021-11-09T16:58:47.494357Z
    mon.magna031 crashed on host magna031 at 2021-11-09T17:28:14.312133Z
    mon.magna032 crashed on host magna032 at 2021-11-09T17:28:14.289100Z
[ceph: root@magna031 /]# 


collected crash info and pasted here -> http://pastebin.test.redhat.com/1007149

Will collect mgr and mon logs for the same.

Comment 14 Preethi 2021-11-12 06:21:07 UTC
@vikhyat, Setup do not have coredump configured. We need to configure and repro the issue to collect core dumps. I will try to repro the issue in the new cluster and collect core dump if reproduced.

Comment 21 Preethi 2021-11-12 10:08:29 UTC
@Vikhyat, I have attached crash logs captured from mgr and mon nodes from the cluster where issue was seen,

Comment 22 Prashant Dhange 2021-11-24 02:09:53 UTC
(In reply to Preethi from comment #21)
> @Vikhyat, I have attached crash logs captured from mgr and mon nodes from
> the cluster where issue was seen,

@Preethi The BZ attachments from comment#15 to comment#20 does not contain a coredump of mgr daemon. Can you provide the coredump file for mgr crash located in /var/lib/systemd/coredump/ directory ? The file name starts with "core." e.g for a osd daemon 

$ ls -ltr /var/lib/systemd/coredump/
total 55676
-rw-r-----. 1 root root 57010176 Oct 7 22:19 core.ceph-osd.167.fdb8f3e50a094893b7840041c8af5164.7509.1607379543000000.lz4

Let us know if you cannot find the coredump in /var/lib/systemd/coredump/ directory on node where mgr daemon has crashed. Refer KCS solution https://access.redhat.com/solutions/3968111 , in pre rhcs-5.x release we had to edit ceph-mgr service file to add privilege in docker/podman run command to allow coredump generation in containerized environment (Refer section *Second* -- you donot need to trigger coredump generation, when mgr gets crashed then you will find coredump in /var/lib/systemd/coredump/ directory).

Comment 23 Preethi 2021-11-26 05:40:57 UTC
@Prashant, We have upgraded the cluster to latest and there is no core dump in the system now.

Comment 25 Vikhyat Umrao 2022-06-08 22:15:00 UTC
*** Bug 2095032 has been marked as a duplicate of this bug. ***

Comment 50 errata-xmlrpc 2023-03-20 18:55:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1360