Bug 1984881 - [RADOS] MGR daemon crashed saying - FAILED ceph_assert(pending_service_map.epoch > service_map.epoch)
Summary: [RADOS] MGR daemon crashed saying - FAILED ceph_assert(pending_service_map.ep...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 5.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 6.0
Assignee: Neha Ojha
QA Contact: skanta
Masauso Lungu
URL:
Whiteboard:
: 2095032 (view as bug list)
Depends On:
Blocks: 2095062 2126050
TreeView+ depends on / blocked
 
Reported: 2021-07-22 11:55 UTC by Vasishta
Modified: 2024-03-26 10:28 UTC (History)
13 users (show)

Fixed In Version: ceph-17.2.3-11.el9cp
Doc Type: Bug Fix
Doc Text:
.The Ceph Manager checks that deals with the initial service map is now relaxed Previously, when upgrading a cluster, the Ceph Manager would receive several `service_map` versions from the previously active Ceph manager. This caused the manager daemon to crash, due to an incorrect check in the code, when the newly activated manager received a map with a higher version sent by the previously active manager. With this fix, the check in the Ceph Manager that deals with the initial service map is relaxed to correctly check service maps and no assertion occurs during the Ceph Manager failover.
Clone Of:
: 2095062 (view as bug list)
Environment:
Last Closed: 2023-03-20 18:55:34 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 51835 0 None None None 2021-07-23 22:52:28 UTC
Github ceph ceph pull 45984 0 None open mgr: relax "pending_service_map.epoch > service_map.epoch" assert 2022-06-08 22:17:07 UTC
Github ceph ceph pull 46738 0 None Merged quincy: mgr: relax "pending_service_map.epoch > service_map.epoch" assert 2022-08-19 04:01:30 UTC
Red Hat Product Errata RHBA-2023:1360 0 None None None 2023-03-20 18:56:13 UTC

Description Vasishta 2021-07-22 11:55:29 UTC
Description of problem:
In a cluster upgraded from RHCS 4.x to 5.x, MGR daemon had crashed when checked later.
On the upgraded cluster, only NFS conf was modified and performed some IO

Version-Release number of selected component (if applicable):
"ceph_version": "16.2.0-102.el8cp"

How reproducible:
Once

Steps to Reproduce:
1. Configure 4.x with NFS Ganesha on RGW, Perform IO
2. Upgrade cluster to 5.x
3. Modified NFS configuration, perform IO

Actual results:
"assert_line": 2928,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/mgr/DaemonServer.cc: In function 'DaemonServer::got_service_map()::<lambda(const ServiceMap&)>' thread 7ffbac41b700 time 2021-07-18T17:14:00.116180+0000\n/builddir/build/BUILD/ceph-16.2.0/src/mgr/DaemonServer.cc: 2928: FAILED ceph_assert(pending_service_map.epoch > service_map.epoch)\n",


Expected results:
No Crashes

Additional info:

Comment 9 Preethi 2021-11-10 06:35:41 UTC
@Neha, Am able to see the same issue while upgrading from 5.0 to 5.1. Upgrade was successful however, ceph health reports 1 mgr and 2 mon daemon crash.

upgraded from 5.x  16.2.0-141.el8cp ceph version  to 5.1 ceph version 16.2.6-20.el8cp


below ceph health status
[ceph: root@magna031 /]# ceph health
HEALTH_WARN 3 daemons have recently crashed
[ceph: root@magna031 /]# ceph health detail
HEALTH_WARN 3 daemons have recently crashed
[WRN] RECENT_CRASH: 3 daemons have recently crashed
    mgr.magna006.vxieja crashed on host magna006 at 2021-11-09T16:58:47.494357Z
    mon.magna031 crashed on host magna031 at 2021-11-09T17:28:14.312133Z
    mon.magna032 crashed on host magna032 at 2021-11-09T17:28:14.289100Z
[ceph: root@magna031 /]# 


collected crash info and pasted here -> http://pastebin.test.redhat.com/1007149

Will collect mgr and mon logs for the same.

Comment 14 Preethi 2021-11-12 06:21:07 UTC
@vikhyat, Setup do not have coredump configured. We need to configure and repro the issue to collect core dumps. I will try to repro the issue in the new cluster and collect core dump if reproduced.

Comment 21 Preethi 2021-11-12 10:08:29 UTC
@Vikhyat, I have attached crash logs captured from mgr and mon nodes from the cluster where issue was seen,

Comment 22 Prashant Dhange 2021-11-24 02:09:53 UTC
(In reply to Preethi from comment #21)
> @Vikhyat, I have attached crash logs captured from mgr and mon nodes from
> the cluster where issue was seen,

@Preethi The BZ attachments from comment#15 to comment#20 does not contain a coredump of mgr daemon. Can you provide the coredump file for mgr crash located in /var/lib/systemd/coredump/ directory ? The file name starts with "core." e.g for a osd daemon 

$ ls -ltr /var/lib/systemd/coredump/
total 55676
-rw-r-----. 1 root root 57010176 Oct 7 22:19 core.ceph-osd.167.fdb8f3e50a094893b7840041c8af5164.7509.1607379543000000.lz4

Let us know if you cannot find the coredump in /var/lib/systemd/coredump/ directory on node where mgr daemon has crashed. Refer KCS solution https://access.redhat.com/solutions/3968111 , in pre rhcs-5.x release we had to edit ceph-mgr service file to add privilege in docker/podman run command to allow coredump generation in containerized environment (Refer section *Second* -- you donot need to trigger coredump generation, when mgr gets crashed then you will find coredump in /var/lib/systemd/coredump/ directory).

Comment 23 Preethi 2021-11-26 05:40:57 UTC
@Prashant, We have upgraded the cluster to latest and there is no core dump in the system now.

Comment 25 Vikhyat Umrao 2022-06-08 22:15:00 UTC
*** Bug 2095032 has been marked as a duplicate of this bug. ***

Comment 50 errata-xmlrpc 2023-03-20 18:55:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1360


Note You need to log in before you can comment on or make changes to this bug.