1984881 – [RADOS] MGR daemon crashed saying - FAILED ceph_assert(pending_service_map.epoch > service_map.epoch)

Bug 1984881 - [RADOS] MGR daemon crashed saying - FAILED ceph_assert(pending_service_map.epoch > service_map.epoch)

Summary: [RADOS] MGR daemon crashed saying - FAILED ceph_assert(pending_service_map.ep...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	6.0
Assignee:	Neha Ojha
QA Contact:	skanta
Docs Contact:	Masauso Lungu
URL:
Whiteboard:
Duplicates (1):	2095032 (view as bug list)
Depends On:
Blocks:	2095062 2126050
TreeView+	depends on / blocked

Reported:	2021-07-22 11:55 UTC by Vasishta
Modified:	2024-03-26 10:28 UTC (History)
CC List:	13 users (show)
Fixed In Version:	ceph-17.2.3-11.el9cp
Doc Type:	Bug Fix
Doc Text:	.The Ceph Manager checks that deals with the initial service map is now relaxed Previously, when upgrading a cluster, the Ceph Manager would receive several `service_map` versions from the previously active Ceph manager. This caused the manager daemon to crash, due to an incorrect check in the code, when the newly activated manager received a map with a higher version sent by the previously active manager. With this fix, the check in the Ceph Manager that deals with the initial service map is relaxed to correctly check service maps and no assertion occurs during the Ceph Manager failover.
Clone Of:
Clones:	2095062 (view as bug list)
Environment:
Last Closed:	2023-03-20 18:55:34 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	51835	None	None	None	2021-07-23 22:52:28 UTC
Github	ceph ceph pull 45984	None	open	mgr: relax "pending_service_map.epoch > service_map.epoch" assert	2022-06-08 22:17:07 UTC
Github	ceph ceph pull 46738	None	Merged	quincy: mgr: relax "pending_service_map.epoch > service_map.epoch" assert	2022-08-19 04:01:30 UTC
Red Hat Product Errata	RHBA-2023:1360	None	None	None	2023-03-20 18:56:13 UTC

Description Vasishta 2021-07-22 11:55:29 UTC

Description of problem:
In a cluster upgraded from RHCS 4.x to 5.x, MGR daemon had crashed when checked later.
On the upgraded cluster, only NFS conf was modified and performed some IO

Version-Release number of selected component (if applicable):
"ceph_version": "16.2.0-102.el8cp"

How reproducible:
Once

Steps to Reproduce:
1. Configure 4.x with NFS Ganesha on RGW, Perform IO
2. Upgrade cluster to 5.x
3. Modified NFS configuration, perform IO

Actual results:
"assert_line": 2928,
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.0/src/mgr/DaemonServer.cc: In function 'DaemonServer::got_service_map()::<lambda(const ServiceMap&)>' thread 7ffbac41b700 time 2021-07-18T17:14:00.116180+0000\n/builddir/build/BUILD/ceph-16.2.0/src/mgr/DaemonServer.cc: 2928: FAILED ceph_assert(pending_service_map.epoch > service_map.epoch)\n",


Expected results:
No Crashes

Additional info:

Comment 9 Preethi 2021-11-10 06:35:41 UTC

@Neha, Am able to see the same issue while upgrading from 5.0 to 5.1. Upgrade was successful however, ceph health reports 1 mgr and 2 mon daemon crash.

upgraded from 5.x  16.2.0-141.el8cp ceph version  to 5.1 ceph version 16.2.6-20.el8cp


below ceph health status
[ceph: root@magna031 /]# ceph health
HEALTH_WARN 3 daemons have recently crashed
[ceph: root@magna031 /]# ceph health detail
HEALTH_WARN 3 daemons have recently crashed
[WRN] RECENT_CRASH: 3 daemons have recently crashed
    mgr.magna006.vxieja crashed on host magna006 at 2021-11-09T16:58:47.494357Z
    mon.magna031 crashed on host magna031 at 2021-11-09T17:28:14.312133Z
    mon.magna032 crashed on host magna032 at 2021-11-09T17:28:14.289100Z
[ceph: root@magna031 /]# 


collected crash info and pasted here -> http://pastebin.test.redhat.com/1007149

Will collect mgr and mon logs for the same.

Comment 14 Preethi 2021-11-12 06:21:07 UTC

@vikhyat, Setup do not have coredump configured. We need to configure and repro the issue to collect core dumps. I will try to repro the issue in the new cluster and collect core dump if reproduced.

Comment 21 Preethi 2021-11-12 10:08:29 UTC

@Vikhyat, I have attached crash logs captured from mgr and mon nodes from the cluster where issue was seen,

Comment 22 Prashant Dhange 2021-11-24 02:09:53 UTC

(In reply to Preethi from comment #21)
> @Vikhyat, I have attached crash logs captured from mgr and mon nodes from
> the cluster where issue was seen,

@Preethi The BZ attachments from comment#15 to comment#20 does not contain a coredump of mgr daemon. Can you provide the coredump file for mgr crash located in /var/lib/systemd/coredump/ directory ? The file name starts with "core." e.g for a osd daemon 

$ ls -ltr /var/lib/systemd/coredump/
total 55676
-rw-r-----. 1 root root 57010176 Oct 7 22:19 core.ceph-osd.167.fdb8f3e50a094893b7840041c8af5164.7509.1607379543000000.lz4

Let us know if you cannot find the coredump in /var/lib/systemd/coredump/ directory on node where mgr daemon has crashed. Refer KCS solution https://access.redhat.com/solutions/3968111 , in pre rhcs-5.x release we had to edit ceph-mgr service file to add privilege in docker/podman run command to allow coredump generation in containerized environment (Refer section *Second* -- you donot need to trigger coredump generation, when mgr gets crashed then you will find coredump in /var/lib/systemd/coredump/ directory).

Comment 23 Preethi 2021-11-26 05:40:57 UTC

@Prashant, We have upgraded the cluster to latest and there is no core dump in the system now.

Comment 25 Vikhyat Umrao 2022-06-08 22:15:00 UTC

*** Bug 2095032 has been marked as a duplicate of this bug. ***

Comment 50 errata-xmlrpc 2023-03-20 18:55:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1360

Note You need to log in before you can comment on or make changes to this bug.