2001152 – [Workload-DFG] All three MGR daemons crashed on the same time - in thread_name:safe_timer during adjust_pgs

Bug 2001152 - [Workload-DFG] All three MGR daemons crashed on the same time - in thread_name:safe_timer during adjust_pgs

Summary: [Workload-DFG] All three MGR daemons crashed on the same time - in thread_na...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.0z1
Assignee:	Neha Ojha
QA Contact:	Pawan
Docs Contact:	Mary Frances Hull
URL:
Whiteboard:
Depends On:
Blocks:	1959686
TreeView+	depends on / blocked

Reported:	2021-09-03 22:59 UTC by Vikhyat Umrao
Modified:	2021-11-02 16:39 UTC (History)
CC List:	13 users (show)
Fixed In Version:	ceph-16.2.0-120.el8cp
Doc Type:	Bug Fix
Doc Text:	.The Ceph Manager no longer crashes during large increases to `pg_num` and `pgp_num` Previously, the code that adjusts placement groups did not handle large increases to `pg_num` and `pgp_num` parameters correctly, and led to an integer underflow that can crash the Ceph Manager. With this release, the code that adjusts placement groups was fixed. As a result, large increases to placement groups do not cause the Ceph Manager to crash.
Clone Of:
Environment:
Last Closed:	2021-11-02 16:39:06 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	47738	None	None	None	2021-09-03 23:15:27 UTC
Github	ceph ceph pull 41862	None	None	None	2021-09-07 23:38:59 UTC
Red Hat Issue Tracker	RHCEPH-1441	None	None	None	2021-09-03 23:00:07 UTC
Red Hat Product Errata	RHBA-2021:4105	None	None	None	2021-11-02 16:39:34 UTC

Description Vikhyat Umrao 2021-09-03 22:59:17 UTC

Description of problem:
All three MGR daemon crashed on the same time with the same abort message

- ** Caught signal (Aborted) **  in thread 7f4117eb8700 thread_name:safe_timer

Version-Release number of selected component (if applicable):
RHCS 5 
16.2.0-117.el8cp 



     0> 2021-09-03T20:45:37.923+0000 7f4117eb8700 -1 *** Caught signal (Aborted) **
 in thread 7f4117eb8700 thread_name:safe_timer

 ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7f43aca0bb20]
 2: gsignal()
 3: abort()
 4: /usr/bin/ceph-mgr(+0x154588) [0x55d650d08588]
 5: (DaemonServer::adjust_pgs()+0x3f04) [0x55d650dc0c94]
 6: (DaemonServer::tick()+0x103) [0x55d650dc5673]
 7: (Context::complete(int)+0xd) [0x55d650d50c4d]
 8: (SafeTimer::timer_thread()+0x1b7) [0x7f43adf0dc67]
 9: (SafeTimerThread::entry()+0x11) [0x7f43adf0f241]
 10: /lib64/libpthread.so.0(+0x814a) [0x7f43aca0114a]
 11: clone()

Comment 1 Vikhyat Umrao 2021-09-03 23:02:30 UTC

# ceph -s
  cluster:
    id:     08890e38-0cc9-11ec-9c28-bc97e178dd80
    health: HEALTH_WARN
            no active mgr
 
  services:
    mon: 3 daemons, quorum f18-h02-000-r640.rdu2.scalelab.redhat.com,f18-h07-000-r640,f18-h06-000-r640 (age 7h)
    mgr: no daemons active (since 2h)
    osd: 192 osds: 192 up (since 3h), 192 in (since 7h)
    rgw: 8 daemons active (8 hosts, 1 zones)
 
  data:
    pools:   7 pools, 2239 pgs
    objects: 2.50M objects, 9.4 TiB
    usage:   26 TiB used, 335 TiB / 361 TiB avail
    pgs:     2231 active+clean
             7    active+clean+scrubbing+deep
             1    active+clean+scrubbing

Comment 18 errata-xmlrpc 2021-11-02 16:39:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4105

Note You need to log in before you can comment on or make changes to this bug.