Bug 2001152 - [Workload-DFG] All three MGR daemons crashed on the same time - in thread_name:safe_timer during adjust_pgs
Summary: [Workload-DFG] All three MGR daemons crashed on the same time - in thread_na...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 5.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 5.0z1
Assignee: Neha Ojha
QA Contact: Pawan
Mary Frances Hull
URL:
Whiteboard:
Depends On:
Blocks: 1959686
TreeView+ depends on / blocked
 
Reported: 2021-09-03 22:59 UTC by Vikhyat Umrao
Modified: 2021-11-02 16:39 UTC (History)
13 users (show)

Fixed In Version: ceph-16.2.0-120.el8cp
Doc Type: Bug Fix
Doc Text:
.The Ceph Manager no longer crashes during large increases to `pg_num` and `pgp_num` Previously, the code that adjusts placement groups did not handle large increases to `pg_num` and `pgp_num` parameters correctly, and led to an integer underflow that can crash the Ceph Manager. With this release, the code that adjusts placement groups was fixed. As a result, large increases to placement groups do not cause the Ceph Manager to crash.
Clone Of:
Environment:
Last Closed: 2021-11-02 16:39:06 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 47738 0 None None None 2021-09-03 23:15:27 UTC
Github ceph ceph pull 41862 0 None None None 2021-09-07 23:38:59 UTC
Red Hat Issue Tracker RHCEPH-1441 0 None None None 2021-09-03 23:00:07 UTC
Red Hat Product Errata RHBA-2021:4105 0 None None None 2021-11-02 16:39:34 UTC

Description Vikhyat Umrao 2021-09-03 22:59:17 UTC
Description of problem:
All three MGR daemon crashed on the same time with the same abort message

- ** Caught signal (Aborted) **  in thread 7f4117eb8700 thread_name:safe_timer

Version-Release number of selected component (if applicable):
RHCS 5 
16.2.0-117.el8cp 



     0> 2021-09-03T20:45:37.923+0000 7f4117eb8700 -1 *** Caught signal (Aborted) **
 in thread 7f4117eb8700 thread_name:safe_timer

 ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7f43aca0bb20]
 2: gsignal()
 3: abort()
 4: /usr/bin/ceph-mgr(+0x154588) [0x55d650d08588]
 5: (DaemonServer::adjust_pgs()+0x3f04) [0x55d650dc0c94]
 6: (DaemonServer::tick()+0x103) [0x55d650dc5673]
 7: (Context::complete(int)+0xd) [0x55d650d50c4d]
 8: (SafeTimer::timer_thread()+0x1b7) [0x7f43adf0dc67]
 9: (SafeTimerThread::entry()+0x11) [0x7f43adf0f241]
 10: /lib64/libpthread.so.0(+0x814a) [0x7f43aca0114a]
 11: clone()

Comment 1 Vikhyat Umrao 2021-09-03 23:02:30 UTC
# ceph -s
  cluster:
    id:     08890e38-0cc9-11ec-9c28-bc97e178dd80
    health: HEALTH_WARN
            no active mgr
 
  services:
    mon: 3 daemons, quorum f18-h02-000-r640.rdu2.scalelab.redhat.com,f18-h07-000-r640,f18-h06-000-r640 (age 7h)
    mgr: no daemons active (since 2h)
    osd: 192 osds: 192 up (since 3h), 192 in (since 7h)
    rgw: 8 daemons active (8 hosts, 1 zones)
 
  data:
    pools:   7 pools, 2239 pgs
    objects: 2.50M objects, 9.4 TiB
    usage:   26 TiB used, 335 TiB / 361 TiB avail
    pgs:     2231 active+clean
             7    active+clean+scrubbing+deep
             1    active+clean+scrubbing

Comment 18 errata-xmlrpc 2021-11-02 16:39:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4105


Note You need to log in before you can comment on or make changes to this bug.