Bug 2001152

Summary: [Workload-DFG] All three MGR daemons crashed on the same time - in thread_name:safe_timer during adjust_pgs
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: Neha Ojha <nojha>
Status: CLOSED ERRATA QA Contact: Pawan <pdhiran>
Severity: high Docs Contact: Mary Frances Hull <mhull>
Priority: unspecified    
Version: 5.0CC: agunn, akupczyk, bhubbard, ceph-eng-bugs, gsitlani, nojha, pdhiran, rzarzyns, sseshasa, tserlin, twilkins, vimishra, vumrao
Target Milestone: ---Keywords: CodeChange
Target Release: 5.0z1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-16.2.0-120.el8cp Doc Type: Bug Fix
Doc Text:
.The Ceph Manager no longer crashes during large increases to `pg_num` and `pgp_num` Previously, the code that adjusts placement groups did not handle large increases to `pg_num` and `pgp_num` parameters correctly, and led to an integer underflow that can crash the Ceph Manager. With this release, the code that adjusts placement groups was fixed. As a result, large increases to placement groups do not cause the Ceph Manager to crash.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-02 16:39:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1959686    

Description Vikhyat Umrao 2021-09-03 22:59:17 UTC
Description of problem:
All three MGR daemon crashed on the same time with the same abort message

- ** Caught signal (Aborted) **  in thread 7f4117eb8700 thread_name:safe_timer

Version-Release number of selected component (if applicable):
RHCS 5 
16.2.0-117.el8cp 



     0> 2021-09-03T20:45:37.923+0000 7f4117eb8700 -1 *** Caught signal (Aborted) **
 in thread 7f4117eb8700 thread_name:safe_timer

 ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7f43aca0bb20]
 2: gsignal()
 3: abort()
 4: /usr/bin/ceph-mgr(+0x154588) [0x55d650d08588]
 5: (DaemonServer::adjust_pgs()+0x3f04) [0x55d650dc0c94]
 6: (DaemonServer::tick()+0x103) [0x55d650dc5673]
 7: (Context::complete(int)+0xd) [0x55d650d50c4d]
 8: (SafeTimer::timer_thread()+0x1b7) [0x7f43adf0dc67]
 9: (SafeTimerThread::entry()+0x11) [0x7f43adf0f241]
 10: /lib64/libpthread.so.0(+0x814a) [0x7f43aca0114a]
 11: clone()

Comment 1 Vikhyat Umrao 2021-09-03 23:02:30 UTC
# ceph -s
  cluster:
    id:     08890e38-0cc9-11ec-9c28-bc97e178dd80
    health: HEALTH_WARN
            no active mgr
 
  services:
    mon: 3 daemons, quorum f18-h02-000-r640.rdu2.scalelab.redhat.com,f18-h07-000-r640,f18-h06-000-r640 (age 7h)
    mgr: no daemons active (since 2h)
    osd: 192 osds: 192 up (since 3h), 192 in (since 7h)
    rgw: 8 daemons active (8 hosts, 1 zones)
 
  data:
    pools:   7 pools, 2239 pgs
    objects: 2.50M objects, 9.4 TiB
    usage:   26 TiB used, 335 TiB / 361 TiB avail
    pgs:     2231 active+clean
             7    active+clean+scrubbing+deep
             1    active+clean+scrubbing

Comment 18 errata-xmlrpc 2021-11-02 16:39:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4105