Bug 2296500 - Ceph osd aborted in thread 7f71d41a9640 thread_name:safe_timer
Summary: Ceph osd aborted in thread 7f71d41a9640 thread_name:safe_timer
Keywords:
Status: POST
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ODF 4.18.0
Assignee: Mohit Agrawal
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks: 2336611 2383547 2383548
TreeView+ depends on / blocked
 
Reported: 2024-07-09 08:36 UTC by Prasad Desala
Modified: 2025-07-25 17:14 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2336611 (view as bug list)
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OCSBZM-8630 0 None None None 2024-08-26 11:24:17 UTC

Description Prasad Desala 2024-07-09 08:36:30 UTC
Description of problem (please be detailed as possible and provide log
snippests):
==================================================================================
Ceph osd (osd.5) aborted in thread 7f71d41a9640 thread_name:safe_timer during a cluster full test. The cluster was filled up to the full_ratio limits of 85%, after which the test failed while attempting to establish a connection to prometheus. It could be that the osd might have crashed during the tear down part of the test while deleting the data.

sh-5.1$ ceph crash ls
epID                                                                ENTITY  NEW  
2024-07-09T05:50:41.011720Z_16c75204-1166-4892-a9db-9c97b34c8270  osd.5    *   


ceph.osd.5 log snip:
====================
    -4> 2024-07-09T05:50:41.769+0000 7f71d09a2640  5 osd.5 pg_epoch: 648 pg[1.b( v 645'9977 (219'6281,645'9977] local-lis/les=603/604 n=195 ec=32/8 lis/c=603/603 les/c/f=604/604/635 sis=646) [3,2] r=-1 lpr=646 pi=[603,646)/1 crt=645'9977 lcod 645'9976 mlcod 0'0 unknown NOTIFY mbc={}] enter Started/Stray
    -3> 2024-07-09T05:50:41.769+0000 7f71d09a2640  5 osd.5 pg_epoch: 646 pg[1.53( v 645'12435 (557'10010,645'12435] local-lis/les=571/572 n=200 ec=221/8 lis/c=571/571 les/c/f=572/575/635 sis=571) [5,1,2] r=0 lpr=571 crt=645'12435 lcod 645'12434 mlcod 645'12434 active+clean] exit Started/Primary/Active/Clean 48155.259186 18241 0.193717
    -2> 2024-07-09T05:50:41.769+0000 7f71d09a2640  5 osd.5 pg_epoch: 646 pg[1.53( v 645'12435 (557'10010,645'12435] local-lis/les=571/572 n=200 ec=221/8 lis/c=571/571 les/c/f=572/575/635 sis=571) [5,1,2] r=0 lpr=571 crt=645'12435 lcod 645'12434 mlcod 645'12434 active mbc={255={}}] exit Started/Primary/Active 48379.727367 0 0.000000
    -1> 2024-07-09T05:50:41.769+0000 7f71d09a2640  5 osd.5 pg_epoch: 646 pg[1.53( v 645'12435 (557'10010,645'12435] local-lis/les=571/572 n=200 ec=221/8 lis/c=571/571 les/c/f=572/575/635 sis=571) [5,1,2] r=0 lpr=571 crt=645'12435 lcod 645'12434 mlcod 645'12434 active mbc={255={}}] exit Started/Primary 48380.649031 0 0.000000
     0> 2024-07-09T05:50:41.011+0000 7f71d41a9640 -1 *** Caught signal (Aborted) **
 in thread 7f71d41a9640 thread_name:safe_timer

 ceph version 17.2.6-216.el9cp (2787f204195806caaa97db2b363a263d5a9fa156) quincy (stable)
 1: /lib64/libc.so.6(+0x54db0) [0x7f71efe9ddb0]
 2: /lib64/libc.so.6(+0xa154c) [0x7f71efeea54c]
 3: raise()
 4: abort()
 5: /lib64/libc.so.6(+0x2871b) [0x7f71efe7171b]
 6: /lib64/libc.so.6(+0x4dca6) [0x7f71efe96ca6]
 7: ceph-osd(+0x843de8) [0x557239663de8]
 8: (Message::encode(unsigned long, int, bool)+0x2e) [0x5572399fc5ae]
 9: (ProtocolV2::send_message(Message*)+0x241) [0x557239b83991]
 10: (AsyncConnection::send_message(Message*)+0x276) [0x557239b701f6]
 11: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x1d1) [0x55723934c591]
 12: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x772) [0x5572396797b2]
 13: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x652) [0x557239672df2]
 14: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x38f) [0x55723948223f]
 15: (PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x57) [0x5572394873c7]
 16: (PrimaryLogPG::handle_watch_timeout(std::shared_ptr<Watch>)+0xb73) [0x557239489a93]
 17: ceph-osd(+0x5b8d5e) [0x5572393d8d5e]
 18: (CommonSafeTimer<std::mutex>::timer_thread()+0x11a) [0x5572398ddb3a]
 19: ceph-osd(+0xabe3f1) [0x5572398de3f1]
 20: /lib64/libc.so.6(+0x9f802) [0x7f71efee8802]
 21: /lib64/libc.so.6(+0x3f450) [0x7f71efe88450]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/10 civetweb
   1/ 5 rgw_access
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_cleaner
   0/ 5 seastore_lba
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 alienstore
   1/ 5 mclock
   1/ 5 ceph_exporter
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f71ca195640 / osd_srv_heartbt
  7f71ca996640 / tp_osd_tp
  7f71cb197640 / tp_osd_tp
  7f71cb998640 / tp_osd_tp
  7f71cc199640 / tp_osd_tp
  7f71cc99a640 / tp_osd_tp
  7f71cd19b640 / tp_osd_tp
  7f71cd99c640 / tp_osd_tp
  7f71ce19d640 / tp_osd_tp
  7f71ce99e640 / tp_osd_tp
  7f71cf19f640 / tp_osd_tp
  7f71cf9a0640 / tp_osd_tp
  7f71d01a1640 / tp_osd_tp
  7f71d09a2640 / tp_osd_tp
  7f71d11a3640 / tp_osd_tp
  7f71d19a4640 / tp_osd_tp
  7f71d21a5640 / tp_osd_tp
  7f71d41a9640 / safe_timer
  7f71db1b7640 / ceph-osd
  7f71db9b8640 / cfin
  7f71dc1b9640 / bstore_kv_sync
  7f71dc9ba640 / bstore_kv_final
  7f71dd5d8640 / ms_dispatch
  7f71de5da640 / safe_timer
  7f71df5dc640 / ms_dispatch
  7f71e01c1640 / bstore_mempool
  7f71e39d5640 / bstore_aio
  7f71e59d9640 / safe_timer
  7f71ea1e2640 / safe_timer
  7f71eb3a2640 / io_context_pool
  7f71ecd0a640 / io_context_pool
  7f71ed50b640 / admin_socket
  7f71edd0c640 / msgr-worker-2
  7f71ee50d640 / msgr-worker-1
  7f71eed0e640 / msgr-worker-0
  max_recent     10000
  max_new        10000
  log_file /var/log/ceph/ceph-osd.5.log
--- end dump of recent events ---
2024-07-09T05:51:43.199+0000 7f64499295c0  0 set uid:gid to 167:167 (ceph:ceph)
2024-07-09T05:51:43.199+0000 7f64499295c0  0 ceph version 17.2.6-216.el9cp (2787f204195806caaa97db2b363a263d5a9fa156) quincy (stable), process ceph-osd, pid 39716


Version of all relevant components (if applicable):
OCP: 4.14.31
ODF: v4.14.9-rhodf
Ceph: ceph version 17.2.6-216.el9cp (2787f204195806caaa97db2b363a263d5a9fa156) quincy (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
The ceph health went to warning state due to this osd crash which can be made healthy by archiving the crash.

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
It could possibly be a race condition

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
===================
The crash was observed while running - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/cross_functional/system_test/test_cluster_full_and_recovery.py
The test performs below steps,
1.Create PVC1 [FS + RBD]
2.Verify new PVC1 [FS + RBD] on Bound state
3.Run FIO on PVC1_FS + PVC1_RBD
4.Calculate Checksum PVC1_FS + PVC1_RBD
5.Fill the cluster to “Full ratio” (usually 85%) with benchmark-operator
6.Verify Alerts are seen ["CephClusterCriticallyFull", "CephOSDNearFull"]
7.Respin 'osd' ,'mgr', 'mon' pods
8.Create PVC2 [FS + RBD]
9.Verify PVC2 [FS + RBD] are in Pending state
10.Create snapshot from PVC1 [FS+RBD]
11.Verify snapshots on false state
12.Change Ceph full_ratiofrom from 85% to 95%
13.Delete  benchmark-operator PVCs
14.Verify PVC2 [FS + RBD]  are moved to Bound state
15.Verify snapshots moved from false state to true state
16.Restore new pvc from snapshot pvc [RBD + FS]
17.Verify checksum PVC1 equal to PVC1_RESTORE
18.Change Ceph full_ratiofrom from 95% to 85%

It failed at step-6 while establishing a connection to the prometheus 

Actual results:
===============
one of the ceph osd crashed

Expected results:
==================
No ceph crashes

Comment 7 Sunil Kumar Acharya 2024-08-26 11:22:42 UTC
Are there any blockers to provide devel ack for this bz? If not, please provide the devel ack.

Comment 11 Sunil Kumar Acharya 2024-08-28 12:37:53 UTC
are we blocked on anything to provide devel ack on this bz?


Note You need to log in before you can comment on or make changes to this bug.