(copied from upstream tracker) In production environment, we have a problem: one standby-replay's md_log_replay thread is hanged. 1,The reason: line1: while (!journaler->is_readable() && line2: journaler->get_read_pos() < journaler->get_write_pos() && line3: !journaler->get_error()) { line4: C_SaferCond readable_waiter; line5: journaler->wait_for_readable(&readable_waiter); line6: r = readable_waiter.wait(); line7: } This code is from void MDLog::_replay_thread(). (1), If the code enter the while and this thread ("md_log_replay") is switched to the MR_Finisher thread between line3 and line5. (HERE: journaler->get_read_pos() < journaler->get_write_pos()) (2), Then the MR_Finisher thread calls Journaler::C_Read: finish ls->_finish_read() -> _assimilate_prefetch(). a) In _assimilate_prefetch(), journaler->get_write_pos() maybe set to be equal to journaler->get_read_pos(). b) Because the variable on_readable is 0, the f->complete() will not be called! if (on_readable) { C_OnFinisher *f = on_readable; on_readable = 0; f->complete(0); } (3),Then the MR_Finisher thread is switched to the md_log_replay thread, it will hang on line6 forever !!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 5.3 Bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0980