Bug 2161481 - mds: md_log_replay thread (replay thread) can remain blocked
Summary: mds: md_log_replay thread (replay thread) can remain blocked
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 5.2
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 5.3z1
Assignee: Venky Shankar
QA Contact: Hemanth Kumar
URL:
Whiteboard:
Depends On:
Blocks: 2161483
TreeView+ depends on / blocked
 
Reported: 2023-01-17 04:53 UTC by Venky Shankar
Modified: 2023-02-28 10:07 UTC (History)
5 users (show)

Fixed In Version: ceph-16.2.10-105.el8cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2161483 (view as bug list)
Environment:
Last Closed: 2023-02-28 10:06:24 UTC
Embargoed:
hyelloji: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 57764 0 None None None 2023-01-17 04:53:09 UTC
Red Hat Issue Tracker RHCEPH-5940 0 None None None 2023-01-17 04:56:40 UTC
Red Hat Product Errata RHSA-2023:0980 0 None None None 2023-02-28 10:07:26 UTC

Description Venky Shankar 2023-01-17 04:53:10 UTC
(copied from upstream tracker)

In production environment, we have a problem: one standby-replay's md_log_replay thread is hanged.

1,The reason:

line1:    while (!journaler->is_readable() &&
  line2:       journaler->get_read_pos() < journaler->get_write_pos() &&
  line3:       !journaler->get_error()) {
  line4:        C_SaferCond readable_waiter;
  line5:        journaler->wait_for_readable(&readable_waiter);
  line6:        r = readable_waiter.wait();
  line7:    }
This code is from void MDLog::_replay_thread().
(1), If the code enter the while and this thread ("md_log_replay") is switched to the MR_Finisher thread between line3 and line5.  (HERE: journaler->get_read_pos() < journaler->get_write_pos())
  (2), Then the MR_Finisher thread calls Journaler::C_Read: finish ls->_finish_read() -> _assimilate_prefetch().
    a) In _assimilate_prefetch(), journaler->get_write_pos() maybe set to be equal to journaler->get_read_pos().
    b) Because the variable on_readable is 0, the f->complete() will not be called!
        if (on_readable) {
          C_OnFinisher *f = on_readable;
          on_readable = 0;
          f->complete(0);
        }
  (3),Then the MR_Finisher thread is switched to the md_log_replay thread, it will hang on line6 forever !!

Comment 9 errata-xmlrpc 2023-02-28 10:06:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 5.3 Bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0980


Note You need to log in before you can comment on or make changes to this bug.