2161481 – mds: md_log_replay thread (replay thread) can remain blocked

Bug 2161481 - mds: md_log_replay thread (replay thread) can remain blocked

Summary: mds: md_log_replay thread (replay thread) can remain blocked

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	5.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	5.3z1
Assignee:	Venky Shankar
QA Contact:	Hemanth Kumar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2161483
TreeView+	depends on / blocked

Reported:	2023-01-17 04:53 UTC by Venky Shankar
Modified:	2023-02-28 10:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ceph-16.2.10-105.el8cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2161483 (view as bug list)
Environment:
Last Closed:	2023-02-28 10:06:24 UTC
Embargoed:
Dependent Products:
Flags:	hyelloji: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	57764	None	None	None	2023-01-17 04:53:09 UTC
Red Hat Issue Tracker	RHCEPH-5940	None	None	None	2023-01-17 04:56:40 UTC
Red Hat Product Errata	RHSA-2023:0980	None	None	None	2023-02-28 10:07:26 UTC

Description Venky Shankar 2023-01-17 04:53:10 UTC

(copied from upstream tracker)

In production environment, we have a problem: one standby-replay's md_log_replay thread is hanged.

1,The reason:

line1:    while (!journaler->is_readable() &&
  line2:       journaler->get_read_pos() < journaler->get_write_pos() &&
  line3:       !journaler->get_error()) {
  line4:        C_SaferCond readable_waiter;
  line5:        journaler->wait_for_readable(&readable_waiter);
  line6:        r = readable_waiter.wait();
  line7:    }
This code is from void MDLog::_replay_thread().
(1), If the code enter the while and this thread ("md_log_replay") is switched to the MR_Finisher thread between line3 and line5.  (HERE: journaler->get_read_pos() < journaler->get_write_pos())
  (2), Then the MR_Finisher thread calls Journaler::C_Read: finish ls->_finish_read() -> _assimilate_prefetch().
    a) In _assimilate_prefetch(), journaler->get_write_pos() maybe set to be equal to journaler->get_read_pos().
    b) Because the variable on_readable is 0, the f->complete() will not be called!
        if (on_readable) {
          C_OnFinisher *f = on_readable;
          on_readable = 0;
          f->complete(0);
        }
  (3),Then the MR_Finisher thread is switched to the md_log_replay thread, it will hang on line6 forever !!

Comment 9 errata-xmlrpc 2023-02-28 10:06:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 5.3 Bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0980

Note You need to log in before you can comment on or make changes to this bug.