Bug 2162020 - gfs2: The gfs2_logd process to hang or stall which causes a performance degradation on the gfs2 filesystem
Summary: gfs2: The gfs2_logd process to hang or stall which causes a performance degra...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.9
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: rc
: ---
Assignee: Robert Peterson
QA Contact: cluster-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-01-18 15:45 UTC by Shane Bradley
Modified: 2023-08-10 17:08 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:
bdm: needinfo-
bdm: needinfo-
bdm: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-145632 0 None None None 2023-01-18 16:47:17 UTC
Red Hat Knowledge Base (Solution) 6994611 0 None None None 2023-01-18 15:45:51 UTC

Comment 7 Robert Peterson 2023-02-08 15:33:58 UTC
I just wanted to provide an update for the customer.

I have analyzed the new data they sent in and it was very helpful. They did everything right.

The bottom line is that the latest failure was very similar to the previous one.
All the hung processes eventually boiled down to the log flush daemon getting deadlocked in a tight cpu-bound lock on expradarnetpre01:

root      15948  0.9  0.0      0     0 ?        R    Jan23 192:51 [gfs2_logd]

(All the runs show gfs2_logd in the same tight loop)

I can't tell for sure where or why it's in this tight loop, but it's almost guaranteed to be items on gfs2's "active items list" ("ail") that never got written and removed.
As I said before, we have seen and fixed several issues like this in rhel8 and up, but rhel7 is still lagging behind in its patches.
The test kernel we provided earlier contained 4 patches that seemed most likely to fix the problem. Unfortunately, it looks like they need more.

I've backported two more upstream patches to a new test kernel. The first one is just a simple prerequisite refactoring that doesn't change the logic.
The second patch, "gfs2: Wipe jdata and ail1 in gfs2_journal_wipe, formerly gfs2_meta_wipe" fixed another case in which the gfs2_logd daemon got stuck.
I did not include this in the first test kernel because the patch had more to do with journaled data (as opposed to metadata) in the journal, so I thought it was unlikely to have caused the problem.

I'd like to port another patch or two as well, which I should be able to do today. This is purely instrumentation: it detects when the gfs2_logd daemon gets stuck on the ail list, and if not resolved in a timely manner, stops trying and dumps out the complete ail list so we can tell more about why it got stuck.
This will print messages to the console, dmesg, and the syslog. So if it happens again, we should get more information about the problem in the sos reports.

I'm also going to reevaluate all the other patches we did between rhel7 and rhel8 to make sure we didn't forget any others they might need. I plan to do this today.
Then, of course, the new test kernel will need to be re-tested which may take a day or two.
Hopefully we can provide a new test kernel by the end of this week or early next week, but of course, there are no guarantees.

The good news is that we're pretty sure this is not a new problem, so we know how to debug it and have fixed it before.

Comment 11 Robert Peterson 2023-02-09 20:05:33 UTC
I built a new rhel7 test kernel with several additional patches.
I tested it with xfstests and the QE group's cluster coherency test, and it passed.
It's interesting to note that xfstests deadlocks on a stock 3.10.0-1160.85.1 kernel, but passes with my test kernel.
I asked Nate in the QE group to run regression tests.
The test kernel is called 3.10.0-1160.84.1.el7.case03393117b and it contains the following patches from upstream:

---------------[ origin/main..main.bz2162020 ]---------------
20fc877f007 Bob Peterson         gfs2: Only set PageChecked if we have a transaction
6ac5a43e507 Bob Peterson         GFS2: gfs2_free_extlen can return an extent that is too long
b784e16a3b7 Bob Peterson         GFS2: Only set PageChecked for jdata pages
147c7ce222f Bob Peterson         gfs2: instrumentation wrt ail1 stuck
6df82ce48fa Bob Peterson         gfs2: initialize transaction tr_ailX_lists earlier
d92a57f6ec8 Bob Peterson         gfs2: Wipe jdata and ail1 in gfs2_journal_wipe, formerly gfs2_meta_wipe
17a4a160df5 Bob Peterson         GFS2: Refactor gfs2_remove_from_journal
6efd9093e9c Bob Peterson         gfs2: Fix case in which ail writes are done to jdata holes
7db64ea20f6 Bob Peterson         gfs2: In gfs2_ail1_start_one unplug the IO when needed
84c2eb65ebc Bob Peterson         gfs2: Don't get stuck with I/O plugged in gfs2_ail1_flush
f7698b28e1d Bob Peterson         Revert "GFS2: Re-add a call to log_flush_wait when flushing the journal"


Note You need to log in before you can comment on or make changes to this bug.