Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2162020

Summary: gfs2: The gfs2_logd process to hang or stall which causes a performance degradation on the gfs2 filesystem
Product: Red Hat Enterprise Linux 7 Reporter: Shane Bradley <sbradley>
Component: kernelAssignee: Robert Peterson <rpeterso>
kernel sub component: GFS/GFS2 QA Contact: cluster-qe <cluster-qe>
Status: CLOSED MIGRATED Docs Contact:
Severity: medium    
Priority: high CC: bdm, bmarson, gfs2-maint, rpeterso, sbradley, swachira
Version: 7.9Keywords: MigratedToJIRA
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-09-25 11:06:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 7 Robert Peterson 2023-02-08 15:33:58 UTC
I just wanted to provide an update for the customer.

I have analyzed the new data they sent in and it was very helpful. They did everything right.

The bottom line is that the latest failure was very similar to the previous one.
All the hung processes eventually boiled down to the log flush daemon getting deadlocked in a tight cpu-bound lock on expradarnetpre01:

root      15948  0.9  0.0      0     0 ?        R    Jan23 192:51 [gfs2_logd]

(All the runs show gfs2_logd in the same tight loop)

I can't tell for sure where or why it's in this tight loop, but it's almost guaranteed to be items on gfs2's "active items list" ("ail") that never got written and removed.
As I said before, we have seen and fixed several issues like this in rhel8 and up, but rhel7 is still lagging behind in its patches.
The test kernel we provided earlier contained 4 patches that seemed most likely to fix the problem. Unfortunately, it looks like they need more.

I've backported two more upstream patches to a new test kernel. The first one is just a simple prerequisite refactoring that doesn't change the logic.
The second patch, "gfs2: Wipe jdata and ail1 in gfs2_journal_wipe, formerly gfs2_meta_wipe" fixed another case in which the gfs2_logd daemon got stuck.
I did not include this in the first test kernel because the patch had more to do with journaled data (as opposed to metadata) in the journal, so I thought it was unlikely to have caused the problem.

I'd like to port another patch or two as well, which I should be able to do today. This is purely instrumentation: it detects when the gfs2_logd daemon gets stuck on the ail list, and if not resolved in a timely manner, stops trying and dumps out the complete ail list so we can tell more about why it got stuck.
This will print messages to the console, dmesg, and the syslog. So if it happens again, we should get more information about the problem in the sos reports.

I'm also going to reevaluate all the other patches we did between rhel7 and rhel8 to make sure we didn't forget any others they might need. I plan to do this today.
Then, of course, the new test kernel will need to be re-tested which may take a day or two.
Hopefully we can provide a new test kernel by the end of this week or early next week, but of course, there are no guarantees.

The good news is that we're pretty sure this is not a new problem, so we know how to debug it and have fixed it before.

Comment 11 Robert Peterson 2023-02-09 20:05:33 UTC
I built a new rhel7 test kernel with several additional patches.
I tested it with xfstests and the QE group's cluster coherency test, and it passed.
It's interesting to note that xfstests deadlocks on a stock 3.10.0-1160.85.1 kernel, but passes with my test kernel.
I asked Nate in the QE group to run regression tests.
The test kernel is called 3.10.0-1160.84.1.el7.case03393117b and it contains the following patches from upstream:

---------------[ origin/main..main.bz2162020 ]---------------
20fc877f007 Bob Peterson         gfs2: Only set PageChecked if we have a transaction
6ac5a43e507 Bob Peterson         GFS2: gfs2_free_extlen can return an extent that is too long
b784e16a3b7 Bob Peterson         GFS2: Only set PageChecked for jdata pages
147c7ce222f Bob Peterson         gfs2: instrumentation wrt ail1 stuck
6df82ce48fa Bob Peterson         gfs2: initialize transaction tr_ailX_lists earlier
d92a57f6ec8 Bob Peterson         gfs2: Wipe jdata and ail1 in gfs2_journal_wipe, formerly gfs2_meta_wipe
17a4a160df5 Bob Peterson         GFS2: Refactor gfs2_remove_from_journal
6efd9093e9c Bob Peterson         gfs2: Fix case in which ail writes are done to jdata holes
7db64ea20f6 Bob Peterson         gfs2: In gfs2_ail1_start_one unplug the IO when needed
84c2eb65ebc Bob Peterson         gfs2: Don't get stuck with I/O plugged in gfs2_ail1_flush
f7698b28e1d Bob Peterson         Revert "GFS2: Re-add a call to log_flush_wait when flushing the journal"

Comment 83 RHEL Program Management 2023-09-25 11:05:59 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 84 RHEL Program Management 2023-09-25 11:06:21 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.