Bug 2162020
| Summary: | gfs2: The gfs2_logd process to hang or stall which causes a performance degradation on the gfs2 filesystem | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Shane Bradley <sbradley> |
| Component: | kernel | Assignee: | Robert Peterson <rpeterso> |
| kernel sub component: | GFS/GFS2 | QA Contact: | cluster-qe <cluster-qe> |
| Status: | ASSIGNED --- | Docs Contact: | |
| Severity: | medium | ||
| Priority: | high | CC: | bdm, bmarson, gfs2-maint, rpeterso, sbradley, swachira |
| Version: | 7.9 | Flags: | bdm:
needinfo-
bdm: needinfo- bdm: needinfo- |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I built a new rhel7 test kernel with several additional patches. I tested it with xfstests and the QE group's cluster coherency test, and it passed. It's interesting to note that xfstests deadlocks on a stock 3.10.0-1160.85.1 kernel, but passes with my test kernel. I asked Nate in the QE group to run regression tests. The test kernel is called 3.10.0-1160.84.1.el7.case03393117b and it contains the following patches from upstream: ---------------[ origin/main..main.bz2162020 ]--------------- 20fc877f007 Bob Peterson gfs2: Only set PageChecked if we have a transaction 6ac5a43e507 Bob Peterson GFS2: gfs2_free_extlen can return an extent that is too long b784e16a3b7 Bob Peterson GFS2: Only set PageChecked for jdata pages 147c7ce222f Bob Peterson gfs2: instrumentation wrt ail1 stuck 6df82ce48fa Bob Peterson gfs2: initialize transaction tr_ailX_lists earlier d92a57f6ec8 Bob Peterson gfs2: Wipe jdata and ail1 in gfs2_journal_wipe, formerly gfs2_meta_wipe 17a4a160df5 Bob Peterson GFS2: Refactor gfs2_remove_from_journal 6efd9093e9c Bob Peterson gfs2: Fix case in which ail writes are done to jdata holes 7db64ea20f6 Bob Peterson gfs2: In gfs2_ail1_start_one unplug the IO when needed 84c2eb65ebc Bob Peterson gfs2: Don't get stuck with I/O plugged in gfs2_ail1_flush f7698b28e1d Bob Peterson Revert "GFS2: Re-add a call to log_flush_wait when flushing the journal" |
I just wanted to provide an update for the customer. I have analyzed the new data they sent in and it was very helpful. They did everything right. The bottom line is that the latest failure was very similar to the previous one. All the hung processes eventually boiled down to the log flush daemon getting deadlocked in a tight cpu-bound lock on expradarnetpre01: root 15948 0.9 0.0 0 0 ? R Jan23 192:51 [gfs2_logd] (All the runs show gfs2_logd in the same tight loop) I can't tell for sure where or why it's in this tight loop, but it's almost guaranteed to be items on gfs2's "active items list" ("ail") that never got written and removed. As I said before, we have seen and fixed several issues like this in rhel8 and up, but rhel7 is still lagging behind in its patches. The test kernel we provided earlier contained 4 patches that seemed most likely to fix the problem. Unfortunately, it looks like they need more. I've backported two more upstream patches to a new test kernel. The first one is just a simple prerequisite refactoring that doesn't change the logic. The second patch, "gfs2: Wipe jdata and ail1 in gfs2_journal_wipe, formerly gfs2_meta_wipe" fixed another case in which the gfs2_logd daemon got stuck. I did not include this in the first test kernel because the patch had more to do with journaled data (as opposed to metadata) in the journal, so I thought it was unlikely to have caused the problem. I'd like to port another patch or two as well, which I should be able to do today. This is purely instrumentation: it detects when the gfs2_logd daemon gets stuck on the ail list, and if not resolved in a timely manner, stops trying and dumps out the complete ail list so we can tell more about why it got stuck. This will print messages to the console, dmesg, and the syslog. So if it happens again, we should get more information about the problem in the sos reports. I'm also going to reevaluate all the other patches we did between rhel7 and rhel8 to make sure we didn't forget any others they might need. I plan to do this today. Then, of course, the new test kernel will need to be re-tested which may take a day or two. Hopefully we can provide a new test kernel by the end of this week or early next week, but of course, there are no guarantees. The good news is that we're pretty sure this is not a new problem, so we know how to debug it and have fixed it before.