Bug 813137
| Summary: | [xfs/xfstests 273] heavy cp workload hang | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Eryu Guan <eguan> | ||||||
| Component: | kernel | Assignee: | Brian Foster <bfoster> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Boris Ranto <branto> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 6.3 | CC: | akrherz, bfoster, branto, dchinner, dhoward, esandeen, fhrbata, garima_kochhar, gireesha_us, jcpunk, jwest, kernel-eus-qe, leo, narebeestjes, pasteur, rhinduja, rwheeler, torel | ||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | kernel-2.6.32-288.el6 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Various race conditions that led to indefinite log reservation hangs due to xfsaild "idle" mode occurred in XFS file system. This could lead to certain tasks being unresponsive; for example, the cp utility could become unresponsive on heavy workload. This update improves the Active Item List (AIL) pushing logic in xfsaild. Also, the log reservation algorithm and interactions with xfsaild have been improved. As a result, the aforementioned problems no longer occur in this scenario.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2013-02-21 06:07:57 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 860787, 969150, 972578 | ||||||||
| Attachments: |
|
||||||||
|
Description
Eryu Guan
2012-04-17 03:28:57 UTC
Created attachment 577872 [details]
Full console log
Reproduced on 6.2 GA kernel, so it's not a regression can you see if it is reproducable on the 6.1 kernel? That will tell me if it was introduced in 6.2 or not, and hence whether it is a known problem or something new.... I'm running 273 on 6.1 kernel (2.6.32-131.0.15.el6), seems it takes much longer (5397s) to finish one iteration than it takes on 6.2 or 6.3 kernel (295s). I'll leave it running the whole weekend and update results next Monday. 273 takes less than 2 hours(about 6000s) to finish one iteration on rhel6.1, it doesn't finish 100 iterations within two days(the whole weekend). So far I see no hang tasks but only poor performance. FWIW, I was messing around with this and I've reproduced 2/2 times on a local rhel6.3 VM (2.6.32-264.el6, x86_64, 8 cpu, 20GB vd). Looking through the associated area of code, I stuck a xlog_grant_push_ail() call in xlog_grant_log_space() before the first sleep and got through 100 iterations without a hang. I'm trying to learn some of this code so it isn't totally clear if this is appropriate, fixing a symptom vs. a problem, otherwise wildly broken, etc. One thing I did notice is that xlog_grant_log_space() seems to follow a similar pattern as xlog_regrant_write_log_space(), and the latter function includes a bit of logic to wake up any queued tasks and potentially push the ail before the first sleep (though the purpose of xlog_regrant_write_log_space() is not clear to me..). I could try adding similar logic to xlog_grant_log_space() if that sounds reasonable. (FWIW, it seems that upstream is somewhat different, but does include looping ail push logic down in this codepath via xlog_grant_head_wait().) (In reply to comment #6) > 273 takes less than 2 hours(about 6000s) to finish one iteration on rhel6.1, it > doesn't finish 100 iterations within two days(the whole weekend). So far I see > no hang tasks but only poor performance. Ok, that is kind of what I expected. It's related to the speed ups in the code introduced in the 6.2 kernel. (In reply to comment #7) > FWIW, I was messing around with this and I've reproduced 2/2 times on a local > rhel6.3 VM (2.6.32-264.el6, x86_64, 8 cpu, 20GB vd). Looking through the > associated area of code, I stuck a xlog_grant_push_ail() call in > xlog_grant_log_space() before the first sleep and got through 100 iterations > without a hang. That just papers over the issue by adding more push calls that can break the deadlock. It doesn't avoid the deadlock at all, and from previous experience it isn't a reliable workaround for such log space hangs, either. > I'm trying to learn some of this code so it isn't totally clear if this is > appropriate, fixing a symptom vs. a problem, otherwise wildly broken, It's a symptom, unfortunately. There are other problems in the code that have been recently fixed, but I think this one will be realtively easy to fix with a backport. > etc. One > thing I did notice is that xlog_grant_log_space() seems to follow a similar > pattern as xlog_regrant_write_log_space(), and the latter function includes a > bit of logic to wake up any queued tasks and potentially push the ail before > the first sleep (though the purpose of xlog_regrant_write_log_space() is not > clear to me..). It's for permanent (rolling) transactions that already have a multiple transaction log space reservation through xlog_grant_log_space(). the regrant code is just bring the moving the reservation for the next transaction in the series from the reserve head to the write head so it can take place normally. > I could try adding similar logic to xlog_grant_log_space() if > that sounds reasonable. (FWIW, it seems that upstream is somewhat different, > but does include looping ail push logic down in this codepath via > xlog_grant_head_wait().) The problem is much more likely the one fixed by this patch set that was added to the 3.0.x stable kernel series to solve the similar log space hangs that people were reporting upstream. http://oss.sgi.com/archives/xfs/2011-10/msg00235.html The "revert to kthread" patch is not necessary, because RHEL6 doesn't have the workqueue infrastructure that this patch is reverting from and hence still has a thread of the xfsaild. So I'd start by backporting this series and see if that solves the problem for you. Cheers, Dave. (In reply to comment #9) > (In reply to comment #7) ... > > etc. One > > thing I did notice is that xlog_grant_log_space() seems to follow a similar > > pattern as xlog_regrant_write_log_space(), and the latter function includes a > > bit of logic to wake up any queued tasks and potentially push the ail before > > the first sleep (though the purpose of xlog_regrant_write_log_space() is not > > clear to me..). > > It's for permanent (rolling) transactions that already have a multiple > transaction log space reservation through xlog_grant_log_space(). the regrant > code is just bring the moving the reservation for the next transaction in the > series from the reserve head to the write head so it can take place normally. > Thanks for the comments and explanation. > > I could try adding similar logic to xlog_grant_log_space() if > > that sounds reasonable. (FWIW, it seems that upstream is somewhat different, > > but does include looping ail push logic down in this codepath via > > xlog_grant_head_wait().) > > The problem is much more likely the one fixed by this patch set that was added > to the 3.0.x stable kernel series to solve the similar log space hangs that > people were reporting upstream. > > http://oss.sgi.com/archives/xfs/2011-10/msg00235.html > > The "revert to kthread" patch is not necessary, because RHEL6 doesn't have the > workqueue infrastructure that this patch is reverting from and hence still has > a thread of the xfsaild. So I'd start by backporting this series and see if > that solves the problem for you. > > Cheers, > > Dave. I did a quick backport of the core bits here (minus the kthread patch) to the latest rhel6 tree and ran a 100 loop test clean over night. I'm running it once more just to be sure. If it passes I'll pull in the rest (the tracepoints, mainly), run through a full xfstests and post a patch. Thanks again for pointing me in the right direction here! Sure enough, not long after I post and start the second test I reproduce the hang. ;) This patchset seems to make it harder to reproduce, anyways. I think I'll give this test a whirl on an upstream kernel... I ran this 100 loop sequence against 3.4.0-rc4+ 3 or 4 times over the weekend and couldn't reproduce a hang. After reading/comparing some of this code with upstream, I noticed one obvious difference: upstream xfsaild never appears to sleep. My obvious question here.. is this intentional? Specifically... the tout return of xfsaild_push() has changed as this code has gone from kthread->workqueue->kthread, and the current upstream code returns 50 in the !count case (as opposed to 0 as in rhel6). I'll attach a patch against upstream that makes this behave more like rhel for reference... Created attachment 581261 [details] Patch against upstream XFS to cause xfsaild idling This is in reference to comment 12, particularly to illustrate the difference in xfsaild behavior between rhel6 and upstream (3.4.0-rc5+). (In reply to comment #12) > I ran this 100 loop sequence against 3.4.0-rc4+ 3 or 4 times over the weekend > and couldn't reproduce a hang. After reading/comparing some of this code with > upstream, I noticed one obvious difference: upstream xfsaild never appears to > sleep. My obvious question here.. is this intentional? The idling was removed when reverting the workqueue code back to a kthread. Yes, it was intentional, and I forgot about that little detail of the revert. Well done on spotting it. :) > Specifically... the tout return of xfsaild_push() has changed as this code has > gone from kthread->workqueue->kthread, and the current upstream code returns 50 > in the !count case (as opposed to 0 as in rhel6). > > I'll attach a patch against upstream that makes this behave more like rhel for > reference... Did idling cause upstream to fail? As it is, I forgot about another, more recent upstream patch that is probably relevant here: 9f9c19e xfs: fix the logspace waiting algorithm Can you backport that one to your series as well, and see if that fixes the rarer hang you are now seeing? (In reply to comment #14) > (In reply to comment #12) > > I ran this 100 loop sequence against 3.4.0-rc4+ 3 or 4 times over the weekend > > and couldn't reproduce a hang. After reading/comparing some of this code with > > upstream, I noticed one obvious difference: upstream xfsaild never appears to > > sleep. My obvious question here.. is this intentional? > > The idling was removed when reverting the workqueue code back to a kthread. > Yes, it was intentional, and I forgot about that little detail of the revert. > Well done on spotting it. :) > Thanks. :) Out of curiosity, what is the particular purpose for that behavior? > > Specifically... the tout return of xfsaild_push() has changed as this code has > > gone from kthread->workqueue->kthread, and the current upstream code returns 50 > > in the !count case (as opposed to 0 as in rhel6). > > > > I'll attach a patch against upstream that makes this behave more like rhel for > > reference... > > Did idling cause upstream to fail? > Interestingly, it did not. I ran that test overnight and even commented out one or two of the "extra" (as compared to rhel6) ail push calls and didn't reproduce a hang. I suppose that's good in that upstream isn't apparently depending on the non-idle behavior to survive this test. > As it is, I forgot about another, more recent upstream patch that is probably > relevant here: > > 9f9c19e xfs: fix the logspace waiting algorithm > > Can you backport that one to your series as well, and see if that fixes the > rarer hang you are now seeing? Ah... thanks. this looks like the origin of the largest difference I see between rhel6 and upstream (i.e., the waiting algorithm and the clean up to provide some commonality between log grant/regrant). This looks rather promising from what I've seen so far. I'll give it a shot... thanks again. (In reply to comment #15) > (In reply to comment #14) > > (In reply to comment #12) > > > I ran this 100 loop sequence against 3.4.0-rc4+ 3 or 4 times over the weekend > > > and couldn't reproduce a hang. After reading/comparing some of this code with > > > upstream, I noticed one obvious difference: upstream xfsaild never appears to > > > sleep. My obvious question here.. is this intentional? > > > > The idling was removed when reverting the workqueue code back to a kthread. > > Yes, it was intentional, and I forgot about that little detail of the revert. > > Well done on spotting it. :) > > > > Thanks. :) Out of curiosity, what is the particular purpose for that behavior? To prevent excessive wakeups preventing long CPU idle sleep times. i.e. to allow the CPU to spend more time in deep power saving modes. > > > > Specifically... the tout return of xfsaild_push() has changed as this code has > > > gone from kthread->workqueue->kthread, and the current upstream code returns 50 > > > in the !count case (as opposed to 0 as in rhel6). > > > > > > I'll attach a patch against upstream that makes this behave more like rhel for > > > reference... > > > > Did idling cause upstream to fail? > > > > Interestingly, it did not. That's good to know - I've been considering reintroducing the idling upstream given all teh fixes we've done recently... (In reply to comment #16) > (In reply to comment #15) > > (In reply to comment #14) > > > (In reply to comment #12) > > > > I ran this 100 loop sequence against 3.4.0-rc4+ 3 or 4 times over the weekend > > > > and couldn't reproduce a hang. After reading/comparing some of this code with > > > > upstream, I noticed one obvious difference: upstream xfsaild never appears to > > > > sleep. My obvious question here.. is this intentional? > > > > > > The idling was removed when reverting the workqueue code back to a kthread. > > > Yes, it was intentional, and I forgot about that little detail of the revert. > > > Well done on spotting it. :) > > > > > > > Thanks. :) Out of curiosity, what is the particular purpose for that behavior? > > To prevent excessive wakeups preventing long CPU idle sleep times. i.e. to > allow the CPU to spend more time in deep power saving modes. > This is why the idle functionality exists, no? I could have phrased that question better, but I'm wondering why the no-idle behavior was introduced... As an update, I backported 9f9c19e to rhel6 and was still able to reproduce the hang. Doing a side-by-side walk through of the wait/wake code, the logic after this change seems to be relatively equivalent between upstream and rhel6. I started looking at the xfsaild_push() logic and came across 670ce93f. I included that change as well, got through two 100-loop runs without an issue and eventually reproduced the hang in a 3rd run. :( I may try the no-idle change at least for reference, or perhaps adding some tracepoints to get more intuition on what's happening. In the meantime let me know if you think of anything else to test... Since RHEL 6.3 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. (In reply to comment #17) > (In reply to comment #16) > > (In reply to comment #15) > > To prevent excessive wakeups preventing long CPU idle sleep times. i.e. to > > allow the CPU to spend more time in deep power saving modes. > > > > This is why the idle functionality exists, no? I could have phrased that > question better, but I'm wondering why the no-idle behavior was introduced... the idling code was relatively new, and we reverted it because we weren't sure it it was having a watchdog effect on AIL pushing. > As an update, I backported 9f9c19e to rhel6 and was still able to reproduce the > hang. Doing a side-by-side walk through of the wait/wake code, the logic after > this change seems to be relatively equivalent between upstream and rhel6. I > started looking at the xfsaild_push() logic and came across 670ce93f. I > included that change as well, got through two 100-loop runs without an issue > and eventually reproduced the hang in a 3rd run. :( So, that means you have backported these patches? 9f9c19e xfs: fix the logspace waiting algorithm 670ce93 xfs: reduce the number of log forces from tail pushing 17b3847 xfs: force the log if we encounter pinned buffers in .iop_pushbuf bc6e588 xfs: do not update xa_last_pushed_lsn for locked items Which means that the fix might be in these 4: e44f411 xfs: set cursor in xfs_ail_splice() even when AIL was empty af3e402 xfs: convert AIL cursors to use struct list_head 16b5902 xfs: remove confusing ail cursor wrapper 1d8c95a xfs: use a cursor for bulk AIL insertion e44f411 is a fix for a real bug (and could possibly be the cause of the problem), while the other 3 are fixes for excessive CPU usage during log Io completion when inserting items into the AIL. That could be a contributing factor, but I never saw that issue cause a hang.... > I may try the no-idle change at least for reference, or perhaps adding some > tracepoints to get more intuition on what's happening. In the meantime let me > know if you think of anything else to test... Backporting: 9e4c109 xfs: add AIL pushing tracepoints will give you some tracepoints... (In reply to comment #19) ... > > So, that means you have backported these patches? > > 9f9c19e xfs: fix the logspace waiting algorithm > 670ce93 xfs: reduce the number of log forces from tail pushing > 17b3847 xfs: force the log if we encounter pinned buffers in .iop_pushbuf > bc6e588 xfs: do not update xa_last_pushed_lsn for locked items > Correct. I've been doing a little debugging that I'll describe below with 670ce93 reverted because it doesn't seem to fix the problem, but makes it harder to reproduce... > Which means that the fix might be in these 4: > > e44f411 xfs: set cursor in xfs_ail_splice() even when AIL was empty > af3e402 xfs: convert AIL cursors to use struct list_head > 16b5902 xfs: remove confusing ail cursor wrapper > 1d8c95a xfs: use a cursor for bulk AIL insertion > Thanks, I'll give these a shot... > e44f411 is a fix for a real bug (and could possibly be the cause of the > problem), while the other 3 are fixes for excessive CPU usage during log Io > completion when inserting items into the AIL. That could be a contributing > factor, but I never saw that issue cause a hang.... > > > I may try the no-idle change at least for reference, or perhaps adding some > > tracepoints to get more intuition on what's happening. In the meantime let me > > know if you think of anything else to test... > > Backporting: > > 9e4c109 xfs: add AIL pushing tracepoints > > will give you some tracepoints... Thanks again. I've hacked in a couple tracepoints to help me understand the paths I've been digging into and so far I can use them to try and describe 2 hangs that I'm reproducing. The tracepoints are at the end of xfsaild_push(), so I can see the end result of a particular push, and in xfs_trans_ail_push(), so I can see when we do or do not actually execute a wake. This is with the backported patches listed above (save 670ce93). 1.) The first hang I combined with a dump of the in memory ailp->xa_ail log item list using crash to see what was going on. What I see is a sequence like the following: - xfsaild() is working with a target of 21474840310, hits the stuck > 100 heuristic and reschedules setting the last_lsn pointer to 21474837942. - Concurrently, the xa_target value is updated to 21474840950. - xfsaild restarts with the updated target, handles 0 items and goes into idle mode never to return. - Using crash, I see a list with li_lsn entries larger than the last shown target (expected) preceded by a bunch of entries with li_lsn == 17179889014, which goes a bit further back in my tracepoint log. I was thinking that perhaps the push mechanism skipped past these inadvertently across an xa_target update, causing it to think even though it hit a target that everything previous was complete (???). TBH, I don't have a complete understanding of the log mechanism at the moment... so perhaps this is more likely related to your proposed ail list fixes..? In any event, I'm able to avoid this hang by adding a little hack to xfsaild_push() to never enter idle mode on an xfsaild_push() sequence that doesn't start with last_lsn == 0. IOW, if we hit the target and last_lsn != 0, set last_lsn = 0 and schedule one more time. 2.) With the above "fix," I hit a hang and ended up diagnosing it to a slightly different cause. What I see in my run is that although we issue quite a few xfs_trans_push_ail() calls through the updated wait/wake algorithm, this function only calls xfsaild_wakeup() when the corresponding threshold_lsn moves the target_lsn forward (which implies we expected xfsaild, once woken, to reschedule itself intelligently until complete). It looks like I neglected to save my trace log from this one, so the high-level sequence of events from memory is as follows: - xfsaild() is running. - xfsaild_wakeup() runs, updates ailp->xa_target and issues a wake_up_process(). Via tracepoint, wake_up_process() returns 0 to indicate that xfsaild is already active. - xfsaild_push() exits into idle mode. - xfs_trans_ail_push() calls continue to occur for a period of time. None make it to xfsaild_wakeup() because they do not push xa_target forward and I end up hung. I can avoid this hang by setting a flag (stored in ailp) based on the return value of wake_up_process() and using that to make sure that xfs_trans_ail_push() calls continue to result in a wake_up_process() calls until the thread activation actually occurs (adding a reset of this flag in xfsaild when it pulls a new xa_target might be better to avoid spurious wakes, but I'm just hacking around at the moment)... --- Summary So far I've run two successful 100 loop sequences with both of these changes, and started a 3rd this morning. Previously I hadn't got through a single 100-loop sequence with this combination of backports, so I'm a bit more confident that I've at least captured the characteristics of the hangs here. If this 3rd run succeeds, I plan to undo the first "fix," verify I reproduce and get into testing the AIL patches you've proposed. With regard to hang 2... On inspection, it looks like upstream has the same logic to limit wakes in the case where xa_target is not moved forward, so I'm a little curious why I never reproduced that one. At the same time, upstream does include changes like 670ce93 which made this significantly harder for me to reproduce (and could technically be a different hang). If I get a chance, perhaps I'll try reverting that on an upstream kernel and see if the hang occurs... (In reply to comment #20) > > Backporting: > > > > 9e4c109 xfs: add AIL pushing tracepoints > > > > will give you some tracepoints... > > Thanks again. I've hacked in a couple tracepoints to help me understand the > paths I've been digging into and so far I can use them to try and describe 2 > hangs that I'm reproducing. The tracepoints are at the end of xfsaild_push(), > so I can see the end result of a particular push, and in xfs_trans_ail_push(), > so I can see when we do or do not actually execute a wake. This is with the > backported patches listed above (save 670ce93). Ok. > 1.) The first hang I combined with a dump of the in memory ailp->xa_ail log > item list using crash to see what was going on. What I see is a sequence like > the following: > > - xfsaild() is working with a target of 21474840310, hits the stuck > 100 > heuristic and reschedules setting the last_lsn pointer to 21474837942. > - Concurrently, the xa_target value is updated to 21474840950. > - xfsaild restarts with the updated target, handles 0 items and goes into idle > mode never to return. Ah, numbers. I love numbers - I can debug code with numbers. ;) FWIW, log sequence numbers make a lot more sense when printed in hex. e.g 21474837942 = 0x5000005B6 which tells me the cycle number (upper 32 bits) is 5, and the sector offset (lower 32 bits) is 0x5b6. When you then compare it with 21474840950 = 0x500001176, you can see that only the sector offset has changed, and by how much. XFS uses encodings like this everywhere, so printing stuff in hex is often much more insightful than printing in decimal.... Anyway, what this means is that xfs_trans_ail_cursor_first() never found an entry with a last_lsn >= 21474837942, so as far as it is concerned, the push is complete. It sets last_lsn = 0, returns 0 and hence idles. The upstream code currently assumes that even when we reach our target, there is more to do. That's the first problem with the RHEL6 code - idle should only be entered when the AIL is empty, not when we have reached our push target. > - Using crash, I see a list with li_lsn entries larger than the last shown > target (expected) preceded by a bunch of entries with li_lsn == 17179889014, > which goes a bit further back in my tracepoint log. Yes, that is exactly what will happen if we hit this case. > I was thinking that perhaps the push mechanism skipped past these inadvertently > across an xa_target update, causing it to think even though it hit a target > that everything previous was complete (???). TBH, I don't have a complete > understanding of the log mechanism at the moment... so perhaps this is more > likely related to your proposed ail list fixes..? No, that's not fixed in any of the commits I listed - it was caught by the removal of the idling logic. IOWs, Christoph was right in his suspicion that the idling logic was contributing to the problems upstream - we were just never able to confirm it until right now. > In any event, I'm able to avoid this hang by adding a little hack to > xfsaild_push() to never enter idle mode on an xfsaild_push() sequence that > doesn't start with last_lsn == 0. IOW, if we hit the target and last_lsn != 0, > set last_lsn = 0 and schedule one more time. Well, this is handled upstream simply by handling the !count/reached target/empty cursor cases all the same - sleep for 50ms, set last_lsn = 0 to restart the push. That is, there is no idle case at all. Idling should only occur if the AIL is empty so the idle logic in RHEL6 is completely broken and needs to be fixed. > 2.) With the above "fix," I hit a hang and ended up diagnosing it to a slightly > different cause. What I see in my run is that although we issue quite a few > xfs_trans_push_ail() calls through the updated wait/wake algorithm, this > function only calls xfsaild_wakeup() when the corresponding threshold_lsn moves > the target_lsn forward (which implies we expected xfsaild, once woken, to > reschedule itself intelligently until complete). It looks like I neglected to > save my trace log from this one, so the high-level sequence of events from > memory is as follows: > > - xfsaild() is running. > - xfsaild_wakeup() runs, updates ailp->xa_target and issues a > wake_up_process(). Via tracepoint, wake_up_process() returns 0 to indicate that > xfsaild is already active. > - xfsaild_push() exits into idle mode. > - xfs_trans_ail_push() calls continue to occur for a period of time. None make > it to xfsaild_wakeup() because they do not push xa_target forward and I end up > hung. Right, that is also fixed by not having an idle mode. > I can avoid this hang by setting a flag (stored in ailp) based on the return > value of wake_up_process() and using that to make sure that > xfs_trans_ail_push() calls continue to result in a wake_up_process() calls > until the thread activation actually occurs (adding a reset of this flag in > xfsaild when it pulls a new xa_target might be better to avoid spurious wakes, > but I'm just hacking around at the moment)... I think killing the idle mode is the first thing that is necessary. With just the basic backports you have, can you simply change the initial value of tout in xfsaild_push to 50, and see if that avoids the problems you have seen so far? BTW, Brian - good work in debugging this :) Cheers, Dave. (In reply to comment #21) > > 1.) The first hang I combined with a dump of the in memory ailp->xa_ail log > > item list using crash to see what was going on. What I see is a sequence like > > the following: > > > > - xfsaild() is working with a target of 21474840310, hits the stuck > 100 > > heuristic and reschedules setting the last_lsn pointer to 21474837942. > > - Concurrently, the xa_target value is updated to 21474840950. > > - xfsaild restarts with the updated target, handles 0 items and goes into idle > > mode never to return. > > Ah, numbers. I love numbers - I can debug code with numbers. ;) > > FWIW, log sequence numbers make a lot more sense when printed in hex. e.g > 21474837942 = 0x5000005B6 which tells me the cycle number (upper 32 bits) is 5, > and the sector offset (lower 32 bits) is 0x5b6. When you then compare it with > 21474840950 = 0x500001176, you can see that only the sector offset has changed, > and by how much. XFS uses encodings like this everywhere, so printing stuff in > hex is often much more insightful than printing in decimal.... > Good to know, thanks... > Anyway, what this means is that xfs_trans_ail_cursor_first() never found an > entry with a last_lsn >= 21474837942, so as far as it is concerned, the push is > complete. It sets last_lsn = 0, returns 0 and hence idles. > > The upstream code currently assumes that even when we reach our target, there > is more to do. That's the first problem with the RHEL6 code - idle should only > be entered when the AIL is empty, not when we have reached our push target. > Ok, that clears that up... > > I can avoid this hang by setting a flag (stored in ailp) based on the return > > value of wake_up_process() and using that to make sure that > > xfs_trans_ail_push() calls continue to result in a wake_up_process() calls > > until the thread activation actually occurs (adding a reset of this flag in > > xfsaild when it pulls a new xa_target might be better to avoid spurious wakes, > > but I'm just hacking around at the moment)... > > I think killing the idle mode is the first thing that is necessary. With just > the basic backports you have, can you simply change the initial value of tout > in xfsaild_push to 50, and see if that avoids the problems you have seen so > far? > Yep, I'll give that a shot. If that proves sufficient to resolve the hang, how much of the backport stuff would you recommend we include into rhel6? My debug branch is kind of botched with the various hacks I've been making, so I need to reconstruct it anyways. I'll be initially testing the no-idle-mode change in a rhel6 branch that effectively includes the following: bc6e588a8971 xfs: do not update xa_last_pushed_lsn for locked items 17b38471c3c0 xfs: force the log if we encounter pinned buffers in .iop_pushbuf 9e4c109ac822 xfs: add AIL pushing tracepoints 9e7004e741de xfs: exit AIL push work correctly when AIL is empty - ^^^ I think the no-idle-mode change is just a cherry pick from this one 9f9c19ec1a59 xfs: fix the logspace waiting algorithm ... but it sounds like technically the no-idle change alone could address the problem. FWIW, the aforementioned backports were relatively straightforward and I have no issue including them if appropriate. > BTW, Brian - good work in debugging this :) > Thanks for pointing me in the right direction! :) BTW, I'm pretty sure I tested upstream with a change that re-enabled idle mode and never hit this problem through a weekend test run. Now that I have a bit more context, I think some of the performance related changes I've tried testing make this significantly more difficult to reproduce. Once this bug is resolved, is there any interest/value in pursuing upstream fixes to the idle mode logic such that it could be re-enabled in the future? I might have enough context now to revert some upstream changes (for the purposes of creating kernel where this is reproducible and testing potential fixes, but if not, I could always test by proxy on this rhel kernel as well). (In reply to comment #22) > (In reply to comment #21) > > > I can avoid this hang by setting a flag (stored in ailp) based on the return > > > value of wake_up_process() and using that to make sure that > > > xfs_trans_ail_push() calls continue to result in a wake_up_process() calls > > > until the thread activation actually occurs (adding a reset of this flag in > > > xfsaild when it pulls a new xa_target might be better to avoid spurious wakes, > > > but I'm just hacking around at the moment)... > > > > I think killing the idle mode is the first thing that is necessary. With just > > the basic backports you have, can you simply change the initial value of tout > > in xfsaild_push to 50, and see if that avoids the problems you have seen so > > far? > > > > Yep, I'll give that a shot. If that proves sufficient to resolve the hang, how > much of the backport stuff would you recommend we include into rhel6? My debug > branch is kind of botched with the various hacks I've been making, so I need to > reconstruct it anyways. I'll be initially testing the no-idle-mode change in a > rhel6 branch that effectively includes the following: > > bc6e588a8971 xfs: do not update xa_last_pushed_lsn for locked items > 17b38471c3c0 xfs: force the log if we encounter pinned buffers in .iop_pushbuf > 9e4c109ac822 xfs: add AIL pushing tracepoints > 9e7004e741de xfs: exit AIL push work correctly when AIL is empty > - ^^^ I think the no-idle-mode change is just a cherry pick from this one Not really. That was a fix for an idling condition when the AIL pushing was done via a workqueue. The workqueue code had different idle cases, and it was protected by a flag which had a race condition in clearing it. When the code was reverted back to a kthread, that flag was removed and the original idling behaviour was not restored. Hence the idling logic was removed by selective code reversion, not due to any specific patch. > 9f9c19ec1a59 xfs: fix the logspace waiting algorithm > > ... but it sounds like technically the no-idle change alone could address the > problem. FWIW, the aforementioned backports were relatively straightforward and > I have no issue including them if appropriate. Given that they are in the 3.0.x stable kernel, and XFS in RHEL6 is only slightly behind 3.0.x, they probably should be backported.... > > BTW, Brian - good work in debugging this :) > > > > Thanks for pointing me in the right direction! :) > > BTW, I'm pretty sure I tested upstream with a change that re-enabled idle mode > and never hit this problem through a weekend test run. Now that I have a bit > more context, I think some of the performance related changes I've tried > testing make this significantly more difficult to reproduce. It would seem that way - upstream still appears to have at least one bug that results in a log space hang still in it that I'm trying to understand at the moment..... > Once this bug is > resolved, is there any interest/value in pursuing upstream fixes to the idle > mode logic such that it could be re-enabled in the future? Yes, there is. Laptop users will get noticable longer battery life if we don't wake up the CPU 20x/sec per filesystem.... Cheers, Dave. I've run through the 273 100-loop test probably 8-10 times now through the end of last week and over the weekend and can no longer reproduce hangs. I ran a full xfstests today and observed no regressions. The patchset has been posted for rhel6.4: http://post-office.corp.redhat.com/archives/rhkernel-list/2012-May/msg00840.html This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Patch(es) available on kernel-2.6.32-288.el6 Clearing rh private, nothing secret here that I can see. We're looking for a RHEL 6.3 z-stream fix for this bug. Sorry to make noise, but the async 6.3 z-stream errata just released: kernel-2.6.32-279.14.1 does not mention this bug nor bugzilla #860787 in the rpm changelog? The erratum text mentions #860787 though. I am not clever enough to figure out if the patch is in the source or not, sorry. Did the patch make it? Thank you rpm -q --changelog -p kernel-2.6.32-279.14.1.el6.x86_64.rpm | grep 860787 - [fs] xfs: disable xfsaild idle mode (Brian Foster) [860787 813137] - [fs] xfs: fix the logspace waiting algorithm (Brian Foster) [860787 813137] - [fs] xfs: add AIL pushing tracepoints (Brian Foster) [860787 813137] - [fs] xfs: force the log if we encounter pinned buffers in .iop_pushbuf (Brian Foster) [860787 813137] - [fs] xfs: do not update xa_last_pushed_lsn for locked items (Brian Foster) [860787 813137] Patches are in, changelog also contains info about this fix. Thank you, I was looking here: https://rhn.redhat.com/network/software/packages/change_log.pxt?pid=734552 and it stops at "[fs] udf: Avoid run away". I should have not assumed this was correct. Sorry for the noise... Ref. Daryl's question. I started seeing this with upgrade to 2.6.32-279.14.1. On an idle node loadavg = 1. Note I did not see this on 2.6.32-279.5.2.el6.x86_64 top - 16:30:53 up 11 days, 18:03, 1 user, load average: 1.03, 1.02, 1.00 Tasks: 331 total, 1 running, 330 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.1%sy, 0.0%ni, 99.9%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 49551592k total, 951012k used, 48600580k free, 9664k buffers Swap: 15999992k total, 16556k used, 15983436k free, 101316k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3304 root 20 0 0 0 0 D 0.3 0.0 11:40.47 xfsaild/sda5 Is this fixed or not in 2.6.32-279.14.1 ? Just to add a *me too* on 2.6.32-279.14.1 , xfsaild goes into deadlock state. Unmounting the XFS filesystem causes it to return to "normal" and the load average to go back to zero, but then a few hours after remounting, the issue is back. I can't tell is anything bad is happening though. Tore and Daryl, The load average issue has been reported and is known. It is in fact a side effect of this fix, but it is considered to be generally harmless. We do have bug 883905 filed to track and fix this in rhel going forward. Daryl, you referred to xfsaild going into a "deadlocked" state. By that I assume you mean top shows xfsaild in D state (for IO wait), not that your filesystem has deadlocked. Is that an accurate assumption? In any event, if your issue is as described in bug 883905, I would appreciate it if we could take further comments over there. Thanks. :) Hi, bug 883905 is private, so it is hard to say :) Yes, top shows the 'D' state as is indicated by comment 36. The filesystem appears to be a-ok. (In reply to comment #39) > Hi, bug 883905 is private, so it is hard to say :) Yes, top shows the 'D' > state as is indicated by comment 36. The filesystem appears to be a-ok. Oops! That should be fixed now. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0496.html *** Bug 970111 has been marked as a duplicate of this bug. *** |