| Summary: | [RHEL6:CGROUPS] regression in res_counter_uncharge_locked | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Travis Gummels <tgummels> | ||||
| Component: | kernel | Assignee: | Johannes Weiner <jweiner> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 6.2 | CC: | aarcange, ddumas, leiwang, lwang, lwoodman, riel, woodard | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-11-04 09:21:21 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Travis Gummels
2011-10-25 19:27:02 UTC
Can you reproduce this bug without lustre? If the backtrace with zap_huge_pmd is always the first one you get, it is indicative of the problem being elsewhere. That code path notices that the number of pages being subtracted from the cgroup is less than the number of pages currently accounted to it, which happens if somebody (else) left the counter in an inconsistent state earlier.
Looks like more uncharges took place than charges in this case:
void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
{
if (WARN_ON(counter->usage < val))
val = counter->usage;
counter->usage -= val;
}
Or perhaps not all pages allocated to the cgroup are properly charged when they are allocated yet they are properly uncharged when they are freed?
Larry
Some of the things that we discovered as of today: - we didn't see the problem with 131.12.1 the problem first appeared when they moved to the 206 kernel. They didn't do enough testing on the kernels between 131.12.1 and 206 to see if the problem occurs there. We are in the process of trying to test using NFS on the 202 kernel. According to Linda it may have something to do with the patches that went into 182. - it doesn't seem to occur when we have THP turned off. It ran all night and didn't happen. Ben, how long does it take to reproduce this problem? Also, lets test the very latest(-124) kernel just be be sure the problem hasnt been fixed since -206. Quite a few changes have gon in between -206 and -214. Larry The program that is triggering it right now is IOR http://sourceforge.net/projects/ior-sio/ We are trying to come up with a better reproducer but have yet to find the magic combination. It takes extensive testing, the frequency is about 1-2% and it takes a while to manifest > than seconds but < than hours on a 16 core sandy bridge. We have yet to be able to reproduce the problem using NFS. At the moment, this doesn't tell us much with any certainty. Our NFS infrastructure is much less capable than our Lustre infrastructure and consequently testing is going much more slowly and therefore makes different demands on the VM subsystem. For example with NFS runs the page cache seems to stay roughly constant but with Lustre the size of the page cache grows considerably during the IOR run. - The problem does reproduce on the 202 kernel. - as I said before the problem only manifests with THP enabled. - it took about 30 minutes to hit. - So far we have only been able to reproduce it with lustre but not with NFS. There is no code in lustre that touches any part of cgroups or memcgs in particular. We think that the difference is in comparative capacity of the lustre file system vs the NFS file systems. It is immediately obvious that lustre runs much faster and that puts more demands on the system's page cache. Overnight we are going to do a run with a fairly small subset of nodes against the NFS server. Hopefully then we can avoid overloading the NFS server enough that it will be able to perform well enough that it puts enough demands on the VM to trigger the problem, - One of the nodes that it hit the problem on was running IOR on just 1 node at the time. The test that is failing scales up from 1-256 procs all within one memcg running on a 16 processor sandy bridge machine. We don't know how many threads were actually running at that time. There should only be one other memcg on the node at the time. We haven't proven this yet. We think that it is the rootcg which is being being uncharged and causing the warning. Tomorrow, we will continue testing this. To the best of our knowledge we think that root_mem_cgroup->res should always == 0 What we are seeing though is: counter->usage = 0, val=4096! and counter->usage = 0, val=2097152 Ben, if we use the "caveman approach" in debugging this problem and do a binary search we should be able to find which kernel version caused this problem in 6 or 7 tries. At 30 minutes to hit the problem thats only about 3 or 4 hours of runtime. Since the problem occurs in -202 but not -131 I'd start with -165 and do a binary search. The faster we find the exact cause of this problem the more likely the fix will make 6.2 Larry <neb> is there any way to automate the lustre module build so that it isn't so difficult for you to make different kernels. <grondo> it is automated <neb> the guys working on it understand it is a pain but they really would love a bisection between 202 and 131.12.1 that you were running. <grondo> the problem is that the lustre package has a BuildRequires of kernel-devel <neb> so you need kernel devel for all the other kernels. <neb> I can get you that trivially. <grondo> ok, there are actually multiple problems with that <grondo> 1st, lustre requires its own kernel patches <grondo> so I'd have to apply the lustre patches to every git bisect result <grondo> and likely they would not apply cleanly, so that wouldn't be automatic <grondo> 2nd, our automated build system uses mock, and thus when pulling in kernel-devel, it always just grabs what yum thinks is the "latest" rpm <grondo> First I'd like to see if I can reproduce the problem on a standalone node <grondo> then I can see if I can reproduce against just NFS. <grondo> If so, then that solves the lustre problem, and likely we can just use straight RHEL kernels Every iteration divides the problem in 1/2 so just a couple/few tests really zeros in on the problem. Larry <grondo> ok, hit the issue on 165 <grondo> I'd better retest 131 OK it is starting to look like it isn't a regression but possibly a problem with our previous testing methodologies. When we retested with 131.12.1 and let it run overnight we were able to reproduce the problem. We are currently think that it could be one of three different things: 1) a difference between unit testing and integration testing - for example changes in the timing and depth of the cgroups as we built up our software stack. The job launcher's support of cgroups was being coded, evolving, and tested as we were gearing up for this release. I could be that our use of cgroups changed over that period of time sufficiently that our previous testing was not directly applicable. - Or maybe we tested lustre independently of our job launcher's support of cgroups 2) changes in the hardware making race conditions to trigger this problem. Much of our testing was on the previous generation of processors, and sandy bridges are just now becoming commonly available. Consequently, the increase in the core count may have made made it more likely for us to see this. 3) the scale of the test clusters. We are currently rolling out a very large sandy bridge cluster and cluster wide testing makes these hard to hit problems happen more frequently. We are going to continue brainstorming as to the possible cause of this failure in testing and also start looking into doing a bisection to isolate the patch that is actually at fault. Updates throughout the day. Either way "WARN_ON(counter->usage < val)" should never fire. The only that could happen is if some leak occurred or there was some corruption within the counter->usage field. Larry Created attachment 531156 [details]
[patch] memcg: add memcg sanity checks at allocating and freeing pages
If a page is freed without being uncharged and reused as an anonymous page, it will get uncharged when it is unmapped again. Although you wouldn't be able to remove the cgroup until the page is finally uncharged, at least the charge-uncharge is always balanced eventually.
However, if the page is reused as a transparent huge page, the uncharge is bigger than the charge and this will eventually manifest in the warning you see.
Please try reproducing with the debug kernel build (CONFIG_DEBUG_VM enabled) and the attached patch applied. This should catch a missed uncharge upon freeing the page - before it's being reused - and thus help pin down the faulty party.
Johannes, Thank you. This is exactly the kind of patch we needed. We will begin testing with it immediately. We found one bug that matched your description. <grondo> the issue Brian found is that truncate_complete_page is not exported from kernel <grondo> so Lustre has to use its own internal version <grondo> which is wrong (from an older kernel) <grondo> so does not call remove_from_page_cache <grondo> Trying a kernel with EXPORT_SYMBOL(truncate_complete_page) <grondo> which will allow lustre to use the kernel version Testing with a fixed version now and we'll see if we uncover any more problems. (In reply to comment #14) > If a page is freed without being uncharged and reused as an anonymous page, it > will get uncharged when it is unmapped again. Although you wouldn't be able to > remove the cgroup until the page is finally uncharged, at least the > charge-uncharge is always balanced eventually. > > However, if the page is reused as a transparent huge page, the uncharge is > bigger than the charge and this will eventually manifest in the warning you > see. In the end it was this description and how it interacted with THP which allowed us to find the bug by inspection before we could even get a kernel built with the patch. LLNL is very grateful for the assistance even though the problem ended up being in their out if tree code. That is good to hear. I will push this debugging patch into RHEL as we have it upstream and it has proved useful yet another time. |