Bug 748991 - [RHEL6:CGROUPS] regression in res_counter_uncharge_locked
Summary: [RHEL6:CGROUPS] regression in res_counter_uncharge_locked
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.2
Hardware: All
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Johannes Weiner
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-25 19:27 UTC by Travis Gummels
Modified: 2015-09-01 03:51 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-11-04 09:21:21 UTC
Target Upstream Version:


Attachments (Terms of Use)
[patch] memcg: add memcg sanity checks at allocating and freeing pages (4.98 KB, patch)
2011-11-01 15:40 UTC, Johannes Weiner
no flags Details | Diff

Description Travis Gummels 2011-10-25 19:27:02 UTC
Description of problem:

LLNL has hit a possible regression in cgroups in res_counter_uncharge_locked.  

Version-Release number of selected component (if applicable):
RHEL 6 Update 2 Snapshot 2

How reproducible:
100%

Steps to Reproduce:

1. Under investigation.

Actual results:

As per LLNL:

2011-10-24 09:47:58 WARNING: at kernel/res_counter.c:72
res_counter_uncharge_locked+0x37/0x40() (Tainted: G        W  ---------------- 
)
2011-10-24 09:47:58 Hardware name: appro-512x
2011-10-24 09:47:58 Modules linked in: xt_owner nf_conntrack_ipv4
nf_defrag_ipv4 xt_state nf_conntrack lmv(U) mgc(U) lustre(U) lov(U) osc(U)
lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U)
lnet(U) libcfs(U) acpi_cpufreq freq_table mperf ipt_LOG xt_multiport
iptable_filter ip_tables rds_rdma rds ib_ipoib rdma_ucm ib_ucm ib_uverbs
ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa dm_mirror dm_region_hash dm_log
dm_mod vhost_net macvtap macvlan tun kvm uinput sg sd_mod crc_t10dif microcode
isci libsas scsi_transport_sas ahci sb_edac edac_core iTCO_wdt
iTCO_vendor_support i2c_i801 i2c_core ioatdma ib_qib(U) ib_mad ib_core wmi ipv6
nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca [last unloaded:
cpufreq_ondemand]
2011-10-24 09:47:59 Pid: 41013, comm: IOR Tainted: G        W  ----------------
  2.6.32-206.1chaos.ch5.x86_64 #1
2011-10-24 09:47:59 Call Trace:
2011-10-24 09:47:59  [<ffffffff81069af7>] ? warn_slowpath_common+0x87/0xc0
2011-10-24 09:47:59  [<ffffffff81069b4a>] ? warn_slowpath_null+0x1a/0x20
2011-10-24 09:47:59  [<ffffffff810c6287>] ?
res_counter_uncharge_locked+0x37/0x40
2011-10-24 09:47:59  [<ffffffff810c62d7>] ? res_counter_uncharge+0x47/0x70
2011-10-24 09:47:59  [<ffffffff8116ad65>] ?
__mem_cgroup_uncharge_common+0xd5/0x2c0
2011-10-24 09:47:59  [<ffffffff8116af9f>] ? mem_cgroup_uncharge_page+0x2f/0x40
2011-10-24 09:47:59  [<ffffffff8114498f>] ? page_remove_rmap+0x3f/0xa0
2011-10-24 09:47:59  [<ffffffff81171fa8>] ? zap_huge_pmd+0xa8/0x1c0
2011-10-24 09:47:59  [<ffffffff811396eb>] ? unmap_vmas+0x7cb/0xbf0
2011-10-24 09:47:59  [<ffffffff811411c1>] ? unmap_region+0x91/0x130
2011-10-24 09:47:59  [<ffffffff81141886>] ? do_munmap+0x2b6/0x3a0
2011-10-24 09:47:59  [<ffffffff811419c6>] ? sys_munmap+0x56/0x80
2011-10-24 09:47:59  [<ffffffff8100b0b2>] ? system_call_fastpath+0x16/0x1b
2011-10-24 09:47:59 ---[ end trace b3809c246ab31181 ]---

Eventually, stacks that involve lustre start getting printed.  My initial guess
is that lustre is a victim here, but it is certainly possible that it is doing
something wrong with page accounting...

2011-10-24 10:27:51 WARNING: at kernel/res_counter.c:72
res_counter_uncharge_locked+0x37/0x40() (Tainted: G        W  ---------------- 
)
2011-10-24 10:27:51 Hardware name: appro-512x
2011-10-24 10:27:51 Modules linked in: xt_owner nf_conntrack_ipv4
nf_defrag_ipv4 xt_state nf_conntrack lmv(U) mgc(U) lustre(U) lov(U) osc(U)
lquota(U) mdc
(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U)
acpi_cpufreq freq_table mperf ipt_LOG xt_multiport iptable_filter ip_tables 
rds_rdma rds ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm
ib_addr ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan
 tun kvm uinput sg sd_mod crc_t10dif microcode isci libsas scsi_transport_sas
ahci sb_edac edac_core iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ioatdm
a ib_qib(U) ib_mad ib_core wmi ipv6 nfs lockd fscache nfs_acl auth_rpcgss
sunrpc igb dca [last unloaded: cpufreq_ondemand]
2011-10-24 10:27:51 Pid: 41011, comm: IOR Tainted: G        W  ----------------
  2.6.32-206.1chaos.ch5.x86_64 #1
2011-10-24 10:27:51 Call Trace:
2011-10-24 10:27:51  [<ffffffff81069af7>] ? warn_slowpath_common+0x87/0xc0
2011-10-24 10:27:51  [<ffffffff81069b4a>] ? warn_slowpath_null+0x1a/0x20
2011-10-24 10:27:51  [<ffffffff810c6287>] ?
res_counter_uncharge_locked+0x37/0x40
2011-10-24 10:27:51  [<ffffffff810c62d7>] ? res_counter_uncharge+0x47/0x70
2011-10-24 10:27:51  [<ffffffff81167f75>] ?
__mem_cgroup_cancel_charge+0x35/0x70
2011-10-24 10:27:51  [<ffffffff81167fdb>] ? mem_cgroup_cancel_charge+0x2b/0x80
2011-10-24 10:27:51  [<ffffffff8116a6e2>] ?
__mem_cgroup_commit_charge+0x112/0x120
2011-10-24 10:27:51  [<ffffffff8116bea5>] ? mem_cgroup_charge_common+0xa5/0xd0
2011-10-24 10:27:51  [<ffffffff8116bff8>] ? mem_cgroup_cache_charge+0x128/0x140
2011-10-24 10:27:51  [<ffffffffa085787e>] ? osc_page_is_under_lock+0x7e/0x1c0
[osc]
2011-10-24 10:27:51  [<ffffffff8111120a>] ? add_to_page_cache_locked+0x4a/0x140
2011-10-24 10:27:51  [<ffffffff8111132c>] ? add_to_page_cache_lru+0x2c/0x80
2011-10-24 10:27:51  [<ffffffff81110c7e>] ? find_get_page+0x1e/0xa0
2011-10-24 10:27:51  [<ffffffff8111254b>] ? grab_cache_page_nowait+0x8b/0xc0
2011-10-24 10:27:51  [<ffffffffa0912b2b>] ? ll_readahead+0x95b/0x1290 [lustre]
2011-10-24 10:27:51  [<ffffffffa0936e3a>] ? ccc_page_is_under_lock+0x1aa/0x200
[lustre]
2011-10-24 10:27:51  [<ffffffffa0910f78>] ? ras_update+0x58/0xe30 [lustre]
2011-10-24 10:27:51  [<ffffffffa08ae87e>] ? lov_page_stripe+0x3e/0x150 [lov]
2011-10-24 10:27:52  [<ffffffffa093cdd5>] ? vvp_io_read_page+0x385/0x3c0
[lustre]
2011-10-24 10:27:52  [<ffffffffa060d475>] ? cl_io_read_page+0x95/0x1a0
[obdclass]
2011-10-24 10:27:52  [<ffffffffa0601ac9>] ? cl_page_assume+0xe9/0x250
[obdclass]
2011-10-24 10:27:52  [<ffffffffa09139d8>] ? ll_readpage+0x98/0x1f0 [lustre]
2011-10-24 10:27:52  [<ffffffff81110dd7>] ? __lock_page_killable+0x67/0x70
2011-10-24 10:27:52  [<ffffffff81090b90>] ? wake_bit_function+0x0/0x50
2011-10-24 10:27:52  [<ffffffff8111277c>] ? generic_file_aio_read+0x1fc/0x700
2011-10-24 10:27:52  [<ffffffffa0935892>] ? ccc_object_size_unlock+0x42/0x50
[lustre]
2011-10-24 10:27:52  [<ffffffffa093d3eb>] ? vvp_io_read_start+0x13b/0x3d0
[lustre]
2011-10-24 10:27:52  [<ffffffffa0606985>] ? cl_wait+0xb5/0x280 [obdclass]
2011-10-24 10:27:52  [<ffffffffa060ab88>] ? cl_io_start+0x68/0x170 [obdclass]
2011-10-24 10:27:52  [<ffffffffa060f700>] ? cl_io_loop+0x110/0x1c0 [obdclass]
2011-10-24 10:27:52  [<ffffffffa04dc3a2>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
2011-10-24 10:27:52  [<ffffffffa08e4947>] ? ll_file_io_generic+0x3c7/0x580
[lustre]
2011-10-24 10:27:52  [<ffffffffa04da494>] ? cfs_hash_dual_bd_unlock+0x34/0x60
[libcfs]
2011-10-24 10:27:52  [<ffffffffa05fe759>] ? cl_env_get+0x29/0x350 [obdclass]
2011-10-24 10:27:52  [<ffffffffa08eb5ac>] ? ll_file_aio_read+0x13c/0x310
[lustre]
2011-10-24 10:27:52  [<ffffffffa05fe8ce>] ? cl_env_get+0x19e/0x350 [obdclass]
2011-10-24 10:27:52  [<ffffffffa05fcf9f>] ? cl_env_put+0x1af/0x2e0 [obdclass]
2011-10-24 10:27:52  [<ffffffffa08eb8f1>] ? ll_file_read+0x171/0x310 [lustre]
2011-10-24 10:27:52  [<ffffffff8100ba0e>] ? common_interrupt+0xe/0x13
2011-10-24 10:27:52  [<ffffffff81177cc5>] ? vfs_read+0xb5/0x1a0
2011-10-24 10:27:52  [<ffffffff81177e01>] ? sys_read+0x51/0x90
2011-10-24 10:27:52  [<ffffffff8100b0b2>] ? system_call_fastpath+0x16/0x1b
2011-10-24 10:27:52 ---[ end trace b3809c246ab32bc9 ]---


Expected results:


Additional info:

There is a big concern that this is a major regression and that RH needs to start testing this bit of code by exercising the creation and deletion of memcg's

Comment 1 Rik van Riel 2011-10-25 21:53:32 UTC
Can you reproduce this bug without lustre?

If the backtrace with zap_huge_pmd is always the first one you get, it is indicative of the problem being elsewhere. 

That code path notices that the number of pages being subtracted from the cgroup is less than the number of pages currently accounted to it, which happens if somebody (else) left the counter in an inconsistent state earlier.

Comment 2 Larry Woodman 2011-10-26 02:47:55 UTC
Looks like more uncharges took place than charges in this case:


void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
{
        if (WARN_ON(counter->usage < val))
                val = counter->usage;

        counter->usage -= val;
}

Or perhaps not all pages allocated to the cgroup are properly charged when they are allocated yet they are properly uncharged when they are freed?

Larry

Comment 3 Ben Woodard 2011-10-26 17:52:36 UTC
Some of the things that we discovered as of today:

- we didn't see the problem with 131.12.1 the problem first appeared when they moved to the 206 kernel. They didn't do enough testing on the kernels between 131.12.1 and 206 to see if the problem occurs there. We are in the process of trying to test using NFS on the 202 kernel. According to Linda it may have something to do with the patches that went into 182.

- it doesn't seem to occur when we have THP turned off. It ran all night and didn't happen.

Comment 4 Larry Woodman 2011-10-26 18:12:37 UTC
Ben, how long does it take to reproduce this problem?  Also, lets test the very latest(-124) kernel just be be sure the problem hasnt been fixed since -206.  Quite a few changes have gon in between -206 and -214.

Larry

Comment 5 Ben Woodard 2011-10-26 18:49:18 UTC
The program that is triggering it right now is IOR http://sourceforge.net/projects/ior-sio/ We are trying to come up with a better reproducer but have yet to find the magic combination. It takes extensive testing, the frequency is about 1-2% and it takes a while to manifest > than seconds but < than hours on a 16 core sandy bridge. 

We have yet to be able to reproduce the problem using NFS. At the moment, this doesn't tell us much with any certainty. Our NFS infrastructure is much less capable than our Lustre infrastructure and consequently testing is going much more slowly and therefore makes different demands on the VM subsystem. For example with NFS runs the page cache seems to stay roughly constant but with Lustre the size of the page cache grows considerably during the IOR run.

Comment 6 Ben Woodard 2011-10-26 23:53:14 UTC
- The problem does reproduce on the 202 kernel.

- as I said before the problem only manifests with THP enabled.

- it took about 30 minutes to hit.

- So far we have only been able to reproduce it with lustre but not with NFS.
There is no code in lustre that touches any part of cgroups or memcgs in particular. We think that the difference is in comparative capacity of the lustre file system vs the NFS file systems. It is immediately obvious that lustre runs much faster and that puts more demands on the system's page cache. Overnight we are going to do a run with a fairly small subset of nodes against the NFS server. Hopefully then we can avoid overloading the NFS server enough that it will be able to perform well enough that it puts enough demands on the VM to trigger the problem,

- One of the nodes that it hit the problem on was running IOR on just 1 node at the time. The test that is failing scales up from 1-256 procs all within one memcg running on a 16 processor sandy bridge machine. We don't know how many threads were actually running at that time.
There should only be one other memcg on the node at the time.

Comment 7 Ben Woodard 2011-10-27 00:45:17 UTC
We haven't proven this yet. We think that it is the rootcg which is being being uncharged and causing the warning. Tomorrow, we will continue testing this. To the best of our knowledge we think that root_mem_cgroup->res should always == 0 What we are seeing though is: 
counter->usage = 0, val=4096! and 
counter->usage = 0, val=2097152

Comment 8 Larry Woodman 2011-10-27 03:28:57 UTC
Ben, if we use the "caveman approach" in debugging this problem and do a binary search we should be able to find which kernel version caused this problem in 6 or 7 tries.  At 30 minutes to hit the problem thats only about 3 or 4 hours of runtime.  Since the problem occurs in -202 but not -131 I'd start with -165 and do a binary search.  The faster we find the exact cause of this problem the more likely the fix will make 6.2

Larry

Comment 9 Ben Woodard 2011-10-27 19:20:07 UTC
<neb> is there any way to automate the lustre module build so that it isn't so difficult for you to make different kernels.
<grondo> it is automated
<neb> the guys working on it understand it is a pain but they really would love a bisection between 202 and 131.12.1 that you were running.
<grondo> the problem is that the lustre package has a BuildRequires of kernel-devel
<neb> so you need kernel devel for all the other kernels.
<neb> I can get you that trivially.
<grondo> ok, there are actually multiple problems with that
<grondo> 1st, lustre requires its own kernel patches
<grondo> so I'd have to apply the lustre patches to every git bisect result
<grondo> and likely they would not apply cleanly, so that wouldn't be automatic
<grondo> 2nd, our automated build system uses mock, and thus when pulling in kernel-devel, it always just grabs what yum thinks is the "latest" rpm
<grondo> First I'd like to see if I can reproduce the problem on a standalone node
<grondo> then I can see if I can reproduce against just NFS.
<grondo> If so, then that solves the lustre problem, and likely we can just use straight RHEL kernels

Comment 10 Larry Woodman 2011-10-27 20:01:18 UTC
Every iteration divides the problem in 1/2 so just a couple/few tests really zeros in on the problem.

Larry

Comment 11 Ben Woodard 2011-10-27 23:43:16 UTC
<grondo> ok, hit the issue on 165
<grondo> I'd better retest 131

Comment 12 Ben Woodard 2011-10-28 14:10:03 UTC
OK it is starting to look like it isn't a regression but possibly a problem with our previous testing methodologies. When we retested with 131.12.1 and let it run overnight we were able to reproduce the problem. 

We are currently think that it could be one of three different things:
1) a difference between unit testing and integration testing
 - for example changes in the timing and depth of the cgroups as we built up our software stack. The job launcher's support of cgroups was being coded, evolving, and tested as we were gearing up for this release. I could be that our use of cgroups changed over that period of time sufficiently that our previous testing was not directly applicable. 
 - Or maybe we tested lustre independently of our job launcher's support of cgroups
2) changes in the hardware making race conditions to trigger this problem. Much of our testing was on the previous generation of processors, and sandy bridges are just now becoming commonly available. Consequently, the increase in the core count may have made made it more likely for us to see this.
3) the scale of the test clusters. We are currently rolling out a very large sandy bridge cluster and cluster wide testing makes these hard to hit problems happen more frequently.

We are going to continue brainstorming as to the possible cause of this failure in testing and also start looking into doing a bisection to isolate the patch that is actually at fault. Updates throughout the day.

Comment 13 Larry Woodman 2011-10-28 14:40:13 UTC
Either way "WARN_ON(counter->usage < val)" should never fire.  The only that could happen is if some leak occurred or there was some corruption within the counter->usage field.

Larry

Comment 14 Johannes Weiner 2011-11-01 15:40:54 UTC
Created attachment 531156 [details]
[patch] memcg: add memcg sanity checks at allocating and freeing pages

If a page is freed without being uncharged and reused as an anonymous page, it will get uncharged when it is unmapped again.  Although you wouldn't be able to remove the cgroup until the page is finally uncharged, at least the charge-uncharge is always balanced eventually.

However, if the page is reused as a transparent huge page, the uncharge is bigger than the charge and this will eventually manifest in the warning you see.

Please try reproducing with the debug kernel build (CONFIG_DEBUG_VM enabled) and the attached patch applied.  This should catch a missed uncharge upon freeing the page - before it's being reused - and thus help pin down the faulty party.

Comment 15 Ben Woodard 2011-11-01 16:51:10 UTC
Johannes,

Thank you. This is exactly the kind of patch we needed. We will begin testing with it immediately.

Comment 16 Ben Woodard 2011-11-01 21:08:18 UTC
We found one bug that matched your description.

<grondo> the issue Brian found is that truncate_complete_page is not exported from kernel
<grondo> so Lustre has to use its own internal version
<grondo> which is wrong (from an older kernel)
<grondo> so does not call remove_from_page_cache
<grondo> Trying a kernel with EXPORT_SYMBOL(truncate_complete_page)
<grondo> which will allow lustre to use the kernel version

Testing with a fixed version now and we'll see if we uncover any more problems.

Comment 18 Ben Woodard 2011-11-02 14:22:55 UTC
(In reply to comment #14) 
> If a page is freed without being uncharged and reused as an anonymous page, it
> will get uncharged when it is unmapped again.  Although you wouldn't be able to
> remove the cgroup until the page is finally uncharged, at least the
> charge-uncharge is always balanced eventually.
> 
> However, if the page is reused as a transparent huge page, the uncharge is
> bigger than the charge and this will eventually manifest in the warning you
> see.

In the end it was this description and how it interacted with THP which allowed us to find the bug by inspection before we could even get a kernel built with the patch. 

LLNL is very grateful for the assistance even though the problem ended up being in their out if tree code.

Comment 19 Johannes Weiner 2011-11-04 09:21:21 UTC
That is good to hear.  I will push this debugging patch into RHEL as we have it upstream and it has proved useful yet another time.


Note You need to log in before you can comment on or make changes to this bug.