Bug 1795049

Summary: RHEL8: Latency issue on Kubernetes / k8s / OpenShift
Product: Red Hat Enterprise Linux 8 Reporter: Jesper Brouer <jbrouer>
Component: kernelAssignee: Waiman Long <llong>
kernel sub component: Control Groups QA Contact: Chao Ye <cye>
Status: CLOSED ERRATA Docs Contact: Jaroslav Klech <jklech>
Severity: urgent    
Priority: unspecified CC: akamra, arozansk, daolivei, dshaks, jbrouer, jklech, jmaxwell, juri.lelli, lgoncalv, llong, longman, mtosatti, perobins, rkrawitz, smeisner, williams
Version: 8.2Flags: pm-rhel: mirror+
Target Milestone: rc   
Target Release: 8.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-4.18.0-176.el8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1796920 (view as bug list) Environment:
Last Closed: 2020-04-28 16:38:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1742931, 1796920    

Description Jesper Brouer 2020-01-26 17:42:52 UTC
Description of problem:

GitHub engineers have written a blogpost[1], that describe network stalls (of 100 ms) on Kubernetes. I have looked at the details, and I conclude that RHEL8 and OpenShift is vulnerable to this issue.

 [1] https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/

This bugzilla is a request to backport upstream kernel commits that fixes this issue. The fixes are related to kernel mem cgroup (mm/memcontrol.c) accounting.

The blogpost is long, so a summary:
 The gist is: Network latency can jump to 100ms in a k8s cluster. This is cause by kernel doing too much work (in memcg_stat_show) that stalls softirq, which network depend on.  In their case it was cAdvisor that was triggering this by reading /sys/fs/cgroup/memory/memory.stat.

Comment 1 Jesper Brouer 2020-01-26 18:08:09 UTC
This BZ is focused on backporting the specific fixes.

BUT the problem is more general, and it can (likely will) happen in other situations in the future.
The root-cause is that our RHEL8 kernel isn't preemptable, which allowed this long starvation.
(Cc. some people from Real Time team)

GitHub's kernel is compiled with CONFIG_PREEMPT_VOLUNTARY=y like RHEL8.
Thus, an alternative fix is to compile RHEL8 with CONFIG_PREEMPT=y.

Comment 2 Jesper Brouer 2020-01-26 18:13:45 UTC
Easy way to test/spot if system is having this issue:

 $ time cat /sys/fs/cgroup/memory/memory.stat

Reports[2] says it takes 0.376s on the affected machine, and 0.002s on an unaffected machine.

[2] https://github.com/google/cadvisor/issues/1774#issuecomment-401910711

Comment 4 Jesper Brouer 2020-01-26 19:51:49 UTC
I believe these 4 patches from Johannes fixes the issue.
Kept the full patch desc as it describe the problem precisely.

commit 205b20cc5a99cdf197c32f4dbee2b09c699477f0
Author: Johannes Weiner <hannes>
Date:   Tue May 14 15:47:06 2019 -0700

    mm: memcontrol: make cgroup stats and events query API explicitly local
    
    Patch series "mm: memcontrol: memory.stat cost & correctness".
    
    The cgroup memory.stat file holds recursive statistics for the entire
    subtree.  The current implementation does this tree walk on-demand
    whenever the file is read.  This is giving us problems in production.
    
    1. The cost of aggregating the statistics on-demand is high.  A lot of
       system service cgroups are mostly idle and their stats don't change
       between reads, yet we always have to check them.  There are also always
       some lazily-dying cgroups sitting around that are pinned by a handful
       of remaining page cache; the same applies to them.
    
       In an application that periodically monitors memory.stat in our
       fleet, we have seen the aggregation consume up to 5% CPU time.
    
    2. When cgroups die and disappear from the cgroup tree, so do their
       accumulated vm events.  The result is that the event counters at
       higher-level cgroups can go backwards and confuse some of our
       automation, let alone people looking at the graphs over time.
    
    To address both issues, this patch series changes the stat
    implementation to spill counts upwards when the counters change.
    
    The upward spilling is batched using the existing per-cpu cache.  In a
    sparse file stress test with 5 level cgroup nesting, the additional cost
    of the flushing was negligible (a little under 1% of CPU at 100% CPU
    utilization, compared to the 5% of reading memory.stat during regular
    operation).
    
    This patch (of 4):
    
    memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
    currently returning the state of the local memcg or lruvec, not the
    recursive state.
    
    In practice there is a demand for both versions, although the callers
    that want the recursive counts currently sum them up by hand.
    
    Per default, cgroups are considered recursive entities and generally we
    expect more users of the recursive counters, with the local counts being
    special cases.  To reflect that in the name, add a _local suffix to the
    current implementations.
    
    The following patch will re-incarnate these functions with recursive
    semantics, but with an O(1) implementation.
    
    [hannes: fix bisection hole]
      Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes>
    Reviewed-by: Shakeel Butt <shakeelb>
    Reviewed-by: Roman Gushchin <guro>
    Cc: Michal Hocko <mhocko>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

commit db9adbcbe740e0986b575dd56aad834ce9e9b5d3
Author: Johannes Weiner <hannes>
Date:   Tue May 14 15:47:09 2019 -0700

    mm: memcontrol: move stat/event counting functions out-of-line
    
    These are getting too big to be inlined in every callsite.  They were
    stolen from vmstat.c, which already out-of-lines them, and they have
    only been growing since.  The callsites aren't that hot, either.
    
    Move __mod_memcg_state()
         __mod_lruvec_state() and
         __count_memcg_events() out of line and add kerneldoc comments.
    
    Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes>
    Reviewed-by: Shakeel Butt <shakeelb>
    Reviewed-by: Roman Gushchin <guro>
    Cc: Michal Hocko <mhocko>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

commit 42a300353577ccc17ecc627b8570a89fa1678bec
Author: Johannes Weiner <hannes>
Date:   Tue May 14 15:47:12 2019 -0700

    mm: memcontrol: fix recursive statistics correctness & scalabilty
    
    Right now, when somebody needs to know the recursive memory statistics
    and events of a cgroup subtree, they need to walk the entire subtree and
    sum up the counters manually.
    
    There are two issues with this:
    
    1. When a cgroup gets deleted, its stats are lost. The state counters
       should all be 0 at that point, of course, but the events are not.
       When this happens, the event counters, which are supposed to be
       monotonic, can go backwards in the parent cgroups.
    
    2. During regular operation, we always have a certain number of lazily
       freed cgroups sitting around that have been deleted, have no tasks,
       but have a few cache pages remaining. These groups' statistics do not
       change until we eventually hit memory pressure, but somebody
       watching, say, memory.stat on an ancestor has to iterate those every
       time.
    
    This patch addresses both issues by introducing recursive counters at
    each level that are propagated from the write side when stats change.
    
    Upward propagation happens when the per-cpu caches spill over into the
    local atomic counter.  This is the same thing we do during charge and
    uncharge, except that the latter uses atomic RMWs, which are more
    expensive; stat changes happen at around the same rate.  In a sparse
    file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
    nesting levels, perf shows __mod_memcg_page state at ~1%.
    
    Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes>
    Reviewed-by: Shakeel Butt <shakeelb>
    Reviewed-by: Roman Gushchin <guro>
    Cc: Michal Hocko <mhocko>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

commit def0fdae813dbbbbb588bfc5f52856be2e842b35
Author: Johannes Weiner <hannes>
Date:   Tue May 14 15:47:15 2019 -0700

    mm: memcontrol: fix NUMA round-robin reclaim at intermediate level
    
    When a cgroup is reclaimed on behalf of a configured limit, reclaim
    needs to round-robin through all NUMA nodes that hold pages of the memcg
    in question.  However, when assembling the mask of candidate NUMA nodes,
    the code only consults the *local* cgroup LRU counters, not the
    recursive counters for the entire subtree.  Cgroup limits are frequently
    configured against intermediate cgroups that do not have memory on their
    own LRUs.  In this case, the node mask will always come up empty and
    reclaim falls back to scanning only the current node.
    
    If a cgroup subtree has some memory on one node but the processes are
    bound to another node afterwards, the limit reclaim will never age or
    reclaim that memory anymore.
    
    To fix this, use the recursive LRU counts for a cgroup subtree to
    determine which nodes hold memory of that cgroup.
    
    The code has been broken like this forever, so it doesn't seem to be a
    problem in practice.  I just noticed it while reviewing the way the LRU
    counters are used in general.
    
    Link: http://lkml.kernel.org/r/20190412151507.2769-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes>
    Reviewed-by: Shakeel Butt <shakeelb>
    Reviewed-by: Roman Gushchin <guro>
    Cc: Michal Hocko <mhocko>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

Comment 7 Jesper Brouer 2020-01-30 14:33:55 UTC
I've created a bpftrace script that can be used for detecting IRQ-to-softirq latency spikes, which can be caused by this issue.

https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt

Comment 8 Jesper Brouer 2020-01-30 15:03:48 UTC
Using the bpftrace script in comment #7, I have verified that a RHEL8 kernel does experience latency spikes corresponding to the time it takes to cat/read file /sys/fs/cgroup/memory/memory.stat.

I created 20000 memory cgroups as described in bug 1793534 comment #8:

 $ for i in `seq 1 20000`; do sudo mkdir /sys/fs/cgroup/memory/example$i; echo $i; done

After this it takes approx 44 ms (or 44000 usecs) to read /sys/fs/cgroup/memory/memory.stat.
Reading this file constantly like:

 while : ; do time taskset -c 0 cat /sys/fs/cgroup/memory/memory.stat > /dev/null ; done

 [...]
 real	0m0,044s
 user	0m0,000s
 sys	0m0,044s


For the latency spike to occur, the NIC softirq and process reading the memory.stat file must run on the same CPU. Which is why 'taskset' is used in above to restrict process to run on CPU-0. Like-wise I've configured CPU-0 to handle the IRQ for this NIC.  (Hint: restricted NIC to single RX-queue via cmd:  ethtool --set-channels mlx5p1 combined 1).

Test is sending 1 million UDP packets.


If nobody is reading memory.stat latency histogram on CPU-0 is:


 $ sudo ./softirq_net_latency.bt 5000
 Attaching 5 probes...
 Tracing softirq latency ... Hit Ctrl-C to end.
  - Will report on latency above 5000 usecs (= 5 ms)
 [...]

 @latency_usecs[0]: 
 [0]                69465 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
 [1]                 1721 |@                                                   |
 [2, 4)               176 |                                                    |
 [4, 8)                11 |                                                    |
 [8, 16)                4 |                                                    |
 [16, 32)               2 |                                                    |
 [32, 64)               1 |                                                    |

 
If the while-loop is reading memory.stat:

 @latency_usecs[0]: 
 [0]                 4082 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
 [1]                  338 |@@@@                                                |
 [2, 4)               121 |@                                                   |
 [4, 8)                57 |                                                    |
 [8, 16)                4 |                                                    |
 [16, 32)               0 |                                                    |
 [32, 64)               3 |                                                    |
 [64, 128)              0 |                                                    |
 [128, 256)             0 |                                                    |
 [256, 512)             1 |                                                    |
 [512, 1K)              5 |                                                    |
 [1K, 2K)               1 |                                                    |
 [2K, 4K)               0 |                                                    |
 [4K, 8K)               0 |                                                    |
 [8K, 16K)              0 |                                                    |
 [16K, 32K)             4 |                                                    |
 [32K, 64K)            76 |                                                    |

There is also a lot of High IRQ-to-softirq latency reports:

 sudo ./softirq_net_latency.bt 5000
 Attaching 5 probes...
 Tracing softirq latency ... Hit Ctrl-C to end.
  - Will report on latency above 5000 usecs (= 5 ms)
 High IRQ-to-softirq latency: 17033 usec (17 ms) on CPU:0
 High IRQ-to-softirq latency: 41906 usec (41 ms) on CPU:0
 High IRQ-to-softirq latency: 42523 usec (42 ms) on CPU:0
 High IRQ-to-softirq latency: 42118 usec (42 ms) on CPU:0
 High IRQ-to-softirq latency: 41765 usec (41 ms) on CPU:0
 High IRQ-to-softirq latency: 42315 usec (42 ms) on CPU:0
 High IRQ-to-softirq latency: 42028 usec (42 ms) on CPU:0
 High IRQ-to-softirq latency: 42531 usec (42 ms) on CPU:0
 [...]

This clearly show that reading memory.stat is the cause of these latency spikes.

Comment 10 Waiman Long 2020-01-30 19:53:02 UTC
(In reply to Jesper Brouer from comment #8)
> Using the bpftrace script in comment #7, I have verified that a RHEL8 kernel
> does experience latency spikes corresponding to the time it takes to
> cat/read file /sys/fs/cgroup/memory/memory.stat.
> 
> I created 20000 memory cgroups as described in bug 1793534 comment #8:
> 
>  $ for i in `seq 1 20000`; do sudo mkdir /sys/fs/cgroup/memory/example$i;
> echo $i; done
> 
> After this it takes approx 44 ms (or 44000 usecs) to read
> /sys/fs/cgroup/memory/memory.stat.
> Reading this file constantly like:
> 
>  while : ; do time taskset -c 0 cat /sys/fs/cgroup/memory/memory.stat >
> /dev/null ; done
> 
>  [...]
>  real	0m0,044s
>  user	0m0,000s
>  sys	0m0,044s
> 
> 
> For the latency spike to occur, the NIC softirq and process reading the
> memory.stat file must run on the same CPU. Which is why 'taskset' is used in
> above to restrict process to run on CPU-0. Like-wise I've configured CPU-0
> to handle the IRQ for this NIC.  (Hint: restricted NIC to single RX-queue
> via cmd:  ethtool --set-channels mlx5p1 combined 1).
> 
> Test is sending 1 million UDP packets.

I have built a brew kernel with additional preemption point in memory.stat read path. You can download it from

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=26132571

I know little about network testing. Would you mind running your test with this kernel to see if it helps to reduce the latency spike?

Thanks,
Longman

Comment 11 Jesper Brouer 2020-01-31 09:03:58 UTC
(In reply to Waiman Long from comment #10)

> I have built a brew kernel with additional preemption point in memory.stat
> read path. You can download it from
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=26132571
> 
> I know little about network testing. Would you mind running your test with
> this kernel to see if it helps to reduce the latency spike?

I've installed the kernel, but the directory /sys/fs/cgroup/memory/ is missing.

 [root@broadwell ~]# uname -a
 Linux broadwell 4.18.0-174.el8.bz1795049_memcg.x86_64 #1 SMP Thu Jan 30 18:15:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

 [root@broadwell ~]# cd /sys/fs/cgroup/memory/
 -bash: cd: /sys/fs/cgroup/memory/: No such file or directory

What am I missing?

Hint:
 $ mount | grep cgroup
 cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)


Followup questions:

Question(1): How do I re-enabled cgroup v1 to test this?

Question(2): Is this regression/issue only cgroup v1 related?

Question(3): Have OpenShift on RHEL8 been fully transisioned to cgroup v2 ?

Comment 12 Jesper Brouer 2020-01-31 09:18:16 UTC
(In reply to Jesper Brouer from comment #11)
> Followup questions:
> 
> Question(1): How do I re-enabled cgroup v1 to test this?

I've resorted to boot the kernel with boot param: systemd.unified_cgroup_hierarchy=0

Full kexec cmd:

  sudo kexec -l --initrd=/boot/initramfs-4.18.0-174.el8.bz1795049_memcg.x86_64.img \
 '--append=root=UUID=09e8c15f-80d2-47e3-8e73-d3fdfcf33eef ro rhgb console=tty0 console=ttyS1,115200n8 LANG=en_US.UTF-8 systemd.unified_cgroup_hierarchy=0' \
  -- /boot/vmlinuz-4.18.0-174.el8.bz1795049_memcg.x86_64

Now the cgroup v1 dirs are mounted:

 $ mount | grep cgroup
 tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
 cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
 cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
 cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
 cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
 cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
 cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
 cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
 cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
 cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
 cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
 cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
 cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
 cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)

Comment 13 Jesper Brouer 2020-01-31 09:22:10 UTC
First assessment is that this newer kernel source have less slowdown when reading: /sys/fs/cgroup/memory/memory.stat

I've created 20000 mem cgroups via cmdline:

 for i in `seq 1 20000`; \
  do sudo mkdir /sys/fs/cgroup/memory/example$i ; \
  echo $i; \
 done

Before it took 44 ms to read memory.stat, now it takes 12 ms.

 $ time cat /sys/fs/cgroup/memory/memory.stat > /dev/null

 real	0m0,013s
 user	0m0,001s
 sys	0m0,012s

Comment 15 Jesper Brouer 2020-01-31 09:42:07 UTC
I repeated the test from comment #8 and used my latency tool[1].

Summary (TL;DR): success
 - The latency is reduced to latency spikes around 83 usec (single worstcase 124 usec).
 - The expected latency was 10 ms (10000 usec)

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt

Latency histogram:

 latency_usecs[0]: 
 [0]                17215 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
 [1]                  486 |@                                                   |
 [2, 4)              1923 |@@@@@                                               |
 [4, 8)                16 |                                                    |
 [8, 16)                4 |                                                    |
 [16, 32)               0 |                                                    |
 [32, 64)            5115 |@@@@@@@@@@@@@@@                                     |
 [64, 128)            856 |@@                                                  |

The while loop used for reading memory.stat file (time is 10 ms):

 while : ; do time taskset -c 0 cat /sys/fs/cgroup/memory/memory.stat > /dev/null ; done

 [...]
 real	0m0,010s
 user	0m0,000s
 sys	0m0,010s

Another observation is that there are now zero packet drops, before the packet drops were significant.

Comment 16 Jesper Brouer 2020-01-31 09:45:48 UTC
(In reply to Waiman Long from comment #10)

> I know little about network testing. Would you mind running your test with
> this kernel to see if it helps to reduce the latency spike?

Sure always eager to help, see comment #15.  (Update to clear needinfo)

Comment 18 Waiman Long 2020-01-31 13:35:41 UTC
(In reply to Jesper Brouer from comment #11)
> (In reply to Waiman Long from comment #10)
> 
> > I have built a brew kernel with additional preemption point in memory.stat
> > read path. You can download it from
> > 
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=26132571
> > 
> > I know little about network testing. Would you mind running your test with
> > this kernel to see if it helps to reduce the latency spike?
> 
> I've installed the kernel, but the directory /sys/fs/cgroup/memory/ is
> missing.
> 
>  [root@broadwell ~]# uname -a
>  Linux broadwell 4.18.0-174.el8.bz1795049_memcg.x86_64 #1 SMP Thu Jan 30
> 18:15:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> 
>  [root@broadwell ~]# cd /sys/fs/cgroup/memory/
>  -bash: cd: /sys/fs/cgroup/memory/: No such file or directory
> 
> What am I missing?
> 
> Hint:
>  $ mount | grep cgroup
>  cgroup2 on /sys/fs/cgroup type cgroup2
> (rw,nosuid,nodev,noexec,relatime,nsdelegate)
> 
> 
> Followup questions:
> 
> Question(1): How do I re-enabled cgroup v1 to test this?
> 
> Question(2): Is this regression/issue only cgroup v1 related?
> 
> Question(3): Have OpenShift on RHEL8 been fully transisioned to cgroup v2 ?

The kernel change is simple enough that it shouldn't cause this side effect. Probably there are some config issue in your system that causes this.

-Longman

Comment 21 Herton R. Krzesinski 2020-02-05 13:11:45 UTC
Patch(es) available on kernel-4.18.0-176.el8

Comment 28 errata-xmlrpc 2020-04-28 16:38:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1769