Bug 1405347 - High IO read load on RHEL 7.3 from thin_ls
Summary: High IO read load on RHEL 7.3 from thin_ls
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.3.1
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Seth Jennings
QA Contact: Mike Fiedler
URL:
Whiteboard: aos-scalability-35
: 1397593 (view as bug list)
Depends On:
Blocks: 1267746 1417265
TreeView+ depends on / blocked
 
Reported: 2016-12-16 09:20 UTC by Vladislav Walek
Modified: 2020-09-10 10:03 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Fixes an issue where I/O can be saturated on a node due to the collection of per-container disk stats from a thin pool with a large amount of metadata. The solution is to disable the collection of these statistics until such time that an efficiently way to collect them can be found.
Clone Of:
: 1417265 (view as bug list)
Environment:
Last Closed: 2017-04-12 19:18:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0884 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.5 RPM Release Advisory 2017-04-12 22:50:07 UTC

Description Vladislav Walek 2016-12-16 09:20:42 UTC
Description of problem:

Since upgrading from RHEL 7.2 to RHEL 7.3 we saw a huge increase of IO read load on the Docker volume group. After tracing this issue down we found out that this was caused by thin_ls which is used by CAdvisor. After removing thin_ls the IO read load went back to normal.

Version-Release number of selected component (if applicable):

OpenShift 3.3.1.7 on RHEL 7.3

How reproducible:


Actual results:


Expected results:


Additional info:

Comment 1 Javier Ramirez 2016-12-19 09:20:32 UTC
More information provided by the customer

We used the blktrace(8) utility to trace the source of I/O operations. Using this we were able to confirm that the requests originate from thin_ls(8). RHBZ#1315452 (https://bugzilla.redhat.com/show_bug.cgi?id=1315452) is where its inclusion in RedHat 7.3 was requested specifically for OpenShift.

cAdvisor, a component of Kubernetes and OpenShift collecting statistics on containers, makes use of thin_ls when it's available. The so-called thin-pool watcher calls thin_ls every 15 seconds (https://github.com/openshift/origin/blob/release-1.3/vendor/github.com/google/cadvisor/devicemapper/thin_pool_watcher.go#L50).

Here's how you can reproduce our observations.

$ oc version
oc v3.3.1.7
kubernetes v1.3.0+52492b4
features: Basic-Auth GSSAPI Kerberos SPNEGO

# Docker storage configuration
$ cat /etc/sysconfig/docker-storage
DOCKER_STORAGE_OPTIONS="--storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/vg_docker-docker--pool --storage-opt dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=true "

# Trace read operations over 10 seconds
$ blktrace -d /dev/mapper/vg_docker-docker--pool_tmeta -w 10 -a read

# Parse trace and summarize by process. The "dm-*" parameter depends on the system setup. Look at the filenames generated by blktrace. 
$ blkparse -s -i dm-47 | less

In my reproduction shortly before this writing the statistics for thin_ls were as follows, averaging out at about 326 read IOPS (note that only reads were traced):

thin_ls (33526)
 Reads Queued:       8,713,   34,852KiB  Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:    3,261,   13,044KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB

The bumps in I/O every ~15 seconds can also be seen using this command (iostat is from the sysstat package, stdbuf is used to prevent buffering in the pipe):

$ stdbuf -oL bash -c "iostat -x -N -d -y -m 3 vg_docker-docker--pool_tmeta | tr -s ' ' | cut -d ' ' -f1-5 | grep vg_docker-docker--pool_tmeta"

The workaround consists of replacing thin_ls with a no-op executable (in practice implemented via /etc/fstab):

$ mount --bind /bin/true /usr/sbin/thin_ls

Once mounted the read operations for vg_docker-docker--pool_tmeta are usually zero or close to it. It is our understanding, based on a reading of the cAdvisor code, that this workaround has no negative consequences compared to not having thin_ls as was the case with RedHat 7.2.

Comment 3 Andy Goldstein 2017-01-12 13:48:26 UTC
Maybe something changed in the devicemapper and/or kernel code between RHEL 7.2 and 7.3 that has created this performance regression. Joe, any ideas?

Comment 6 Joe Thornber 2017-01-13 15:33:09 UTC
Could you tell me exactly how thin_ls is being called please?

Some of the --format fields require reading all the mappings.

Comment 7 Andy Goldstein 2017-01-13 15:35:15 UTC
thin_ls --no-headers -m -o DEV,EXCLUSIVE_BYTES <deviceName>

Comment 8 Joe Thornber 2017-01-13 15:55:52 UTC
ok, so EXCLUSIVE_BYTES will require reading all the metadata.  Could you tell me how long it takes to execute on 7.2 and 7.3 with identical metadata please?  Also thin_ls --version on both please.

(I've looked at the diff between 0.6.2 and 0.6.3 which is quite small and can't see any reason for a perf difference).

Comment 9 Andy Goldstein 2017-01-13 16:51:56 UTC
I mistakenly thought that thin_ls was running in the 7.2 setup and the 7.3 setup post-upgrade, but I have since learned that the kernel version used when the OS was 7.2 was too old to run thin_ls. Therefore, the issue isn't a performance regression in thin_ls.

What thin_ls does is read the thin pool's metadata in its entirety so it can determine which blocks are exclusive to each container. This means that if it the metadata is 300MB, OpenShift is going to invoke thin_ls every 15 seconds to read and process 300MB of thin pool data.

We probably should reduce the frequency that OpenShift invokes thin_ls, and possibly make it configurable.

Comment 10 Jean Abraham 2017-01-13 16:56:40 UTC
Would it be possible to make the frequency of the read configurable? 

https://github.com/openshift/origin/blob/release-1.3/vendor/github.com/google/cadvisor/devicemapper/thin_pool_watcher.go#L50

Comment 11 Andy Goldstein 2017-01-13 17:00:14 UTC
If we wanted it to be something in the OpenShift node config that trickles down to cadvisor, that would be rather involved, as it would require changes to both kubernetes and cadvisor. It's not impossible, just not a quick fix.

A quicker fix might be to have the cadvisor code look for an environment variable and use that to override the 15 second default.

In the long term, if we can move users to read-only containers, we could avoid calling thin_ls entirely, but we're not there yet.

Comment 12 Jean Abraham 2017-01-13 17:05:34 UTC
Having the option to reduce the disk usage via a configurable parameter would be good, even if we have to take the quicker route.

Comment 13 Paul Morie 2017-01-17 20:29:43 UTC
I think reducing the frequency of the thin_ls invocations is the best route for this with our current situation.  Note that reducing the frequency of invocation will mean that disk eviction is less responsive since cadvisor will be updated less frequently with the usage information from each container.

I don't necessarily think that looking at an env var for the thin_ls frequency would be much less work than adding a first-class option.  We would still have to bump the cadvisor dependency in OpenShift, and we should make this capability available in upstream k8s as well.

Comment 14 Mike Snitzer 2017-01-18 17:23:17 UTC
(In reply to Paul Morie from comment #13)
> I think reducing the frequency of the thin_ls invocations is the best route
> for this with our current situation.  Note that reducing the frequency of
> invocation will mean that disk eviction is less responsive since cadvisor
> will be updated less frequently with the usage information from each
> container.

OK, but comment#9 says the current frequency is every 15 seconds.  SO effectively you're _constantly running thin_ls right now.  Every 15 seconds is certainly excessive.

You'd do well to benchmark how long a single iteration of thin_ls takes for the worst case possible for your setup and then double it (at a minimum, likely better to consider the expected rate of change and increase the frequency accordingly).

Comment 15 Seth Jennings 2017-01-30 18:27:11 UTC
Proposed solution is to increase the minimum refresh interval and dynamically grow the interval based on thin_ls run durations if needed.

cadvisor PR:
https://github.com/google/cadvisor/pull/1583

Comment 16 Eric Paris 2017-02-02 21:10:17 UTC
@seth can you post all the information you know about `thin_ls never returns` to https://bugzilla.redhat.com/show_bug.cgi?id=1417265 so they can start fixing thin_ls?

We need to work around the problem here, but we need to give the thin_ls team data to fix the root problem.

Comment 17 Seth Jennings 2017-02-03 22:12:11 UTC
After looking into this today, I made a realization that I think makes thin_ls unworkable here.

thin_ls does on the order of 100s of I/O operations per MB of thin pool metadata.  The game-changing realization is that these are raw _uncached_ block device I/Os.  This means that the data is read _every_ time thin_ls is called.  In IOPS restricted environment there is no safe way to use this.

The current plan (confirming consensus now) is to disable thin_ls code paths entirely.

This will disable per-container disk stats and effect the pod eviction selection under disk pressure.

Comment 19 Mike Snitzer 2017-02-06 20:00:33 UTC
(In reply to Seth Jennings from comment #17)
> After looking into this today, I made a realization that I think makes
> thin_ls unworkable here.
> 
> thin_ls does on the order of 100s of I/O operations per MB of thin pool
> metadata.  The game-changing realization is that these are raw _uncached_
> block device I/Os.  This means that the data is read _every_ time thin_ls is
> called.  In IOPS restricted environment there is no safe way to use this.

What is your definition of "safe"!?

> The current plan (confirming consensus now) is to disable thin_ls code paths
> entirely.
> 
> This will disable per-container disk stats and effect the pod eviction
> selection under disk pressure.

This makes _zero_ sense.  This is like saying: I'm not using fsck to repair my filesystem because it needs to read the device directly.

By using thin_ls you're able to get an understanding of container usage on a per-container (per-thin) basis.  To get that insight you need to get access to the metadata that the kernel is actively using.  You need to read that metadata from disk each time because userspace cannot cache the associated metadata blocks because the kernel is managing that metadata completely independent of userspace.

Comment 20 Seth Jennings 2017-02-06 21:28:30 UTC
(In reply to Mike Snitzer from comment #19)
> (In reply to Seth Jennings from comment #17)
> > After looking into this today, I made a realization that I think makes
> > thin_ls unworkable here.
> > 
> > thin_ls does on the order of 100s of I/O operations per MB of thin pool
> > metadata.  The game-changing realization is that these are raw _uncached_
> > block device I/Os.  This means that the data is read _every_ time thin_ls is
> > called.  In IOPS restricted environment there is no safe way to use this.
> 
> What is your definition of "safe"!?

In this case, safe is "regardless of metadata size and IOPS limit, we can expect thin_ls to return in a reasonable amount of time".  We can't run a command that can max out IOPS on the instance for minutes.

> 
> > The current plan (confirming consensus now) is to disable thin_ls code paths
> > entirely.
> > 
> > This will disable per-container disk stats and effect the pod eviction
> > selection under disk pressure.
> 
> This makes _zero_ sense.  This is like saying: I'm not using fsck to repair
> my filesystem because it needs to read the device directly.
> 
> By using thin_ls you're able to get an understanding of container usage on a
> per-container (per-thin) basis.  To get that insight you need to get access
> to the metadata that the kernel is actively using.  You need to read that
> metadata from disk each time because userspace cannot cache the associated
> metadata blocks because the kernel is managing that metadata completely
> independent of userspace.

I understand that given the design of devicemapper, there is likely not a solution here.  The data must be read and can't be cached.  It isn't a criticism of thin_ls or devicemapper.  It is just a reality that happens to wreck our use case; namely, gathering disk usage stats on a per-container basis on a 1m interval.

Comment 22 Troy Dawson 2017-02-10 22:44:55 UTC
This has been merged into ocp and is in OCP v3.5.0.19 or newer.

Comment 24 Eric Paris 2017-02-14 19:46:01 UTC
*** Bug 1397593 has been marked as a duplicate of this bug. ***

Comment 25 Mike Fiedler 2017-02-24 13:08:36 UTC
Verified on 3.5.0.32.   thin_ls no longer shows up in cpu profiles for node vertical scalability runs.   In earlier 3.4 and 3.5 builds, it was present.

Comment 30 errata-xmlrpc 2017-04-12 19:18:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884


Note You need to log in before you can comment on or make changes to this bug.