Description of problem: Since upgrading from RHEL 7.2 to RHEL 7.3 we saw a huge increase of IO read load on the Docker volume group. After tracing this issue down we found out that this was caused by thin_ls which is used by CAdvisor. After removing thin_ls the IO read load went back to normal. Version-Release number of selected component (if applicable): OpenShift 3.3.1.7 on RHEL 7.3 How reproducible: Actual results: Expected results: Additional info:
More information provided by the customer We used the blktrace(8) utility to trace the source of I/O operations. Using this we were able to confirm that the requests originate from thin_ls(8). RHBZ#1315452 (https://bugzilla.redhat.com/show_bug.cgi?id=1315452) is where its inclusion in RedHat 7.3 was requested specifically for OpenShift. cAdvisor, a component of Kubernetes and OpenShift collecting statistics on containers, makes use of thin_ls when it's available. The so-called thin-pool watcher calls thin_ls every 15 seconds (https://github.com/openshift/origin/blob/release-1.3/vendor/github.com/google/cadvisor/devicemapper/thin_pool_watcher.go#L50). Here's how you can reproduce our observations. $ oc version oc v3.3.1.7 kubernetes v1.3.0+52492b4 features: Basic-Auth GSSAPI Kerberos SPNEGO # Docker storage configuration $ cat /etc/sysconfig/docker-storage DOCKER_STORAGE_OPTIONS="--storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/vg_docker-docker--pool --storage-opt dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=true " # Trace read operations over 10 seconds $ blktrace -d /dev/mapper/vg_docker-docker--pool_tmeta -w 10 -a read # Parse trace and summarize by process. The "dm-*" parameter depends on the system setup. Look at the filenames generated by blktrace. $ blkparse -s -i dm-47 | less In my reproduction shortly before this writing the statistics for thin_ls were as follows, averaging out at about 326 read IOPS (note that only reads were traced): thin_ls (33526) Reads Queued: 8,713, 34,852KiB Writes Queued: 0, 0KiB Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 3,261, 13,044KiB Writes Completed: 0, 0KiB Read Merges: 0, 0KiB Write Merges: 0, 0KiB The bumps in I/O every ~15 seconds can also be seen using this command (iostat is from the sysstat package, stdbuf is used to prevent buffering in the pipe): $ stdbuf -oL bash -c "iostat -x -N -d -y -m 3 vg_docker-docker--pool_tmeta | tr -s ' ' | cut -d ' ' -f1-5 | grep vg_docker-docker--pool_tmeta" The workaround consists of replacing thin_ls with a no-op executable (in practice implemented via /etc/fstab): $ mount --bind /bin/true /usr/sbin/thin_ls Once mounted the read operations for vg_docker-docker--pool_tmeta are usually zero or close to it. It is our understanding, based on a reading of the cAdvisor code, that this workaround has no negative consequences compared to not having thin_ls as was the case with RedHat 7.2.
Maybe something changed in the devicemapper and/or kernel code between RHEL 7.2 and 7.3 that has created this performance regression. Joe, any ideas?
Could you tell me exactly how thin_ls is being called please? Some of the --format fields require reading all the mappings.
thin_ls --no-headers -m -o DEV,EXCLUSIVE_BYTES <deviceName>
ok, so EXCLUSIVE_BYTES will require reading all the metadata. Could you tell me how long it takes to execute on 7.2 and 7.3 with identical metadata please? Also thin_ls --version on both please. (I've looked at the diff between 0.6.2 and 0.6.3 which is quite small and can't see any reason for a perf difference).
I mistakenly thought that thin_ls was running in the 7.2 setup and the 7.3 setup post-upgrade, but I have since learned that the kernel version used when the OS was 7.2 was too old to run thin_ls. Therefore, the issue isn't a performance regression in thin_ls. What thin_ls does is read the thin pool's metadata in its entirety so it can determine which blocks are exclusive to each container. This means that if it the metadata is 300MB, OpenShift is going to invoke thin_ls every 15 seconds to read and process 300MB of thin pool data. We probably should reduce the frequency that OpenShift invokes thin_ls, and possibly make it configurable.
Would it be possible to make the frequency of the read configurable? https://github.com/openshift/origin/blob/release-1.3/vendor/github.com/google/cadvisor/devicemapper/thin_pool_watcher.go#L50
If we wanted it to be something in the OpenShift node config that trickles down to cadvisor, that would be rather involved, as it would require changes to both kubernetes and cadvisor. It's not impossible, just not a quick fix. A quicker fix might be to have the cadvisor code look for an environment variable and use that to override the 15 second default. In the long term, if we can move users to read-only containers, we could avoid calling thin_ls entirely, but we're not there yet.
Having the option to reduce the disk usage via a configurable parameter would be good, even if we have to take the quicker route.
I think reducing the frequency of the thin_ls invocations is the best route for this with our current situation. Note that reducing the frequency of invocation will mean that disk eviction is less responsive since cadvisor will be updated less frequently with the usage information from each container. I don't necessarily think that looking at an env var for the thin_ls frequency would be much less work than adding a first-class option. We would still have to bump the cadvisor dependency in OpenShift, and we should make this capability available in upstream k8s as well.
(In reply to Paul Morie from comment #13) > I think reducing the frequency of the thin_ls invocations is the best route > for this with our current situation. Note that reducing the frequency of > invocation will mean that disk eviction is less responsive since cadvisor > will be updated less frequently with the usage information from each > container. OK, but comment#9 says the current frequency is every 15 seconds. SO effectively you're _constantly running thin_ls right now. Every 15 seconds is certainly excessive. You'd do well to benchmark how long a single iteration of thin_ls takes for the worst case possible for your setup and then double it (at a minimum, likely better to consider the expected rate of change and increase the frequency accordingly).
Proposed solution is to increase the minimum refresh interval and dynamically grow the interval based on thin_ls run durations if needed. cadvisor PR: https://github.com/google/cadvisor/pull/1583
@seth can you post all the information you know about `thin_ls never returns` to https://bugzilla.redhat.com/show_bug.cgi?id=1417265 so they can start fixing thin_ls? We need to work around the problem here, but we need to give the thin_ls team data to fix the root problem.
After looking into this today, I made a realization that I think makes thin_ls unworkable here. thin_ls does on the order of 100s of I/O operations per MB of thin pool metadata. The game-changing realization is that these are raw _uncached_ block device I/Os. This means that the data is read _every_ time thin_ls is called. In IOPS restricted environment there is no safe way to use this. The current plan (confirming consensus now) is to disable thin_ls code paths entirely. This will disable per-container disk stats and effect the pod eviction selection under disk pressure.
OSE 3.3 PR: https://github.com/openshift/ose/pull/603 OSE 3.4 PR: https://github.com/openshift/ose/pull/604 ORIGIN PR: https://github.com/openshift/origin/pull/12822
(In reply to Seth Jennings from comment #17) > After looking into this today, I made a realization that I think makes > thin_ls unworkable here. > > thin_ls does on the order of 100s of I/O operations per MB of thin pool > metadata. The game-changing realization is that these are raw _uncached_ > block device I/Os. This means that the data is read _every_ time thin_ls is > called. In IOPS restricted environment there is no safe way to use this. What is your definition of "safe"!? > The current plan (confirming consensus now) is to disable thin_ls code paths > entirely. > > This will disable per-container disk stats and effect the pod eviction > selection under disk pressure. This makes _zero_ sense. This is like saying: I'm not using fsck to repair my filesystem because it needs to read the device directly. By using thin_ls you're able to get an understanding of container usage on a per-container (per-thin) basis. To get that insight you need to get access to the metadata that the kernel is actively using. You need to read that metadata from disk each time because userspace cannot cache the associated metadata blocks because the kernel is managing that metadata completely independent of userspace.
(In reply to Mike Snitzer from comment #19) > (In reply to Seth Jennings from comment #17) > > After looking into this today, I made a realization that I think makes > > thin_ls unworkable here. > > > > thin_ls does on the order of 100s of I/O operations per MB of thin pool > > metadata. The game-changing realization is that these are raw _uncached_ > > block device I/Os. This means that the data is read _every_ time thin_ls is > > called. In IOPS restricted environment there is no safe way to use this. > > What is your definition of "safe"!? In this case, safe is "regardless of metadata size and IOPS limit, we can expect thin_ls to return in a reasonable amount of time". We can't run a command that can max out IOPS on the instance for minutes. > > > The current plan (confirming consensus now) is to disable thin_ls code paths > > entirely. > > > > This will disable per-container disk stats and effect the pod eviction > > selection under disk pressure. > > This makes _zero_ sense. This is like saying: I'm not using fsck to repair > my filesystem because it needs to read the device directly. > > By using thin_ls you're able to get an understanding of container usage on a > per-container (per-thin) basis. To get that insight you need to get access > to the metadata that the kernel is actively using. You need to read that > metadata from disk each time because userspace cannot cache the associated > metadata blocks because the kernel is managing that metadata completely > independent of userspace. I understand that given the design of devicemapper, there is likely not a solution here. The data must be read and can't be cached. It isn't a criticism of thin_ls or devicemapper. It is just a reality that happens to wreck our use case; namely, gathering disk usage stats on a per-container basis on a 1m interval.
This has been merged into ocp and is in OCP v3.5.0.19 or newer.
*** Bug 1397593 has been marked as a duplicate of this bug. ***
Verified on 3.5.0.32. thin_ls no longer shows up in cpu profiles for node vertical scalability runs. In earlier 3.4 and 3.5 builds, it was present.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0884