Description of problem: We have implemented the eviction on one of our clusters, but it seems to fail to list the docker images when it should kick it. Could it be the load is already too high to docker list images and therefore the eviction manager fails? Logs and configuration files are attached. Version-Release number of selected component (if applicable): oc v3.4.1.2 kubernetes v1.4.0+776c994 features: Basic-Auth GSSAPI Kerberos SPNEGO
The error in question is: Feb 15 11:49:47 ip-172-31-8-225.ec2.internal atomic-openshift-node[945]: E0215 11:49:47.151424 945 eviction_manager.go:162] eviction manager: unexpected err: failed ImageStats: failed to list docker images - operation timeout: context deadline exceeded
For context, the general flow is as follows: 1. The eviction manager runs every 10s to verify the state of the node. 2. At each interval, it invokes the kubelet /summary/stats API endpoint to collect information on the signals it tracks. 3. To satisfy that request, the kubelet asks the container runtime to ListImages to determine the total storage in bytes used by the images. The request to the underlying container runtime to list images is made via a client that has a timeout configured via the kubeletArgument --runtime-request-timeout. It should default to 2m0s, and the supplied configuration does not appear to do any override. From the log output, it appears that this timeout was reached 4 times over the course of 8 hrs. If the log output was set to "-v 3", I would be able to know if the operation ever succeeded. For the node in question, are you able to do the following: $ oc get --raw api/v1/nodes/<nodename>/proxy/stats/summary And not hit the timeout?
It may also be possible that the call to list images is being impacted by excessive invocations of thin_ls which was fixed in https://github.com/openshift/ose/pull/604 but is not available in v3.4.1.2 but does appear to be available in v3.4.1.7. Is it possible to try and reproduce with the later version so we can rule out thin_ls causing the delay?
Derek, attaching the increasing logging from two nodes which has been running over memory during the weekend. The upgrade in question is planned, but we do not have a date yet.
We have upgraded to v3.4.1.8, but we still see the problem. Logs to follow.
For awareness, cAdvisor/kubelet does the following every 10s to calculate memory.available for the node. #!/bin/bash #!/usr/bin/env bash # This script reproduces what kubelet/cAdvisor does # to calculate memory.available relative to root cgroup. # The major change is that it excludes total_inactive_file memory. # current memory usage memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}') memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024)) memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes) memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}') memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file)) memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set)) memory_available_in_kb=$((memory_available_in_bytes / 1024)) memory_available_in_mb=$((memory_available_in_kb / 1024)) echo "memory.capacity_in_bytes $memory_capacity_in_bytes" echo "memory.usage_in_bytes $memory_usage_in_bytes" echo "memory.total_inactive_file $memory_total_inactive_file" echo "memory.working_set $memory_working_set" echo "memory.available_in_bytes $memory_available_in_bytes" echo "memory.available_in_kb $memory_available_in_kb" echo "memory.available_in_mb $memory_available_in_mb" I would like to do the following: 1. know the eviction configuration 2. set a --hard-threshold that will trigger before your alerts 3. update your alerting to use the memory.available_in_bytes as defined above 4. eviction runs ~10s 5. skew w/ cAdvisor could cause stats to be stale by ~10s depending on skew 6. ideally, we get new rpms that contain https://github.com/openshift/ose/pull/648 to help us know if eviction is missing alerting windows based on its logged observations on the system.
update script to handle lt 0 case. #!/bin/bash #!/usr/bin/env bash # This script reproduces what kubelet/cAdvisor does # to calculate memory.available relative to root cgroup. # The major change is that it excludes total_inactive_file memory. # current memory usage memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}') memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024)) memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes) memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}') memory_working_set=$memory_usage_in_bytes if [ "$memory_working_set" -lt "$memory_total_inactive_file" ]; then memory_working_set=0 else memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file)) fi memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set)) memory_available_in_kb=$((memory_available_in_bytes / 1024)) memory_available_in_mb=$((memory_available_in_kb / 1024)) echo "memory.capacity_in_bytes $memory_capacity_in_bytes" echo "memory.usage_in_bytes $memory_usage_in_bytes" echo "memory.total_inactive_file $memory_total_inactive_file" echo "memory.working_set $memory_working_set" echo "memory.available_in_bytes $memory_available_in_bytes" echo "memory.available_in_kb $memory_available_in_kb" echo "memory.available_in_mb $memory_available_in_mb"
(In reply to Derek Carr from comment #20) (...) > I would like to do the following: > > 1. know the eviction configuration system-reserved: - cpu=500m,memory=1Gi eviction-soft: - memory.available<1Gi eviction-soft-grace-period: - memory.available=1m > 2. set a --hard-threshold that will trigger before your alerts I can set the hard threshold to 1G and soft to 2G. Ok? > 3. update your alerting to use the memory.available_in_bytes as defined above Will investigate how to do this, as it looks like zabbix reports only memory including inactive. I run your script on one of the node with low memory and high load (into 25+) and it reported memory.available_in_mb as 1452, but we have still seen performance impact.
Created attachment 1264831 [details] Script to dump memory profiles Requires socat and curl to be installed. Optionally will use: jq, go tools
Can this be reproduced with the json logger vs. journald?
Production clusters are unable to change the log driver. json file is the log driver of choice of operations for the forseeable future. Any problems found with the json file logger must be resolved.
Please provide a core file from the docker daemon during an event. Thanks
Hi folks, This has come up with a ticket that my team is working on and we'd like to ask the status of this issue. Is it still outstanding? -lisa
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days