Description of problem: Currently, there is only one metric for image pull times and that is `kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image"`. However, its biggest bucket is 10 seconds, meaning it is not possible to get any details about the duration of image pulls that exceeds 10 seconds, which is the case for almost all pulls. This in turn makes it impossible to investigate issues where this seems to take too long (We have regular occurences of pull times > 15 mins for image pulls from the cluster registry of images that are less than 500m in size). I've done a strawman PR to upstream that might help (but might cause other issues): https://github.com/kubernetes/kubernetes/pull/96054 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Left a comment to tweak the buckets, but looks good to me.
checked on version: 4.7.0-0.nightly-2020-11-22-204912 $ kubectl get --raw /api/v1/nodes/ip-10-0-143-238.us-east-2.compute.internal/proxy/metrics | grep -i "kubelet_runtime_operations_duration_seconds_bucket" | grep -i "pull_image" kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.005"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.01"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.025"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.05"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.1"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.25"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.5"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="1"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="2.5"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="5"} 2 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="10"} 3 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="+Inf"} 14 compared with results on version: 4.6.0 $ kubectl get --raw /api/v1/nodes/ip-10-0-133-243.us-east-2.compute.internal/proxy/metrics | grep -i "kubelet_runtime_operations_duration_seconds_bucket" | grep -i "pull_image" kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.005"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.01"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.025"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.05"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.1"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.25"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.5"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="1"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="2.5"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="5"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="10"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="+Inf"} 16 Hi, @aaleman, Is the results as expected in version 4.7.0-0.nightly-2020-11-22-204912 ?
@Minli no that doesn't look like the change is there, it should also have metrics for le=120, le=300, le=720
according to Comment 7 , this issue is not fixed.
I've manually built a kubelet from upstream commit c1f36fa6f28d3618c03b65799bc3f58007624e5f and deployed it to a cluster, it does have the new buckets: $ kubectl get --raw /api/v1/nodes/lpt/proxy/metrics | grep -i "kubelet_runtime_operations_duration_seconds_bucket" | grep -i "pull_image" kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.005"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.0125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.03125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.078125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.1953125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.48828125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="1.220703125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="3.0517578125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="7.62939453125"} 1 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="19.073486328125"} 1 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="47.6837158203125"} 1 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="119.20928955078125"} 1 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="298.0232238769531"} 1 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="745.0580596923828"} 1 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="+Inf"} 1 So the issue seems to be that either our downstream pick breaks the change (which is unlikely) or that for some reason this didn't end up in that nightly. I don't know how to check for a given change in a given nightly, so someone who knows more details about that needs to check.
checked on version 4.7.0-0.nightly-2020-11-25-114114, 3 out of 3 workers get expected results, but only 1 out of 3 masters get expected results, the other 2 not. $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-129-193.us-east-2.compute.internal Ready worker 29h v1.19.2+ad738ba ip-10-0-152-116.us-east-2.compute.internal Ready master 29h v1.19.2+ad738ba ip-10-0-167-166.us-east-2.compute.internal Ready master 29h v1.19.2+ad738ba ip-10-0-182-84.us-east-2.compute.internal Ready worker 29h v1.19.2+ad738ba ip-10-0-202-254.us-east-2.compute.internal Ready master 29h v1.19.2+ad738ba ip-10-0-221-71.us-east-2.compute.internal Ready worker 29h v1.19.2+ad738ba worker: $ kubectl get --raw /api/v1/nodes/ip-10-0-129-193.us-east-2.compute.internal/proxy/metrics | grep -i "kubelet_runtime_operations_duration_seconds_bucket" | grep -i "pull_image" kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.005"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.0125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.03125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.078125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.1953125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.48828125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="1.220703125"} 133 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="3.0517578125"} 360 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="7.62939453125"} 362 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="19.073486328125"} 362 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="47.6837158203125"} 362 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="119.20928955078125"} 362 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="298.0232238769531"} 363 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="745.0580596923828"} 363 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="+Inf"} 363 master#1 : unexpected $ kubectl get --raw /api/v1/nodes/ip-10-0-152-116.us-east-2.compute.internal/proxy/metrics | grep -i "kubelet_runtime_operations_duration_seconds_bucket" | grep -i "pull_image" $ master#2: expected $ kubectl get --raw /api/v1/nodes/ip-10-0-167-166.us-east-2.compute.internal/proxy/metrics | grep -i "kubelet_runtime_operations_duration_seconds_bucket" | grep -i "pull_image" kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.005"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.0125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.03125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.078125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.1953125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.48828125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="1.220703125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="3.0517578125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="7.62939453125"} 1 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="19.073486328125"} 3 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="47.6837158203125"} 5 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="119.20928955078125"} 5 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="298.0232238769531"} 5 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="745.0580596923828"} 5 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="+Inf"} 5 master#3:unexpected $ kubectl get --raw /api/v1/nodes/ip-10-0-202-254.us-east-2.compute.internal/proxy/metrics | grep -i "kubelet_runtime_operations_duration_seconds_bucket" | grep -i "pull_image" $ @ Peter Hunt , do you know why?
Hi, @ Ryan Phillips, what do you think the issue I met in Comment 11?
test again on 4.7.0-0.nightly-2020-11-29-133728, 1 out of 3 masters didn't get expected results $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-133-193.us-east-2.compute.internal Ready master 29h v1.19.2+ad738ba ip-10-0-138-5.us-east-2.compute.internal Ready worker 29h v1.19.2+ad738ba ip-10-0-172-210.us-east-2.compute.internal Ready master 29h v1.19.2+ad738ba ip-10-0-183-199.us-east-2.compute.internal Ready worker 29h v1.19.2+ad738ba ip-10-0-192-221.us-east-2.compute.internal Ready master 29h v1.19.2+ad738ba ip-10-0-219-122.us-east-2.compute.internal Ready worker 29h v1.19.2+ad738ba $ kubectl get --raw /api/v1/nodes/ip-10-0-192-221.us-east-2.compute.internal/proxy/metrics | grep -i "kubelet_runtime_operations_duration_seconds_bucket" | grep -i "pull_image" $
This should be in the tree and working.
@minli we have a long running cluster that is currently on 4.7.0-rc.1 and it does have the metric on all masters: $ for node in $(k get node|rg master|awk '{ print $1 }'); do echo "Node: $node"; kubectl get --raw /api/v1/nodes/$node/proxy/metrics | grep -i "kubelet_runtime_operations_duration_seconds_bucket" | grep -i "pull_image"; done Node: ip-10-0-140-81.ec2.internal kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.005"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.0125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.03125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.078125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.1953125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.48828125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="1.220703125"} 1 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="3.0517578125"} 24 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="7.62939453125"} 40 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="19.073486328125"} 53 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="47.6837158203125"} 53 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="119.20928955078125"} 53 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="298.0232238769531"} 53 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="745.0580596923828"} 53 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="+Inf"} 53 Node: ip-10-0-159-123.ec2.internal kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.005"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.0125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.03125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.078125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.1953125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.48828125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="1.220703125"} 3 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="3.0517578125"} 24 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="7.62939453125"} 38 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="19.073486328125"} 48 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="47.6837158203125"} 55 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="119.20928955078125"} 55 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="298.0232238769531"} 55 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="745.0580596923828"} 55 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="+Inf"} 55 Node: ip-10-0-175-171.ec2.internal kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.005"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.0125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.03125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.078125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.1953125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.48828125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="1.220703125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="3.0517578125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="7.62939453125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="19.073486328125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="47.6837158203125"} 15 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="119.20928955078125"} 15 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="298.0232238769531"} 15 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="745.0580596923828"} 15 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="+Inf"} 15 I would assume that you didn't see it on all masters before because the larger buckets are only visible once something actually took that long (i.E. before at least one pull took 745 seconds or more, that bucket won't appear). Hope that helps!
Verified on 4.7.0-0.nightly-2021-02-09-224509, I now see the metrics. $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME aiyengar-oc47-1102-hqddg-master-0.c.openshift-qe.internal Ready master 21h v1.20.0+ba45583 10.0.0.4 <none> Red Hat Enterprise Linux CoreOS 47.83.202102090044-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git8921e00.el8.51 aiyengar-oc47-1102-hqddg-master-1.c.openshift-qe.internal Ready master 21h v1.20.0+ba45583 10.0.0.5 <none> Red Hat Enterprise Linux CoreOS 47.83.202102090044-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git8921e00.el8.51 aiyengar-oc47-1102-hqddg-master-2.c.openshift-qe.internal Ready master 21h v1.20.0+ba45583 10.0.0.3 <none> Red Hat Enterprise Linux CoreOS 47.83.202102090044-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git8921e00.el8.51 aiyengar-oc47-1102-hqddg-worker-a-bc27t.c.openshift-qe.internal Ready worker 20h v1.20.0+ba45583 10.0.32.2 <none> Red Hat Enterprise Linux CoreOS 47.83.202102090044-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git8921e00.el8.51 aiyengar-oc47-1102-hqddg-worker-b-zz7vd.c.openshift-qe.internal Ready worker 20h v1.20.0+ba45583 10.0.32.3 <none> Red Hat Enterprise Linux CoreOS 47.83.202102090044-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git8921e00.el8.51 aiyengar-oc47-1102-hqddg-worker-c-dwkmc.c.openshift-qe.internal Ready worker 20h v1.20.0+ba45583 10.0.32.4 <none> Red Hat Enterprise Linux CoreOS 47.83.202102090044-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git8921e00.el8.51 $ kubectl get --raw /api/v1/nodes/aiyengar-oc47-1102-hqddg-master-0.c.openshift-qe.internal/proxy/metrics | grep -i "kubelet_runtime_operations_duration_seconds_bucket" | grep -i "pull_image" kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.005"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.0125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.03125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.078125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.1953125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="0.48828125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="1.220703125"} 0 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="3.0517578125"} 4 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="7.62939453125"} 25 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="19.073486328125"} 49 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="47.6837158203125"} 60 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="119.20928955078125"} 60 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="298.0232238769531"} 60 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="745.0580596923828"} 60 kubelet_runtime_operations_duration_seconds_bucket{operation_type="pull_image",le="+Inf"} 60
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633