Description of problem: cAdvisor is not fetching container metadata for some nodes in the cluster, and for other nodes, it's working fine. On the nodes configure cadvisor on 4194 port, using [1] [1] https://access.redhat.com/solutions/3325151 So after successfully configuring [1]as well. Not seeing metrics on some nodes. On problematic nodes, curl command gives output as below: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # curl 127.0.0.1:4194/metrics | grep container_cpu_user_seconds_total % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds. # TYPE container_cpu_user_seconds_total counter container_cpu_user_seconds_total{container_name="",id="/",image="",name="",namespace="",pod_name=""} 31780.82 container_cpu_user_seconds_total{container_name="",id="/kubepods.slice",image="",name="",namespace="",pod_name=""} 6750.53 container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-besteffort.slice",image="",name="",namespace="",pod_name=""} 2123.13 container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod015aa554_a7b2_11e9_9eb1_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 214.56 container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5796438e_a464_11e9_b038_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 1684.44 container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod968d5540_a7af_11e9_9eb1_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 222.77 container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-burstable.slice",image="",name="",namespace="",pod_name=""} 4627.38 container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod579aa9c8_a464_11e9_b038_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 2729.73 container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod57a32e91_a464_11e9_b038_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 1196.94 container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5be99987_a465_11e9_b038_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 698.66 container_cpu_user_seconds_total{container_name="",id="/system.slice",image="",name="",namespace="",pod_name=""} 13962.65 container_cpu_user_seconds_total{container_name="",id="/system.slice/NetworkManager.service",image="",name="",namespace="",pod_name=""} 0 container_cpu_user_seconds_total{container_name="",id="/system.slice/auditd.service",image="",name="",namespace="",pod_name=""} 0 container_cpu_user_seconds_total{container_name="",id="/system.slice/chronyd.service",image="",name="",namespace="",pod_name=""} 0 container_cpu_user_seconds_total{container_name="",id="/system.slice/cloud-config.service",image="",name="",namespace="",pod_name=""} 0 container_cpu_user_seconds_total{container_name="",id="/system.slice/cloud-final.service",image="",name="",namespace="",pod_name=""} 0 container_cpu_user_seconds_total{container_name="",id="/system.slice/cloud-init-local.service",image="",name="",namespace="",pod_name=""} 0 container_cpu_user_seconds_total{container_name="",id="/system.slice/cloud-init.service",image="",name="",namespace="",pod_name=""} 0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #For Non-problematic nodes output is like below, can see the container metadata: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ container_cpu_user_seconds_total{container_name="POD",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod015aa554_a7b2_11e9_9eb1_fa163ee00fc2.slice/docker-5153293b66167d228c0e9d213af8d80ecec7e389fbbd2c45964758c815f5aa07.scope",image="registry.redhat.io/openshift3/ose-pod:v3.11.117",name="k8s_POD_mediawiki-1-ttm6t_apb_015aa554-a7b2-11e9-9eb1-fa163ee00fc2_0",namespace="apb",pod_name="mediawiki-1-ttm6t"} 0.16 container_cpu_user_seconds_total{container_name="POD",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5796438e_a464_11e9_b038_fa163ee00fc2.slice/docker-5b05ff460addcf54fd85e84ac2bf0a601af7a20373a18c4e2a6ae509c7885c69.scope",image="registry.redhat.io/openshift3/ose-pod:v3.11.117",name="k8s_POD_sync-l2h2n_openshift-node_5796438e-a464-11e9-b038-fa163ee00fc2_0",namespace="openshift-node",pod_name="sync-l2h2n"} 0.36 container_cpu_user_seconds_total{container_name="POD",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod968d5540_a7af_11e9_9eb1_fa163ee00fc2.slice/docker-12e9b33c57add056e4f249c9a4205e154b6f4de7dea4467263c5344efd189c25.scope",image="registry.redhat.io/openshift3/ose-pod:v3.11.117",name="k8s_POD_mediawiki-1-mkl26_default_968d5540-a7af-11e9-9eb1-fa163ee00fc2_0",namespace="default",pod_name="mediawiki-1-mkl26"} 0.16 container_cpu_user_seconds_total{container_name="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod579aa9c8_a464_11e9_b038_fa163ee00fc2.slice/docker-de3757137eada453dc87134d1902b234ea5744b1310c78090d52798f5f14bc3d.scope",image="registry.redhat.io/openshift3/ose-pod:v3.11.117",name="k8s_POD_ovs-h8l7k_openshift-sdn_579aa9c8-a464-11e9-b038-fa163ee00fc2_0",namespace="openshift-sdn",pod_name="ovs-h8l7k"} 0.33 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Version-Release number of selected component (if applicable): 3.9.60 How reproducible: Not reproduced when tried on 3.9.85 Steps to Reproduce: 1. Configure [1] on all the nodes in the cluster, and check if all the nodes are fetching the correct container metrics, with below command # curl 127.0.0.1:4194/metrics | grep container_cpu_user_seconds_total Actual results: Cadvisor on some nodes not showing up the container metadata Expected results: Cadvisor should work properly on every node, why some nodes are having problem Additional info: I could see some ambiguous logs there and these messages are in both node's atomic openshift logs. ======================== worker 5: cAdvisor metrics do not work as expected Jul 13 07:33:27 wrk05 atomic-openshift-node[89551]: E0713 07:33:27.520145 89551 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/system.slice/run-10516.scope": RecentStats: unable to find data for container /system.slice/run-10516.scope] Jul 13 07:33:38 wrk05 atomic-openshift-node[89551]: E0713 07:33:38.267128 89551 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/system.slice/run-10516.scope": RecentStats: unable to find data for container /system.slice/run-10516.scope] Jul 18 09:26:51 wrk05 atomic-openshift-node[89551]: E0718 09:26:51.659522 89551 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/system.slice/run-1426.scope": RecentStats: unable to find data for container /system.slice/run-1426.scope] ======================== worker 1: cAdvisor metrics work as expected Jul 15 16:19:05 wrk01 atomic-openshift-node[55676]: E0715 16:19:05.833055 55676 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4325a2e5_a71c_11e9_aec8_fa163e4f0e94.slice/docker-56e84d4582cc4e038a96e4c17fb1b942752e5cbc88706f28a171fa2585639913.scope": RecentStats: unable to find data for container /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4325a2e5_a71c_11e9_aec8_fa163e4f0e94.slice/docker-56e84d4582cc4e038a96e4c17fb1b942752e5cbc88706f28a171fa2585639913.scope] Jul 15 07:19:05 wrk01 atomic-openshift-node[55676]: E0715 07:19:05.716278 55676 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podd2f5a79b_a6d0_11e9_aec8_fa163e4f0e94.slice/docker-df810656e9745361c72a6854d9d0bd36ec250f757547e712b6dee8ef8c597039.scope": RecentStats: unable to find data for container /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podd2f5a79b_a6d0_11e9_aec8_fa163e4f0e94.slice/docker-df810656e9745361c72a6854d9d0bd36ec250f757547e712b6dee8ef8c597039.scope] Jul 15 23:19:16 wrk01 atomic-openshift-node[55676]: E0715 23:19:16.635754 55676 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/system.slice/run-76196.scope": RecentStats: unable to find data for container /system.slice/run-76196.scope] ================
This is likely due to a race between docker and the kubelet starting [1]. The docker root is setup to be a symlink pointing from the default docker path (/var/lib/docker) to the default libcontainers path (/var/lib/containers/docker) [2,3]. A potential workaround is to restart the kubelet after docker has started on the system, though this should not be needed if the link exists. Does the /var/lib/docker -> /var/lib/containers/docker link exist on your system? [1] https://github.com/google/cadvisor/issues/1932 [2] https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/container_runtime/defaults/main.yml#L53 [3] https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/container_runtime/tasks/common/setup_docker_symlink.yml#L8-L43
(In reply to Ryan Phillips from comment #1) > This is likely due to a race between docker and the kubelet starting [1]. > The docker root is setup to be a symlink pointing from the default docker > path (/var/lib/docker) to the default libcontainers path > (/var/lib/containers/docker) [2,3]. > > A potential workaround is to restart the kubelet after docker has started on -- Restarted node service, but no luck. > the system, though this should not be needed if the link exists. Does the > /var/lib/docker -> /var/lib/containers/docker link exist on your system? -- No, Symlink is not present there. # ll /var/lib/ drwx--x--x. 11 root root 139 Jul 27 21:50 docker > > [1] https://github.com/google/cadvisor/issues/1932 > [2] > https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/ > container_runtime/defaults/main.yml#L53 > [3] > https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/ > container_runtime/tasks/common/setup_docker_symlink.yml#L8-L43
Is this setup a crio or docker install? Can you provide the output of `docker info | grep -i root`?
This is a Docker install. >> Can you provide the output of `docker info | grep -i root`? -- Docker Root Dir: /var/lib/docker Let me know if you require any other data. Thank you, Best, Shivkumar Ople
This appears to be an environmental issue where docker/cadvisor and kubelet have gotten out of sync somehow. There is very little information on how a node might get into this state. Since you tried restarting the kubelet, is this a machine where we can try 're-initializing' docker? Stopping docker, removing all the docker files, restarting docker, then restarting kubelet? The following link shows someone had some success with this. [1] 1. https://github.com/google/cadvisor/issues/1932#issue-315493628
Hello Ryan, Could you specify which docker files to remove exactly, from which location? In the GitHub link, couldn't see any specific files which need to be removed. [1] https://github.com/google/cadvisor/issues/1932#issue-315493628
We need to: 1. stop and uninstall docker 2. remove /var/lib/docker 3. reinstall and start docker After these three steps, rebooting the node might be needed as well.
(In reply to Ryan Phillips from comment #7) > We need to: > > 1. stop and uninstall docker > 2. remove /var/lib/docker > 3. reinstall and start docker > > After these three steps, rebooting the node might be needed as well. -- This would impact the customer's operations. Is there something else we could try that is less disruptive?
I am going to bump the cadvisor dependency, since there are a number of fixes in 3.10. [1] should alleviate the empty strings reported. 1. https://github.com/google/cadvisor/pull/1871