Bug 1733571

Summary:	cadvisor not working as expected on some of the nodes
Product:	OpenShift Container Platform	Reporter:	Shivkumar Ople <sople>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Sunil Choudhary <schoudha>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.9.0	CC:	aos-bugs, jokerman, nagrawal, prdeshpa, rphillips, vjaypurk
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-13 22:17:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Shivkumar Ople 2019-07-26 14:56:09 UTC

Description of problem:


cAdvisor is not fetching container metadata for some nodes in the cluster, and for other nodes, it's working fine. 

On the nodes configure cadvisor on 4194 port, using [1] 

[1] https://access.redhat.com/solutions/3325151

So after successfully configuring [1]as well. Not seeing metrics on some nodes. 


On problematic nodes, curl command gives output as below:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# curl 127.0.0.1:4194/metrics | grep container_cpu_user_seconds_total
  
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds.
# TYPE container_cpu_user_seconds_total counter
container_cpu_user_seconds_total{container_name="",id="/",image="",name="",namespace="",pod_name=""} 31780.82
container_cpu_user_seconds_total{container_name="",id="/kubepods.slice",image="",name="",namespace="",pod_name=""} 6750.53
container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-besteffort.slice",image="",name="",namespace="",pod_name=""} 2123.13
container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod015aa554_a7b2_11e9_9eb1_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 214.56
container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5796438e_a464_11e9_b038_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 1684.44
container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod968d5540_a7af_11e9_9eb1_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 222.77
container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-burstable.slice",image="",name="",namespace="",pod_name=""} 4627.38
container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod579aa9c8_a464_11e9_b038_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 2729.73
container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod57a32e91_a464_11e9_b038_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 1196.94
container_cpu_user_seconds_total{container_name="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5be99987_a465_11e9_b038_fa163ee00fc2.slice",image="",name="",namespace="",pod_name=""} 698.66
container_cpu_user_seconds_total{container_name="",id="/system.slice",image="",name="",namespace="",pod_name=""} 13962.65
container_cpu_user_seconds_total{container_name="",id="/system.slice/NetworkManager.service",image="",name="",namespace="",pod_name=""} 0
container_cpu_user_seconds_total{container_name="",id="/system.slice/auditd.service",image="",name="",namespace="",pod_name=""} 0
container_cpu_user_seconds_total{container_name="",id="/system.slice/chronyd.service",image="",name="",namespace="",pod_name=""} 0
container_cpu_user_seconds_total{container_name="",id="/system.slice/cloud-config.service",image="",name="",namespace="",pod_name=""} 0
container_cpu_user_seconds_total{container_name="",id="/system.slice/cloud-final.service",image="",name="",namespace="",pod_name=""} 0
container_cpu_user_seconds_total{container_name="",id="/system.slice/cloud-init-local.service",image="",name="",namespace="",pod_name=""} 0
container_cpu_user_seconds_total{container_name="",id="/system.slice/cloud-init.service",image="",name="",namespace="",pod_name=""} 0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


#For Non-problematic nodes output is like below, can see the container metadata:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
container_cpu_user_seconds_total{container_name="POD",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod015aa554_a7b2_11e9_9eb1_fa163ee00fc2.slice/docker-5153293b66167d228c0e9d213af8d80ecec7e389fbbd2c45964758c815f5aa07.scope",image="registry.redhat.io/openshift3/ose-pod:v3.11.117",name="k8s_POD_mediawiki-1-ttm6t_apb_015aa554-a7b2-11e9-9eb1-fa163ee00fc2_0",namespace="apb",pod_name="mediawiki-1-ttm6t"} 0.16
container_cpu_user_seconds_total{container_name="POD",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5796438e_a464_11e9_b038_fa163ee00fc2.slice/docker-5b05ff460addcf54fd85e84ac2bf0a601af7a20373a18c4e2a6ae509c7885c69.scope",image="registry.redhat.io/openshift3/ose-pod:v3.11.117",name="k8s_POD_sync-l2h2n_openshift-node_5796438e-a464-11e9-b038-fa163ee00fc2_0",namespace="openshift-node",pod_name="sync-l2h2n"} 0.36
container_cpu_user_seconds_total{container_name="POD",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod968d5540_a7af_11e9_9eb1_fa163ee00fc2.slice/docker-12e9b33c57add056e4f249c9a4205e154b6f4de7dea4467263c5344efd189c25.scope",image="registry.redhat.io/openshift3/ose-pod:v3.11.117",name="k8s_POD_mediawiki-1-mkl26_default_968d5540-a7af-11e9-9eb1-fa163ee00fc2_0",namespace="default",pod_name="mediawiki-1-mkl26"} 0.16
container_cpu_user_seconds_total{container_name="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod579aa9c8_a464_11e9_b038_fa163ee00fc2.slice/docker-de3757137eada453dc87134d1902b234ea5744b1310c78090d52798f5f14bc3d.scope",image="registry.redhat.io/openshift3/ose-pod:v3.11.117",name="k8s_POD_ovs-h8l7k_openshift-sdn_579aa9c8-a464-11e9-b038-fa163ee00fc2_0",namespace="openshift-sdn",pod_name="ovs-h8l7k"} 0.33

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



Version-Release number of selected component (if applicable):

3.9.60


How reproducible:

Not reproduced when tried on 3.9.85



Steps to Reproduce:

1. Configure [1] on all the nodes in the cluster, and check if all the nodes are fetching the correct container metrics, with below command

  # curl 127.0.0.1:4194/metrics | grep container_cpu_user_seconds_total


Actual results:

Cadvisor on some nodes not showing up the container metadata  

Expected results:

Cadvisor should work properly on every node, why some nodes are having problem


Additional info:

I could see some ambiguous logs there and these messages are in both node's atomic openshift logs.

========================
worker 5: cAdvisor metrics do not work as expected


Jul 13 07:33:27 wrk05  atomic-openshift-node[89551]: E0713 07:33:27.520145   89551 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/system.slice/run-10516.scope": RecentStats: unable to find data for container /system.slice/run-10516.scope]


Jul 13 07:33:38 wrk05  atomic-openshift-node[89551]: E0713 07:33:38.267128   89551 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/system.slice/run-10516.scope": RecentStats: unable to find data for container /system.slice/run-10516.scope]


Jul 18 09:26:51 wrk05  atomic-openshift-node[89551]: E0718 09:26:51.659522   89551 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/system.slice/run-1426.scope": RecentStats: unable to find data for container /system.slice/run-1426.scope]




========================

worker 1: cAdvisor metrics work as expected

Jul 15 16:19:05 wrk01  atomic-openshift-node[55676]: E0715 16:19:05.833055   55676 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4325a2e5_a71c_11e9_aec8_fa163e4f0e94.slice/docker-56e84d4582cc4e038a96e4c17fb1b942752e5cbc88706f28a171fa2585639913.scope": RecentStats: unable to find data for container /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4325a2e5_a71c_11e9_aec8_fa163e4f0e94.slice/docker-56e84d4582cc4e038a96e4c17fb1b942752e5cbc88706f28a171fa2585639913.scope]


Jul 15 07:19:05 wrk01 atomic-openshift-node[55676]: E0715 07:19:05.716278   55676 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podd2f5a79b_a6d0_11e9_aec8_fa163e4f0e94.slice/docker-df810656e9745361c72a6854d9d0bd36ec250f757547e712b6dee8ef8c597039.scope": RecentStats: unable to find data for container /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podd2f5a79b_a6d0_11e9_aec8_fa163e4f0e94.slice/docker-df810656e9745361c72a6854d9d0bd36ec250f757547e712b6dee8ef8c597039.scope]


Jul 15 23:19:16 wrk01 atomic-openshift-node[55676]: E0715 23:19:16.635754   55676 cadvisor_stats_provider.go:355] Partial failure issuing cadvisor.ContainerInfoV2: partial failures: ["/system.slice/run-76196.scope": RecentStats: unable to find data for container /system.slice/run-76196.scope]


================

Comment 1 Ryan Phillips 2019-07-29 20:11:45 UTC

This is likely due to a race between docker and the kubelet starting [1]. The docker root is setup to be a symlink pointing from the default docker path (/var/lib/docker) to the default libcontainers path (/var/lib/containers/docker) [2,3].

A potential workaround is to restart the kubelet after docker has started on the system, though this should not be needed if the link exists. Does the /var/lib/docker -> /var/lib/containers/docker link exist on your system?

[1] https://github.com/google/cadvisor/issues/1932
[2] https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/container_runtime/defaults/main.yml#L53
[3] https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/container_runtime/tasks/common/setup_docker_symlink.yml#L8-L43

Comment 2 Shivkumar Ople 2019-07-30 06:39:29 UTC

(In reply to Ryan Phillips from comment #1)
> This is likely due to a race between docker and the kubelet starting [1].
> The docker root is setup to be a symlink pointing from the default docker
> path (/var/lib/docker) to the default libcontainers path
> (/var/lib/containers/docker) [2,3].
> 
> A potential workaround is to restart the kubelet after docker has started on

    -- Restarted node service, but no luck.

> the system, though this should not be needed if the link exists. Does the
> /var/lib/docker -> /var/lib/containers/docker link exist on your system?

   -- No, Symlink is not present there.  
  
   # ll /var/lib/
    drwx--x--x. 11 root    root     139 Jul 27 21:50 docker


> 
> [1] https://github.com/google/cadvisor/issues/1932
> [2]
> https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/
> container_runtime/defaults/main.yml#L53
> [3]
> https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/
> container_runtime/tasks/common/setup_docker_symlink.yml#L8-L43

Comment 3 Ryan Phillips 2019-07-30 14:12:21 UTC

Is this setup a crio or docker install?

Can you provide the output of `docker info | grep -i root`?

Comment 4 Shivkumar Ople 2019-07-31 05:51:12 UTC

This is a Docker install.

>> Can you provide the output of `docker info | grep -i root`?

   -- Docker Root Dir: /var/lib/docker

Let me know if you require any other data.

Thank you,

Best,
Shivkumar Ople

Comment 5 Ryan Phillips 2019-08-06 18:57:53 UTC

This appears to be an environmental issue where docker/cadvisor and kubelet have gotten out of sync somehow. There is very little information on how a node might get into this state. Since you tried restarting the kubelet, is this a machine where we can try 're-initializing' docker? Stopping docker, removing all the docker files, restarting docker, then restarting kubelet? The following link shows someone had some success with this. [1]

1. https://github.com/google/cadvisor/issues/1932#issue-315493628

Comment 6 Shivkumar Ople 2019-08-07 07:15:09 UTC

Hello Ryan,

Could you specify which docker files to remove exactly, from which location?

In the GitHub link, couldn't see any specific files which need to be removed.

[1] https://github.com/google/cadvisor/issues/1932#issue-315493628

Comment 7 Ryan Phillips 2019-08-07 15:59:35 UTC

We need to:

1. stop and uninstall docker
2. remove /var/lib/docker
3. reinstall and start docker

After these three steps, rebooting the node might be needed as well.

Comment 8 Shivkumar Ople 2019-08-09 12:12:37 UTC

(In reply to Ryan Phillips from comment #7)
> We need to:
> 
> 1. stop and uninstall docker
> 2. remove /var/lib/docker
> 3. reinstall and start docker
> 
> After these three steps, rebooting the node might be needed as well.

 -- This would impact the customer's operations. Is there something else we could try that is less disruptive?

Comment 9 Ryan Phillips 2019-08-12 17:18:53 UTC

I am going to bump the cadvisor dependency, since there are a number of fixes in 3.10. [1] should alleviate the empty strings reported.

1. https://github.com/google/cadvisor/pull/1871