Bug 1733571
Summary: | cadvisor not working as expected on some of the nodes | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Shivkumar Ople <sople> |
Component: | Node | Assignee: | Ryan Phillips <rphillips> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Sunil Choudhary <schoudha> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.9.0 | CC: | aos-bugs, jokerman, nagrawal, prdeshpa, rphillips, vjaypurk |
Target Milestone: | --- | ||
Target Release: | 3.11.z | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-13 22:17:33 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Shivkumar Ople
2019-07-26 14:56:09 UTC
This is likely due to a race between docker and the kubelet starting [1]. The docker root is setup to be a symlink pointing from the default docker path (/var/lib/docker) to the default libcontainers path (/var/lib/containers/docker) [2,3]. A potential workaround is to restart the kubelet after docker has started on the system, though this should not be needed if the link exists. Does the /var/lib/docker -> /var/lib/containers/docker link exist on your system? [1] https://github.com/google/cadvisor/issues/1932 [2] https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/container_runtime/defaults/main.yml#L53 [3] https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/container_runtime/tasks/common/setup_docker_symlink.yml#L8-L43 (In reply to Ryan Phillips from comment #1) > This is likely due to a race between docker and the kubelet starting [1]. > The docker root is setup to be a symlink pointing from the default docker > path (/var/lib/docker) to the default libcontainers path > (/var/lib/containers/docker) [2,3]. > > A potential workaround is to restart the kubelet after docker has started on -- Restarted node service, but no luck. > the system, though this should not be needed if the link exists. Does the > /var/lib/docker -> /var/lib/containers/docker link exist on your system? -- No, Symlink is not present there. # ll /var/lib/ drwx--x--x. 11 root root 139 Jul 27 21:50 docker > > [1] https://github.com/google/cadvisor/issues/1932 > [2] > https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/ > container_runtime/defaults/main.yml#L53 > [3] > https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/ > container_runtime/tasks/common/setup_docker_symlink.yml#L8-L43 Is this setup a crio or docker install? Can you provide the output of `docker info | grep -i root`? This is a Docker install.
>> Can you provide the output of `docker info | grep -i root`?
-- Docker Root Dir: /var/lib/docker
Let me know if you require any other data.
Thank you,
Best,
Shivkumar Ople
This appears to be an environmental issue where docker/cadvisor and kubelet have gotten out of sync somehow. There is very little information on how a node might get into this state. Since you tried restarting the kubelet, is this a machine where we can try 're-initializing' docker? Stopping docker, removing all the docker files, restarting docker, then restarting kubelet? The following link shows someone had some success with this. [1] 1. https://github.com/google/cadvisor/issues/1932#issue-315493628 Hello Ryan, Could you specify which docker files to remove exactly, from which location? In the GitHub link, couldn't see any specific files which need to be removed. [1] https://github.com/google/cadvisor/issues/1932#issue-315493628 We need to: 1. stop and uninstall docker 2. remove /var/lib/docker 3. reinstall and start docker After these three steps, rebooting the node might be needed as well. (In reply to Ryan Phillips from comment #7) > We need to: > > 1. stop and uninstall docker > 2. remove /var/lib/docker > 3. reinstall and start docker > > After these three steps, rebooting the node might be needed as well. -- This would impact the customer's operations. Is there something else we could try that is less disruptive? I am going to bump the cadvisor dependency, since there are a number of fixes in 3.10. [1] should alleviate the empty strings reported. 1. https://github.com/google/cadvisor/pull/1871 |