Bug 1568583 - failed to create a sandbox for pod "ovs-rpljr": Error response from daemon: lstat /var/lib/docker/overlay2/09a5b1f6274d724b06bde4ba9f93bd7ac7254c56dfcf72b5989c47806de6e47c: no such file or directory
Summary: failed to create a sandbox for pod "ovs-rpljr": Error response from daemon: l...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.10.0
Assignee: Antonio Murdaca
QA Contact: DeShuai Ma
URL:
Whiteboard: aos-scalability-310
: 1570163 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-17 20:01 UTC by Mike Fiedler
Modified: 2018-04-24 15:37 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-24 15:36:01 UTC
Target Upstream Version:


Attachments (Terms of Use)
Node/docker logs (71.78 KB, application/x-gzip)
2018-04-17 20:01 UTC, Mike Fiedler
no flags Details
Docker/atomic-openshift-node + journalctl -xe logs (92.89 KB, application/x-xz)
2018-04-24 06:21 UTC, jmencak
no flags Details

Description Mike Fiedler 2018-04-17 20:01:07 UTC
Description of problem:

After install 3.10.0-0.22.0 on AWS (cloudprovider enabled)  - all of the nodes are NotReady.   Rebooting fixes them.   Restarting atomic-openshift-node and docker does not.

Starting this issue with Networking - please move as needed.

03179    8355 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
03370    8355 kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
05042    8355 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
05216    8355 kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
06824    8355 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
06981    8355 kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized



Version-Release number of selected component (if applicable): 3.10.0-0.22.0


How reproducible: Always


Steps to Reproduce:
1. AWS install (see inventory below) - 1 master, 1 infra, 2 computes
2. After the install oc get nodes shows computes as NotReady
3. Rebooting the compute fixes the issue

Full node/docker logs attached.

Actual results:

Nodes NotReady after install

Expected results:

Nodes ready and schedullable

Additional info:

Inventory with credentials redacted:

[OSEv3:children]
masters
nodes

etcd





[OSEv3:vars]

#The following parameters is used by post-actions
iaas_name=AWS
use_rpm_playbook=false
openshift_playbook_rpm_repos=[{'id': 'aos-playbook-rpm', 'name': 'aos-playbook-rpm', 'baseurl': 'http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/AtomicOpenShift/3.10/latest/x86_64/os', 'enabled': 1, 'gpgcheck': 0}]




update_is_images_url=registry.reg-aws.openshift.com:443











#The following parameters is used by openshift-ansible
ansible_ssh_user=root




openshift_cloudprovider_kind=aws

openshift_cloudprovider_aws_access_key=<redacted>


openshift_cloudprovider_aws_secret_key=<redacted>












openshift_master_default_subdomain_enable=true
openshift_master_default_subdomain=apps.0417-ezg.qe.rhcloud.com




openshift_auth_type=allowall

openshift_master_identity_providers=[{'name': 'allow_all', 'login': 'true', 'challenge': 'true', 'kind': 'AllowAllPasswordIdentityProvider'}]




 openshift_node_labels="{'region': 'primary', 'zone': 'default'}"



openshift_release=v3.10
openshift_deployment_type=openshift-enterprise
openshift_cockpit_deployer_prefix=registry.reg-aws.openshift.com:443/openshift3/
oreg_url=registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version}
oreg_auth_user=<redacted>
oreg_auth_password=<redacted>
openshift_docker_additional_registries=registry.reg-aws.openshift.com:443
openshift_docker_insecure_registries=registry.reg-aws.openshift.com:443
openshift_service_catalog_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
ansible_service_broker_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
ansible_service_broker_image_tag=v3.10
template_service_broker_selector={"region": "infra"}
template_service_broker_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
template_service_broker_version=v3.10
openshift_web_console_prefix=registry.reg-aws.openshift.com:443/openshift3/ose-
openshift_enable_service_catalog=true
osm_cockpit_plugins=['cockpit-kubernetes']
osm_use_cockpit=false
openshift_docker_options=--log-opt max-size=100M --log-opt max-file=3 --signature-verification=false
use_cluster_metrics=true
openshift_master_cluster_method=native
openshift_master_dynamic_provisioning_enabled=true
openshift_hosted_router_registryurl=registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version}
openshift_hosted_registry_registryurl=registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version}
osm_default_node_selector=region=primary
openshift_registry_selector="region=infra,zone=default"
openshift_hosted_router_selector="region=infra,zone=default"
openshift_disable_check=disk_availability,memory_availability,package_availability,docker_image_availability,docker_storage,package_version
osm_host_subnet_length=9
openshift_node_kubelet_args={"pods-per-core": ["0"], "max-pods": ["510"]}
debug_level=2
openshift_set_hostname=true
openshift_override_hostname_check=true
os_sdn_network_plugin_name=redhat/openshift-ovs-networkpolicy
openshift_hosted_router_replicas=1
openshift_hosted_registry_storage_kind=object
openshift_hosted_registry_storage_provider=s3
openshift_hosted_registry_storage_s3_accesskey=<redacted>
openshift_hosted_registry_storage_s3_secretkey=<redacted>
openshift_hosted_registry_storage_s3_bucket=aoe-svt-test
openshift_hosted_registry_storage_s3_region=us-west-2
openshift_hosted_registry_replicas=1
openshift_hosted_prometheus_deploy=true
openshift_prometheus_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_prometheus_image_version=v3.10
openshift_prometheus_proxy_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_prometheus_proxy_image_version=v3.10
openshift_prometheus_alertmanager_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_prometheus_alertmanager_image_version=v3.10
openshift_prometheus_alertbuffer_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_prometheus_alertbuffer_image_version=v3.10
openshift_prometheus_node_selector={"region": "infra"}
openshift_metrics_install_metrics=false
openshift_metrics_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_metrics_image_version=v3.10
openshift_metrics_cassandra_storage_type=dynamic
openshift_metrics_cassandra_pvc_size=25Gi
openshift_logging_install_logging=false
openshift_logging_image_prefix=registry.reg-aws.openshift.com:443/openshift3/
openshift_logging_image_version=v3.10
openshift_logging_storage_kind=dynamic
openshift_logging_es_pvc_size=50Gi
openshift_logging_es_pvc_dynamic=true
openshift_logging_curator_nodeselector={"region": "infra"}
openshift_logging_kibana_nodeselector={"region": "infra"}
openshift_logging_es_nodeselector={"region": "infra"}
openshift_clusterid=mffiedler
containerized=false
openshift_use_system_containers=false
openshift_use_crio=false




[lb]


[etcd]
ec2-54-191-146-101.us-west-2.compute.amazonaws.com  openshift_public_hostname=ec2-54-191-146-101.us-west-2.compute.amazonaws.com


[masters]
ec2-54-191-146-101.us-west-2.compute.amazonaws.com  openshift_public_hostname=ec2-54-191-146-101.us-west-2.compute.amazonaws.com



[nodes]
ec2-54-191-146-101.us-west-2.compute.amazonaws.com  openshift_public_hostname=ec2-54-191-146-101.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'master', 'zone': 'default'}" openshift_schedulable=true

ec2-54-68-62-115.us-west-2.compute.amazonaws.com  openshift_public_hostname=ec2-54-68-62-115.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}"

ec2-54-68-62-115.us-west-2.compute.amazonaws.com  openshift_public_hostname=ec2-54-68-62-115.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}"


ec2-34-214-63-108.us-west-2.compute.amazonaws.com  openshift_public_hostname=ec2-34-214-63-108.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'primary', 'zone': 'default'}"
ec2-34-214-102-13.us-west-2.compute.amazonaws.com  openshift_public_hostname=ec2-34-214-102-13.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'primary', 'zone': 'default'}"

Comment 1 Mike Fiedler 2018-04-17 20:01:34 UTC
Created attachment 1423249 [details]
Node/docker logs

Comment 2 Scott Dodson 2018-04-17 20:03:49 UTC
This also seems to happen during upgrades from 3.9 on any host other than the first master. I'm looking at it, but would love assistance from Networking or Clayton.

Comment 3 Mike Fiedler 2018-04-17 20:05:23 UTC
Possibly related to https://github.com/openshift/openshift-ansible/issues/7967

Comment 4 Mike Fiedler 2018-04-17 20:06:23 UTC
Let me know what I can gather.  I have a good reproducer now.

Comment 5 jmencak 2018-04-18 06:05:01 UTC
I wasn't able to reproduce this on a small 8-node KVM environment on a single bare-metal host, but I'm seeing this on an OpenStack environment now too.  This seems to affect only a certain percentage of nodes.

[openshift@master-0 ~]$ oc get nodes
NAME                                STATUS     ROLES           AGE       VERSION
app-node-0.scale-ci.example.com     Ready      compute         13m       v1.10.0+b81c8f8
app-node-1.scale-ci.example.com     Ready      compute         13m       v1.10.0+b81c8f8
cns-0.scale-ci.example.com          Ready      compute         13m       v1.10.0+b81c8f8
cns-1.scale-ci.example.com          Ready      compute         13m       v1.10.0+b81c8f8
cns-2.scale-ci.example.com          Ready      compute         13m       v1.10.0+b81c8f8
infra-node-0.scale-ci.example.com   NotReady   compute,infra   13m       v1.10.0+b81c8f8
infra-node-1.scale-ci.example.com   NotReady   compute,infra   13m       v1.10.0+b81c8f8
infra-node-2.scale-ci.example.com   NotReady   compute,infra   13m       v1.10.0+b81c8f8
master-0.scale-ci.example.com       Ready      master          15m       v1.10.0+b81c8f8
master-1.scale-ci.example.com       Ready      master          15m       v1.10.0+b81c8f8
master-2.scale-ci.example.com       Ready      master          15m       v1.10.0+b81c8f8

root@infra-node-0: /home/openshift # find /etc/cni/
/etc/cni/
/etc/cni/net.d

/etc/cni/net.d is empty on all infra nodes which are NotReady.

Test blocker for scale testing.

Comment 6 jmencak 2018-04-18 07:41:20 UTC
Reboot of the infra-node-0.scale-ci.example.com made the node Ready, the remaining infra nodes still not ready after 1h wait.

[openshift@master-0 ~]$ oc get nodes
NAME                                STATUS     ROLES           AGE       VERSION
app-node-0.scale-ci.example.com     Ready      compute         1h        v1.10.0+b81c8f8
app-node-1.scale-ci.example.com     Ready      compute         1h        v1.10.0+b81c8f8
cns-0.scale-ci.example.com          Ready      compute         1h        v1.10.0+b81c8f8
cns-1.scale-ci.example.com          Ready      compute         1h        v1.10.0+b81c8f8
cns-2.scale-ci.example.com          Ready      compute         1h        v1.10.0+b81c8f8
infra-node-0.scale-ci.example.com   Ready      compute,infra   1h        v1.10.0+b81c8f8
infra-node-1.scale-ci.example.com   NotReady   compute,infra   1h        v1.10.0+b81c8f8
infra-node-2.scale-ci.example.com   NotReady   compute,infra   1h        v1.10.0+b81c8f8
master-0.scale-ci.example.com       Ready      master          1h        v1.10.0+b81c8f8
master-1.scale-ci.example.com       Ready      master          1h        v1.10.0+b81c8f8
master-2.scale-ci.example.com       Ready      master          1h        v1.10.0+b81c8f8

Comment 7 Dan Williams 2018-04-18 18:13:46 UTC
The SDN itself creates the /etc/cni/net.d config file when it has started up and is ready. So if that file is not being created, then either the SDN daemonset has not been correctly started, or even if it has been started it is unable to start correct.

We'd need docker container logs from the SDN daemonset container to debug further, which should be available with kubectl/oc or 'docker logs'.

Comment 8 Dan Williams 2018-04-18 18:29:37 UTC
My bad, node logs are attached.  Error seems to be:

Apr 17 19:32:34 ip-172-31-20-200.us-west-2.compute.internal atomic-openshift-node[8355]: E0417 19:32:34.028801    8355 pod_workers.go:186] Error syncing pod 108bb4c8-4276-11e8-b3df-029c7b97694a ("sdn-fjh78_openshift-sdn(108bb4c8-4276-11e8-b3df-029c7b97694a)"), skipping: failed to "CreatePodSandbox" for "sdn-fjh78_openshift-sdn(108bb4c8-4276-11e8-b3df-029c7b97694a)" with CreatePodSandboxError: "CreatePodSandbox for pod \"sdn-fjh78_openshift-sdn(108bb4c8-4276-11e8-b3df-029c7b97694a)\" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod \"sdn-fjh78\": Error response from daemon: lstat /var/lib/docker/overlay2/09a5b1f6274d724b06bde4ba9f93bd7ac7254c56dfcf72b5989c47806de6e47c: no such file or directory"
Apr 17 19:32:34 ip-172-31-20-200.us-west-2.compute.internal atomic-openshift-node[8355]: E0417 19:32:34.029317    8355 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "ovs-rpljr": Error response from daemon: lstat /var/lib/docker/overlay2/09a5b1f6274d724b06bde4ba9f93bd7ac7254c56dfcf72b5989c47806de6e47c: no such file or directory

And this continues over and over and over.  I don't think this is a network problem, it's more of a runtime/node problem.  Not only the network daemonset has this issue.

Comment 9 jmencak 2018-04-24 06:21:44 UTC
Created attachment 1425810 [details]
Docker/atomic-openshift-node + journalctl -xe logs

Still hitting this with:

[openshift@master-0 ~]$ oc version
oc v3.10.0-0.27.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master-0.scale-ci.example.com:8443
openshift v3.10.0-0.27.0
kubernetes v1.10.0+b81c8f8

[cloud-user@ansible-host openshift-ansible]$ git describe
openshift-ansible-3.10.0-0.27.0-4-gd0a1341

[openshift@master-0 ~]$ oc get nodes
NAME                                STATUS     ROLES           AGE       VERSION
app-node-0.scale-ci.example.com     Ready      compute         7h        v1.10.0+b81c8f8
app-node-1.scale-ci.example.com     Ready      compute         7h        v1.10.0+b81c8f8
cns-0.scale-ci.example.com          Ready      compute         7h        v1.10.0+b81c8f8
cns-1.scale-ci.example.com          Ready      compute         7h        v1.10.0+b81c8f8
cns-2.scale-ci.example.com          Ready      compute         7h        v1.10.0+b81c8f8
infra-node-0.scale-ci.example.com   NotReady   compute,infra   7h        v1.10.0+b81c8f8
infra-node-1.scale-ci.example.com   NotReady   compute,infra   7h        v1.10.0+b81c8f8
infra-node-2.scale-ci.example.com   NotReady   compute,infra   7h        v1.10.0+b81c8f8
master-0.scale-ci.example.com       Ready      master          7h        v1.10.0+b81c8f8
master-1.scale-ci.example.com       Ready      master          7h        v1.10.0+b81c8f8
master-2.scale-ci.example.com       Ready      master          7h        v1.10.0+b81c8f8

Issues seem to be mostly on OpenStack installs.  I was lucky only once so far out of 4 installs.  Haven't hit this so far on a KVM mini-cluster (did ~8 installs).

Comment 10 Mike Fiedler 2018-04-24 11:43:45 UTC
Marking this as urgent, it is blocking deployment in the scalability lab.

Comment 11 Antonio Murdaca 2018-04-24 11:49:53 UTC
Adding Vivek.

Vivek, I wasn't be able to narrow this down and I believe it's a race where the container root directory isn't created (or mounted) in time when we do an lstat. I wasn't able to reproduce on Fedora either and I'm trying RHEL now.

Comment 12 Mike Fiedler 2018-04-24 12:12:47 UTC
The most reliable reproducer of this is openshift-ansible install.  During the majority (75%) of installs this issue is hit twice.

1.  when the control-plane components try to start - etcd, master-api, master-controllers.   The static pods never start due to this issue.   We've been rebooting to get around it - atomic-openshift-node restart does not work for sure - I will try docker restart

2. the install normally completes successfully and ends with all compute nodes in NotReady due to this.  Again reboots seem to fix it.

Additionally, I believe we see it randomly during container creation while doing cluster testing, but I do not yet have a reliable reproducer outside of install.

Comment 13 Scott Dodson 2018-04-24 12:26:13 UTC
Is https://bugzilla.redhat.com/show_bug.cgi?id=1570163 a dupe of this?

Comment 14 Vivek Goyal 2018-04-24 13:32:30 UTC
- Is a node already in a state where this problem is visible? I want to login into that node. Provide me information to login.

- Also what's the docker version being used. Provide "docker info" output and output of "rpm -aq | grep docker"

Comment 16 jmencak 2018-04-24 13:45:25 UTC
(In reply to Vivek Goyal from comment #14)
> - Is a node already in a state where this problem is visible? I want to
> login into that node. Provide me information to login.
Vivek, please ping me (jmencak) on #aos-scalability, I'll let you log in.

> - Also what's the docker version being used. Provide "docker info" output
> and output of "rpm -aq | grep docker"
[openshift@master-0 ~]$ rpm -aq | grep docker-1.13
docker-1.13.1-62.gitc6c9b51.el7.x86_64

Comment 17 Mike Fiedler 2018-04-24 13:51:30 UTC
Client:
 Version:         1.13.1
 API version:     1.26
 Package version: docker-1.13.1-62.gitc6c9b51.el7.x86_64
 Go version:      go1.9.2
 Git commit:      c6c9b51/1.13.1
 Built:           Wed Apr 11 01:22:02 2018
 OS/Arch:         linux/amd64

Server:
 Version:         1.13.1
 API version:     1.26 (minimum version 1.12)
 Package version: docker-1.13.1-62.gitc6c9b51.el7.x86_64
 Go version:      go1.9.2
 Git commit:      c6c9b51/1.13.1
 Built:           Wed Apr 11 01:22:02 2018
 OS/Arch:         linux/amd64
 Experimental:    false


Will ping you with system details

Comment 18 Mike Fiedler 2018-04-24 15:36:01 UTC
The root cause of this issue is that the system test golden images mount docker storage xfs filesystem at /var/lib/docker/overlay2 and have done so since docker 1.12 and ocp 3.6.

Starting in docker 1.13, that mountpoint is unmounted when docker stops (https://bugzilla.redhat.com/show_bug.cgi?id=1568450) which openshift-ansible does during the install while restarting docker.   This causes docker to start using the rootfs for /var/lib/docker/overlay2 and results in a mix of images between the external filesystem and rootfs.   Some images previously available are no longer available after docker restart, resulting in this issue.

Vivek has documented a best practice for creating the docker overlay2 filesystem here:  https://bugzilla.redhat.com/show_bug.cgi?id=1568450#c7.   It would be great if that ends up in product doc - or at least a restriction not to mount anything at /var/lib/docker/overlay2.

Following the best practice, this problem no longer occurs.  Resolving this as NOTABUG

Comment 19 Mike Fiedler 2018-04-24 15:37:37 UTC
*** Bug 1570163 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.