Description of problem: After install 3.10.0-0.22.0 on AWS (cloudprovider enabled) - all of the nodes are NotReady. Rebooting fixes them. Restarting atomic-openshift-node and docker does not. Starting this issue with Networking - please move as needed. 03179 8355 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d 03370 8355 kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized 05042 8355 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d 05216 8355 kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized 06824 8355 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d 06981 8355 kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized Version-Release number of selected component (if applicable): 3.10.0-0.22.0 How reproducible: Always Steps to Reproduce: 1. AWS install (see inventory below) - 1 master, 1 infra, 2 computes 2. After the install oc get nodes shows computes as NotReady 3. Rebooting the compute fixes the issue Full node/docker logs attached. Actual results: Nodes NotReady after install Expected results: Nodes ready and schedullable Additional info: Inventory with credentials redacted: [OSEv3:children] masters nodes etcd [OSEv3:vars] #The following parameters is used by post-actions iaas_name=AWS use_rpm_playbook=false openshift_playbook_rpm_repos=[{'id': 'aos-playbook-rpm', 'name': 'aos-playbook-rpm', 'baseurl': 'http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/AtomicOpenShift/3.10/latest/x86_64/os', 'enabled': 1, 'gpgcheck': 0}] update_is_images_url=registry.reg-aws.openshift.com:443 #The following parameters is used by openshift-ansible ansible_ssh_user=root openshift_cloudprovider_kind=aws openshift_cloudprovider_aws_access_key=<redacted> openshift_cloudprovider_aws_secret_key=<redacted> openshift_master_default_subdomain_enable=true openshift_master_default_subdomain=apps.0417-ezg.qe.rhcloud.com openshift_auth_type=allowall openshift_master_identity_providers=[{'name': 'allow_all', 'login': 'true', 'challenge': 'true', 'kind': 'AllowAllPasswordIdentityProvider'}] openshift_node_labels="{'region': 'primary', 'zone': 'default'}" openshift_release=v3.10 openshift_deployment_type=openshift-enterprise openshift_cockpit_deployer_prefix=registry.reg-aws.openshift.com:443/openshift3/ oreg_url=registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version} oreg_auth_user=<redacted> oreg_auth_password=<redacted> openshift_docker_additional_registries=registry.reg-aws.openshift.com:443 openshift_docker_insecure_registries=registry.reg-aws.openshift.com:443 openshift_service_catalog_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose- ansible_service_broker_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ose- ansible_service_broker_image_tag=v3.10 template_service_broker_selector={"region": "infra"} template_service_broker_prefix=registry.reg-aws.openshift.com:443/openshift3/ose- template_service_broker_version=v3.10 openshift_web_console_prefix=registry.reg-aws.openshift.com:443/openshift3/ose- openshift_enable_service_catalog=true osm_cockpit_plugins=['cockpit-kubernetes'] osm_use_cockpit=false openshift_docker_options=--log-opt max-size=100M --log-opt max-file=3 --signature-verification=false use_cluster_metrics=true openshift_master_cluster_method=native openshift_master_dynamic_provisioning_enabled=true openshift_hosted_router_registryurl=registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version} openshift_hosted_registry_registryurl=registry.reg-aws.openshift.com:443/openshift3/ose-${component}:${version} osm_default_node_selector=region=primary openshift_registry_selector="region=infra,zone=default" openshift_hosted_router_selector="region=infra,zone=default" openshift_disable_check=disk_availability,memory_availability,package_availability,docker_image_availability,docker_storage,package_version osm_host_subnet_length=9 openshift_node_kubelet_args={"pods-per-core": ["0"], "max-pods": ["510"]} debug_level=2 openshift_set_hostname=true openshift_override_hostname_check=true os_sdn_network_plugin_name=redhat/openshift-ovs-networkpolicy openshift_hosted_router_replicas=1 openshift_hosted_registry_storage_kind=object openshift_hosted_registry_storage_provider=s3 openshift_hosted_registry_storage_s3_accesskey=<redacted> openshift_hosted_registry_storage_s3_secretkey=<redacted> openshift_hosted_registry_storage_s3_bucket=aoe-svt-test openshift_hosted_registry_storage_s3_region=us-west-2 openshift_hosted_registry_replicas=1 openshift_hosted_prometheus_deploy=true openshift_prometheus_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ openshift_prometheus_image_version=v3.10 openshift_prometheus_proxy_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ openshift_prometheus_proxy_image_version=v3.10 openshift_prometheus_alertmanager_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ openshift_prometheus_alertmanager_image_version=v3.10 openshift_prometheus_alertbuffer_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ openshift_prometheus_alertbuffer_image_version=v3.10 openshift_prometheus_node_selector={"region": "infra"} openshift_metrics_install_metrics=false openshift_metrics_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ openshift_metrics_image_version=v3.10 openshift_metrics_cassandra_storage_type=dynamic openshift_metrics_cassandra_pvc_size=25Gi openshift_logging_install_logging=false openshift_logging_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ openshift_logging_image_version=v3.10 openshift_logging_storage_kind=dynamic openshift_logging_es_pvc_size=50Gi openshift_logging_es_pvc_dynamic=true openshift_logging_curator_nodeselector={"region": "infra"} openshift_logging_kibana_nodeselector={"region": "infra"} openshift_logging_es_nodeselector={"region": "infra"} openshift_clusterid=mffiedler containerized=false openshift_use_system_containers=false openshift_use_crio=false [lb] [etcd] ec2-54-191-146-101.us-west-2.compute.amazonaws.com openshift_public_hostname=ec2-54-191-146-101.us-west-2.compute.amazonaws.com [masters] ec2-54-191-146-101.us-west-2.compute.amazonaws.com openshift_public_hostname=ec2-54-191-146-101.us-west-2.compute.amazonaws.com [nodes] ec2-54-191-146-101.us-west-2.compute.amazonaws.com openshift_public_hostname=ec2-54-191-146-101.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'master', 'zone': 'default'}" openshift_schedulable=true ec2-54-68-62-115.us-west-2.compute.amazonaws.com openshift_public_hostname=ec2-54-68-62-115.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}" ec2-54-68-62-115.us-west-2.compute.amazonaws.com openshift_public_hostname=ec2-54-68-62-115.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}" ec2-34-214-63-108.us-west-2.compute.amazonaws.com openshift_public_hostname=ec2-34-214-63-108.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'primary', 'zone': 'default'}" ec2-34-214-102-13.us-west-2.compute.amazonaws.com openshift_public_hostname=ec2-34-214-102-13.us-west-2.compute.amazonaws.com openshift_node_labels="{'region': 'primary', 'zone': 'default'}"
Created attachment 1423249 [details] Node/docker logs
This also seems to happen during upgrades from 3.9 on any host other than the first master. I'm looking at it, but would love assistance from Networking or Clayton.
Possibly related to https://github.com/openshift/openshift-ansible/issues/7967
Let me know what I can gather. I have a good reproducer now.
I wasn't able to reproduce this on a small 8-node KVM environment on a single bare-metal host, but I'm seeing this on an OpenStack environment now too. This seems to affect only a certain percentage of nodes. [openshift@master-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION app-node-0.scale-ci.example.com Ready compute 13m v1.10.0+b81c8f8 app-node-1.scale-ci.example.com Ready compute 13m v1.10.0+b81c8f8 cns-0.scale-ci.example.com Ready compute 13m v1.10.0+b81c8f8 cns-1.scale-ci.example.com Ready compute 13m v1.10.0+b81c8f8 cns-2.scale-ci.example.com Ready compute 13m v1.10.0+b81c8f8 infra-node-0.scale-ci.example.com NotReady compute,infra 13m v1.10.0+b81c8f8 infra-node-1.scale-ci.example.com NotReady compute,infra 13m v1.10.0+b81c8f8 infra-node-2.scale-ci.example.com NotReady compute,infra 13m v1.10.0+b81c8f8 master-0.scale-ci.example.com Ready master 15m v1.10.0+b81c8f8 master-1.scale-ci.example.com Ready master 15m v1.10.0+b81c8f8 master-2.scale-ci.example.com Ready master 15m v1.10.0+b81c8f8 root@infra-node-0: /home/openshift # find /etc/cni/ /etc/cni/ /etc/cni/net.d /etc/cni/net.d is empty on all infra nodes which are NotReady. Test blocker for scale testing.
Reboot of the infra-node-0.scale-ci.example.com made the node Ready, the remaining infra nodes still not ready after 1h wait. [openshift@master-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION app-node-0.scale-ci.example.com Ready compute 1h v1.10.0+b81c8f8 app-node-1.scale-ci.example.com Ready compute 1h v1.10.0+b81c8f8 cns-0.scale-ci.example.com Ready compute 1h v1.10.0+b81c8f8 cns-1.scale-ci.example.com Ready compute 1h v1.10.0+b81c8f8 cns-2.scale-ci.example.com Ready compute 1h v1.10.0+b81c8f8 infra-node-0.scale-ci.example.com Ready compute,infra 1h v1.10.0+b81c8f8 infra-node-1.scale-ci.example.com NotReady compute,infra 1h v1.10.0+b81c8f8 infra-node-2.scale-ci.example.com NotReady compute,infra 1h v1.10.0+b81c8f8 master-0.scale-ci.example.com Ready master 1h v1.10.0+b81c8f8 master-1.scale-ci.example.com Ready master 1h v1.10.0+b81c8f8 master-2.scale-ci.example.com Ready master 1h v1.10.0+b81c8f8
The SDN itself creates the /etc/cni/net.d config file when it has started up and is ready. So if that file is not being created, then either the SDN daemonset has not been correctly started, or even if it has been started it is unable to start correct. We'd need docker container logs from the SDN daemonset container to debug further, which should be available with kubectl/oc or 'docker logs'.
My bad, node logs are attached. Error seems to be: Apr 17 19:32:34 ip-172-31-20-200.us-west-2.compute.internal atomic-openshift-node[8355]: E0417 19:32:34.028801 8355 pod_workers.go:186] Error syncing pod 108bb4c8-4276-11e8-b3df-029c7b97694a ("sdn-fjh78_openshift-sdn(108bb4c8-4276-11e8-b3df-029c7b97694a)"), skipping: failed to "CreatePodSandbox" for "sdn-fjh78_openshift-sdn(108bb4c8-4276-11e8-b3df-029c7b97694a)" with CreatePodSandboxError: "CreatePodSandbox for pod \"sdn-fjh78_openshift-sdn(108bb4c8-4276-11e8-b3df-029c7b97694a)\" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod \"sdn-fjh78\": Error response from daemon: lstat /var/lib/docker/overlay2/09a5b1f6274d724b06bde4ba9f93bd7ac7254c56dfcf72b5989c47806de6e47c: no such file or directory" Apr 17 19:32:34 ip-172-31-20-200.us-west-2.compute.internal atomic-openshift-node[8355]: E0417 19:32:34.029317 8355 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "ovs-rpljr": Error response from daemon: lstat /var/lib/docker/overlay2/09a5b1f6274d724b06bde4ba9f93bd7ac7254c56dfcf72b5989c47806de6e47c: no such file or directory And this continues over and over and over. I don't think this is a network problem, it's more of a runtime/node problem. Not only the network daemonset has this issue.
Created attachment 1425810 [details] Docker/atomic-openshift-node + journalctl -xe logs Still hitting this with: [openshift@master-0 ~]$ oc version oc v3.10.0-0.27.0 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://master-0.scale-ci.example.com:8443 openshift v3.10.0-0.27.0 kubernetes v1.10.0+b81c8f8 [cloud-user@ansible-host openshift-ansible]$ git describe openshift-ansible-3.10.0-0.27.0-4-gd0a1341 [openshift@master-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION app-node-0.scale-ci.example.com Ready compute 7h v1.10.0+b81c8f8 app-node-1.scale-ci.example.com Ready compute 7h v1.10.0+b81c8f8 cns-0.scale-ci.example.com Ready compute 7h v1.10.0+b81c8f8 cns-1.scale-ci.example.com Ready compute 7h v1.10.0+b81c8f8 cns-2.scale-ci.example.com Ready compute 7h v1.10.0+b81c8f8 infra-node-0.scale-ci.example.com NotReady compute,infra 7h v1.10.0+b81c8f8 infra-node-1.scale-ci.example.com NotReady compute,infra 7h v1.10.0+b81c8f8 infra-node-2.scale-ci.example.com NotReady compute,infra 7h v1.10.0+b81c8f8 master-0.scale-ci.example.com Ready master 7h v1.10.0+b81c8f8 master-1.scale-ci.example.com Ready master 7h v1.10.0+b81c8f8 master-2.scale-ci.example.com Ready master 7h v1.10.0+b81c8f8 Issues seem to be mostly on OpenStack installs. I was lucky only once so far out of 4 installs. Haven't hit this so far on a KVM mini-cluster (did ~8 installs).
Marking this as urgent, it is blocking deployment in the scalability lab.
Adding Vivek. Vivek, I wasn't be able to narrow this down and I believe it's a race where the container root directory isn't created (or mounted) in time when we do an lstat. I wasn't able to reproduce on Fedora either and I'm trying RHEL now.
The most reliable reproducer of this is openshift-ansible install. During the majority (75%) of installs this issue is hit twice. 1. when the control-plane components try to start - etcd, master-api, master-controllers. The static pods never start due to this issue. We've been rebooting to get around it - atomic-openshift-node restart does not work for sure - I will try docker restart 2. the install normally completes successfully and ends with all compute nodes in NotReady due to this. Again reboots seem to fix it. Additionally, I believe we see it randomly during container creation while doing cluster testing, but I do not yet have a reliable reproducer outside of install.
Is https://bugzilla.redhat.com/show_bug.cgi?id=1570163 a dupe of this?
- Is a node already in a state where this problem is visible? I want to login into that node. Provide me information to login. - Also what's the docker version being used. Provide "docker info" output and output of "rpm -aq | grep docker"
(In reply to Vivek Goyal from comment #14) > - Is a node already in a state where this problem is visible? I want to > login into that node. Provide me information to login. Vivek, please ping me (jmencak) on #aos-scalability, I'll let you log in. > - Also what's the docker version being used. Provide "docker info" output > and output of "rpm -aq | grep docker" [openshift@master-0 ~]$ rpm -aq | grep docker-1.13 docker-1.13.1-62.gitc6c9b51.el7.x86_64
Client: Version: 1.13.1 API version: 1.26 Package version: docker-1.13.1-62.gitc6c9b51.el7.x86_64 Go version: go1.9.2 Git commit: c6c9b51/1.13.1 Built: Wed Apr 11 01:22:02 2018 OS/Arch: linux/amd64 Server: Version: 1.13.1 API version: 1.26 (minimum version 1.12) Package version: docker-1.13.1-62.gitc6c9b51.el7.x86_64 Go version: go1.9.2 Git commit: c6c9b51/1.13.1 Built: Wed Apr 11 01:22:02 2018 OS/Arch: linux/amd64 Experimental: false Will ping you with system details
The root cause of this issue is that the system test golden images mount docker storage xfs filesystem at /var/lib/docker/overlay2 and have done so since docker 1.12 and ocp 3.6. Starting in docker 1.13, that mountpoint is unmounted when docker stops (https://bugzilla.redhat.com/show_bug.cgi?id=1568450) which openshift-ansible does during the install while restarting docker. This causes docker to start using the rootfs for /var/lib/docker/overlay2 and results in a mix of images between the external filesystem and rootfs. Some images previously available are no longer available after docker restart, resulting in this issue. Vivek has documented a best practice for creating the docker overlay2 filesystem here: https://bugzilla.redhat.com/show_bug.cgi?id=1568450#c7. It would be great if that ends up in product doc - or at least a restriction not to mount anything at /var/lib/docker/overlay2. Following the best practice, this problem no longer occurs. Resolving this as NOTABUG
*** Bug 1570163 has been marked as a duplicate of this bug. ***