Description of problem: New nodes created occasionally remains unready for ever Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Scale a machineSet 2. Wait for the nodes to join. See a node remaining unready for ever Actual results: Node remains unready Ready False Wed, 10 Apr 2019 20:42:45 +0200 Wed, 10 Apr 2019 20:40:35 +0200 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized Expected results: Node goes unready Additional info:
Some discussion of this in bug 1591752, although I'm not sure if this is a new issue or that one coming back. Example job [1]. We've seen it 19 times in the past ~24 hours (in ~4% of all failed *-e2e-aws* jobs). [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/277/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/755
Definitely not good. Can you put links to any e2e jobs that fail due to this? Keep in mind, this state is normal for short periods of time as nodes come up and reboot. https://github.com/openshift/machine-config-operator/pull/604 added a tmpfiles change to remove this file on boot via systemd-tmpfiles, so that's a likely cause. That "shouldn't" cause it, of course, but computers being what they are... cc phil, who wrote that change.
Some debugging notes from the above e2e run: The not-ready node is ip-10-0-131-166.ec2.internal. It's SDN pod is sdn-qldcw. However, that pod has never run. It has 3 events: - Successfully assigned openshift-sdn/sdn-qldcw to ip-10-0-131-166.ec2.internal - Failed create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_sdn-qldcw_openshift-sdn_7d590162-5bb4-11e9-a45a-12239d9bef42_0\": Manifest does not match provided manifest digest sha256:d61921e97f2850bcff984990dae93e9e4ada5b34e2c0f70985e84b91d4090880 - Created pod: sdn-qldcw (this is from the DaemonSet controller and not meaningful) So the question is: why did starting the SDN pod fail?
Created attachment 1554466 [details] Error matches from the past 24 hours $ jq '. | keys' errors.json [ "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/188/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/415", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/188/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/416", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/188/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/418", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/188/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/420", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/192/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/419", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-autoscaler-operator/86/pull-ci-openshift-cluster-autoscaler-operator-master-e2e-aws-operator/260", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/750", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/753", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/760", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/762", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/275/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/749", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/275/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/752", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/276/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/748", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/276/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/751", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/276/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/754", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/276/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/764", "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/277/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/755" ] The schema for the full file is job URI -> build-log regexp -> matched build-log content. Generated with [1,2]. [1]: https://github.com/wking/openshift-release/tree/deck-d3/d3 [2]: curl -s 'http://localhost:8000/search?q=E.*Network+plugin+returns+error:+cni+config+uninitialized' | jq . >errors.json
Almost certainly caused by #1669096, which is now fixed. Will sleep this for a few days and see if it recurs.
Casey, comment 2, I wrote the change. It fixes 1654044. On a new node the file should not exist so pr 604 shouldn't matter.
Looks like bug 1669096 was supposed to have been fixed almost 2 months ago in CRI-O 1.12.5-6. I can easily reproduce this on a cluster created yesterday with newer CRI-O. All pods on the node are stuck in ContainerCreating or Init:0/2 status and have the same error: Failed create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_sdn-f82k7_openshift-sdn_b2c67192-5c65-11e9-b425-0293ee4fe3b8_0": Manifest does not match provided manifest digest sha256:d61921e97f2850bcff984990dae93e9e4ada5b34e2c0f70985e84b91d4090880 That includes pods unrelated to SDN. Some info from the node: Kernel Version: 4.18.0-80.el8.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 410.8.20190408.1 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.13.4-3.rhaos4.1.git30006b3.el8 Kubelet Version: v1.13.4+1ad602308 Kube-Proxy Version: v1.13.4+1ad602308
Bug 1698253 was the recent CRI-O issue.