Description of problem: I am trying to do upgrade of a CRIO runtime cluster, it has following nodes 1 lb 3 masters 3 etcd (separate) 2 infra 5 compute The cluster is loaded with following 2000 projects, each project has 5 build configs 10 buikds 5 deployment configs 20 secrets 5 routes I see one of the nodes do not become ready during upgrade, and it has following error Jun 15 12:45:12 ip-172-31-60-77.us-west-2.compute.internal atomic-openshift-node[54555]: ERROR: logging before flag.Parse: E0615 12:45:12.318979 54555 kubelet.go:2147] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized Jun 15 12:45:12 ip-172-31-60-77.us-west-2.compute.internal atomic-openshift-node[54555]: ERROR: logging before flag.Parse: E0615 12:45:12.937747 54555 pod_workers.go:186] Error syncing pod 84639f6d-6fe4-11e8-9f48-0265e7a43eb8 ("dockergc-rklnx_default(84639f6d-6fe4-11e8-9f48-0265e7a43eb8)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] Jun 15 12:45:17 ip-172-31-60-77.us-west-2.compute.internal atomic-openshift-node[54555]: ERROR: logging before flag.Parse: E0615 12:45:17.319803 54555 kubelet.go:2147] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized Jun 15 12:45:22 ip-172-31-60-77.us-west-2.compute.internal atomic-openshift-node[54555]: ERROR: logging before flag.Parse: E0615 12:45:22.320564 54555 kubelet.go:2147] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized cluster is in following state ip-172-31-13-29.us-west-2.compute.internal Ready compute 21h v1.10.0+b81c8f8 ip-172-31-2-176.us-west-2.compute.internal Ready infra 21h v1.10.0+b81c8f8 ip-172-31-2-190.us-west-2.compute.internal Ready infra 21h v1.10.0+b81c8f8 ip-172-31-21-242.us-west-2.compute.internal Ready compute 21h v1.9.1+a0ce1bc657 ip-172-31-3-179.us-west-2.compute.internal Ready master 21h v1.10.0+b81c8f8 ip-172-31-35-193.us-west-2.compute.internal Ready compute 21h v1.10.0+b81c8f8 ip-172-31-42-21.us-west-2.compute.internal Ready compute 21h v1.9.1+a0ce1bc657 ip-172-31-45-73.us-west-2.compute.internal Ready master 21h v1.10.0+b81c8f8 ip-172-31-50-124.us-west-2.compute.internal Ready master 21h v1.10.0+b81c8f8 ip-172-31-60-77.us-west-2.compute.internal NotReady,SchedulingDisabled compute 21h v1.10.0+b81c8f8 Version-Release number of selected component (if applicable): openshift v3.10.0-0.67.0 Steps to Reproduce: 1. create 3.9 crio runtime cluster 2. upgrade to 3.10 cluster Actual results: Upgrade fails Expected results: Upgrade should complete Additional info: will attach ansible logs and node journal logs.
Created attachment 1451924 [details] ansible log with -vvv
Created attachment 1451925 [details] node logs
Casey: That error is probably not the root cause. It shows up until the sdn startup completes... so something else is preventing it from completing.
Note: Upgrade completes if I run it again.
It looks like the openshift-sdn pod is crashlooping. Can you either get the termination message from the apiserver or from docker logs directly?
Ah, now that the node is upgraded, I don't know that we'll be able to debug. If you're lucky, you can grab the docker logs of the crashlooped pods.
I will try to re-produce it and update the bz.
vikas set me up with another node in the failed state. CRIO had decided it no longer had a valid CNI configuration: Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal systemd[1]: Starting Open Container Initiative Daemon... Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal crio[17222]: time="2018-06-15 17:21:40.343834130Z" level=warning msg="hooks path: "/etc/containers/oci/hooks.d" does not exist" Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal crio[17222]: time="2018-06-15 17:21:40.343965562Z" level=error msg="error updating cni config: Missing CNI default network" Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal crio[17222]: time="2018-06-15 17:21:40.343975143Z" level=error msg="Missing CNI default network" Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal crio[17222]: time="2018-06-15 17:21:40.344374186Z" level=error msg="no such file or directory" Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal crio[17222]: time="2018-06-15 17:21:40.344434293Z" level=error msg="CNI monitoring error <nil>" Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal systemd[1]: Started Open Container Initiative Daemon. I checked that there was a valid CNI configuration in /etc/cni/net.d, and restarted cri. The node is now Ready. It looks like the upgrade process is somehow racing with CRI to parse the network configuration. And CRI, unlike kubelet, gives up if there's a failure to parse.
*** Bug 1592781 has been marked as a duplicate of this bug. ***
I wonder if this is the problem: https://github.com/openshift/openshift-ansible/blob/master/roles/container_runtime/tasks/package_crio.yml#L66 Ansible shouldn't be writing that file.
Should this go to upgrade component ?
(In reply to Vikas Laad from comment #11) > Should this go to upgrade component ? No, lets leave it in networking. They have the expertise to assert the exact steps that need to happen to upgrade successfully and if they need help implementing those in ansible then we can help with that.
cri-o container runtime upgrade is blocked by this bug. please double confirm not fixing it in v3.10 Thanks
Sorry, I didn't realize cri-o was supported in 3.9. Yes, this needs to be resolved for 3.10.0 :-(
Just to summarize my findings (in this and 1592781): Both machines were in "network not ready" state. crio and the sdn pod were running. The SDN pod had written /etc/cni/net.d/80-(something).conf. However, crio was reporting "network not ready" to the kubelet. The only thing I did to fix this was restart crio. It immediately reported a ready status to the kubelet, and the node was back in action. So my analysis is that there's something wrong with crio's cni driver.
We'd need to see some CRIO log output to debug this further...
Ideally at debug level too.
Mrunal: FYI in case you have any ideas about what might be going on.
The attachment of bug 1592781 has cri-o upgrade log (instances are gone) hope it helps
Created attachment 1453643 [details] journal logs with crio in debug Attached node logs with crio in debug log level.
This appears to be a bug in cri-o's vendored ocicni that has been fixed already upstream in ocicni. crio will need to grab ocicni commit b27166bd or just revendor to latest ocicni. See https://github.com/cri-o/ocicni/pull/21
To elaborate, when CRIO restarts we see: Jun 21 22:32:06 ip-172-31-61-158.us-west-2.compute.internal crio[125075]: time="2018-06-21 22:32:06.288512456Z" level=error msg="error updating cni config: Missing CNI default network" Jun 21 22:32:06 ip-172-31-61-158.us-west-2.compute.internal crio[125075]: time="2018-06-21 22:32:06.288524548Z" level=error msg="Missing CNI default network" Jun 21 22:32:06 ip-172-31-61-158.us-west-2.compute.internal crio[125075]: time="2018-06-21 22:32:06.288862594Z" level=error msg="no such file or directory" Jun 21 22:32:06 ip-172-31-61-158.us-west-2.compute.internal crio[125075]: time="2018-06-21 22:32:06.288960523Z" level=debug msg="CNI monitoring event "": " Jun 21 22:32:06 ip-172-31-61-158.us-west-2.compute.internal crio[125075]: time="2018-06-21 22:32:06.288983165Z" level=error msg="CNI monitoring error <nil>" Apparently the Go inotify watcher code will happily return 'nil' errors, which previously ocicni's directory monitor would treat as an error and cease watching /etc/cni/net.d. This after the "CNI monitoring error" statement, CRIO will not pick up any new files written to /etc/cni/net.d including the openshift-sdn one that would allow the node to become ready.
I will
I am working on vendoring in latest ocicni. Thanks!
Opened - https://github.com/kubernetes-incubator/cri-o/pull/1642/files We will cut a new cri-o release after that merges.
Upgrade failed. cri-o-1.10.4-1.rhaos3.10.gitebaa77a.el7.x86_64 openshift-ansible-3.10.10-1.git.248.0bb6b58.el7.noarch.rpm { "lastHeartbeatTime": "2018-07-02T08:03:17Z", "lastTransitionTime": "2018-07-02T07:59:43Z", "message": "runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized", "reason": "KubeletNotReady", "status": "False", "type": "Ready" }, [root@wmengug75crio22-master-etcd-1 ~]# oc get node -o wide NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME wmengug75crio22-master-etcd-1 Ready master 54m v1.10.0+b81c8f8 104.197.34.22 Red Hat Enterprise Linux Server 7.5 (Maipo) 3.10.0-862.el7.x86_64 cri-o://1.10.4 wmengug75crio22-master-etcd-2 NotReady master 54m v1.10.0+b81c8f8 35.232.174.34 Red Hat Enterprise Linux Server 7.5 (Maipo) 3.10.0-862.el7.x86_64 cri-o://1.10.4 wmengug75crio22-master-etcd-3 Ready master 54m v1.9.1+a0ce1bc657 35.193.201.112 Red Hat Enterprise Linux Server 7.5 (Maipo) 3.10.0-862.el7.x86_64 cri-o://1.9.12 wmengug75crio22-node-primary-1 Ready compute 54m v1.9.1+a0ce1bc657 35.202.213.206 Red Hat Enterprise Linux Server 7.5 (Maipo) 3.10.0-862.el7.x86_64 cri-o://1.9.12 wmengug75crio22-node-primary-2 Ready compute 54m v1.9.1+a0ce1bc657 104.197.231.240 Red Hat Enterprise Linux Server 7.5 (Maipo) 3.10.0-862.el7.x86_64 cri-o://1.9.12 wmengug75crio22-node-primary-3 Ready compute 54m v1.9.1+a0ce1bc657 35.226.94.240 Red Hat Enterprise Linux Server 7.5 (Maipo) 3.10.0-862.el7.x86_64 cri-o://1.9.12 wmengug75crio22-nrri-1 Ready <none> 54m v1.9.1+a0ce1bc657 35.224.255.164 Red Hat Enterprise Linux Server 7.5 (Maipo) 3.10.0-862.el7.x86_64 cri-o://1.9.12 wmengug75crio22-nrri-2 Ready <none> 54m v1.9.1+a0ce1bc657 35.232.14.94 Red Hat Enterprise Linux Server 7.5 (Maipo) 3.10.0-862.el7.x86_64 cri-o://1.9.12
Fix is in https://github.com/kubernetes-incubator/cri-o/releases/tag/v1.10.5 Need a new build and re-test please (build probably already available).
Fixed. cri-o-1.10.5-1.rhaos3.10.git4c81456.el7.x86_64 openshift-ansible-3.10.10-1.git.248.0bb6b58.el7.noarch.rpm { "lastHeartbeatTime": "2018-07-03T09:03:03Z", "lastTransitionTime": "2018-07-03T09:03:03Z", "message": "kubelet is posting ready status", "reason": "KubeletReady", "status": "True", "type": "Ready" }, Kernel Version: 3.10.0-862.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo) although upgrade failed due to bug 1596818 it is not crio problem.