Bug 1591752 - Upgrade - Network plugin returns error: cni config uninitialized with crio runtime
Summary: Upgrade - Network plugin returns error: cni config uninitialized with crio ru...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.10.0
Assignee: Mrunal Patel
QA Contact: Weihua Meng
URL:
Whiteboard:
: 1592781 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-15 12:54 UTC by Vikas Laad
Modified: 2018-12-20 21:47 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-20 21:37:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ansible log with -vvv (14.44 MB, text/plain)
2018-06-15 13:10 UTC, Vikas Laad
no flags Details
node logs (16.72 MB, application/zip)
2018-06-15 13:10 UTC, Vikas Laad
no flags Details
journal logs with crio in debug (2.28 MB, application/zip)
2018-06-22 02:49 UTC, Vikas Laad
no flags Details

Description Vikas Laad 2018-06-15 12:54:22 UTC
Description of problem:
I am trying to do upgrade of a CRIO runtime cluster, it has following nodes
1 lb
3 masters
3 etcd (separate)
2 infra
5 compute

The cluster is loaded with following 2000 projects, each project has
5 build configs
10 buikds
5 deployment configs
20 secrets
5 routes

I see one of the nodes do not become ready during upgrade, and it has following error
Jun 15 12:45:12 ip-172-31-60-77.us-west-2.compute.internal atomic-openshift-node[54555]: ERROR: logging before flag.Parse: E0615 12:45:12.318979   54555 kubelet.go:2147] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized
Jun 15 12:45:12 ip-172-31-60-77.us-west-2.compute.internal atomic-openshift-node[54555]: ERROR: logging before flag.Parse: E0615 12:45:12.937747   54555 pod_workers.go:186] Error syncing pod 84639f6d-6fe4-11e8-9f48-0265e7a43eb8 ("dockergc-rklnx_default(84639f6d-6fe4-11e8-9f48-0265e7a43eb8)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
Jun 15 12:45:17 ip-172-31-60-77.us-west-2.compute.internal atomic-openshift-node[54555]: ERROR: logging before flag.Parse: E0615 12:45:17.319803   54555 kubelet.go:2147] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized
Jun 15 12:45:22 ip-172-31-60-77.us-west-2.compute.internal atomic-openshift-node[54555]: ERROR: logging before flag.Parse: E0615 12:45:22.320564   54555 kubelet.go:2147] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized

cluster is in following state
ip-172-31-13-29.us-west-2.compute.internal    Ready                         compute   21h       v1.10.0+b81c8f8
ip-172-31-2-176.us-west-2.compute.internal    Ready                         infra     21h       v1.10.0+b81c8f8
ip-172-31-2-190.us-west-2.compute.internal    Ready                         infra     21h       v1.10.0+b81c8f8
ip-172-31-21-242.us-west-2.compute.internal   Ready                         compute   21h       v1.9.1+a0ce1bc657
ip-172-31-3-179.us-west-2.compute.internal    Ready                         master    21h       v1.10.0+b81c8f8
ip-172-31-35-193.us-west-2.compute.internal   Ready                         compute   21h       v1.10.0+b81c8f8
ip-172-31-42-21.us-west-2.compute.internal    Ready                         compute   21h       v1.9.1+a0ce1bc657
ip-172-31-45-73.us-west-2.compute.internal    Ready                         master    21h       v1.10.0+b81c8f8
ip-172-31-50-124.us-west-2.compute.internal   Ready                         master    21h       v1.10.0+b81c8f8
ip-172-31-60-77.us-west-2.compute.internal    NotReady,SchedulingDisabled   compute   21h       v1.10.0+b81c8f8


Version-Release number of selected component (if applicable):
openshift v3.10.0-0.67.0

Steps to Reproduce:
1. create 3.9 crio runtime cluster
2. upgrade to 3.10 cluster

Actual results:
Upgrade fails

Expected results:
Upgrade should complete

Additional info:
will attach ansible logs and node journal logs.

Comment 1 Vikas Laad 2018-06-15 13:10:18 UTC
Created attachment 1451924 [details]
ansible log with -vvv

Comment 2 Vikas Laad 2018-06-15 13:10:59 UTC
Created attachment 1451925 [details]
node logs

Comment 3 Ben Bennett 2018-06-15 13:28:08 UTC
Casey: That error is probably not the root cause.  It shows up until the sdn startup completes... so something else is preventing it from completing.

Comment 4 Vikas Laad 2018-06-15 15:04:49 UTC
Note: Upgrade completes if I run it again.

Comment 5 Casey Callendrello 2018-06-15 15:09:33 UTC
It looks like the openshift-sdn pod is crashlooping. Can you either get the termination message from the apiserver or from docker logs directly?

Comment 6 Casey Callendrello 2018-06-15 15:52:04 UTC
Ah, now that the node is upgraded, I don't know that we'll be able to debug. If you're lucky, you can grab the docker logs of the crashlooped pods.

Comment 7 Vikas Laad 2018-06-15 17:11:20 UTC
I will try to re-produce it and update the bz.

Comment 8 Casey Callendrello 2018-06-18 14:48:47 UTC
vikas set me up with another node in the failed state. CRIO had decided it no longer had a valid CNI configuration:

Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal systemd[1]: Starting Open Container Initiative Daemon...
Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal crio[17222]: time="2018-06-15 17:21:40.343834130Z" level=warning msg="hooks path: "/etc/containers/oci/hooks.d" does not exist"
Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal crio[17222]: time="2018-06-15 17:21:40.343965562Z" level=error msg="error updating cni config: Missing CNI default network"
Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal crio[17222]: time="2018-06-15 17:21:40.343975143Z" level=error msg="Missing CNI default network"
Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal crio[17222]: time="2018-06-15 17:21:40.344374186Z" level=error msg="no such file or directory"
Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal crio[17222]: time="2018-06-15 17:21:40.344434293Z" level=error msg="CNI monitoring error <nil>"
Jun 15 17:21:40 ip-172-31-32-59.us-west-2.compute.internal systemd[1]: Started Open Container Initiative Daemon.

I checked that there was a valid CNI configuration in /etc/cni/net.d, and restarted cri. The node is now Ready.

It looks like the upgrade process is somehow racing with CRI to parse the network configuration. And CRI, unlike kubelet, gives up if there's a failure to parse.

Comment 9 Casey Callendrello 2018-06-19 13:37:35 UTC
*** Bug 1592781 has been marked as a duplicate of this bug. ***

Comment 10 Casey Callendrello 2018-06-19 13:54:47 UTC
I wonder if this is the problem: https://github.com/openshift/openshift-ansible/blob/master/roles/container_runtime/tasks/package_crio.yml#L66

Ansible shouldn't be writing that file.

Comment 11 Vikas Laad 2018-06-19 13:57:22 UTC
Should this go to upgrade component ?

Comment 13 Scott Dodson 2018-06-19 14:07:29 UTC
(In reply to Vikas Laad from comment #11)
> Should this go to upgrade component ?

No, lets leave it in networking. They have the expertise to assert the exact steps that need to happen to upgrade successfully and if they need help implementing those in ansible then we can help with that.

Comment 14 Weihua Meng 2018-06-20 08:06:09 UTC
cri-o container runtime upgrade is blocked by this bug.

please double confirm not fixing it in v3.10 

Thanks

Comment 15 Ben Bennett 2018-06-20 19:13:43 UTC
Sorry, I didn't realize cri-o was supported in 3.9.  Yes, this needs to be resolved for 3.10.0 :-(

Comment 16 Casey Callendrello 2018-06-20 21:08:55 UTC
Just to summarize my findings (in this and 1592781):

Both machines were in "network not ready" state. crio and the sdn pod were running. The SDN pod had written /etc/cni/net.d/80-(something).conf. However, crio was reporting "network not ready" to the kubelet.

The only thing I did to fix this was restart crio. It immediately reported a ready status to the kubelet, and the node was back in action.

So my analysis is that there's something wrong with crio's cni driver.

Comment 17 Dan Williams 2018-06-21 17:01:24 UTC
We'd need to see some CRIO log output to debug this further...

Comment 18 Dan Williams 2018-06-21 17:01:39 UTC
Ideally at debug level too.

Comment 19 Ben Bennett 2018-06-21 17:59:16 UTC
Mrunal: FYI in case you have any ideas about what might be going on.

Comment 20 Weihua Meng 2018-06-22 02:35:08 UTC
The attachment of bug 1592781 has cri-o upgrade log (instances are gone) 
hope it helps

Comment 21 Vikas Laad 2018-06-22 02:49:52 UTC
Created attachment 1453643 [details]
journal logs with crio in debug

Attached node logs with crio in debug log level.

Comment 22 Dan Williams 2018-06-25 15:16:47 UTC
This appears to be a bug in cri-o's vendored ocicni that has been fixed already upstream in ocicni.  crio will need to grab ocicni commit b27166bd or just revendor to latest ocicni.  See https://github.com/cri-o/ocicni/pull/21

Comment 23 Dan Williams 2018-06-25 15:19:11 UTC
To elaborate, when CRIO restarts we see:

Jun 21 22:32:06 ip-172-31-61-158.us-west-2.compute.internal crio[125075]: time="2018-06-21 22:32:06.288512456Z" level=error msg="error updating cni config: Missing CNI default network"
Jun 21 22:32:06 ip-172-31-61-158.us-west-2.compute.internal crio[125075]: time="2018-06-21 22:32:06.288524548Z" level=error msg="Missing CNI default network"
Jun 21 22:32:06 ip-172-31-61-158.us-west-2.compute.internal crio[125075]: time="2018-06-21 22:32:06.288862594Z" level=error msg="no such file or directory"
Jun 21 22:32:06 ip-172-31-61-158.us-west-2.compute.internal crio[125075]: time="2018-06-21 22:32:06.288960523Z" level=debug msg="CNI monitoring event "": "
Jun 21 22:32:06 ip-172-31-61-158.us-west-2.compute.internal crio[125075]: time="2018-06-21 22:32:06.288983165Z" level=error msg="CNI monitoring error <nil>"

Apparently the Go inotify watcher code will happily return 'nil' errors, which previously ocicni's directory monitor would treat as an error and cease watching /etc/cni/net.d. This after the "CNI monitoring error" statement, CRIO will not pick up any new files written to /etc/cni/net.d including the openshift-sdn one that would allow the node to become ready.

Comment 24 Mrunal Patel 2018-06-25 16:37:26 UTC
I will

Comment 25 Mrunal Patel 2018-06-25 16:38:07 UTC
I am working on vendoring in latest ocicni. Thanks!

Comment 26 Mrunal Patel 2018-06-25 17:10:40 UTC
Opened - https://github.com/kubernetes-incubator/cri-o/pull/1642/files
We will cut a new cri-o release after that merges.

Comment 27 Weihua Meng 2018-07-02 08:22:08 UTC
Upgrade failed.
cri-o-1.10.4-1.rhaos3.10.gitebaa77a.el7.x86_64
openshift-ansible-3.10.10-1.git.248.0bb6b58.el7.noarch.rpm

                        {
                            "lastHeartbeatTime": "2018-07-02T08:03:17Z", 
                            "lastTransitionTime": "2018-07-02T07:59:43Z", 
                            "message": "runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized", 
                            "reason": "KubeletNotReady", 
                            "status": "False", 
                            "type": "Ready"
                        }, 


[root@wmengug75crio22-master-etcd-1 ~]# oc get node -o wide
NAME                             STATUS     ROLES     AGE       VERSION             EXTERNAL-IP       OS-IMAGE                                      KERNEL-VERSION          CONTAINER-RUNTIME
wmengug75crio22-master-etcd-1    Ready      master    54m       v1.10.0+b81c8f8     104.197.34.22     Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-862.el7.x86_64   cri-o://1.10.4
wmengug75crio22-master-etcd-2    NotReady   master    54m       v1.10.0+b81c8f8     35.232.174.34     Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-862.el7.x86_64   cri-o://1.10.4
wmengug75crio22-master-etcd-3    Ready      master    54m       v1.9.1+a0ce1bc657   35.193.201.112    Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-862.el7.x86_64   cri-o://1.9.12
wmengug75crio22-node-primary-1   Ready      compute   54m       v1.9.1+a0ce1bc657   35.202.213.206    Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-862.el7.x86_64   cri-o://1.9.12
wmengug75crio22-node-primary-2   Ready      compute   54m       v1.9.1+a0ce1bc657   104.197.231.240   Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-862.el7.x86_64   cri-o://1.9.12
wmengug75crio22-node-primary-3   Ready      compute   54m       v1.9.1+a0ce1bc657   35.226.94.240     Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-862.el7.x86_64   cri-o://1.9.12
wmengug75crio22-nrri-1           Ready      <none>    54m       v1.9.1+a0ce1bc657   35.224.255.164    Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-862.el7.x86_64   cri-o://1.9.12
wmengug75crio22-nrri-2           Ready      <none>    54m       v1.9.1+a0ce1bc657   35.232.14.94      Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-862.el7.x86_64   cri-o://1.9.12

Comment 29 Antonio Murdaca 2018-07-02 11:03:37 UTC
Fix is in https://github.com/kubernetes-incubator/cri-o/releases/tag/v1.10.5
Need a new build and re-test please (build probably already available).

Comment 31 Weihua Meng 2018-07-03 09:26:01 UTC
Fixed.
cri-o-1.10.5-1.rhaos3.10.git4c81456.el7.x86_64
openshift-ansible-3.10.10-1.git.248.0bb6b58.el7.noarch.rpm

                        {
                            "lastHeartbeatTime": "2018-07-03T09:03:03Z", 
                            "lastTransitionTime": "2018-07-03T09:03:03Z", 
                            "message": "kubelet is posting ready status", 
                            "reason": "KubeletReady", 
                            "status": "True", 
                            "type": "Ready"
                        }, 

Kernel Version: 3.10.0-862.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)

although upgrade failed due to bug 1596818
it is not crio problem.


Note You need to log in before you can comment on or make changes to this bug.