Bug 1753801

Summary:	4.2 to 4.2 upgrade caused crio to error with 'error loading cached network config'
Product:	OpenShift Container Platform	Reporter:	Ryan Phillips <rphillips>
Component:	Networking	Assignee:	Dan Williams <dcbw>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	aos-bugs, bbennett, ccoleman, cdc, danw, dcbw, jokerman, mcambria, mpatel, yinzhou
Version:	4.2.0	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-09-23 20:14:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ryan Phillips 2019-09-19 22:36:27 UTC

Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_insights-operator/16/pull-ci-openshift-insights-operator-master-e2e-aws-upgrade/95

The CI job failed to upgrade from 4.2 to 4.2. The master log contains the following log after a scheduled upgrade reboot:

  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319337165Z" level=error msg="error loading cached network config: cache file path requires network name (""), container ID ("9e0ae156a44978547cb3c3bf6d33c7c44ff8c3c4be356beaccf6f13a94b95a9b"), and interface name ("eth0")"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319396157Z" level=error msg="CNI network "" not found"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319433055Z" level=error msg="error loading cached network config: cache file path requires network name (""), container ID ("30df20c96d9adccc9e7f7efb539d92a2101a6d7352eda0536e527b32b890a876"), and interface name ("eth0")"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319441490Z" level=error msg="CNI network "" not found"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319462315Z" level=error msg="error loading cached network config: cache file path requires network name (""), container ID ("91b3f497ab5d74252e5234ea277567025f5ba12dd329996a5eea5069842fd9c1"), and interface name ("eth0")"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319468883Z" level=error msg="CNI network "" not found"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319482090Z" level=error msg="error loading cached network config: cache file path requires network name (""), container ID ("c0be228bcc3c33d311ac40fb5610423675e993098bdde18a01606891f7fb73c3"), and interface name ("eth0")"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319488175Z" level=error msg="CNI network "" not found"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319501846Z" level=error msg="error loading cached network config: cache file path requires network name (""), container ID ("3c6bf7ab68dd4f44470e8fd6b358477f11875b96aa2ee551cfffb1429b5fe792"), and interface name ("eth0")"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319507912Z" level=error msg="CNI network "" not found"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319521584Z" level=error msg="error loading cached network config: cache file path requires network name (""), container ID ("324de78280f28ffc42b7c7dfba7baa021c2aae3319233a158a28211c222939a1"), and interface name ("eth0")"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319527648Z" level=error msg="CNI network "" not found"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319541724Z" level=error msg="error loading cached network config: cache file path requires network name (""), container ID ("621ca69dde4fc9740c7c05d399ab8cc546026705ad3434da5b594773f385f8a7"), and interface name ("eth0")"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319548496Z" level=error msg="CNI network "" not found"
  Sep 19 17:18:09 ip-10-0-153-74 crio[1088]: time="2019-09-19 17:18:09.319561372Z" level=error msg="error loading cached network config: cache file path requires network name (""), container ID ("74f0a825d32367b73339983c421c9729e467b5b47ad9f1eb15e57d9d1e9e6dd8"), and interface name ("eth0")"

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. CI build

Actual results:


Expected results:


Additional info:

Comment 1 Mrunal Patel 2019-09-19 22:50:39 UTC

The errors appear to be from the code that was merged yesterday to get in ocicni fixes: https://github.com/cri-o/cri-o/pull/2800.

Comment 2 Ben Bennett 2019-09-20 14:18:32 UTC

So, those errors are more like warnings.  They are only really errors when the state was expected to be present, and the latest change added the new state, so when upgrading to the latest there would be nothing to read, and the old behavior is expected.

But that doesn't explain why node ip-10-0-128-148.ec2.internal didn't become ready.

Comment 3 Dan Winship 2019-09-20 17:21:47 UTC

(In reply to Ben Bennett from comment #2)
> But that doesn't explain why node ip-10-0-128-148.ec2.internal didn't become
> ready.

I think it does; you were looking at the sdn-c9g8w logs, but note that that pod started a few minutes *after* e2e-aws-upgrade declared defeat:

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_insights-operator/16/pull-ci-openshift-insights-operator-master-e2e-aws-upgrade/95/build-log.txt:

    Sep 19 18:05:16.404: INFO: Unexpected error occurred: Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator kube-controller-manager is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-128-148.ec2.internal" not ready

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_insights-operator/16/pull-ci-openshift-insights-operator-master-e2e-aws-upgrade/95/artifacts/e2e-aws-upgrade/pods/openshift-sdn_sdn-c9g8w_sdn.log:

    I0919 18:08:44.009567   97623 node.go:145] Initializing SDN node of type "redhat/openshift-ovs-networkpolicy" with configured hostname "ip-10-0-128-148.ec2.internal" (IP "10.0.128.148")

So the failure there is irrelevant; it happened while the cluster was in the process of being shut down. I think the cri-o bug is the only bug here.

Comment 4 Ben Bennett 2019-09-20 17:55:01 UTC

The bug Dan is referring to is https://bugzilla.redhat.com/show_bug.cgi?id=1753988

*** This bug has been marked as a duplicate of bug 1753988 ***

Comment 5 Ryan Phillips 2019-09-21 00:35:13 UTC

I think I found the issue: https://github.com/cri-o/ocicni/pull/62

Comment 7 Casey Callendrello 2019-09-23 13:09:46 UTC


*** This bug has been marked as a duplicate of bug 1753988 ***

Comment 8 Casey Callendrello 2019-09-23 13:10:15 UTC


*** This bug has been marked as a duplicate of bug 1754434 ***

Comment 9 Casey Callendrello 2019-09-23 15:24:15 UTC

Unmarking as duplicate. Sorry for the spam.

Comment 11 Ben Bennett 2019-09-23 20:14:43 UTC

The networking on the node failed to start.  Dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1754638

*** This bug has been marked as a duplicate of bug 1754638 ***