1698624 – New nodes created occasionally remains unready for ever: cni config uninitialized

Bug 1698624 - New nodes created occasionally remains unready for ever: cni config uninitialized

Summary: New nodes created occasionally remains unready for ever: cni config uninitial...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Casey Callendrello
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-10 19:13 UTC by Alberto
Modified:	2019-08-26 17:22 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-26 17:22:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Error matches from the past 24 hours (480.03 KB, application/json) 2019-04-11 09:43 UTC, W. Trevor King	no flags	Details
View All

Description Alberto 2019-04-10 19:13:36 UTC

Description of problem:
New nodes created occasionally remains unready for ever

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Scale a machineSet
2. Wait for the nodes to join. See a node remaining unready for ever

Actual results:
Node remains unready

  Ready            False   Wed, 10 Apr 2019 20:42:45 +0200   Wed, 10 Apr 2019 20:40:35 +0200   KubeletNotReady              runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized


Expected results:
Node goes unready

Additional info:

Comment 1 W. Trevor King 2019-04-10 19:19:31 UTC

Some discussion of this in bug 1591752, although I'm not sure if this is a new issue or that one coming back.  Example job [1].  We've seen it 19 times in the past ~24 hours (in ~4% of all failed *-e2e-aws* jobs).

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/277/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/755

Comment 2 Casey Callendrello 2019-04-11 09:20:12 UTC

Definitely not good. Can you put links to any e2e jobs that fail due to this? Keep in mind, this state is normal for short periods of time as nodes come up and reboot.

https://github.com/openshift/machine-config-operator/pull/604 added a tmpfiles change to remove this file on boot via systemd-tmpfiles, so that's a likely cause. That "shouldn't" cause it, of course, but computers being what they are...

cc phil, who wrote that change.

Comment 3 Casey Callendrello 2019-04-11 09:39:37 UTC

Some debugging notes from the above e2e run:

The not-ready node is ip-10-0-131-166.ec2.internal. It's SDN pod is sdn-qldcw. However, that pod has never run. It has 3 events:

- Successfully assigned openshift-sdn/sdn-qldcw to ip-10-0-131-166.ec2.internal
- Failed create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_sdn-qldcw_openshift-sdn_7d590162-5bb4-11e9-a45a-12239d9bef42_0\": Manifest does not match provided manifest digest sha256:d61921e97f2850bcff984990dae93e9e4ada5b34e2c0f70985e84b91d4090880
- Created pod: sdn-qldcw (this is from the DaemonSet controller and not meaningful)

So the question is: why did starting the SDN pod fail?

Comment 4 W. Trevor King 2019-04-11 09:43:28 UTC

Created attachment 1554466 [details]
Error matches from the past 24 hours

$ jq '. | keys' errors.json 
[
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/188/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/415",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/188/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/416",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/188/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/418",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/188/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/420",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/192/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/419",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-autoscaler-operator/86/pull-ci-openshift-cluster-autoscaler-operator-master-e2e-aws-operator/260",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/750",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/753",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/760",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/762",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/275/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/749",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/275/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/752",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/276/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/748",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/276/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/751",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/276/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/754",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/276/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/764",
  "https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/277/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/755"
]

The schema for the full file is job URI -> build-log regexp -> matched build-log content.  Generated with [1,2].

[1]: https://github.com/wking/openshift-release/tree/deck-d3/d3
[2]: curl -s 'http://localhost:8000/search?q=E.*Network+plugin+returns+error:+cni+config+uninitialized' | jq . >errors.json

Comment 5 Casey Callendrello 2019-04-11 09:53:32 UTC

Almost certainly caused by #1669096, which is now fixed. Will sleep this for a few days and see if it recurs.

Comment 6 Phil Cameron 2019-04-11 13:47:56 UTC

Casey, comment 2, I wrote the change. It fixes 1654044. On a new node the file should not exist so pr 604 shouldn't matter.

Comment 7 Brad Ison 2019-04-11 14:56:10 UTC

Looks like bug 1669096 was supposed to have been fixed almost 2 months ago in CRI-O 1.12.5-6. I can easily reproduce this on a cluster created yesterday with newer CRI-O. All pods on the node are stuck in ContainerCreating or Init:0/2 status and have the same error:

Failed create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_sdn-f82k7_openshift-sdn_b2c67192-5c65-11e9-b425-0293ee4fe3b8_0": Manifest does not match provided manifest digest sha256:d61921e97f2850bcff984990dae93e9e4ada5b34e2c0f70985e84b91d4090880

That includes pods unrelated to SDN.

Some info from the node:

 Kernel Version:                          4.18.0-80.el8.x86_64
 OS Image:                                Red Hat Enterprise Linux CoreOS 410.8.20190408.1 (Ootpa)                                                                                   
 Operating System:                        linux
 Architecture:                            amd64
 Container Runtime Version:               cri-o://1.13.4-3.rhaos4.1.git30006b3.el8
 Kubelet Version:                         v1.13.4+1ad602308
 Kube-Proxy Version:                      v1.13.4+1ad602308

Comment 8 W. Trevor King 2019-04-11 15:09:01 UTC

Bug 1698253 was the recent CRI-O issue.

Note You need to log in before you can comment on or make changes to this bug.