Bug 1886842 - Bootstrap failed to complete: awaiting 3 nodes: No CNI configuration file in /etc/kubernetes/cni/net.d/
Summary: Bootstrap failed to complete: awaiting 3 nodes: No CNI configuration file in ...
Keywords:
Status: CLOSED DUPLICATE of bug 1886834
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Douglas Smith
QA Contact: Anurag saxena
URL:
Whiteboard:
: 1886840 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-09 14:03 UTC by Jing Zhang
Modified: 2020-10-12 21:24 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
operator install console operator install monitoring
Last Closed: 2020-10-12 21:24:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jing Zhang 2020-10-09 14:03:09 UTC
test:
operator install console 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=operator+install+console


FIXME: Replace this paragraph with a particular job URI from the search results to ground discussion.  A given test may fail for several reasons, and this bug should be scoped to one of those reasons.  Ideally you'd pick a job showing the most-common reason, but since that's hard to determine, you may also chose to pick a job at random.  Release-gating jobs (release-openshift-...) should be preferred over presubmits (pull-ci-...) because they are closer to the released product and less likely to have in-flight code changes that complicate analysis.

FIXME: Provide a snippet of the test failure or error from the job log

Comment 1 W. Trevor King 2020-10-10 04:07:15 UTC
Two FIXMEs still in comment 0.  The bug was reported against 4.7, but I see no 4.7 OCP release jobs failing with this (although there are some 4.7 OKD jobs failing with this):

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=operator+install+console&maxAge=168h&type=junit&groupBy=job' | grep 'failures match' | sort
periodic-ci-kube-reporting-metering-operator-master-metering-periodic-aws - 7 runs, 100% failed, 14% of failures match
periodic-ci-kube-reporting-metering-operator-release-4.7-metering-periodic-aws - 7 runs, 100% failed, 14% of failures match
periodic-ci-kubernetes-conformance-k8s - 7 runs, 29% failed, 100% of failures match
...
promote-release-openshift-machine-os-content-e2e-aws-4.5 - 435 runs, 6% failed, 4% of failures match
promote-release-openshift-machine-os-content-e2e-aws-4.6 - 445 runs, 7% failed, 6% of failures match
promote-release-openshift-okd-machine-os-content-e2e-gcp-4.6 - 82 runs, 51% failed, 64% of failures match
...
release-openshift-ocp-installer-e2e-aws-4.4 - 12 runs, 17% failed, 50% of failures match
release-openshift-ocp-installer-e2e-aws-csi-4.5 - 7 runs, 71% failed, 20% of failures match
release-openshift-ocp-installer-e2e-aws-mirrors-4.4 - 7 runs, 57% failed, 25% of failures match
release-openshift-ocp-installer-e2e-aws-ovn-4.6 - 27 runs, 78% failed, 14% of failures match
release-openshift-ocp-installer-e2e-aws-serial-4.1 - 7 runs, 14% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-serial-4.4 - 18 runs, 61% failed, 9% of failures match
release-openshift-ocp-installer-e2e-aws-upi-4.5 - 20 runs, 40% failed, 13% of failures match
release-openshift-ocp-installer-e2e-azure-4.4 - 12 runs, 42% failed, 20% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 27 runs, 56% failed, 27% of failures match
release-openshift-ocp-installer-e2e-azure-serial-4.6 - 28 runs, 18% failed, 20% of failures match
release-openshift-ocp-installer-e2e-gcp-4.6 - 31 runs, 13% failed, 25% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 27 runs, 52% failed, 29% of failures match
release-openshift-ocp-installer-e2e-gcp-rt-4.4 - 14 runs, 100% failed, 79% of failures match
release-openshift-ocp-installer-e2e-gcp-rt-4.5 - 3 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-serial-4.5 - 20 runs, 30% failed, 33% of failures match
release-openshift-ocp-installer-e2e-metal-4.6 - 31 runs, 35% failed, 9% of failures match
release-openshift-ocp-installer-e2e-metal-serial-4.5 - 20 runs, 35% failed, 43% of failures match
release-openshift-ocp-installer-e2e-metal-serial-4.6 - 28 runs, 32% failed, 33% of failures match
release-openshift-ocp-installer-e2e-openstack-4.4 - 14 runs, 57% failed, 13% of failures match
release-openshift-ocp-installer-e2e-openstack-4.5 - 14 runs, 50% failed, 14% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.4 - 15 runs, 67% failed, 20% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.5 - 14 runs, 50% failed, 43% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.5 - 38 runs, 50% failed, 11% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 67 runs, 61% failed, 10% of failures match
release-openshift-okd-installer-e2e-aws-4.5 - 16 runs, 25% failed, 25% of failures match
release-openshift-okd-installer-e2e-aws-4.6 - 51 runs, 88% failed, 60% of failures match
release-openshift-origin-installer-e2e-aws-ovn-network-stress-4.5 - 21 runs, 14% failed, 33% of failures match
release-openshift-origin-installer-e2e-aws-sdn-network-stress-4.4 - 21 runs, 10% failed, 50% of failures match
release-openshift-origin-installer-e2e-aws-serial-4.6 - 40 runs, 25% failed, 10% of failures match
release-openshift-origin-installer-e2e-azure-4.6 - 44 runs, 50% failed, 9% of failures match
release-openshift-origin-installer-e2e-azure-shared-vpc-4.5 - 6 runs, 67% failed, 25% of failures match
release-openshift-origin-installer-e2e-gcp-4.5 - 25 runs, 16% failed, 25% of failures match
release-openshift-origin-installer-e2e-gcp-4.7 - 54 runs, 35% failed, 47% of failures match
release-openshift-origin-installer-e2e-gcp-serial-4.4 - 11 runs, 82% failed, 11% of failures match
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.4 - 7 runs, 43% failed, 33% of failures match
release-openshift-origin-installer-launch-aws - 207 runs, 50% failed, 4% of failures match
release-openshift-origin-installer-launch-gcp - 453 runs, 41% failed, 5% of failures match

The last 24h have been better:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=operator+install+console&maxAge=24h&type=junit&groupBy=job&name=release-openshift-ocp-' | grep 'failures match' | sort
release-openshift-ocp-installer-e2e-aws-ovn-4.6 - 4 runs, 75% failed, 33% of failures match
release-openshift-ocp-installer-e2e-aws-upi-4.5 - 6 runs, 33% failed, 50% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 4 runs, 100% failed, 50% of failures match
release-openshift-ocp-installer-e2e-gcp-rt-4.4 - 2 runs, 100% failed, 100% of failures match
release-openshift-ocp-installer-e2e-gcp-serial-4.5 - 6 runs, 50% failed, 67% of failures match
release-openshift-ocp-installer-e2e-metal-serial-4.5 - 6 runs, 33% failed, 50% of failures match
release-openshift-ocp-installer-e2e-openstack-serial-4.5 - 2 runs, 100% failed, 50% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.5 - 7 runs, 29% failed, 50% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 10 runs, 80% failed, 25% of failures match

Also note that this is really a post-test state check, not a post-install state check [1].  Picking on release-openshift-ocp-installer-e2e-gcp-ovn-4.6, here are some jobs:

$ curl -s 'https://search.ci.openshift.org/search?search=operator+install+console&maxAge=24h&type=junit&groupBy=job&name=release-openshift-ocp-installer-e2e-gcp-ovn-4.6' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.6/1314410856292814848
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.6/1314697955512422400

The latter blew up before bootstrap-complete:

level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - last change 2020-10-09T22:56:39Z"
level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/network-metrics-daemon\" is not available (awaiting 3 nodes)\nDaemonSet \"openshift-multus/multus-admission-controller\" is waiting for other operators to become ready\nDaemonSet \"openshift-ovn-kubernetes/ovnkube-node\" is not available (awaiting 3 nodes)"
level=info msg="Cluster operator network Available is False with Startup: The network is starting up"
level=info msg="Pulling debug logs from the bootstrap machine"
level=info msg="Bootstrap gather logs captured here \"/tmp/artifacts/installer/log-bundle-20201009232434.tar.gz\""
level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition" 

And that "awaiting 3 nodes" thing is because the control-plane kubelets are mad:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.6/1314697955512422400/artifacts/e2e-gcp/nodes.json | jq -r '.items[].status.conditions[] | select(.type == "Ready") | .status + " " + .reason + ": " + .message'
False KubeletNotReady: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
False KubeletNotReady: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
False KubeletNotReady: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?

I dunno who's fault _that_ is, but I'm pretty sure it's not the console.  Sending it over to the node folks.

[1]: https://github.com/openshift/release/pull/12298

Comment 2 W. Trevor King 2020-10-10 04:17:54 UTC
*** Bug 1886840 has been marked as a duplicate of this bug. ***

Comment 3 Peter Hunt 2020-10-12 16:47:32 UTC
If SDN doesn't put CNI configs in /etc/kubernetes/cni/net.d, then CRI-O can make no forward progress. Moving to networking to investigate what is going wrong

Comment 5 Dave Cain 2020-10-12 19:44:48 UTC
Seeing similar behavior in 4.6.0-rc.2, RHCOS version 46.82.202010091720-0.  

Mine is a baremetal UPI install with 3 node controller/worker deployment.  Journalctl output on control plane nodes lists similar NetworkPluginNotReady message.  Looking also at the openshift-ovn-kubernetes namespace, ovnkube-node pods are not able to get subnet annotations.

oc get pods -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS   AGE
ovnkube-master-9jwnx   6/6     Running            1          63m
ovnkube-master-ffjbc   6/6     Running            0          63m
ovnkube-master-mxffv   6/6     Running            0          63m
ovnkube-node-bkjzf     2/3     CrashLoopBackOff   9          63m
ovnkube-node-vdmt7     2/3     Error              9          63m
ovnkube-node-xck78     2/3     CrashLoopBackOff   9          63m
ovs-node-5vkk5         1/1     Running            0          63m
ovs-node-jfbk4         1/1     Running            0          63m
ovs-node-mptdm         1/1     Running            0          63m

I1012 19:40:11.113574   73249 node.go:193] Waiting for node master2.sandbox.lab to start, no annotation found on node for subnet: node "master2.sandbox.lab" has no "k8s.ovn.org/node-subnets" annotation

Comment 6 W. Trevor King 2020-10-12 20:47:23 UTC
There is a suspicion that this is a dup of bug 1886834.

Comment 7 Douglas Smith 2020-10-12 21:22:40 UTC
Peter's got it correctly, if there's CNI configuration dropped, it's a strong sign that something with the default network provider (OVN or openshift-sdn) didn't complete properly, and it didn't drop its configuration on disk.

Taking a look at this ovnkube log from the above provided runs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.6/1314410856292814848/artifacts/e2e-gcp/pods/openshift-ovn-kubernetes_ovnkube-master-8mnrn_kube-rbac-proxy.log

I can see that OVN kubernetes is complaining that `ovn-master-metrics-cert not mounted after 20 minutes`

This is directly related to bug 1886834, which addresses always mounting the certs share, as completed in: https://github.com/openshift/cluster-network-operator/pull/834/files

I believe this should be considered a dupe.

Comment 8 Douglas Smith 2020-10-12 21:24:01 UTC

*** This bug has been marked as a duplicate of bug 1886834 ***


Note You need to log in before you can comment on or make changes to this bug.