Bug 1698950 - Nightly builds >= 4.0.0-0.nightly-2019-04-11-001944 failing during bootstrap with kube-controller-manager in CrashLoopBackoff
Summary: Nightly builds >= 4.0.0-0.nightly-2019-04-11-001944 failing during bootstrap ...
Keywords:
Status: CLOSED DUPLICATE of bug 1700504
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.1.0
Assignee: Casey Callendrello
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks: 1694226 1698672
TreeView+ depends on / blocked
 
Reported: 2019-04-11 13:25 UTC by Mike Fiedler
Modified: 2019-05-06 13:00 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-16 20:59:40 UTC
Target Upstream Version:


Attachments (Terms of Use)
bootstrap journal (88.71 KB, application/gzip)
2019-04-11 13:25 UTC, Mike Fiedler
no flags Details
install log (551.83 KB, text/plain)
2019-04-11 13:31 UTC, Mike Fiedler
no flags Details
install-gather artifacts (410.80 KB, application/gzip)
2019-04-12 16:26 UTC, Mike Fiedler
no flags Details
install-gather artifacts redux (2.53 MB, application/gzip)
2019-04-12 16:46 UTC, Mike Fiedler
no flags Details

Description Mike Fiedler 2019-04-11 13:25:21 UTC
Created attachment 1554567 [details]
bootstrap journal

Description of problem:

Bootstrap failures in nightly builds of 4.0.0-0.nightly-2019-04-11-001944 and later:


Apr 11 13:16:34 ip-10-0-4-9 openshift.sh[1349]: error: unable to recognize "./99_openshift-machineconfig_master.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1"                                                                                                                                                              
Apr 11 13:16:34 ip-10-0-4-9 openshift.sh[1349]: kubectl create --filename ./99_openshift-machineconfig_master.yaml failed. Retrying in 5 seconds...
Apr 11 13:16:35 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:35.421593    1564 kubelet_node_status.go:276] Setting node annotation to enable volume controller attach/detach
Apr 11 13:16:35 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:35.423526    1564 kubelet_node_status.go:429] Recording NodeHasSufficientDisk event message for node ip-10-0-4-9
Apr 11 13:16:35 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:35.423562    1564 kubelet_node_status.go:429] Recording NodeHasSufficientMemory event message for node ip-10-0-4-9
Apr 11 13:16:35 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:35.423582    1564 kubelet_node_status.go:429] Recording NodeHasNoDiskPressure event message for node ip-10-0-4-9
Apr 11 13:16:35 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:35.423602    1564 kubelet_node_status.go:429] Recording NodeHasSufficientPID event message for node ip-10-0-4-9
Apr 11 13:16:38 ip-10-0-4-9 bootkube.sh[1348]: Error: error while checking pod status: timed out waiting for the condition
Apr 11 13:16:38 ip-10-0-4-9 bootkube.sh[1348]: Tearing down temporary bootstrap control plane...
Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.205127    1564 kubelet.go:1900] SyncLoop (REMOVE, "file"): "bootstrap-cluster-version-operator-ip-10-0-4-9_openshift-cluster-version(f8698915187f859edaee2a7489f8e225)"                                                                                                                                       
Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.205242    1564 kubelet_pods.go:1095] Killing unwanted pod "bootstrap-cluster-version-operator-ip-10-0-4-9"
Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206243    1564 kubelet.go:1900] SyncLoop (REMOVE, "file"): "bootstrap-kube-apiserver-ip-10-0-4-9_kube-system(f236f5d0930b38f71dbb0d8fcce9701f)"                                                                                                                                                               
Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206291    1564 kubelet.go:1900] SyncLoop (REMOVE, "file"): "bootstrap-kube-controller-manager-ip-10-0-4-9_kube-system(a84629aa5f5248ee1d36066ca5d97293)"                                                                                                                                                      
Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206347    1564 kubelet_pods.go:1095] Killing unwanted pod "bootstrap-kube-controller-manager-ip-10-0-4-9"
Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206472    1564 kubelet_pods.go:1095] Killing unwanted pod "bootstrap-kube-apiserver-ip-10-0-4-9"
Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206593    1564 kubelet.go:1900] SyncLoop (REMOVE, "file"): "bootstrap-kube-scheduler-ip-10-0-4-9_kube-system(c1d86eaa0cf05e166565c5b35a7959bf)"                                                                                                                                                               
Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206662    1564 kubelet_pods.go:1095] Killing unwanted pod "bootstrap-kube-scheduler-ip-10-0-4-9"
Apr 11 13:16:38 ip-10-0-4-9 bootkube.sh[1348]: Error: error while checking pod status: timed out waiting for the condition
Apr 11 13:16:38 ip-10-0-4-9 systemd[1]: libpod-537a39aa41ce3be842baa983b1b47b3b03a430ec788a2901f0e4d08c9ef0dcb2.scope: Consumed 1.190s CPU time
Apr 11 13:16:38 ip-10-0-4-9 systemd[1]: crio-564cc112fdc83e56a0986ab76f8491d6a24cb7a0e8d4e95a96bf9ebc4ad8212c.scope: Consumed 27.922s CPU time
Apr 11 13:16:38 ip-10-0-4-9 bootkube.sh[1348]: unable to find container etcd-signer: no container with name or ID etcd-signer found: no such container
Apr 11 13:16:38 ip-10-0-4-9 systemd[1]: bootkube.service: Main process exited, code=exited, status=125/n/a
Apr 11 13:16:38 ip-10-0-4-9 systemd[1]: bootkube.service: Failed with result 'exit-code'.


Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-04-11-001944


How reproducible: Always


Install log and bootstrap journal attached

Comment 1 Mike Fiedler 2019-04-11 13:31:01 UTC
Created attachment 1554569 [details]
install log

Comment 2 Antonio Murdaca 2019-04-11 14:48:33 UTC
Abhinav, this looks really related to https://github.com/openshift/installer/pull/1150, do you have any idea? I'm trying to understand why the kind isn't recognized

Comment 3 Antonio Murdaca 2019-04-11 15:34:34 UTC
This is not an MCO error, as noted by Abhinav on Slack, that MCO error is transient and reconciles after the MCO is up and running. Something else is making the bootstrap fail:

Apr 11 13:17:33 ip-10-0-4-9 hyperkube[1564]: E0411 13:17:33.476898    1564 pod_workers.go:186] Error syncing pod a84629aa5f5248ee1d36066ca5d97293 ("bootstrap-kube-controller-manager-ip-10-0-4-9_kube-system(a84629aa5f5248ee1d36066ca5d97293)"), skipping: failed to "StartContainer" for "kube-controller-manager" with CrashLoopBackOff: "Back-off 10s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-ip-10-0-4-9_kube-system(a84629aa5f5248ee1d36066ca5d97293)"

Apr 11 13:17:04 ip-10-0-4-9 bootkube.sh[11076]:         Pod Status:          kube-apiserver        DoesNotExist
Apr 11 13:17:04 ip-10-0-4-9 bootkube.sh[11076]:         Pod Status:openshift-kube-scheduler/openshift-kube-scheduler        DoesNotExist
Apr 11 13:17:04 ip-10-0-4-9 bootkube.sh[11076]:         Pod Status: kube-controller-manager        DoesNotExist
Apr 11 13:17:04 ip-10-0-4-9 bootkube.sh[11076]:         Pod Status:openshift-cluster-version/cluster-version-operator        Ready

Comment 4 Antonio Murdaca 2019-04-11 15:46:10 UTC
Not an MCO BZ, bad PR merged for Master https://github.com/openshift/library-go/pull/346

PR has merged as well, so closing this as it should be fixed.

Comment 8 David Eads 2019-04-12 15:49:49 UTC
We need the logs from the kube-controller-manager.  If your cluster did not bootstrap, the installer team can help walk you through gathering them.

There is some possibility that listing pods in the openshift-kube-controller-manager namespace will include a long enough snippet, but that is just luck.

Comment 9 Mike Fiedler 2019-04-12 16:26:58 UTC
Created attachment 1554851 [details]
install-gather artifacts

Data collected by install-gather.sh (https://github.com/openshift/installer/pull/1561)

Note:  there was no /tmp/artifacts on any of the systems and the script complained about that.

Comment 10 Mike Fiedler 2019-04-12 16:46:50 UTC
Created attachment 1554852 [details]
install-gather artifacts redux

Re-ran with installer-master-gather.sh in /usr/local/bin.   Seems to have much more data but the script still terminated with an error but that's an issue for that tool.

                                                                                                                                                        
+ tar cz -C /tmp/artifacts .
tar: /tmp/artifacts: Cannot open: Not a directory
tar: Error is not recoverable: exiting now
+ echo 'Log bundle written to ~/log-bundle.tar.gz'
Log bundle written to ~/log-bundle.tar.gz

Comment 11 David Eads 2019-04-12 17:31:52 UTC
The kube-controller-manager-operator container never started.  It never started because 

    "message": "(combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-controller-manager-operator-54c6d769f4-2rnzn_openshift-kube-controller-manager-operator_1f508e1a-5d25-11e9-b0ba-024239aadba4_0(19273c65ace39e557bcf7097e457a358ca52e4cb193388eb3bda13d5143c955d): netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input",


This persisted even though the pod ovs-r8hcs is running.

Comment 12 Dan Winship 2019-04-12 17:59:42 UTC
From sdn.log it looks like RBAC permissions are messed up:

E0412 16:38:53.713745    2002 reflector.go:125] k8s.io/client-go/informers/factory.go:132: Failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:openshift-sdn:sdn" cannot list resource "pods" in API group "" at the cluster scope: RBAC: [clusterrole.rbac.authorization.k8s.io "system:discovery" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "system:basic-user" not found, clusterrole.rbac.authorization.k8s.io "openshift-sdn" not found, clusterrole.rbac.authorization.k8s.io "system:discovery" not found]

(Not sure exactly why that translates into the CNI plugin returning "" rather than a valid error message.)

Comment 13 Dan Winship 2019-04-12 18:51:53 UTC
Actually, looking at the full SDN log, it starts up at 13:16 and everything seems fine. No errors about anything. (Then at 13:17 we get the first "failed to create pod network sandbox" error, which doesn't correspond to anything in the SDN log.)

Then there's a round of "Unexpected EOF" and "connection refused" errors every 20 minutes after that. It's only 3 hours later (after many rounds of apiserver disconnects and reconnects) that it gets the RBAC errors. Which might be a result of the test timing out and starting to tear things down?

Comment 14 W. Trevor King 2019-04-13 05:53:30 UTC
I feel like this and 1698672 may be the same issue.

Comment 15 Sudha Ponnaganti 2019-04-15 18:20:06 UTC

*** This bug has been marked as a duplicate of bug 1698672 ***

Comment 21 Erica von Buelow 2019-04-16 13:45:11 UTC
I'm confused - what was the error this bz is tracking, did a change go in that we suspect addressed it? I'm moving to modified since it seems something fixed the issue? Please assign it to the responsible component/change the status/mark as duplicate as appropriate.

Comment 22 Mike Fiedler 2019-04-16 14:11:29 UTC
re: comment 21.   There are two separate bz:  1) this one and 2) https://bugzilla.redhat.com/show_bug.cgi?id=1698672.   At one point it was thought this was a duplicate of bug 1698672 but in fixing that bug @deads2k decided this bug is a separate issue.   No fix has been delivered for this bug.   In comment 18 @eparis asks if this problem is reproducible in a build with the fix for bug 1698672.   That is unknown as there are no nightly builds that are installable since 10-Apr.

Comment 23 Eric Paris 2019-04-16 14:15:23 UTC
The problems are reported in comments #12 and #13. No patches have been made to address the fact that the sdn is requesting things like pods as system:serviceaccount:openshift-sdn:sdn and is being rejected.

I do not believe that #672 is related in any way.

My question is about reproducibility. If this is not common it should not be a beta blocker. If this is common it should be. In either case, that is unrelated to #672. This is not fixed and moving to MODIFIED without reason is not an acceptable next step.

Comment 24 Erica von Buelow 2019-04-16 16:44:54 UTC
(In reply to Eric Paris from comment #23) I confirmed what was reported in #13 - the RBAC errors start after the cluster starts getting torn down, unrelated to whatever the original issue was. Moving back to networking.

Comment 27 Mike Fiedler 2019-04-16 19:00:43 UTC
4.1.0-0.nightly-2019-04-16-171602 has a bootstrap timeout failure as well, but the symptoms look different.   Tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1700504

Comment 28 Dan Winship 2019-04-16 20:59:40 UTC

*** This bug has been marked as a duplicate of bug 1700504 ***


Note You need to log in before you can comment on or make changes to this bug.