Hide Forgot
Created attachment 1554567 [details] bootstrap journal Description of problem: Bootstrap failures in nightly builds of 4.0.0-0.nightly-2019-04-11-001944 and later: Apr 11 13:16:34 ip-10-0-4-9 openshift.sh[1349]: error: unable to recognize "./99_openshift-machineconfig_master.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1" Apr 11 13:16:34 ip-10-0-4-9 openshift.sh[1349]: kubectl create --filename ./99_openshift-machineconfig_master.yaml failed. Retrying in 5 seconds... Apr 11 13:16:35 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:35.421593 1564 kubelet_node_status.go:276] Setting node annotation to enable volume controller attach/detach Apr 11 13:16:35 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:35.423526 1564 kubelet_node_status.go:429] Recording NodeHasSufficientDisk event message for node ip-10-0-4-9 Apr 11 13:16:35 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:35.423562 1564 kubelet_node_status.go:429] Recording NodeHasSufficientMemory event message for node ip-10-0-4-9 Apr 11 13:16:35 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:35.423582 1564 kubelet_node_status.go:429] Recording NodeHasNoDiskPressure event message for node ip-10-0-4-9 Apr 11 13:16:35 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:35.423602 1564 kubelet_node_status.go:429] Recording NodeHasSufficientPID event message for node ip-10-0-4-9 Apr 11 13:16:38 ip-10-0-4-9 bootkube.sh[1348]: Error: error while checking pod status: timed out waiting for the condition Apr 11 13:16:38 ip-10-0-4-9 bootkube.sh[1348]: Tearing down temporary bootstrap control plane... Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.205127 1564 kubelet.go:1900] SyncLoop (REMOVE, "file"): "bootstrap-cluster-version-operator-ip-10-0-4-9_openshift-cluster-version(f8698915187f859edaee2a7489f8e225)" Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.205242 1564 kubelet_pods.go:1095] Killing unwanted pod "bootstrap-cluster-version-operator-ip-10-0-4-9" Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206243 1564 kubelet.go:1900] SyncLoop (REMOVE, "file"): "bootstrap-kube-apiserver-ip-10-0-4-9_kube-system(f236f5d0930b38f71dbb0d8fcce9701f)" Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206291 1564 kubelet.go:1900] SyncLoop (REMOVE, "file"): "bootstrap-kube-controller-manager-ip-10-0-4-9_kube-system(a84629aa5f5248ee1d36066ca5d97293)" Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206347 1564 kubelet_pods.go:1095] Killing unwanted pod "bootstrap-kube-controller-manager-ip-10-0-4-9" Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206472 1564 kubelet_pods.go:1095] Killing unwanted pod "bootstrap-kube-apiserver-ip-10-0-4-9" Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206593 1564 kubelet.go:1900] SyncLoop (REMOVE, "file"): "bootstrap-kube-scheduler-ip-10-0-4-9_kube-system(c1d86eaa0cf05e166565c5b35a7959bf)" Apr 11 13:16:38 ip-10-0-4-9 hyperkube[1564]: I0411 13:16:38.206662 1564 kubelet_pods.go:1095] Killing unwanted pod "bootstrap-kube-scheduler-ip-10-0-4-9" Apr 11 13:16:38 ip-10-0-4-9 bootkube.sh[1348]: Error: error while checking pod status: timed out waiting for the condition Apr 11 13:16:38 ip-10-0-4-9 systemd[1]: libpod-537a39aa41ce3be842baa983b1b47b3b03a430ec788a2901f0e4d08c9ef0dcb2.scope: Consumed 1.190s CPU time Apr 11 13:16:38 ip-10-0-4-9 systemd[1]: crio-564cc112fdc83e56a0986ab76f8491d6a24cb7a0e8d4e95a96bf9ebc4ad8212c.scope: Consumed 27.922s CPU time Apr 11 13:16:38 ip-10-0-4-9 bootkube.sh[1348]: unable to find container etcd-signer: no container with name or ID etcd-signer found: no such container Apr 11 13:16:38 ip-10-0-4-9 systemd[1]: bootkube.service: Main process exited, code=exited, status=125/n/a Apr 11 13:16:38 ip-10-0-4-9 systemd[1]: bootkube.service: Failed with result 'exit-code'. Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-04-11-001944 How reproducible: Always Install log and bootstrap journal attached
Created attachment 1554569 [details] install log
Abhinav, this looks really related to https://github.com/openshift/installer/pull/1150, do you have any idea? I'm trying to understand why the kind isn't recognized
This is not an MCO error, as noted by Abhinav on Slack, that MCO error is transient and reconciles after the MCO is up and running. Something else is making the bootstrap fail: Apr 11 13:17:33 ip-10-0-4-9 hyperkube[1564]: E0411 13:17:33.476898 1564 pod_workers.go:186] Error syncing pod a84629aa5f5248ee1d36066ca5d97293 ("bootstrap-kube-controller-manager-ip-10-0-4-9_kube-system(a84629aa5f5248ee1d36066ca5d97293)"), skipping: failed to "StartContainer" for "kube-controller-manager" with CrashLoopBackOff: "Back-off 10s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-ip-10-0-4-9_kube-system(a84629aa5f5248ee1d36066ca5d97293)" Apr 11 13:17:04 ip-10-0-4-9 bootkube.sh[11076]: Pod Status: kube-apiserver DoesNotExist Apr 11 13:17:04 ip-10-0-4-9 bootkube.sh[11076]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler DoesNotExist Apr 11 13:17:04 ip-10-0-4-9 bootkube.sh[11076]: Pod Status: kube-controller-manager DoesNotExist Apr 11 13:17:04 ip-10-0-4-9 bootkube.sh[11076]: Pod Status:openshift-cluster-version/cluster-version-operator Ready
Not an MCO BZ, bad PR merged for Master https://github.com/openshift/library-go/pull/346 PR has merged as well, so closing this as it should be fixed.
We need the logs from the kube-controller-manager. If your cluster did not bootstrap, the installer team can help walk you through gathering them. There is some possibility that listing pods in the openshift-kube-controller-manager namespace will include a long enough snippet, but that is just luck.
Created attachment 1554851 [details] install-gather artifacts Data collected by install-gather.sh (https://github.com/openshift/installer/pull/1561) Note: there was no /tmp/artifacts on any of the systems and the script complained about that.
Created attachment 1554852 [details] install-gather artifacts redux Re-ran with installer-master-gather.sh in /usr/local/bin. Seems to have much more data but the script still terminated with an error but that's an issue for that tool. + tar cz -C /tmp/artifacts . tar: /tmp/artifacts: Cannot open: Not a directory tar: Error is not recoverable: exiting now + echo 'Log bundle written to ~/log-bundle.tar.gz' Log bundle written to ~/log-bundle.tar.gz
The kube-controller-manager-operator container never started. It never started because "message": "(combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-controller-manager-operator-54c6d769f4-2rnzn_openshift-kube-controller-manager-operator_1f508e1a-5d25-11e9-b0ba-024239aadba4_0(19273c65ace39e557bcf7097e457a358ca52e4cb193388eb3bda13d5143c955d): netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input", This persisted even though the pod ovs-r8hcs is running.
From sdn.log it looks like RBAC permissions are messed up: E0412 16:38:53.713745 2002 reflector.go:125] k8s.io/client-go/informers/factory.go:132: Failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:openshift-sdn:sdn" cannot list resource "pods" in API group "" at the cluster scope: RBAC: [clusterrole.rbac.authorization.k8s.io "system:discovery" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "system:basic-user" not found, clusterrole.rbac.authorization.k8s.io "openshift-sdn" not found, clusterrole.rbac.authorization.k8s.io "system:discovery" not found] (Not sure exactly why that translates into the CNI plugin returning "" rather than a valid error message.)
Actually, looking at the full SDN log, it starts up at 13:16 and everything seems fine. No errors about anything. (Then at 13:17 we get the first "failed to create pod network sandbox" error, which doesn't correspond to anything in the SDN log.) Then there's a round of "Unexpected EOF" and "connection refused" errors every 20 minutes after that. It's only 3 hours later (after many rounds of apiserver disconnects and reconnects) that it gets the RBAC errors. Which might be a result of the test timing out and starting to tear things down?
I feel like this and 1698672 may be the same issue.
*** This bug has been marked as a duplicate of bug 1698672 ***
I'm confused - what was the error this bz is tracking, did a change go in that we suspect addressed it? I'm moving to modified since it seems something fixed the issue? Please assign it to the responsible component/change the status/mark as duplicate as appropriate.
re: comment 21. There are two separate bz: 1) this one and 2) https://bugzilla.redhat.com/show_bug.cgi?id=1698672. At one point it was thought this was a duplicate of bug 1698672 but in fixing that bug @deads2k decided this bug is a separate issue. No fix has been delivered for this bug. In comment 18 @eparis asks if this problem is reproducible in a build with the fix for bug 1698672. That is unknown as there are no nightly builds that are installable since 10-Apr.
The problems are reported in comments #12 and #13. No patches have been made to address the fact that the sdn is requesting things like pods as system:serviceaccount:openshift-sdn:sdn and is being rejected. I do not believe that #672 is related in any way. My question is about reproducibility. If this is not common it should not be a beta blocker. If this is common it should be. In either case, that is unrelated to #672. This is not fixed and moving to MODIFIED without reason is not an acceptable next step.
(In reply to Eric Paris from comment #23) I confirmed what was reported in #13 - the RBAC errors start after the cluster starts getting torn down, unrelated to whatever the original issue was. Moving back to networking.
4.1.0-0.nightly-2019-04-16-171602 has a bootstrap timeout failure as well, but the symptoms look different. Tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1700504
*** This bug has been marked as a duplicate of bug 1700504 ***