Description of problem: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/1123 Run template e2e-aws-serial - e2e-aws-serial container setup 48m13s Lease acquired, installing... Installing from release registry.svc.ci.openshift.org/ocp/release:4.1.0-0.ci-2019-09-23-152435 level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised" level=info msg="Consuming \"Install Config\" from target directory" level=info msg="Creating infrastructure resources..." level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-zk3sp66f-52f51.origin-ci-int-aws.dev.rhcloud.com:6443..." level=info msg="API v1.13.4+1b9415b up" level=info msg="Waiting up to 30m0s for bootstrapping to complete..." level=info msg="Make sure ssh-agent is running, env SSH_AUTH_SOCK is set to the ssh-agent's UNIX socket and your private key is added to the agent." level=info msg="Use the following commands to gather logs from the cluster" level=info msg="ssh -A core.96.89 '/usr/local/bin/installer-gather.sh 10.0.135.85 10.0.146.56 10.0.136.255'" level=info msg="scp core.96.89:~/log-bundle.tar.gz ." level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition"
Haven't found duplicates of this issue, and seeing this only once on https://prow.svc.ci.openshift.org/?job=release-*-4.1*&state=failure
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/1123/artifacts/e2e-aws-serial/installer/bootstrap-logs.tar ``` Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]: Pod Status:openshift-kube-apiserver/kube-apiserver Ready Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler Ready Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]: Pod Status:openshift-kube-controller-manager/kube-controller-manager DoesNotExist Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]: Pod Status:openshift-cluster-version/cluster-version-operator Ready ``` kube-controller-manager is failing to start.
The reason why KCM is not started is: 53579:Sep 23 16:18:01 ip-10-0-146-56 hyperkube[865]: I0923 16:18:01.767561 865 status_manager.go:382] Ignoring same status for pod "installer-2-ip-10-0-146-56.ec2.internal_openshift-kube-controller-manager(b66a6c1b-de19-11e9-876e-1265dd9ea47e)", status: {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.146.56 PodIP: StartTime:2019-09-23 15:49:22 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:installer State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:registry.svc.ci.openshift.org/ocp/4.1-2019-09-23-152435@sha256:67228781c681955d25f48108303bfcc6f4cffc36d3cc49e7300ee686cfbdd049 ImageID: ContainerID:}] QOSClass:Burstable} 53611:Sep 23 16:18:02 ip-10-0-146-56 hyperkube[865]: E0923 16:18:02.404219 865 kuberuntime_manager.go:661] createPodSandbox for pod "installer-2-ip-10-0-146-56.ec2.internal_openshift-kube-controller-manager(b66a6c1b-de19-11e9-876e-1265dd9ea47e)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-ip-10-0-146-56.ec2.internal_openshift-kube-controller-manager_b66a6c1b-de19-11e9-876e-1265dd9ea47e_0(ace53600d279ea5747408599dd0f91384d533bd8c3e1664ce4840ffbd172fe36): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input Side note: retrying the bootkube.sh service is not very clever as it produces weird outcomes and makes the debugging hard. If we want to retry, we should increase the bootstrap timeout, but I would vote for not retrying and just giving it one shot to make things right.
That sort of error tends to mean that the network plugin itself is segfaulting. Bah. Is there any way we could get the journal for the bootstrap node? Without that, we're stuck.
Looks like another instance of it (from Azure), but there are no useful logs in the test run - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/196
Fixed in https://github.com/openshift/containernetworking-plugins/pull/14 months ago
this bug should be fixed. Verified this bug on 4.3.0-0.nightly-2019-11-13-233341
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days