Bug 1754631
| Summary: | [build-cop] install CI failures due to segfaulting cni plugin | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lokesh Mandvekar <lsm5> |
| Component: | Networking | Assignee: | Lokesh Mandvekar <lsm5> |
| Networking sub component: | ovn-kubernetes | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | aos-bugs, bbennett, cdc, gblomqui, mfojtik |
| Version: | 4.1.z | ||
| Target Milestone: | --- | ||
| Target Release: | 4.3.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-01-23 11:06:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Lokesh Mandvekar
2019-09-23 18:48:53 UTC
Haven't found duplicates of this issue, and seeing this only once on https://prow.svc.ci.openshift.org/?job=release-*-4.1*&state=failure
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/1123/artifacts/e2e-aws-serial/installer/bootstrap-logs.tar
```
Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]: Pod Status:openshift-kube-apiserver/kube-apiserver Ready
Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler Ready
Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]: Pod Status:openshift-kube-controller-manager/kube-controller-manager DoesNotExist
Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]: Pod Status:openshift-cluster-version/cluster-version-operator Ready
```
kube-controller-manager is failing to start.
The reason why KCM is not started is:
53579:Sep 23 16:18:01 ip-10-0-146-56 hyperkube[865]: I0923 16:18:01.767561 865 status_manager.go:382] Ignoring same status for pod "installer-2-ip-10-0-146-56.ec2.internal_openshift-kube-controller-manager(b66a6c1b-de19-11e9-876e-1265dd9ea47e)", status: {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.146.56 PodIP: StartTime:2019-09-23 15:49:22 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:installer State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:registry.svc.ci.openshift.org/ocp/4.1-2019-09-23-152435@sha256:67228781c681955d25f48108303bfcc6f4cffc36d3cc49e7300ee686cfbdd049 ImageID: ContainerID:}] QOSClass:Burstable}
53611:Sep 23 16:18:02 ip-10-0-146-56 hyperkube[865]: E0923 16:18:02.404219 865 kuberuntime_manager.go:661] createPodSandbox for pod "installer-2-ip-10-0-146-56.ec2.internal_openshift-kube-controller-manager(b66a6c1b-de19-11e9-876e-1265dd9ea47e)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-ip-10-0-146-56.ec2.internal_openshift-kube-controller-manager_b66a6c1b-de19-11e9-876e-1265dd9ea47e_0(ace53600d279ea5747408599dd0f91384d533bd8c3e1664ce4840ffbd172fe36): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
Side note: retrying the bootkube.sh service is not very clever as it produces weird outcomes and makes the debugging hard. If we want to retry, we should increase the bootstrap timeout, but I would vote for not retrying and just giving it one shot to make things right.
That sort of error tends to mean that the network plugin itself is segfaulting. Bah. Is there any way we could get the journal for the bootstrap node? Without that, we're stuck. Looks like another instance of it (from Azure), but there are no useful logs in the test run - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/196 Fixed in https://github.com/openshift/containernetworking-plugins/pull/14 months ago this bug should be fixed. Verified this bug on 4.3.0-0.nightly-2019-11-13-233341 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |