1754631 – [build-cop] install CI failures due to segfaulting cni plugin

Bug 1754631 - [build-cop] install CI failures due to segfaulting cni plugin

Summary: [build-cop] install CI failures due to segfaulting cni plugin

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Lokesh Mandvekar
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-23 18:48 UTC by Lokesh Mandvekar
Modified:	2023-09-14 05:43 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:06:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:07:14 UTC

Description Lokesh Mandvekar 2019-09-23 18:48:53 UTC

Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/1123

Run template e2e-aws-serial - e2e-aws-serial container setup 	48m13s
Lease acquired, installing...
Installing from release registry.svc.ci.openshift.org/ocp/release:4.1.0-0.ci-2019-09-23-152435
level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating infrastructure resources..."
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-zk3sp66f-52f51.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=info msg="API v1.13.4+1b9415b up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Make sure ssh-agent is running, env SSH_AUTH_SOCK is set to the ssh-agent's UNIX socket and your private key is added to the agent."
level=info msg="Use the following commands to gather logs from the cluster"
level=info msg="ssh -A core.96.89 '/usr/local/bin/installer-gather.sh 10.0.135.85 10.0.146.56 10.0.136.255'"
level=info msg="scp core.96.89:~/log-bundle.tar.gz ."
level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition"

Comment 1 Lokesh Mandvekar 2019-09-23 18:51:02 UTC

Haven't found duplicates of this issue, and seeing this only once on https://prow.svc.ci.openshift.org/?job=release-*-4.1*&state=failure

Comment 2 Abhinav Dahiya 2019-09-23 19:10:47 UTC

> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/1123/artifacts/e2e-aws-serial/installer/bootstrap-logs.tar

```
Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]:         Pod Status:openshift-kube-apiserver/kube-apiserver        Ready
Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]:         Pod Status:openshift-kube-scheduler/openshift-kube-scheduler        Ready
Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]:         Pod Status:openshift-kube-controller-manager/kube-controller-manager        DoesNotExist
Sep 23 16:07:34 ip-10-0-3-67 bootkube.sh[10027]:         Pod Status:openshift-cluster-version/cluster-version-operator        Ready
```

kube-controller-manager is failing to start.

Comment 3 Michal Fojtik 2019-09-24 08:58:14 UTC

The reason why KCM is not started is:

53579:Sep 23 16:18:01 ip-10-0-146-56 hyperkube[865]: I0923 16:18:01.767561     865 status_manager.go:382] Ignoring same status for pod "installer-2-ip-10-0-146-56.ec2.internal_openshift-kube-controller-manager(b66a6c1b-de19-11e9-876e-1265dd9ea47e)", status: {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2019-09-23 15:49:22 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.146.56 PodIP: StartTime:2019-09-23 15:49:22 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:installer State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:registry.svc.ci.openshift.org/ocp/4.1-2019-09-23-152435@sha256:67228781c681955d25f48108303bfcc6f4cffc36d3cc49e7300ee686cfbdd049 ImageID: ContainerID:}] QOSClass:Burstable}

53611:Sep 23 16:18:02 ip-10-0-146-56 hyperkube[865]: E0923 16:18:02.404219     865 kuberuntime_manager.go:661] createPodSandbox for pod "installer-2-ip-10-0-146-56.ec2.internal_openshift-kube-controller-manager(b66a6c1b-de19-11e9-876e-1265dd9ea47e)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-ip-10-0-146-56.ec2.internal_openshift-kube-controller-manager_b66a6c1b-de19-11e9-876e-1265dd9ea47e_0(ace53600d279ea5747408599dd0f91384d533bd8c3e1664ce4840ffbd172fe36): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input

Side note: retrying the bootkube.sh service is not very clever as it produces weird outcomes and makes the debugging hard. If we want to retry, we should increase the bootstrap timeout, but I would vote for not retrying and just giving it one shot to make things right.

Comment 4 Casey Callendrello 2019-09-24 14:24:42 UTC

That sort of error tends to mean that the network plugin itself is segfaulting. Bah.

Is there any way we could get the journal for the bootstrap node? Without that, we're stuck.

Comment 5 Ben Bennett 2019-10-22 12:59:42 UTC

Looks like another instance of it (from Azure), but there are no useful logs in the test run - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/196

Comment 10 Casey Callendrello 2019-11-15 16:32:33 UTC

Fixed in https://github.com/openshift/containernetworking-plugins/pull/14 months ago

Comment 12 zhaozhanqi 2019-11-18 02:37:51 UTC

this bug should be fixed. Verified this bug on 4.3.0-0.nightly-2019-11-13-233341

Comment 14 errata-xmlrpc 2020-01-23 11:06:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 15 Red Hat Bugzilla 2023-09-14 05:43:40 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.