1767178 – failed to create pods due to loopback coredump

Bug 1767178 - failed to create pods due to loopback coredump

Summary: failed to create pods due to loopback coredump

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Alexander Constantinescu
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:	SDN-CI-IMPACT
Depends On:	1725832 1834247 1834249
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-30 20:47 UTC by Clayton Coleman
Modified:	2020-05-19 11:43 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1834247 (view as bug list)
Environment:
Last Closed:	2020-05-19 11:43:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24062	0	'None'	closed	Bug 1767178: images/sdn: disable cgo when building CNI binaries	2021-02-02 13:14:23 UTC

Description Clayton Coleman 2019-10-30 20:47:40 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.1/479

Failed to install within 30m due to cluster-monitoring-operator reporting failures, logs contained:

E1030 18:27:21.290374       1 operator.go:264] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating kube-state-metrics failed: reconciling kube-state-metrics Deployment failed: updating deployment object failed: waiting for DeploymentRollout of kube-state-metrics: deployment kube-state-metrics is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)

Possible this is a higher level issue, but seeing a fair number of these in the CI environments on other releases besides 4.1.

Comment 1 Frederic Branczyk 2019-10-31 09:35:48 UTC

The container stayed in creating state for a long time. Looking at kubelet logs I can see this being the last mention:

```
Oct 30 18:30:34 ip-10-0-128-194 hyperkube[1020]: E1030 18:30:34.336694    1020 pod_workers.go:190] Error syncing pod 41377c1e-fb3f-11e9-896b-12518b39b236 ("kube-state-metrics-8448596686-xwv9j_openshift-monitoring(41377c1e-fb3f-11e9-896b-12518b39b236)"), skipping: failed to "CreatePodSandbox" for "kube-state-metrics-8448596686-xwv9j_openshift-monitoring(41377c1e-fb3f-11e9-896b-12518b39b236)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-state-metrics-8448596686-xwv9j_openshift-monitoring(41377c1e-fb3f-11e9-896b-12518b39b236)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-state-metrics-8448596686-xwv9j_openshift-monitoring_41377c1e-fb3f-11e9-896b-12518b39b236_0(38f55bf30ca57a833c5475daf0fd93627a18647b7dc468a3569b84f958bc2a1b): netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input"

```

Talking to networking team to find out where to go next.

Comment 2 Frederic Branczyk 2019-10-31 09:51:13 UTC

After some further poking with the networking team, it seems that the cni plugin segfaulted. For example:

```
Oct 30 18:30:37 ip-10-0-128-194 systemd-coredump[120350]: Process 120341 (loopback) of user 0 dumped core.
                                                          
                                                          Stack trace of thread 120341:
                                                          #0  0x00007f8fa1d470d3 _dl_relocate_object (/usr/lib64/ld-2.28.so)
                                                          #1  0x00007f8fa1d3f1af dl_main (/usr/lib64/ld-2.28.so)
                                                          #2  0x00007f8fa1d54b00 _dl_sysdep_start (/usr/lib64/ld-2.28.so)
                                                          #3  0x00007f8fa1d3d0f8 _dl_start (/usr/lib64/ld-2.28.so)
                                                          #4  0x00007f8fa1d3c038 _start (/usr/lib64/ld-2.28.so)
```

This is apparently a known issue tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1725832. Therefore closing as a dupe, but will comment the artifact link there for further investigation.

*** This bug has been marked as a duplicate of bug 1725832 ***

Comment 3 Frederic Branczyk 2019-10-31 10:12:53 UTC

Sorry acted a bit too fast, while it looks like a similar root cause, it's for a different version, so reopening and assigning to networking.

Comment 4 Florian Weimer 2019-10-31 14:29:34 UTC

Where can I get a copy of the loopback binary mentioned in comment 2? Thanks.

Comment 6 Casey Callendrello 2019-11-13 12:58:01 UTC

I'm going to disable cgo on 4.1, because there's no need for us to use it. Florian, if you still want a copy of the binary, I can give it to you. Note that this is for Openshift 4.1, which doesn't have the go compiler fixes.

Comment 7 Florian Weimer 2019-11-13 13:00:58 UTC

(In reply to Casey Callendrello from comment #6)
> I'm going to disable cgo on 4.1, because there's no need for us to use it.
> Florian, if you still want a copy of the binary, I can give it to you. Note
> that this is for Openshift 4.1, which doesn't have the go compiler fixes.

Yes, I would still like a copy of the binary, both versions if possible (with and without cgo). I want to make sure that this isn't the result of a GNU toolchain bug.

Comment 9 Dan Williams 2020-04-08 17:09:32 UTC

Another option to fix this is https://github.com/cri-o/ocicni/pull/71 though that requires revendor of ocicni into podman and then into CRIO and then into OpenShift.

Note You need to log in before you can comment on or make changes to this bug.