Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1767178

Summary: failed to create pods due to loopback coredump
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DEFERRED Docs Contact:
Severity: high    
Priority: unspecified CC: aconstan, alegrand, anpicker, bbennett, dcbw, erooth, fbranczy, fweimer, kakkoyun, lcosic, mloibl, pkrupa, surbania, tstellar
Version: 4.1.zKeywords: Reopened
Target Milestone: ---   
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: SDN-CI-IMPACT
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1834247 (view as bug list) Environment:
Last Closed: 2020-05-19 11:43:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1725832, 1834247, 1834249    
Bug Blocks:    

Description Clayton Coleman 2019-10-30 20:47:40 UTC
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.1/479

Failed to install within 30m due to cluster-monitoring-operator reporting failures, logs contained:

E1030 18:27:21.290374       1 operator.go:264] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating kube-state-metrics failed: reconciling kube-state-metrics Deployment failed: updating deployment object failed: waiting for DeploymentRollout of kube-state-metrics: deployment kube-state-metrics is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)

Possible this is a higher level issue, but seeing a fair number of these in the CI environments on other releases besides 4.1.

Comment 1 Frederic Branczyk 2019-10-31 09:35:48 UTC
The container stayed in creating state for a long time. Looking at kubelet logs I can see this being the last mention:

```
Oct 30 18:30:34 ip-10-0-128-194 hyperkube[1020]: E1030 18:30:34.336694    1020 pod_workers.go:190] Error syncing pod 41377c1e-fb3f-11e9-896b-12518b39b236 ("kube-state-metrics-8448596686-xwv9j_openshift-monitoring(41377c1e-fb3f-11e9-896b-12518b39b236)"), skipping: failed to "CreatePodSandbox" for "kube-state-metrics-8448596686-xwv9j_openshift-monitoring(41377c1e-fb3f-11e9-896b-12518b39b236)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-state-metrics-8448596686-xwv9j_openshift-monitoring(41377c1e-fb3f-11e9-896b-12518b39b236)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-state-metrics-8448596686-xwv9j_openshift-monitoring_41377c1e-fb3f-11e9-896b-12518b39b236_0(38f55bf30ca57a833c5475daf0fd93627a18647b7dc468a3569b84f958bc2a1b): netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input"

```

Talking to networking team to find out where to go next.

Comment 2 Frederic Branczyk 2019-10-31 09:51:13 UTC
After some further poking with the networking team, it seems that the cni plugin segfaulted. For example:

```
Oct 30 18:30:37 ip-10-0-128-194 systemd-coredump[120350]: Process 120341 (loopback) of user 0 dumped core.
                                                          
                                                          Stack trace of thread 120341:
                                                          #0  0x00007f8fa1d470d3 _dl_relocate_object (/usr/lib64/ld-2.28.so)
                                                          #1  0x00007f8fa1d3f1af dl_main (/usr/lib64/ld-2.28.so)
                                                          #2  0x00007f8fa1d54b00 _dl_sysdep_start (/usr/lib64/ld-2.28.so)
                                                          #3  0x00007f8fa1d3d0f8 _dl_start (/usr/lib64/ld-2.28.so)
                                                          #4  0x00007f8fa1d3c038 _start (/usr/lib64/ld-2.28.so)
```

This is apparently a known issue tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1725832. Therefore closing as a dupe, but will comment the artifact link there for further investigation.

*** This bug has been marked as a duplicate of bug 1725832 ***

Comment 3 Frederic Branczyk 2019-10-31 10:12:53 UTC
Sorry acted a bit too fast, while it looks like a similar root cause, it's for a different version, so reopening and assigning to networking.

Comment 4 Florian Weimer 2019-10-31 14:29:34 UTC
Where can I get a copy of the loopback binary mentioned in comment 2? Thanks.

Comment 6 Casey Callendrello 2019-11-13 12:58:01 UTC
I'm going to disable cgo on 4.1, because there's no need for us to use it. Florian, if you still want a copy of the binary, I can give it to you. Note that this is for Openshift 4.1, which doesn't have the go compiler fixes.

Comment 7 Florian Weimer 2019-11-13 13:00:58 UTC
(In reply to Casey Callendrello from comment #6)
> I'm going to disable cgo on 4.1, because there's no need for us to use it.
> Florian, if you still want a copy of the binary, I can give it to you. Note
> that this is for Openshift 4.1, which doesn't have the go compiler fixes.

Yes, I would still like a copy of the binary, both versions if possible (with and without cgo). I want to make sure that this isn't the result of a GNU toolchain bug.

Comment 9 Dan Williams 2020-04-08 17:09:32 UTC
Another option to fix this is https://github.com/cri-o/ocicni/pull/71 though that requires revendor of ocicni into podman and then into CRIO and then into OpenShift.