https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.1/479 Failed to install within 30m due to cluster-monitoring-operator reporting failures, logs contained: E1030 18:27:21.290374 1 operator.go:264] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating kube-state-metrics failed: reconciling kube-state-metrics Deployment failed: updating deployment object failed: waiting for DeploymentRollout of kube-state-metrics: deployment kube-state-metrics is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1) Possible this is a higher level issue, but seeing a fair number of these in the CI environments on other releases besides 4.1.
The container stayed in creating state for a long time. Looking at kubelet logs I can see this being the last mention: ``` Oct 30 18:30:34 ip-10-0-128-194 hyperkube[1020]: E1030 18:30:34.336694 1020 pod_workers.go:190] Error syncing pod 41377c1e-fb3f-11e9-896b-12518b39b236 ("kube-state-metrics-8448596686-xwv9j_openshift-monitoring(41377c1e-fb3f-11e9-896b-12518b39b236)"), skipping: failed to "CreatePodSandbox" for "kube-state-metrics-8448596686-xwv9j_openshift-monitoring(41377c1e-fb3f-11e9-896b-12518b39b236)" with CreatePodSandboxError: "CreatePodSandbox for pod \"kube-state-metrics-8448596686-xwv9j_openshift-monitoring(41377c1e-fb3f-11e9-896b-12518b39b236)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-state-metrics-8448596686-xwv9j_openshift-monitoring_41377c1e-fb3f-11e9-896b-12518b39b236_0(38f55bf30ca57a833c5475daf0fd93627a18647b7dc468a3569b84f958bc2a1b): netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input" ``` Talking to networking team to find out where to go next.
After some further poking with the networking team, it seems that the cni plugin segfaulted. For example: ``` Oct 30 18:30:37 ip-10-0-128-194 systemd-coredump[120350]: Process 120341 (loopback) of user 0 dumped core. Stack trace of thread 120341: #0 0x00007f8fa1d470d3 _dl_relocate_object (/usr/lib64/ld-2.28.so) #1 0x00007f8fa1d3f1af dl_main (/usr/lib64/ld-2.28.so) #2 0x00007f8fa1d54b00 _dl_sysdep_start (/usr/lib64/ld-2.28.so) #3 0x00007f8fa1d3d0f8 _dl_start (/usr/lib64/ld-2.28.so) #4 0x00007f8fa1d3c038 _start (/usr/lib64/ld-2.28.so) ``` This is apparently a known issue tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1725832. Therefore closing as a dupe, but will comment the artifact link there for further investigation. *** This bug has been marked as a duplicate of bug 1725832 ***
Sorry acted a bit too fast, while it looks like a similar root cause, it's for a different version, so reopening and assigning to networking.
Where can I get a copy of the loopback binary mentioned in comment 2? Thanks.
I'm going to disable cgo on 4.1, because there's no need for us to use it. Florian, if you still want a copy of the binary, I can give it to you. Note that this is for Openshift 4.1, which doesn't have the go compiler fixes.
(In reply to Casey Callendrello from comment #6) > I'm going to disable cgo on 4.1, because there's no need for us to use it. > Florian, if you still want a copy of the binary, I can give it to you. Note > that this is for Openshift 4.1, which doesn't have the go compiler fixes. Yes, I would still like a copy of the binary, both versions if possible (with and without cgo). I want to make sure that this isn't the result of a GNU toolchain bug.
Another option to fix this is https://github.com/cri-o/ocicni/pull/71 though that requires revendor of ocicni into podman and then into CRIO and then into OpenShift.