Hide Forgot
Description of problem: 4.1.0-0.nightly-2019-04-16-171602 install timed out waiting for boot strap. Opening a new bug instead of updating https://bugzilla.redhat.com/show_bug.cgi?id=1698950 as that one has gotten confusing. Plus, this looks different. I don't see the controller-manager repeatedly crashlooping. In this one kube-controller-manager repeats this error every 30 seconds: 2019-04-16T18:32:30.189488211+00:00 stderr F E0416 18:32:30.188392 1 resource_quota_controller.go:445] failed to sync resource monitors: [couldn't start monitor for resource "machine.openshift.io/v1beta1, Resource=machines": unable to monitor quota for resource "machine.openshift.io/v1beta1, Resource=machines", couldn't start monitor for resource "operators.coreos.com/v1alpha1, Resource=installplans": unable to monitor quota for resource "operators.coreos.com/v1alpha1, Resource=installplans", couldn't start monitor for resource "operator.openshift.io/v1, Resource=ingresscontrollers": unable to monitor quota for resource "operator.openshift.io/v1, Resource=ingresscontrollers", couldn't start monitor for resource "machine.openshift.io/v1beta1, Resource=machinedeployments": unable to monitor quota for resource "machine.openshift.io/v1beta1, Resource=machinedeployments", couldn't start monitor for resource "operators.coreos.com/v1, Resource=catalogsourceconfigs": unable to monitor quota for resource "operators.coreos.com/v1, Resource=catalogsourceconfigs", couldn't start monitor for resource "operators.coreos.com/v1alpha1, Resource=clusterserviceversions": unable to monitor quota for resource "operators.coreos.com/v1alpha1, Resource=clusterserviceversions", couldn't start monitor for resource "healthchecking.openshift.io/v1alpha1, Resource=machinehealthchecks": unable to monitor quota for resource "healthchecking.openshift.io/v1alpha1, Resource=machinehealthchecks", couldn't start monitor for resource "machineconfiguration.openshift.io/v1, Resource=mcoconfigs": unable to monitor quota for resource "machineconfiguration.openshift.io/v1, Resource=mcoconfigs", couldn't start monitor for resource "operators.coreos.com/v1, Resource=operatorsources": unable to monitor quota for resource "operators.coreos.com/v1, Resource=operatorsources", couldn't start monitor for resource "operators.coreos.com/v1alpha1, Resource=catalogsources": unable to monitor quota for resource "operators.coreos.com/v1alpha1, Resource=catalogsources", couldn't start monitor for resource "machine.openshift.io/v1beta1, Resource=machinesets": unable to monitor quota for resource "machine.openshift.io/v1beta1, Resource=machinesets", couldn't start monitor for resource "operators.coreos.com/v1, Resource=operatorgroups": unable to monitor quota for resource "operators.coreos.com/v1, Resource=operatorgroups", couldn't start monitor for resource "operators.coreos.com/v1alpha1, Resource=subscriptions": unable to monitor quota for resource "operators.coreos.com/v1alpha1, Resource=subscriptions", couldn't start monitor for resource "tuned.openshift.io/v1, Resource=tuneds": unable to monitor quota for resource "tuned.openshift.io/v1, Resource=tuneds", couldn't start monitor for resource "autoscaling.openshift.io/v1alpha1, Resource=machineautoscalers": unable to monitor quota for resource "autoscaling.openshift.io/v1alpha1, Resource=machineautoscalers", couldn't start monitor for resource "k8s.cni.cncf.io/v1, Resource=network-attachment-definitions": unable to monitor quota for resource "k8s.cni.cncf.io/v1, Resource=network-attachment-definitions", couldn't start monitor for resource "cloudcredential.openshift.io/v1, Resource=credentialsrequests": unable to monitor quota for resource "cloudcredential.openshift.io/v1, Resource=credentialsrequests"] Until it finally gave up 2019-04-16T18:33:02.307052026+00:00 stderr F I0416 18:33:02.307013 1 secure_serving.go:180] Stopped listening on [::]:10257 2019-04-16T18:33:02.307131810+00:00 stderr F E0416 18:33:02.307013 1 controllermanager.go:282] leaderelection lost kube-apiserver repeatedly puts out watch failure error message every second: 2019-04-16T18:36:34.145210286+00:00 stderr F E0416 18:36:34.145158 1 reflector.go:125] github.com/openshift/client-go/oauth/informers/externalversions/factory.go:101: Failed to list *v1.OAuthClient: the server could not find the requested resource (get oauthclients.oauth.openshift.io) 2019-04-16T18:36:34.146385697+00:00 stderr F E0416 18:36:34.146345 1 reflector.go:125] github.com/openshift/client-go/security/informers/externalversions/factory.go:101: Failed to list *v1.SecurityContextConstraints: the server could not find the requested resource (get securitycontextconstraints.security.openshift.io) 2019-04-16T18:36:34.194165342+00:00 stderr F E0416 18:36:34.194116 1 reflector.go:125] github.com/openshift/client-go/user/informers/externalversions/factory.go:101: Failed to list *v1.Group: the server could not find the requested resource (get groups.user.openshift.io) 2019-04-16T18:36:35.146328588+00:00 stderr F E0416 18:36:35.146262 1 reflector.go:125] github.com/openshift/client-go/oauth/informers/externalversions/factory.go:101: Failed to list *v1.OAuthClient: the server could not find the requested resource (get oauthclients.oauth.openshift.io) 2019-04-16T18:36:35.147301341+00:00 stderr F E0416 18:36:35.147266 1 reflector.go:125] github.com/openshift/client-go/security/informers/externalversions/factory.go:101: Failed to list *v1.SecurityContextConstraints: the server could not find the requested resource (get securitycontextconstraints.security.openshift.io) 2019-04-16T18:36:35.195006869+00:00 stderr F E0416 18:36:35.194948 1 reflector.go:125] github.com/openshift/client-go/user/informers/externalversions/factory.go:101: Failed to list *v1.Group: the server could not find the requested resource (get groups.user.openshift.io) 2019-04-16T18:36:36.147353312+00:00 stderr F E0416 18:36:36.147302 1 reflector.go:125] github.com/openshift/client-go/oauth/informers/externalversions/factory.go:101: Failed to list *v1.OAuthClient: the server could not find the requested resource (get oauthclients.oauth.openshift.io) 2019-04-16T18:36:36.148206728+00:00 stderr F E0416 18:36:36.148137 1 reflector.go:125] github.com/openshift/client-go/security/informers/externalversions/factory.go:101: Failed to list *v1.SecurityContextConstraints: the server could not find the requested resource (get securitycontextconstraints.security.openshift.io) 2019-04-16T18:36:36.196023983+00:00 stderr F E0416 18:36:36.195967 1 reflector.go:125] github.com/openshift/client-go/user/informers/externalversions/factory.go:101: Failed to list *v1.Group: the server could not find the requested resource (get groups.user.openshift.io) Version-Release number of selected component (if applicable): 4.1.0-0.nightly-2019-04-16-171602 How reproducible: 1/1 I will kick another install. Steps to Reproduce: 1. Install 4.1.0-0.nightly-2019-04-16-171602 I will attach the output from the installer-gather.sh tool from the installer repo.
Created attachment 1555634 [details] journal and pod logs for bootstrap installer-gather got the journals and pod logs for the bootstrap but hung scp-ing to master-0 so I killed it. The bootstrap is what failed in any case.
Not sure this is master, but don't know where else to start it.
The network operator is reporting Available=true, but the events are reporting the network is unavailable for the next hour and never appears to resolve. See "message": "(combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-apiserver-operator-54445cfc84-d9hgk_openshift-kube-apiserver-operator_8ef80eed-6070-11e9-89fe-02da49078548_0(879d028f8de6ec1c5e167401e915a01978f937846af39c3b0b7a3400a79c74ee): netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input", In the events log. I got here from clusteroperator.json -> pods.json (kube-apiserver-operator says container creating) -> events.json searching by kube-apiserver-operator namespace to why the pod isn't running. Moving to the sdn for further debugging.
Confirming this seems reproducible. 2/2 so far.
*** Bug 1698950 has been marked as a duplicate of this bug. ***
[root@ip-10-0-6-58 core]# oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-136-206.us-east-2.compute.internal Ready master 106m v1.13.4+d4ce02c1d ip-10-0-152-192.us-east-2.compute.internal Ready master 106m v1.13.4+d4ce02c1d ip-10-0-169-118.us-east-2.compute.internal Ready master 106m v1.13.4+d4ce02c1d [root@ip-10-0-6-58 core]# grep server /etc/kubernetes/kubeconfig server: https://api.mffiedler-137.qe.devcluster.openshift.com:6443 [root@ip-10-0-6-58 core]# host api.mffiedler-137.qe.devcluster.openshift.com api.mffiedler-137.qe.devcluster.openshift.com has address 10.0.148.90 api.mffiedler-137.qe.devcluster.openshift.com has address 10.0.172.190 api.mffiedler-137.qe.devcluster.openshift.com has address 10.0.143.109 Note that the IPs do not match. None of the nodes in the "oc get nodes" output are running an apiserver, and if I go to them and run "nc -l -k -p 6443", they start receiving connections from various IPs, but if I then try to curl api.mffiedler-137.qe.devcluster.openshift.com:6443, those connections *don't* show up in any of the 3 "nc" processes. It looks like the CI cluster apiserver load balancing is screwed up, and clusters are getting pointed at the apiservers of other clusters.
Moving to Installer, since we set up the Kubernetes-API load balancer, and a QE cluster is independent of our CI-only Test Infrastructure.
> Note that the IPs do not match. api.* DNS resolves to the network load balancer's listeners. For example (thanks Mike for running this in openshift-qe): $ aws ec2 describe-network-interfaces --filters Name=addresses.private-ip-address,Values=10.0.143.109 --output json | jq '.NetworkInterfaces[] | {InterfaceType, Description}' /tmp/x.json { "InterfaceType": "network_load_balancer", "Description": "ELB net/mffiedler-137-l5r6b-int/61e43ba000a7aedf" } > None of the nodes in the "oc get nodes" output are running an apiserver... This is a problem ;). > ... but if I then try to curl api.mffiedler-137.qe.devcluster.openshift.com:6443, those connections *don't* show up in any of the 3 "nc" processes. Because the load balancer targets are waiting for 200-399 responses to /readyz requests [1] (which is probably part of the traffic you *do* see. So at this point I don't know enough to distinguish between "Master issue" and "Networking issue", but I'm pretty sure it's not an installer-broke-the-load-balancer issue. Moving back to Networking on a "you touched it last" basis ;). [1]: https://github.com/openshift/installer/blob/84922b5b92503c4e2854ed628aa56625b000f9cd/data/data/aws/vpc/master-elb.tf#L49
> "message": "(combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-apiserver-operator-54445cfc84-d9hgk_openshift-kube-apiserver-operator_8ef80eed-6070-11e9-89fe-02da49078548_0(879d028f8de6ec1c5e167401e915a01978f937846af39c3b0b7a3400a79c74ee): netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input", On this point, I've pulled events.json from the past 24 hours of CI failures, and I don't see 'netplugin' mentioned in any of them. This could be an OKD/CI vs. OCP/nightly thing.
Looking at the gathered logs, cluster-bootstrap is waiting on a production Kubernetes API server: $ grep -r 'Pod Status' bootstrap/journals/bootkube.log Apr 16 17:53:41 ip-10-0-4-87 bootkube.sh[1510]: Pod Status:openshift-cluster-version/cluster-version-operator Pending Apr 16 17:53:41 ip-10-0-4-87 bootkube.sh[1510]: Pod Status: kube-apiserver DoesNotExist Apr 16 17:53:41 ip-10-0-4-87 bootkube.sh[1510]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler DoesNotExist Apr 16 17:53:41 ip-10-0-4-87 bootkube.sh[1510]: Pod Status: kube-controller-manager DoesNotExist Apr 16 17:54:11 ip-10-0-4-87 bootkube.sh[1510]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler DoesNotExist Apr 16 17:54:11 ip-10-0-4-87 bootkube.sh[1510]: Pod Status: kube-controller-manager DoesNotExist Apr 16 17:54:11 ip-10-0-4-87 bootkube.sh[1510]: Pod Status:openshift-cluster-version/cluster-version-operator Ready Apr 16 17:54:11 ip-10-0-4-87 bootkube.sh[1510]: Pod Status: kube-apiserver DoesNotExist Apr 16 18:13:07 ip-10-0-4-87 bootkube.sh[11418]: Pod Status: kube-controller-manager DoesNotExist Apr 16 18:13:07 ip-10-0-4-87 bootkube.sh[11418]: Pod Status:openshift-cluster-version/cluster-version-operator Ready Apr 16 18:13:07 ip-10-0-4-87 bootkube.sh[11418]: Pod Status: kube-apiserver DoesNotExist Apr 16 18:13:07 ip-10-0-4-87 bootkube.sh[11418]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler DoesNotExist Apr 16 18:33:28 ip-10-0-4-87 bootkube.sh[18732]: Pod Status: kube-apiserver DoesNotExist Apr 16 18:33:28 ip-10-0-4-87 bootkube.sh[18732]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler DoesNotExist Apr 16 18:33:28 ip-10-0-4-87 bootkube.sh[18732]: Pod Status: kube-controller-manager DoesNotExist Apr 16 18:33:28 ip-10-0-4-87 bootkube.sh[18732]: Pod Status:openshift-cluster-version/cluster-version-operator Ready The kube-apiserver-operator has been launched, but is waiting on control-plane nodes for scheduling the production kube-apiserver Pods, and then it hits the netplugin thing: $ jq -r '.items | sort_by(.firstTimestamp)[] | .firstTimestamp + " " + .lastTimestamp + " " + .metadata.namespace + " " + .message' resources/events.json | grep kube-apiserver 2019-04-16T17:53:36Z 2019-04-16T17:53:36Z openshift-kube-apiserver-operator no nodes available to schedule pods 2019-04-16T17:53:36Z 2019-04-16T17:53:36Z openshift-kube-apiserver-operator Created pod: kube-apiserver-operator-54445cfc84-d9hgk 2019-04-16T17:53:36Z 2019-04-16T17:53:36Z openshift-kube-apiserver-operator Scaled up replica set kube-apiserver-operator-54445cfc84 to 1 2019-04-16T17:53:54Z 2019-04-16T17:53:54Z openshift-kube-apiserver-operator Successfully assigned openshift-kube-apiserver-operator/kube-apiserver-operator-54445cfc84-d9hgk to ip-10-0-148-62.us-east-2.compute.internal 2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-kube-apiserver-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] 2019-04-16T17:54:33Z 2019-04-16T17:54:33Z openshift-kube-apiserver-operator Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-apiserver-operator-54445cfc84-d9hgk_openshift-kube-apiserver-operator_8ef80eed-6070-11e9-89fe-02da49078548_0(9b810470ff1731b9ec3789bdd5244b5df80efc15b00a4c6a0ae2dd428c6258f8): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input ... Network logs are sparse, but just grepping for 'network': $ jq -r '.items | sort_by(.firstTimestamp)[] | .firstTimestamp + " " + .lastTimestamp + " " + .metadata.namespace + " " + .message' resources/events.json | grep network 2019-04-16T17:53:32Z 2019-04-16T17:53:33Z openshift-network-operator Failed to create new replica set "network-operator-5d7c98d4f9": replicasets.apps "network-operator-5d7c98d4f9" is forbidden: cannot set blockOwnerDeletion in this case because cannot find RESTMapping for APIVersion apps/v1 Kind Deployment: no matches for kind "Deployment" in version "apps/v1" 2019-04-16T17:53:34Z 2019-04-16T17:53:36Z openshift-network-operator no nodes available to schedule pods 2019-04-16T17:53:34Z 2019-04-16T17:53:34Z openshift-network-operator Created pod: network-operator-5d7c98d4f9-94m7m 2019-04-16T17:53:34Z 2019-04-16T17:53:34Z openshift-network-operator Scaled up replica set network-operator-5d7c98d4f9 to 1 2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-apiserver-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] 2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-cloud-credential-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] 2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-controller-manager-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] 2019-04-16T17:53:54Z 2019-04-16T17:54:23Z openshift-dns-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] 2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-kube-apiserver-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] 2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-kube-controller-manager-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] 2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-kube-scheduler-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] 2019-04-16T17:53:54Z 2019-04-16T17:53:54Z openshift-network-operator Successfully assigned openshift-network-operator/network-operator-5d7c98d4f9-94m7m to ip-10-0-148-62.us-east-2.compute.internal 2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-operator-lifecycle-manager network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] 2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-operator-lifecycle-manager network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] 2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-service-ca-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized] 2019-04-16T17:53:59Z 2019-04-16T17:53:59Z openshift-network-operator Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352ab091c0db86561b0ed790da2c7b350a831debce9e510f28430bca30a7f5f7" 2019-04-16T17:54:02Z 2019-04-16T17:54:02Z openshift-network-operator Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352ab091c0db86561b0ed790da2c7b350a831debce9e510f28430bca30a7f5f7" 2019-04-16T17:54:02Z 2019-04-16T17:54:02Z openshift-network-operator Created container network-operator 2019-04-16T17:54:02Z 2019-04-16T17:54:02Z openshift-network-operator Started container network-operator 2019-04-16T17:54:32Z 2019-04-16T17:54:32Z openshift-dns-operator Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-operator-5db5484f75-p8hjm_openshift-dns-operator_8f3421e5-6070-11e9-89fe-02da49078548_0(ba10c9f42943bfe6fd5546c30b8ac4572b8e61e21cf7b627cbc5d8143e585ece): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input 2019-04-16T17:54:33Z 2019-04-16T17:54:33Z openshift-apiserver-operator Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_openshift-apiserver-operator-7ddc48bd49-wfgsd_openshift-apiserver-operator_8f48001d-6070-11e9-89fe-02da49078548_0(c831ab7f5b09ad31581779fe2427bb40b0fbb8d570f3569ea2aee565f7371f7f): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input ... Logs from the network-operator container would be interesting. Dunno if this cluster is still around?
Created attachment 1555707 [details] network-operator logs [root@ip-10-0-6-58 core]# KUBECONFIG=/opt/openshift/auth/kubeconfig oc get pods -n openshift-network-operator NAME READY STATUS RESTARTS AGE network-operator-5d7c98d4f9-s9v9f 1/1 Running 0 5h27m [root@ip-10-0-6-58 core]# KUBECONFIG=/opt/openshift/auth/kubeconfig oc logs -n openshift-network-operator network-operator-5d7c98d4f9-s9v9f Error from server: Get https://ip-10-0-136-206.us-east-2.compute.internal:10250/containerLogs/openshift-network-operator/network-operator-5d7c98d4f9-s9v9f/network-operator: remote error: tls: internal error Grr. [root@ip-10-0-6-58 core]# KUBECONFIG=/opt/openshift/auth/kubeconfig oc get pods -n openshift-network-operator --output yaml ... nodeName: ip-10-0-136-206.us-east-2.compute.internal ... containerStatuses: - containerID: cri-o://841c76321c8b60520cb18fa0aeb9b3cd10540edbf1a2fbce9e17b5ed03aedf33 ... [root@ip-10-0-6-58 core]# ssh core.compute.internal sudo crictl logs 841c76321c8b60520cb18fa0aeb9b3cd10540edbf1a2fbce9e17b5ed03aedf33 >/tmp/network-operator.log And then I SCPed that log back to my local host and attached it here.
Looks like I may need to gather openshift-sdn and openshift-multus Pods too...
Created attachment 1555725 [details] Full installer-gather.sh tarball Ah, here's a full gather, run as root on the bootstrap machine so I'd get SSH access to the control-plane machines, because QE put the SSH key in ~root/.ssh/id_rsa.
Thanks @wking. The original machine I ran the gather on had the borked ssh SG config.
From that last gather: $ jq -r '.items | sort_by(.firstTimestamp)[] | .firstTimestamp + " " + .lastTimestamp + " " + .metadata.namespace + " " + .message' resources/events.json | grep netplugin | head -n1 2019-04-16T19:16:07Z 2019-04-17T01:13:48Z openshift-controller-manager-operator (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_openshift-controller-manager-operator-77d4cf868-6rf62_openshift-controller-manager-operator_b5ba10e2-607b-11e9-b5e1-026ab13ffd30_0(cd01f58366fc3df03149d9ff93f1f666e9df3db21be83a8e6ba457d511f6d7e7): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input although the cluster was pretty old by the time I gathered, so older events may have been garbage-collected by then. $ for LOG in control-plane/*/journals/kubelet.log; do grep netplugin "${LOG}" | head -n1; done Apr 16 19:14:20 ip-10-0-136-206 hyperkube[1160]: E0416 19:14:20.853089 1160 remote_runtime.go:96] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_openshift-controller-manager-operator-77d4cf868-6rf62_openshift-controller-manager-operator_b5ba10e2-607b-11e9-b5e1-026ab13ffd30_0(648c402af9812358cab2721462079ed9b7d26cbcdeed3c66dbd5e140350f296d): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input $ for LOG in control-plane/*/journals/kubelet.log; do head -n1 "${LOG}"; done -- Logs begin at Tue 2019-04-16 19:05:06 UTC, end at Wed 2019-04-17 01:15:30 UTC. -- -- Logs begin at Tue 2019-04-16 19:05:00 UTC, end at Wed 2019-04-17 01:15:37 UTC. -- -- Logs begin at Tue 2019-04-16 19:05:04 UTC, end at Wed 2019-04-17 01:15:38 UTC. -- So maybe this only affects the one machine? $ grep -c netplugin control-plane/*/journals/kubelet.log control-plane/ip-10-0-136-206.us-east-2.compute.internal/journals/kubelet.log:71316 control-plane/ip-10-0-152-192.us-east-2.compute.internal/journals/kubelet.log:0 control-plane/ip-10-0-169-118.us-east-2.compute.internal/journals/kubelet.log:0 $ cat control-plane/ip-10-0-136-206.us-east-2.compute.internal/containers/sdn.log ... I0416 19:14:08.463010 2567 service.go:319] Adding new service port "openshift-operator-lifecycle-manager/olm-operator-metrics:https-metrics" at 172.30.143.133:8081/TCP I0416 19:14:24.119874 2567 roundrobin.go:310] LoadBalancerRR: Setting endpoints for kube-system/etcd:etcd-metrics to [10.0.169.118:9979] ... So nothing right in the 19:14:20 vicinity. $ cat control-plane/ip-10-0-136-206.us-east-2.compute.internal/containers/sdn-controller.log I0416 19:14:07.084838 1 leaderelection.go:205] attempting to acquire leader lease openshift-sdn/openshift-network-controller... E0416 20:14:05.499453 1 leaderelection.go:270] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api.mffiedler-137.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: dial tcp 10.0.172.190:6443: connect: connection refused ... So the Kubernetes API flakes out slightly before the netplugin fireworks go off at 19:14:20. In fact, here's all the error-level stuff from that node before then: $ grep -rh '^E' control-plane/ip-10-0-136-206.us-east-2.compute.internal/containers/ | sort | head -n15 E0416 19:13:51.849701 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:14:08.309251 2567 cniserver.go:149] failed to remove old pod info socket: remove /var/run/openshift-sdn: device or resource busy E0416 19:14:08.309307 2567 cniserver.go:152] failed to remove contents of socket directory: remove /var/run/openshift-sdn: device or resource busy E0416 19:14:37.608046 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:15:22.895889 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:16:17.622212 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:16:49.985838 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:17:25.621141 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:17:59.112996 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:18:39.946436 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:19:28.494871 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:20:27.786604 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:21:05.500710 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:21:49.204925 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused E0416 19:22:30.656059 1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused Dunno about that "device or resource busy" stuff, maybe it's important? Also, looks like the CVO is expecting a local Kubernetes API and not getting it. Maybe it should be using the load balancer by this point? Anyhow, that's enough digging for today, I'll wade back in tomorrow ;).
Can anyone in QE provide help with the questions Casey has. He is in Berlin and we will not have a build unless this is resolved tonight.
re: comment 17 > I've been able to run the installer on AWS successfully several times today Is that using CI/OKD or nightly OCP builds? That seems to be a factor. The nightly build e2e tests are failing with the same error in non-QE environments. Try this installer: oc adm release extract --tools registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-04-17-073737
Found the bug. Apr 17 14:25:07 ip-10-0-138-13 crio[1072]: /var/lib/cni/bin/multus: error while loading shared libraries: libcrypto.so.10: cannot open shared object file: No such file or directory The OKD / ART builds of multus are breaking the (normally static!) cni plugin.
Get whoever's working on bug 1642158 over here ;).
Short-term fix would be COPYing those libraries over from the Go-builder image when assembling the multus image.
We re-ran the QE run in the openshift-dev infrastructure. Found the multus bug. However, I would also like to observe that I'm seeing none of the weird etcd certificate errors... so I'd also like to know what causes those.
I suspect disabling cgo will fix this.
Doug is working on this. https://github.com/openshift/multus-cni/pull/12
Created attachment 1556004 [details] installer-gather.sh from a new cluster with a static multus
Dan turned up host-local, loopback, and openshift-sdn in /var/lib/cni/bin/ on the control-plane machines that are still dynamically linked to libcrypto.so.10. So we probably need static builds of those too.
Thanks for the update Trevor, I did make a modification to use `CGO_ENABLED=0` on those earlier in the day in this PR @ https://github.com/openshift/containernetworking-plugins/pull/6 ...However, that's not including the `-tag no_openssl`, so I'm going to add that now, too.
Added `-tag no_openssl` to build for reference CNI plugins, and performed a `ldd ./bin/macvlan` (for example) to verify that they're not dynamic executables, in this PR @ https://github.com/openshift/containernetworking-plugins/pull/7
> However, I would also like to observe that I'm seeing none of the weird etcd certificate errors... Sam says this touches bug 1698456, although the details are beyond me ;).
And bug 1699456 is exactly the etcd issue. Not sure how it's related to bug 1698456; they may just be different symptoms of the same underlying issue(s).
I'm still seeing openshift-machine-api 0s Warning FailedCreatePodSandBox pod/cluster-autoscaler-operator-6bc5984cf9-b9rt7 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-autoscaler-operator-6bc5984cf9-b9rt7_openshift-machine-api_f0cd9004-6161-11e9-8c61-02de036535da_0(fb74a3d01a921da1eda36818ef8ea5fab4dc83750389b92ba982df4103e0aa3f): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
Found that the openshift-sdn plugin, which also includes and builds the loopback and host-local CNI plugins was still dynamically linked. Clayton Coleman opened this PR for updating openshift-sdn: https://github.com/openshift/origin/pull/22595
Douglas - Would this PR fix the root cause for this issue?
Sudha -- I understand the openshift-sdn binary is dynamically linked before this PR (as well as the loopback and host-local plugins) and will exhibit the same behavior as described in comment #26, so it will at least fix these components which I think is the next logical step. I just checked in on that PR and it's been triggered to run against CI again.
https://github.com/openshift/origin/pull/22595 has finally merged. Now we can try again.
*** Bug 1693247 has been marked as a duplicate of this bug. ***
Verified on 4.1.0-0.nightly-2019-04-18-170154. The build installs successfully and passes basic smoke testing
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758