Bug 1700504 - 4.1.0-0.nightly-2019-04-16-171602 install fails - timeout waiting for bootstrap complete with repeated errors in controller-manager and apiserver logs
Summary: 4.1.0-0.nightly-2019-04-16-171602 install fails - timeout waiting for bootstr...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.1.0
Assignee: Douglas Smith
QA Contact: Meng Bo
URL:
Whiteboard: openshift-scalability-41
: 1693247 1698950 (view as bug list)
Depends On:
Blocks: 1694226 1696074 1698672
TreeView+ depends on / blocked
 
Reported: 2019-04-16 18:41 UTC by Mike Fiedler
Modified: 2019-06-04 10:47 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:47:40 UTC
Target Upstream Version:


Attachments (Terms of Use)
journal and pod logs for bootstrap (494.60 KB, application/gzip)
2019-04-16 18:49 UTC, Mike Fiedler
no flags Details
network-operator logs (891.04 KB, text/plain)
2019-04-17 00:46 UTC, W. Trevor King
no flags Details
Full installer-gather.sh tarball (3.66 MB, application/gzip)
2019-04-17 01:18 UTC, W. Trevor King
no flags Details
installer-gather.sh from a new cluster with a static multus (778.40 KB, application/gzip)
2019-04-17 20:06 UTC, W. Trevor King
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:47:50 UTC
Github openshift containernetworking-plugins pull 7 'None' 'closed' 'adds -tags no_openssl to each build' 2019-12-05 23:34:58 UTC
Github openshift multus-cni pull 12 'None' 'closed' 'add CGO_ENABLED=0 to build' 2019-12-05 23:34:58 UTC
Github openshift origin pull 22595 'None' 'closed' 'Disable openssl linking in Origin binaries' 2019-12-05 23:34:58 UTC

Description Mike Fiedler 2019-04-16 18:41:02 UTC
Description of problem:

4.1.0-0.nightly-2019-04-16-171602 install timed out waiting for boot strap.   Opening a new bug instead of updating https://bugzilla.redhat.com/show_bug.cgi?id=1698950 as that one has gotten confusing.   Plus, this looks different.   I don't see the controller-manager repeatedly crashlooping.

In this one kube-controller-manager repeats this error every 30 seconds:

2019-04-16T18:32:30.189488211+00:00 stderr F E0416 18:32:30.188392       1 resource_quota_controller.go:445] failed to sync resource monitors: [couldn't start monitor for resource "machine.openshift.io/v1beta1, Resource=machines": unable to monitor quota for resource "machine.openshift.io/v1beta1, Resource=machines", couldn't start monitor for resource "operators.coreos.com/v1alpha1, Resource=installplans": unable to monitor quota for resource "operators.coreos.com/v1alpha1, Resource=installplans", couldn't start monitor for resource "operator.openshift.io/v1, Resource=ingresscontrollers": unable to monitor quota for resource "operator.openshift.io/v1, Resource=ingresscontrollers", couldn't start monitor for resource "machine.openshift.io/v1beta1, Resource=machinedeployments": unable to monitor quota for resource "machine.openshift.io/v1beta1, Resource=machinedeployments", couldn't start monitor for resource "operators.coreos.com/v1, Resource=catalogsourceconfigs": unable to monitor quota for resource "operators.coreos.com/v1, Resource=catalogsourceconfigs", couldn't start monitor for resource "operators.coreos.com/v1alpha1, Resource=clusterserviceversions": unable to monitor quota for resource "operators.coreos.com/v1alpha1, Resource=clusterserviceversions", couldn't start monitor for resource "healthchecking.openshift.io/v1alpha1, Resource=machinehealthchecks": unable to monitor quota for resource "healthchecking.openshift.io/v1alpha1, Resource=machinehealthchecks", couldn't start monitor for resource "machineconfiguration.openshift.io/v1, Resource=mcoconfigs": unable to monitor quota for resource "machineconfiguration.openshift.io/v1, Resource=mcoconfigs", couldn't start monitor for resource "operators.coreos.com/v1, Resource=operatorsources": unable to monitor quota for resource "operators.coreos.com/v1, Resource=operatorsources", couldn't start monitor for resource "operators.coreos.com/v1alpha1, Resource=catalogsources": unable to monitor quota for resource "operators.coreos.com/v1alpha1, Resource=catalogsources", couldn't start monitor for resource "machine.openshift.io/v1beta1, Resource=machinesets": unable to monitor quota for resource "machine.openshift.io/v1beta1, Resource=machinesets", couldn't start monitor for resource "operators.coreos.com/v1, Resource=operatorgroups": unable to monitor quota for resource "operators.coreos.com/v1, Resource=operatorgroups", couldn't start monitor for resource "operators.coreos.com/v1alpha1, Resource=subscriptions": unable to monitor quota for resource "operators.coreos.com/v1alpha1, Resource=subscriptions", couldn't start monitor for resource "tuned.openshift.io/v1, Resource=tuneds": unable to monitor quota for resource "tuned.openshift.io/v1, Resource=tuneds", couldn't start monitor for resource "autoscaling.openshift.io/v1alpha1, Resource=machineautoscalers": unable to monitor quota for resource "autoscaling.openshift.io/v1alpha1, Resource=machineautoscalers", couldn't start monitor for resource "k8s.cni.cncf.io/v1, Resource=network-attachment-definitions": unable to monitor quota for resource "k8s.cni.cncf.io/v1, Resource=network-attachment-definitions", couldn't start monitor for resource "cloudcredential.openshift.io/v1, Resource=credentialsrequests": unable to monitor quota for resource "cloudcredential.openshift.io/v1, Resource=credentialsrequests"]

Until it finally gave up

2019-04-16T18:33:02.307052026+00:00 stderr F I0416 18:33:02.307013       1 secure_serving.go:180] Stopped listening on [::]:10257
2019-04-16T18:33:02.307131810+00:00 stderr F E0416 18:33:02.307013       1 controllermanager.go:282] leaderelection lost

kube-apiserver repeatedly puts out watch failure error message every second:

2019-04-16T18:36:34.145210286+00:00 stderr F E0416 18:36:34.145158       1 reflector.go:125] github.com/openshift/client-go/oauth/informers/externalversions/factory.go:101: Failed to list *v1.OAuthClient: the server could not find the requested resource (get oauthclients.oauth.openshift.io)
2019-04-16T18:36:34.146385697+00:00 stderr F E0416 18:36:34.146345       1 reflector.go:125] github.com/openshift/client-go/security/informers/externalversions/factory.go:101: Failed to list *v1.SecurityContextConstraints: the server could not find the requested resource (get securitycontextconstraints.security.openshift.io)
2019-04-16T18:36:34.194165342+00:00 stderr F E0416 18:36:34.194116       1 reflector.go:125] github.com/openshift/client-go/user/informers/externalversions/factory.go:101: Failed to list *v1.Group: the server could not find the requested resource (get groups.user.openshift.io)
2019-04-16T18:36:35.146328588+00:00 stderr F E0416 18:36:35.146262       1 reflector.go:125] github.com/openshift/client-go/oauth/informers/externalversions/factory.go:101: Failed to list *v1.OAuthClient: the server could not find the requested resource (get oauthclients.oauth.openshift.io)
2019-04-16T18:36:35.147301341+00:00 stderr F E0416 18:36:35.147266       1 reflector.go:125] github.com/openshift/client-go/security/informers/externalversions/factory.go:101: Failed to list *v1.SecurityContextConstraints: the server could not find the requested resource (get securitycontextconstraints.security.openshift.io)
2019-04-16T18:36:35.195006869+00:00 stderr F E0416 18:36:35.194948       1 reflector.go:125] github.com/openshift/client-go/user/informers/externalversions/factory.go:101: Failed to list *v1.Group: the server could not find the requested resource (get groups.user.openshift.io)
2019-04-16T18:36:36.147353312+00:00 stderr F E0416 18:36:36.147302       1 reflector.go:125] github.com/openshift/client-go/oauth/informers/externalversions/factory.go:101: Failed to list *v1.OAuthClient: the server could not find the requested resource (get oauthclients.oauth.openshift.io)
2019-04-16T18:36:36.148206728+00:00 stderr F E0416 18:36:36.148137       1 reflector.go:125] github.com/openshift/client-go/security/informers/externalversions/factory.go:101: Failed to list *v1.SecurityContextConstraints: the server could not find the requested resource (get securitycontextconstraints.security.openshift.io)
2019-04-16T18:36:36.196023983+00:00 stderr F E0416 18:36:36.195967       1 reflector.go:125] github.com/openshift/client-go/user/informers/externalversions/factory.go:101: Failed to list *v1.Group: the server could not find the requested resource (get groups.user.openshift.io)







Version-Release number of selected component (if applicable): 4.1.0-0.nightly-2019-04-16-171602


How reproducible: 1/1 I will kick another install.   


Steps to Reproduce:
1. Install 4.1.0-0.nightly-2019-04-16-171602


I will attach the output from the installer-gather.sh tool from the installer repo.

Comment 1 Mike Fiedler 2019-04-16 18:49:45 UTC
Created attachment 1555634 [details]
journal and pod logs for bootstrap

installer-gather got the journals and pod logs for the bootstrap but hung scp-ing to master-0 so I killed it.   The bootstrap is what failed in any case.

Comment 2 Mike Fiedler 2019-04-16 18:54:08 UTC
Not sure this is master, but don't know where else to start it.

Comment 3 David Eads 2019-04-16 19:21:31 UTC
The network operator is reporting Available=true, but the events are reporting the network is unavailable for the next hour and never appears to resolve.

See

           "message": "(combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-apiserver-operator-54445cfc84-d9hgk_openshift-kube-apiserver-operator_8ef80eed-6070-11e9-89fe-02da49078548_0(879d028f8de6ec1c5e167401e915a01978f937846af39c3b0b7a3400a79c74ee): netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input",


In the events log.

I got here from clusteroperator.json -> pods.json (kube-apiserver-operator says container creating) -> events.json searching by kube-apiserver-operator namespace to why the pod isn't running.

Moving to the sdn for further debugging.

Comment 4 Mike Fiedler 2019-04-16 19:37:09 UTC
Confirming this seems reproducible.   2/2 so far.

Comment 5 Dan Winship 2019-04-16 20:59:40 UTC
*** Bug 1698950 has been marked as a duplicate of this bug. ***

Comment 6 Dan Winship 2019-04-16 21:08:34 UTC
[root@ip-10-0-6-58 core]# oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-136-206.us-east-2.compute.internal   Ready    master   106m   v1.13.4+d4ce02c1d
ip-10-0-152-192.us-east-2.compute.internal   Ready    master   106m   v1.13.4+d4ce02c1d
ip-10-0-169-118.us-east-2.compute.internal   Ready    master   106m   v1.13.4+d4ce02c1d

[root@ip-10-0-6-58 core]# grep server /etc/kubernetes/kubeconfig 
    server: https://api.mffiedler-137.qe.devcluster.openshift.com:6443

[root@ip-10-0-6-58 core]# host api.mffiedler-137.qe.devcluster.openshift.com
api.mffiedler-137.qe.devcluster.openshift.com has address 10.0.148.90
api.mffiedler-137.qe.devcluster.openshift.com has address 10.0.172.190
api.mffiedler-137.qe.devcluster.openshift.com has address 10.0.143.109


Note that the IPs do not match. None of the nodes in the "oc get nodes" output are running an apiserver, and if I go to them and run "nc -l -k -p 6443", they start receiving connections from various IPs, but if I then try to curl api.mffiedler-137.qe.devcluster.openshift.com:6443, those connections *don't* show up in any of the 3 "nc" processes.

It looks like the CI cluster apiserver load balancing is screwed up, and clusters are getting pointed at the apiservers of other clusters.

Comment 7 W. Trevor King 2019-04-16 21:50:01 UTC
Moving to Installer, since we set up the Kubernetes-API load balancer, and a QE cluster is independent of our CI-only Test Infrastructure.

Comment 8 W. Trevor King 2019-04-16 22:18:10 UTC
> Note that the IPs do not match.

api.* DNS resolves to the network load balancer's listeners.  For example (thanks Mike for running this in openshift-qe):

$ aws ec2 describe-network-interfaces --filters Name=addresses.private-ip-address,Values=10.0.143.109 --output json | jq '.NetworkInterfaces[] | {InterfaceType, Description}' /tmp/x.json
{
  "InterfaceType": "network_load_balancer",
  "Description": "ELB net/mffiedler-137-l5r6b-int/61e43ba000a7aedf"
}

> None of the nodes in the "oc get nodes" output are running an apiserver...

This is a problem ;).

> ... but if I then try to curl api.mffiedler-137.qe.devcluster.openshift.com:6443, those connections *don't* show up in any of the 3 "nc" processes.

Because the load balancer targets are waiting for 200-399 responses to /readyz requests [1] (which is probably part of the traffic you *do* see.

So at this point I don't know enough to distinguish between "Master issue" and "Networking issue", but I'm pretty sure it's not an installer-broke-the-load-balancer issue.  Moving back to Networking on a "you touched it last" basis ;).

[1]: https://github.com/openshift/installer/blob/84922b5b92503c4e2854ed628aa56625b000f9cd/data/data/aws/vpc/master-elb.tf#L49

Comment 9 W. Trevor King 2019-04-16 22:20:38 UTC
>            "message": "(combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-apiserver-operator-54445cfc84-d9hgk_openshift-kube-apiserver-operator_8ef80eed-6070-11e9-89fe-02da49078548_0(879d028f8de6ec1c5e167401e915a01978f937846af39c3b0b7a3400a79c74ee): netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input",

On this point, I've pulled events.json from the past 24 hours of CI failures, and I don't see 'netplugin' mentioned in any of them.  This could be an OKD/CI vs. OCP/nightly thing.

Comment 10 W. Trevor King 2019-04-16 23:32:00 UTC
Looking at the gathered logs, cluster-bootstrap is waiting on a production Kubernetes API server:

  $ grep -r 'Pod Status' bootstrap/journals/bootkube.log
  Apr 16 17:53:41 ip-10-0-4-87 bootkube.sh[1510]:         Pod Status:openshift-cluster-version/cluster-version-operator        Pending
  Apr 16 17:53:41 ip-10-0-4-87 bootkube.sh[1510]:         Pod Status:          kube-apiserver        DoesNotExist
  Apr 16 17:53:41 ip-10-0-4-87 bootkube.sh[1510]:         Pod Status:openshift-kube-scheduler/openshift-kube-scheduler        DoesNotExist
  Apr 16 17:53:41 ip-10-0-4-87 bootkube.sh[1510]:         Pod Status: kube-controller-manager        DoesNotExist
  Apr 16 17:54:11 ip-10-0-4-87 bootkube.sh[1510]:         Pod Status:openshift-kube-scheduler/openshift-kube-scheduler        DoesNotExist
  Apr 16 17:54:11 ip-10-0-4-87 bootkube.sh[1510]:         Pod Status: kube-controller-manager        DoesNotExist
  Apr 16 17:54:11 ip-10-0-4-87 bootkube.sh[1510]:         Pod Status:openshift-cluster-version/cluster-version-operator        Ready
  Apr 16 17:54:11 ip-10-0-4-87 bootkube.sh[1510]:         Pod Status:          kube-apiserver        DoesNotExist
  Apr 16 18:13:07 ip-10-0-4-87 bootkube.sh[11418]:         Pod Status: kube-controller-manager        DoesNotExist
  Apr 16 18:13:07 ip-10-0-4-87 bootkube.sh[11418]:         Pod Status:openshift-cluster-version/cluster-version-operator        Ready
  Apr 16 18:13:07 ip-10-0-4-87 bootkube.sh[11418]:         Pod Status:          kube-apiserver        DoesNotExist
  Apr 16 18:13:07 ip-10-0-4-87 bootkube.sh[11418]:         Pod Status:openshift-kube-scheduler/openshift-kube-scheduler        DoesNotExist
  Apr 16 18:33:28 ip-10-0-4-87 bootkube.sh[18732]:         Pod Status:          kube-apiserver        DoesNotExist
  Apr 16 18:33:28 ip-10-0-4-87 bootkube.sh[18732]:         Pod Status:openshift-kube-scheduler/openshift-kube-scheduler        DoesNotExist
  Apr 16 18:33:28 ip-10-0-4-87 bootkube.sh[18732]:         Pod Status: kube-controller-manager        DoesNotExist
  Apr 16 18:33:28 ip-10-0-4-87 bootkube.sh[18732]:         Pod Status:openshift-cluster-version/cluster-version-operator        Ready

The kube-apiserver-operator has been launched, but is waiting on control-plane nodes for scheduling the production kube-apiserver Pods, and then it hits the netplugin thing:

  $ jq -r '.items | sort_by(.firstTimestamp)[] | .firstTimestamp + " " + .lastTimestamp + " " + .metadata.namespace + " " + .message' resources/events.json  | grep kube-apiserver
  2019-04-16T17:53:36Z 2019-04-16T17:53:36Z openshift-kube-apiserver-operator no nodes available to schedule pods
  2019-04-16T17:53:36Z 2019-04-16T17:53:36Z openshift-kube-apiserver-operator Created pod: kube-apiserver-operator-54445cfc84-d9hgk
  2019-04-16T17:53:36Z 2019-04-16T17:53:36Z openshift-kube-apiserver-operator Scaled up replica set kube-apiserver-operator-54445cfc84 to 1
  2019-04-16T17:53:54Z 2019-04-16T17:53:54Z openshift-kube-apiserver-operator Successfully assigned openshift-kube-apiserver-operator/kube-apiserver-operator-54445cfc84-d9hgk to ip-10-0-148-62.us-east-2.compute.internal
  2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-kube-apiserver-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
  2019-04-16T17:54:33Z 2019-04-16T17:54:33Z openshift-kube-apiserver-operator Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-apiserver-operator-54445cfc84-d9hgk_openshift-kube-apiserver-operator_8ef80eed-6070-11e9-89fe-02da49078548_0(9b810470ff1731b9ec3789bdd5244b5df80efc15b00a4c6a0ae2dd428c6258f8): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
  ...

Network logs are sparse, but just grepping for 'network':

  $ jq -r '.items | sort_by(.firstTimestamp)[] | .firstTimestamp + " " + .lastTimestamp + " " + .metadata.namespace + " " + .message' resources/events.json | grep network
  2019-04-16T17:53:32Z 2019-04-16T17:53:33Z openshift-network-operator Failed to create new replica set "network-operator-5d7c98d4f9": replicasets.apps "network-operator-5d7c98d4f9" is forbidden: cannot set blockOwnerDeletion in this case because cannot find RESTMapping for APIVersion apps/v1 Kind Deployment: no matches for kind "Deployment" in version "apps/v1"
  2019-04-16T17:53:34Z 2019-04-16T17:53:36Z openshift-network-operator no nodes available to schedule pods
  2019-04-16T17:53:34Z 2019-04-16T17:53:34Z openshift-network-operator Created pod: network-operator-5d7c98d4f9-94m7m
  2019-04-16T17:53:34Z 2019-04-16T17:53:34Z openshift-network-operator Scaled up replica set network-operator-5d7c98d4f9 to 1
  2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-apiserver-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
  2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-cloud-credential-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
  2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-controller-manager-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
  2019-04-16T17:53:54Z 2019-04-16T17:54:23Z openshift-dns-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
  2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-kube-apiserver-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
  2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-kube-controller-manager-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
  2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-kube-scheduler-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
  2019-04-16T17:53:54Z 2019-04-16T17:53:54Z openshift-network-operator Successfully assigned openshift-network-operator/network-operator-5d7c98d4f9-94m7m to ip-10-0-148-62.us-east-2.compute.internal
  2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-operator-lifecycle-manager network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
  2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-operator-lifecycle-manager network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
  2019-04-16T17:53:54Z 2019-04-16T17:54:24Z openshift-service-ca-operator network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized]
  2019-04-16T17:53:59Z 2019-04-16T17:53:59Z openshift-network-operator Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352ab091c0db86561b0ed790da2c7b350a831debce9e510f28430bca30a7f5f7"
  2019-04-16T17:54:02Z 2019-04-16T17:54:02Z openshift-network-operator Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352ab091c0db86561b0ed790da2c7b350a831debce9e510f28430bca30a7f5f7"
  2019-04-16T17:54:02Z 2019-04-16T17:54:02Z openshift-network-operator Created container network-operator
  2019-04-16T17:54:02Z 2019-04-16T17:54:02Z openshift-network-operator Started container network-operator
  2019-04-16T17:54:32Z 2019-04-16T17:54:32Z openshift-dns-operator Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-operator-5db5484f75-p8hjm_openshift-dns-operator_8f3421e5-6070-11e9-89fe-02da49078548_0(ba10c9f42943bfe6fd5546c30b8ac4572b8e61e21cf7b627cbc5d8143e585ece): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
  2019-04-16T17:54:33Z 2019-04-16T17:54:33Z openshift-apiserver-operator Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_openshift-apiserver-operator-7ddc48bd49-wfgsd_openshift-apiserver-operator_8f48001d-6070-11e9-89fe-02da49078548_0(c831ab7f5b09ad31581779fe2427bb40b0fbb8d570f3569ea2aee565f7371f7f): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
  ...

Logs from the network-operator container would be interesting.  Dunno if this cluster is still around?

Comment 12 W. Trevor King 2019-04-17 00:46:53 UTC
Created attachment 1555707 [details]
network-operator logs

[root@ip-10-0-6-58 core]# KUBECONFIG=/opt/openshift/auth/kubeconfig oc get pods -n openshift-network-operator
NAME                                READY   STATUS    RESTARTS   AGE
network-operator-5d7c98d4f9-s9v9f   1/1     Running   0          5h27m
[root@ip-10-0-6-58 core]# KUBECONFIG=/opt/openshift/auth/kubeconfig oc logs -n openshift-network-operator network-operator-5d7c98d4f9-s9v9f
Error from server: Get https://ip-10-0-136-206.us-east-2.compute.internal:10250/containerLogs/openshift-network-operator/network-operator-5d7c98d4f9-s9v9f/network-operator: remote error: tls: internal error

Grr.

[root@ip-10-0-6-58 core]# KUBECONFIG=/opt/openshift/auth/kubeconfig oc get pods -n openshift-network-operator --output yaml
...
    nodeName: ip-10-0-136-206.us-east-2.compute.internal
...
    containerStatuses:
    - containerID: cri-o://841c76321c8b60520cb18fa0aeb9b3cd10540edbf1a2fbce9e17b5ed03aedf33
...

[root@ip-10-0-6-58 core]# ssh core@ip-10-0-136-206.us-east-2.compute.internal sudo crictl logs 841c76321c8b60520cb18fa0aeb9b3cd10540edbf1a2fbce9e17b5ed03aedf33  >/tmp/network-operator.log

And then I SCPed that log back to my local host and attached it here.

Comment 13 W. Trevor King 2019-04-17 00:56:25 UTC
Looks like I may need to gather openshift-sdn and openshift-multus Pods too...

Comment 14 W. Trevor King 2019-04-17 01:18:50 UTC
Created attachment 1555725 [details]
Full installer-gather.sh tarball

Ah, here's a full gather, run as root on the bootstrap machine so I'd get SSH access to the control-plane machines, because QE put the SSH key in ~root/.ssh/id_rsa.

Comment 15 Mike Fiedler 2019-04-17 02:01:49 UTC
Thanks @wking.   The original machine I ran the gather on had the borked ssh SG config.

Comment 16 W. Trevor King 2019-04-17 04:23:33 UTC
From that last gather:

$ jq -r '.items | sort_by(.firstTimestamp)[] | .firstTimestamp + " " + .lastTimestamp + " " + .metadata.namespace + " " + .message' resources/events.json | grep netplugin | head -n1
2019-04-16T19:16:07Z 2019-04-17T01:13:48Z openshift-controller-manager-operator (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_openshift-controller-manager-operator-77d4cf868-6rf62_openshift-controller-manager-operator_b5ba10e2-607b-11e9-b5e1-026ab13ffd30_0(cd01f58366fc3df03149d9ff93f1f666e9df3db21be83a8e6ba457d511f6d7e7): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input

although the cluster was pretty old by the time I gathered, so older events may have been garbage-collected by then.

$ for LOG in control-plane/*/journals/kubelet.log; do grep netplugin "${LOG}" | head -n1; done
Apr 16 19:14:20 ip-10-0-136-206 hyperkube[1160]: E0416 19:14:20.853089    1160 remote_runtime.go:96] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_openshift-controller-manager-operator-77d4cf868-6rf62_openshift-controller-manager-operator_b5ba10e2-607b-11e9-b5e1-026ab13ffd30_0(648c402af9812358cab2721462079ed9b7d26cbcdeed3c66dbd5e140350f296d): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
$ for LOG in control-plane/*/journals/kubelet.log; do head -n1 "${LOG}"; done
-- Logs begin at Tue 2019-04-16 19:05:06 UTC, end at Wed 2019-04-17 01:15:30 UTC. --
-- Logs begin at Tue 2019-04-16 19:05:00 UTC, end at Wed 2019-04-17 01:15:37 UTC. --
-- Logs begin at Tue 2019-04-16 19:05:04 UTC, end at Wed 2019-04-17 01:15:38 UTC. --

So maybe this only affects the one machine?

$ grep -c netplugin control-plane/*/journals/kubelet.log
control-plane/ip-10-0-136-206.us-east-2.compute.internal/journals/kubelet.log:71316
control-plane/ip-10-0-152-192.us-east-2.compute.internal/journals/kubelet.log:0
control-plane/ip-10-0-169-118.us-east-2.compute.internal/journals/kubelet.log:0

$ cat control-plane/ip-10-0-136-206.us-east-2.compute.internal/containers/sdn.log 
...
I0416 19:14:08.463010    2567 service.go:319] Adding new service port "openshift-operator-lifecycle-manager/olm-operator-metrics:https-metrics" at 172.30.143.133:8081/TCP
I0416 19:14:24.119874    2567 roundrobin.go:310] LoadBalancerRR: Setting endpoints for kube-system/etcd:etcd-metrics to [10.0.169.118:9979]
...

So nothing right in the 19:14:20 vicinity.

$ cat control-plane/ip-10-0-136-206.us-east-2.compute.internal/containers/sdn-controller.log 
I0416 19:14:07.084838       1 leaderelection.go:205] attempting to acquire leader lease  openshift-sdn/openshift-network-controller...
E0416 20:14:05.499453       1 leaderelection.go:270] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api.mffiedler-137.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: dial tcp 10.0.172.190:6443: connect: connection refused
...

So the Kubernetes API flakes out slightly before the netplugin fireworks go off at 19:14:20.  In fact, here's all the error-level stuff from that node before then:

$ grep -rh '^E' control-plane/ip-10-0-136-206.us-east-2.compute.internal/containers/ | sort | head -n15
E0416 19:13:51.849701       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:14:08.309251    2567 cniserver.go:149] failed to remove old pod info socket: remove /var/run/openshift-sdn: device or resource busy
E0416 19:14:08.309307    2567 cniserver.go:152] failed to remove contents of socket directory: remove /var/run/openshift-sdn: device or resource busy
E0416 19:14:37.608046       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:15:22.895889       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:16:17.622212       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:16:49.985838       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:17:25.621141       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:17:59.112996       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:18:39.946436       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:19:28.494871       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:20:27.786604       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:21:05.500710       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:21:49.204925       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused
E0416 19:22:30.656059       1 leaderelection.go:252] error retrieving resource lock openshift-cluster-version/version: Get https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version: dial tcp 127.0.0.1:6443: connect: connection refused

Dunno about that "device or resource busy" stuff, maybe it's important?  Also, looks like the CVO is expecting a local Kubernetes API and not getting it.  Maybe it should be using the load balancer by this point?  Anyhow, that's enough digging for today, I'll wade back in tomorrow ;).

Comment 20 Sudha Ponnaganti 2019-04-17 13:00:53 UTC
Can anyone in QE provide help with the questions Casey has. He is in Berlin and we will not have a build unless this is resolved tonight.

Comment 23 Mike Fiedler 2019-04-17 13:35:13 UTC
re:  comment 17

> I've been able to run the installer on AWS successfully several times today

Is that using CI/OKD or nightly OCP builds?   That seems to be a factor.   The nightly build e2e tests are failing with the same error in non-QE environments.

Try this installer:  oc adm release extract --tools registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-04-17-073737

Comment 26 Casey Callendrello 2019-04-17 14:27:12 UTC
Found the bug.

Apr 17 14:25:07 ip-10-0-138-13 crio[1072]: /var/lib/cni/bin/multus: error while loading shared libraries: libcrypto.so.10: cannot open shared object file: No such file or directory

The OKD / ART builds of multus are breaking the (normally static!) cni plugin.

Comment 27 W. Trevor King 2019-04-17 14:29:05 UTC
Get whoever's working on bug 1642158 over here ;).

Comment 28 W. Trevor King 2019-04-17 14:31:45 UTC
Short-term fix would be COPYing those libraries over from the Go-builder image when assembling the multus image.

Comment 29 Casey Callendrello 2019-04-17 14:34:45 UTC
We re-ran the QE run in the openshift-dev infrastructure. Found the multus bug. However, I would also like to observe that I'm seeing none of the weird etcd certificate errors... so I'd also like to know what causes those.

Comment 30 Casey Callendrello 2019-04-17 15:44:36 UTC
I suspect disabling cgo will fix this.

Comment 31 Casey Callendrello 2019-04-17 16:04:29 UTC
Doug is working on this.

https://github.com/openshift/multus-cni/pull/12

Comment 34 W. Trevor King 2019-04-17 20:06:33 UTC
Created attachment 1556004 [details]
installer-gather.sh from a new cluster with a static multus

Comment 35 W. Trevor King 2019-04-17 20:11:34 UTC
Dan turned up host-local, loopback, and openshift-sdn in /var/lib/cni/bin/ on the control-plane machines that are still dynamically linked to libcrypto.so.10.  So we probably need static builds of those too.

Comment 36 Douglas Smith 2019-04-17 20:17:16 UTC
Thanks for the update Trevor, I did make a modification to use `CGO_ENABLED=0` on those earlier in the day in this PR @ https://github.com/openshift/containernetworking-plugins/pull/6

...However, that's not including the `-tag no_openssl`, so I'm going to add that now, too.

Comment 37 Douglas Smith 2019-04-17 20:24:38 UTC
Added `-tag no_openssl` to build for reference CNI plugins, and performed a `ldd ./bin/macvlan` (for example) to verify that they're not dynamic executables, in this PR @ https://github.com/openshift/containernetworking-plugins/pull/7

Comment 38 W. Trevor King 2019-04-17 21:47:27 UTC
> However, I would also like to observe that I'm seeing none of the weird etcd certificate errors...

Sam says this touches bug 1698456, although the details are beyond me ;).

Comment 39 W. Trevor King 2019-04-17 21:52:09 UTC
And bug 1699456 is exactly the etcd issue.  Not sure how it's related to bug 1698456; they may just be different symptoms of the same underlying issue(s).

Comment 41 Peter Ruan 2019-04-17 22:57:27 UTC
I'm still seeing 

openshift-machine-api   0s    Warning   FailedCreatePodSandBox   pod/cluster-autoscaler-operator-6bc5984cf9-b9rt7   (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-autoscaler-operator-6bc5984cf9-b9rt7_openshift-machine-api_f0cd9004-6161-11e9-8c61-02de036535da_0(fb74a3d01a921da1eda36818ef8ea5fab4dc83750389b92ba982df4103e0aa3f): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input

Comment 42 Douglas Smith 2019-04-17 23:40:08 UTC
Found that the openshift-sdn plugin, which also includes and builds the loopback and host-local CNI plugins was still dynamically linked.

Clayton Coleman opened this PR for updating openshift-sdn: https://github.com/openshift/origin/pull/22595

Comment 43 Sudha Ponnaganti 2019-04-18 11:07:08 UTC
Douglas - Would this PR fix the root cause for this issue?

Comment 44 Douglas Smith 2019-04-18 12:02:27 UTC
Sudha -- I understand the openshift-sdn binary is dynamically linked before this PR (as well as the loopback and host-local plugins) and will exhibit the same behavior as described in comment #26, so it will at least fix these components which I think is the next logical step. I just checked in on that PR and it's been triggered to run against CI again.

Comment 45 Casey Callendrello 2019-04-18 12:31:11 UTC
https://github.com/openshift/origin/pull/22595 has finally merged. Now we can try again.

Comment 46 Casey Callendrello 2019-04-18 13:02:58 UTC
*** Bug 1693247 has been marked as a duplicate of this bug. ***

Comment 47 Mike Fiedler 2019-04-18 19:08:14 UTC
Verified on 4.1.0-0.nightly-2019-04-18-170154.   The build installs successfully and passes basic smoke testing

Comment 49 errata-xmlrpc 2019-06-04 10:47:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.