Bug 1696774

Summary:	Machine-os-content has not been promoted in the last 32 hours
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Networking	Assignee:	Casey Callendrello <cdc>
Status:	CLOSED ERRATA	QA Contact:	Meng Bo <bmeng>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	amurdaca, aos-bugs, bbreard, danw, dcbw, dustymabe, imcleod, jligon, lxia, mifiedle, nstielau, sjenning, walters, wsun, zzhao
Target Milestone:	---	Keywords:	Reopened, TestBlocker
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:47:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-04-05 15:30:53 UTC

https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/

Has failed 3 times in a row failing to bring up a cluster.  This blocks grabbing the newer kubelet.

We did get bootkube so someone needs to debug

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/88/artifacts/e2e-aws/bootstrap/

Comment 1 Colin Walters 2019-04-05 15:35:27 UTC

https://jira.coreos.com/browse/RHCOS-120

Comment 4 Colin Walters 2019-04-06 12:44:55 UTC

Looks like we have a new failure mode, dying during install:

 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/91/?log#log

One thing I notice in a quick glance is workers aren't coming up.
And the machine-api-operator logs are empty...not sure why yet.

Comment 5 Colin Walters 2019-04-07 20:09:22 UTC

In order to make this easier to debug, I created a release image: registry.svc.ci.openshift.org/rhcos/release:latest

Which you can pass to the installer as usual.

Looks like openshift-apiserver is in CrashLoopBackoff:

Events:
  Type     Reason     Age                      From                            Message
  ----     ------     ----                     ----                            -------
  Normal   Pulled     7m19s (x29 over 6h24m)   kubelet, osiris-5rh5d-master-0  Container image "registry.svc.ci.openshift.org/openshift/origin-v4.0@sha256:ecacd82ccd9f2631fb5e184fc22cf79b3efe60896f29ed0e011f7f4a8589d6ae" already present on machine
  Normal   Created    7m18s (x29 over 6h24m)   kubelet, osiris-5rh5d-master-0  Created container openshift-apiserver
  Normal   Started    7m17s (x29 over 6h24m)   kubelet, osiris-5rh5d-master-0  Started container openshift-apiserver
  Warning  Unhealthy  6m57s (x57 over 6h24m)   kubelet, osiris-5rh5d-master-0  Readiness probe failed: Get https://10.88.0.43:8443/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  BackOff    2m10s (x566 over 6h23m)  kubelet, osiris-5rh5d-master-0  Back-off restarting failed container

$ oc logs pods/apiserver-b44dk
I0407 20:02:44.505348       1 clientca.go:92] [0] "/var/run/configmaps/aggregator-client-ca/ca-bundle.crt" client-ca certificate: "aggregator-signer" [] issuer="<self>" (2019-04-07 13:26:11 +0000 UTC to 2019-04-08 13:26:11 +0000 UTC (now=2019-04-07 20:02:44.505327761 +0000 UTC))
I0407 20:02:44.505780       1 clientca.go:92] [0] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-ca" [] issuer="<self>" (2019-04-07 13:26:08 +0000 UTC to 2029-04-04 13:26:08 +0000 UTC (now=2019-04-07 20:02:44.505770769 +0000 UTC))
I0407 20:02:44.505812       1 clientca.go:92] [1] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "admin-kubeconfig-signer" [] issuer="<self>" (2019-04-07 13:26:08 +0000 UTC to 2029-04-04 13:26:08 +0000 UTC (now=2019-04-07 20:02:44.50579267 +0000 UTC))
I0407 20:02:44.505822       1 clientca.go:92] [2] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kubelet-signer" [] issuer="<self>" (2019-04-07 13:26:13 +0000 UTC to 2019-04-08 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505817243 +0000 UTC))
I0407 20:02:44.505831       1 clientca.go:92] [3] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-control-plane-signer" [] issuer="<self>" (2019-04-07 13:26:13 +0000 UTC to 2020-04-06 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505826169 +0000 UTC))
I0407 20:02:44.505842       1 clientca.go:92] [4] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "system:kube-apiserver" [client] groups=[kube-master] issuer="kube-apiserver-to-kubelet-signer" (2019-04-07 13:26:13 +0000 UTC to 2020-04-06 13:26:14 +0000 UTC (now=2019-04-07 20:02:44.505836365 +0000 UTC))
I0407 20:02:44.505865       1 clientca.go:92] [5] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kubelet-bootstrap-kubeconfig-signer" [] issuer="<self>" (2019-04-07 13:26:09 +0000 UTC to 2029-04-04 13:26:09 +0000 UTC (now=2019-04-07 20:02:44.505860046 +0000 UTC))
I0407 20:02:44.505874       1 clientca.go:92] [6] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-csr-signer_@1554644021" [] issuer="kubelet-signer" (2019-04-07 13:33:40 +0000 UTC to 2019-04-08 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505869248 +0000 UTC))
I0407 20:02:44.508408       1 audit.go:362] Using audit backend: ignoreErrors<log>
I0407 20:02:44.517343       1 plugins.go:84] Registered admission plugin "NamespaceLifecycle"
I0407 20:02:44.517359       1 plugins.go:84] Registered admission plugin "Initializers"
I0407 20:02:44.517364       1 plugins.go:84] Registered admission plugin "ValidatingAdmissionWebhook"
I0407 20:02:44.517368       1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook"
I0407 20:02:44.518190       1 glog.go:79] Admission plugin "project.openshift.io/ProjectRequestLimit" is not configured so it will be disabled.
I0407 20:02:44.518358       1 glog.go:79] Admission plugin "scheduling.openshift.io/PodNodeConstraints" is not configured so it will be disabled.
I0407 20:02:44.518754       1 plugins.go:158] Loaded 5 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,build.openshift.io/BuildConfigSecretInjector,image.openshift.io/ImageLimitRange,image.openshift.io/ImagePolicy,MutatingAdmissionWebhook.
I0407 20:02:44.518766       1 plugins.go:161] Loaded 8 validating admission controller(s) successfully in the following order: OwnerReferencesPermissionEnforcement,build.openshift.io/BuildConfigSecretInjector,build.openshift.io/BuildByStrategy,image.openshift.io/ImageLimitRange,image.openshift.io/ImagePolicy,quota.openshift.io/ClusterResourceQuota,ValidatingAdmissionWebhook,ResourceQuota.
I0407 20:02:44.524149       1 clientconn.go:551] parsed scheme: ""
I0407 20:02:44.524165       1 clientconn.go:557] scheme "" not registered, fallback to default scheme
I0407 20:02:44.524209       1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd.kube-system.svc:2379 0  <nil>}]
I0407 20:02:44.524289       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd.kube-system.svc:2379 <nil>}]
F0407 20:03:04.524406       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 openshift.io [https://etcd.kube-system.svc:2379] /var/run/secrets/etcd-client/tls.key /var/run/secrets/etcd-client/tls.crt /var/run/configmaps/etcd-serving-ca/ca-bundle.crt true {0xc420bfa990 0xc420bfaa20} <nil> 5m0s 1m0s}), err (context deadline exceeded)

Comment 13 Colin Walters 2019-04-08 16:45:12 UTC

Not sure about cause and effect here, but I noticed:

```
oc get csr
NAME        AGE   REQUESTOR                                                                   CONDITION
csr-2bgjq   11m   system:node:ip-10-0-157-133.us-east-2.compute.internal                      Pending
csr-55sdk   11m   system:node:ip-10-0-137-197.us-east-2.compute.internal                      Pending
csr-9bjw6   20m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-d76nz   20m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-k8t4r   20m   system:node:ip-10-0-137-197.us-east-2.compute.internal                      Pending
csr-qj28q   20m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-qzd8x   11m   system:node:ip-10-0-164-115.us-east-2.compute.internal                      Pending
csr-vvvmx   20m   system:node:ip-10-0-157-133.us-east-2.compute.internal                      Pending
csr-x5rwh   19m   system:node:ip-10-0-164-115.us-east-2.compute.internal                      Pending
```

And:

```
oc -n openshift-service-ca logs deploy/service-serving-cert-signer
Error from server: Get https://ip-10-0-157-133.us-east-2.compute.internal:10250/containerLogs/openshift-service-ca/service-serving-cert-signer-5788d67844-jlnts/service-serving-cert-signer-controller: remote error: tls: internal error
```

I just tried a ` oc get csr | xargs oc adm certificate approve`

Comment 21 Colin Walters 2019-04-08 18:51:30 UTC

Dan Winship [2:48 PM]
`10.88.0.0/16 dev cni0 proto kernel scope link src 10.88.0.1`
no part of openshift creates a `cni0`

Comment 22 Colin Walters 2019-04-08 18:58:58 UTC

For reference in order to make the release payload here, I did:

oc adm release new --from-image-stream=origin-v4.0 -n openshift --to-image registry.svc.ci.openshift.org/rhcos/release:latest machine-os-content=registry.svc.ci.openshift.org/rhcos/machine-os-content@sha256:e808a43cd9ef88cbf1c805179ff7672c03907aeff184bc1e902c82d9ab5aa9c8

Then: xokdinst launch -p aws -I registry.svc.ci.openshift.org/rhcos/release:latest horus
(Which uses my installer wrapper, you need to do:
 env OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/rhcos/release:latest openshift-install ...)

Now if e.g. some fixes land in kubelet (or anything host side) for this we'll need to re-push machine-os-content and rerun the above.
Or if fixes land in some of the SDN pods, we can generate a custom payload that overrides those too to test it.

Comment 23 Colin Walters 2019-04-08 19:35:08 UTC

https://github.com/openshift/machine-config-operator/pull/608

Comment 24 Seth Jennings 2019-04-08 20:35:50 UTC

*** Bug 1697213 has been marked as a duplicate of this bug. ***

Comment 25 Colin Walters 2019-04-08 21:54:53 UTC

I am quite confident https://github.com/openshift/machine-config-operator/pull/608 (which just merged) will fix this.  But we also have a crio fix coming too that should work.

Comment 26 Colin Walters 2019-04-09 12:59:51 UTC

OK, latest promotion job is failing in e2e now:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/99/

Failing tests:

[Feature:Prometheus][Conformance] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel/minimal]
[sig-storage] In-tree Volumes [Driver: hostPathSymlink] [Testpattern: Inline-volume (ext3)] volumes should allow exec of files on the volume [Suite:openshift/conformance/parallel] [Suite:k8s]

Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20190409-001133.xml

error: 2 fail, 569 pass, 547 skip (26m22s)

Those look like usual flakes?  Going to ask for retests.

Comment 27 Colin Walters 2019-04-09 19:54:09 UTC

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/101/
passed.

Comment 28 Mike Fiedler 2019-04-10 18:25:18 UTC

Kubelet is still 1.12.4 in the 4.0.0-0.nightly-2019-04-10-141956 build

Moving this to MODIFIED until available in a build.  Then it can go to ON_QA

Comment 29 Mike Fiedler 2019-04-10 18:26:01 UTC

Adding TestBlocker since it blocks Kube 1.13 regression testing

Comment 30 Seth Jennings 2019-04-10 19:28:40 UTC

I just installed a cluster and it is running 1.13 kubelet

$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-129-243.us-west-1.compute.internal   Ready    worker   25h   v1.13.4+1ad602308
ip-10-0-131-69.us-west-1.compute.internal    Ready    master   25h   v1.13.4+1ad602308
ip-10-0-139-220.us-west-1.compute.internal   Ready    worker   25h   v1.13.4+1ad602308
ip-10-0-143-227.us-west-1.compute.internal   Ready    master   25h   v1.13.4+1ad602308
ip-10-0-156-146.us-west-1.compute.internal   Ready    worker   25h   v1.13.4+1ad602308
ip-10-0-158-107.us-west-1.compute.internal   Ready    master   25h   v1.13.4+1ad602308

$ oc get clusterversion
NAME      VERSION                           AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.alpha-2019-04-09-164546   True        False         25h     Cluster version is 4.0.0-0.alpha-2019-04-09-164546

Comment 31 Mike Fiedler 2019-04-11 00:00:46 UTC

re: comment 30 Did you install an ART-produced OCP build?  or a CI OKD build?

Comment 32 Seth Jennings 2019-04-11 04:27:45 UTC

OKD CI build

Comment 34 zhaozhanqi 2019-04-19 03:26:57 UTC

Verified this bug on 4.1.0-0.nightly-2019-04-18-170154

Comment 36 errata-xmlrpc 2019-06-04 10:47:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758