Bug 1696774

Summary: Machine-os-content has not been promoted in the last 32 hours
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Status: CLOSED ERRATA QA Contact: Meng Bo <bmeng>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: amurdaca, aos-bugs, bbreard, danw, dcbw, dustymabe, imcleod, jligon, lxia, mifiedle, nstielau, sjenning, walters, wsun, zzhao
Target Milestone: ---Keywords: Reopened, TestBlocker
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Colin Walters 2019-04-05 15:35:27 UTC
https://jira.coreos.com/browse/RHCOS-120

Comment 4 Colin Walters 2019-04-06 12:44:55 UTC
Looks like we have a new failure mode, dying during install:

 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/91/?log#log

One thing I notice in a quick glance is workers aren't coming up.
And the machine-api-operator logs are empty...not sure why yet.

Comment 5 Colin Walters 2019-04-07 20:09:22 UTC
In order to make this easier to debug, I created a release image: registry.svc.ci.openshift.org/rhcos/release:latest

Which you can pass to the installer as usual.

Looks like openshift-apiserver is in CrashLoopBackoff:

Events:
  Type     Reason     Age                      From                            Message
  ----     ------     ----                     ----                            -------
  Normal   Pulled     7m19s (x29 over 6h24m)   kubelet, osiris-5rh5d-master-0  Container image "registry.svc.ci.openshift.org/openshift/origin-v4.0@sha256:ecacd82ccd9f2631fb5e184fc22cf79b3efe60896f29ed0e011f7f4a8589d6ae" already present on machine
  Normal   Created    7m18s (x29 over 6h24m)   kubelet, osiris-5rh5d-master-0  Created container openshift-apiserver
  Normal   Started    7m17s (x29 over 6h24m)   kubelet, osiris-5rh5d-master-0  Started container openshift-apiserver
  Warning  Unhealthy  6m57s (x57 over 6h24m)   kubelet, osiris-5rh5d-master-0  Readiness probe failed: Get https://10.88.0.43:8443/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  BackOff    2m10s (x566 over 6h23m)  kubelet, osiris-5rh5d-master-0  Back-off restarting failed container

$ oc logs pods/apiserver-b44dk
I0407 20:02:44.505348       1 clientca.go:92] [0] "/var/run/configmaps/aggregator-client-ca/ca-bundle.crt" client-ca certificate: "aggregator-signer" [] issuer="<self>" (2019-04-07 13:26:11 +0000 UTC to 2019-04-08 13:26:11 +0000 UTC (now=2019-04-07 20:02:44.505327761 +0000 UTC))
I0407 20:02:44.505780       1 clientca.go:92] [0] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-ca" [] issuer="<self>" (2019-04-07 13:26:08 +0000 UTC to 2029-04-04 13:26:08 +0000 UTC (now=2019-04-07 20:02:44.505770769 +0000 UTC))
I0407 20:02:44.505812       1 clientca.go:92] [1] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "admin-kubeconfig-signer" [] issuer="<self>" (2019-04-07 13:26:08 +0000 UTC to 2029-04-04 13:26:08 +0000 UTC (now=2019-04-07 20:02:44.50579267 +0000 UTC))
I0407 20:02:44.505822       1 clientca.go:92] [2] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kubelet-signer" [] issuer="<self>" (2019-04-07 13:26:13 +0000 UTC to 2019-04-08 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505817243 +0000 UTC))
I0407 20:02:44.505831       1 clientca.go:92] [3] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-control-plane-signer" [] issuer="<self>" (2019-04-07 13:26:13 +0000 UTC to 2020-04-06 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505826169 +0000 UTC))
I0407 20:02:44.505842       1 clientca.go:92] [4] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "system:kube-apiserver" [client] groups=[kube-master] issuer="kube-apiserver-to-kubelet-signer" (2019-04-07 13:26:13 +0000 UTC to 2020-04-06 13:26:14 +0000 UTC (now=2019-04-07 20:02:44.505836365 +0000 UTC))
I0407 20:02:44.505865       1 clientca.go:92] [5] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kubelet-bootstrap-kubeconfig-signer" [] issuer="<self>" (2019-04-07 13:26:09 +0000 UTC to 2029-04-04 13:26:09 +0000 UTC (now=2019-04-07 20:02:44.505860046 +0000 UTC))
I0407 20:02:44.505874       1 clientca.go:92] [6] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-csr-signer_@1554644021" [] issuer="kubelet-signer" (2019-04-07 13:33:40 +0000 UTC to 2019-04-08 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505869248 +0000 UTC))
I0407 20:02:44.508408       1 audit.go:362] Using audit backend: ignoreErrors<log>
I0407 20:02:44.517343       1 plugins.go:84] Registered admission plugin "NamespaceLifecycle"
I0407 20:02:44.517359       1 plugins.go:84] Registered admission plugin "Initializers"
I0407 20:02:44.517364       1 plugins.go:84] Registered admission plugin "ValidatingAdmissionWebhook"
I0407 20:02:44.517368       1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook"
I0407 20:02:44.518190       1 glog.go:79] Admission plugin "project.openshift.io/ProjectRequestLimit" is not configured so it will be disabled.
I0407 20:02:44.518358       1 glog.go:79] Admission plugin "scheduling.openshift.io/PodNodeConstraints" is not configured so it will be disabled.
I0407 20:02:44.518754       1 plugins.go:158] Loaded 5 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,build.openshift.io/BuildConfigSecretInjector,image.openshift.io/ImageLimitRange,image.openshift.io/ImagePolicy,MutatingAdmissionWebhook.
I0407 20:02:44.518766       1 plugins.go:161] Loaded 8 validating admission controller(s) successfully in the following order: OwnerReferencesPermissionEnforcement,build.openshift.io/BuildConfigSecretInjector,build.openshift.io/BuildByStrategy,image.openshift.io/ImageLimitRange,image.openshift.io/ImagePolicy,quota.openshift.io/ClusterResourceQuota,ValidatingAdmissionWebhook,ResourceQuota.
I0407 20:02:44.524149       1 clientconn.go:551] parsed scheme: ""
I0407 20:02:44.524165       1 clientconn.go:557] scheme "" not registered, fallback to default scheme
I0407 20:02:44.524209       1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd.kube-system.svc:2379 0  <nil>}]
I0407 20:02:44.524289       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd.kube-system.svc:2379 <nil>}]
F0407 20:03:04.524406       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 openshift.io [https://etcd.kube-system.svc:2379] /var/run/secrets/etcd-client/tls.key /var/run/secrets/etcd-client/tls.crt /var/run/configmaps/etcd-serving-ca/ca-bundle.crt true {0xc420bfa990 0xc420bfaa20} <nil> 5m0s 1m0s}), err (context deadline exceeded)

Comment 13 Colin Walters 2019-04-08 16:45:12 UTC
Not sure about cause and effect here, but I noticed:

```
oc get csr
NAME        AGE   REQUESTOR                                                                   CONDITION
csr-2bgjq   11m   system:node:ip-10-0-157-133.us-east-2.compute.internal                      Pending
csr-55sdk   11m   system:node:ip-10-0-137-197.us-east-2.compute.internal                      Pending
csr-9bjw6   20m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-d76nz   20m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-k8t4r   20m   system:node:ip-10-0-137-197.us-east-2.compute.internal                      Pending
csr-qj28q   20m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-qzd8x   11m   system:node:ip-10-0-164-115.us-east-2.compute.internal                      Pending
csr-vvvmx   20m   system:node:ip-10-0-157-133.us-east-2.compute.internal                      Pending
csr-x5rwh   19m   system:node:ip-10-0-164-115.us-east-2.compute.internal                      Pending
```

And:

```
oc -n openshift-service-ca logs deploy/service-serving-cert-signer
Error from server: Get https://ip-10-0-157-133.us-east-2.compute.internal:10250/containerLogs/openshift-service-ca/service-serving-cert-signer-5788d67844-jlnts/service-serving-cert-signer-controller: remote error: tls: internal error
```

I just tried a ` oc get csr | xargs oc adm certificate approve`

Comment 21 Colin Walters 2019-04-08 18:51:30 UTC
Dan Winship [2:48 PM]
`10.88.0.0/16 dev cni0 proto kernel scope link src 10.88.0.1`
no part of openshift creates a `cni0`

Comment 22 Colin Walters 2019-04-08 18:58:58 UTC
For reference in order to make the release payload here, I did:

oc adm release new --from-image-stream=origin-v4.0 -n openshift --to-image registry.svc.ci.openshift.org/rhcos/release:latest machine-os-content=registry.svc.ci.openshift.org/rhcos/machine-os-content@sha256:e808a43cd9ef88cbf1c805179ff7672c03907aeff184bc1e902c82d9ab5aa9c8

Then: xokdinst launch -p aws -I registry.svc.ci.openshift.org/rhcos/release:latest horus
(Which uses my installer wrapper, you need to do:
 env OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/rhcos/release:latest openshift-install ...)

Now if e.g. some fixes land in kubelet (or anything host side) for this we'll need to re-push machine-os-content and rerun the above.
Or if fixes land in some of the SDN pods, we can generate a custom payload that overrides those too to test it.

Comment 24 Seth Jennings 2019-04-08 20:35:50 UTC
*** Bug 1697213 has been marked as a duplicate of this bug. ***

Comment 25 Colin Walters 2019-04-08 21:54:53 UTC
I am quite confident https://github.com/openshift/machine-config-operator/pull/608 (which just merged) will fix this.  But we also have a crio fix coming too that should work.

Comment 26 Colin Walters 2019-04-09 12:59:51 UTC
OK, latest promotion job is failing in e2e now:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/99/

Failing tests:

[Feature:Prometheus][Conformance] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel/minimal]
[sig-storage] In-tree Volumes [Driver: hostPathSymlink] [Testpattern: Inline-volume (ext3)] volumes should allow exec of files on the volume [Suite:openshift/conformance/parallel] [Suite:k8s]

Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20190409-001133.xml

error: 2 fail, 569 pass, 547 skip (26m22s)

Those look like usual flakes?  Going to ask for retests.

Comment 28 Mike Fiedler 2019-04-10 18:25:18 UTC
Kubelet is still 1.12.4 in the 4.0.0-0.nightly-2019-04-10-141956 build

Moving this to MODIFIED until available in a build.  Then it can go to ON_QA

Comment 29 Mike Fiedler 2019-04-10 18:26:01 UTC
Adding TestBlocker since it blocks Kube 1.13 regression testing

Comment 30 Seth Jennings 2019-04-10 19:28:40 UTC
I just installed a cluster and it is running 1.13 kubelet

$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-129-243.us-west-1.compute.internal   Ready    worker   25h   v1.13.4+1ad602308
ip-10-0-131-69.us-west-1.compute.internal    Ready    master   25h   v1.13.4+1ad602308
ip-10-0-139-220.us-west-1.compute.internal   Ready    worker   25h   v1.13.4+1ad602308
ip-10-0-143-227.us-west-1.compute.internal   Ready    master   25h   v1.13.4+1ad602308
ip-10-0-156-146.us-west-1.compute.internal   Ready    worker   25h   v1.13.4+1ad602308
ip-10-0-158-107.us-west-1.compute.internal   Ready    master   25h   v1.13.4+1ad602308

$ oc get clusterversion
NAME      VERSION                           AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.alpha-2019-04-09-164546   True        False         25h     Cluster version is 4.0.0-0.alpha-2019-04-09-164546

Comment 31 Mike Fiedler 2019-04-11 00:00:46 UTC
re: comment 30 Did you install an ART-produced OCP build?  or a CI OKD build?

Comment 32 Seth Jennings 2019-04-11 04:27:45 UTC
OKD CI build

Comment 34 zhaozhanqi 2019-04-19 03:26:57 UTC
Verified this bug on 4.1.0-0.nightly-2019-04-18-170154

Comment 36 errata-xmlrpc 2019-06-04 10:47:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758