Hide Forgot
https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/ Has failed 3 times in a row failing to bring up a cluster. This blocks grabbing the newer kubelet. We did get bootkube so someone needs to debug https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/88/artifacts/e2e-aws/bootstrap/
https://jira.coreos.com/browse/RHCOS-120
Looks like we have a new failure mode, dying during install: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/91/?log#log One thing I notice in a quick glance is workers aren't coming up. And the machine-api-operator logs are empty...not sure why yet.
In order to make this easier to debug, I created a release image: registry.svc.ci.openshift.org/rhcos/release:latest Which you can pass to the installer as usual. Looks like openshift-apiserver is in CrashLoopBackoff: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 7m19s (x29 over 6h24m) kubelet, osiris-5rh5d-master-0 Container image "registry.svc.ci.openshift.org/openshift/origin-v4.0@sha256:ecacd82ccd9f2631fb5e184fc22cf79b3efe60896f29ed0e011f7f4a8589d6ae" already present on machine Normal Created 7m18s (x29 over 6h24m) kubelet, osiris-5rh5d-master-0 Created container openshift-apiserver Normal Started 7m17s (x29 over 6h24m) kubelet, osiris-5rh5d-master-0 Started container openshift-apiserver Warning Unhealthy 6m57s (x57 over 6h24m) kubelet, osiris-5rh5d-master-0 Readiness probe failed: Get https://10.88.0.43:8443/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Warning BackOff 2m10s (x566 over 6h23m) kubelet, osiris-5rh5d-master-0 Back-off restarting failed container $ oc logs pods/apiserver-b44dk I0407 20:02:44.505348 1 clientca.go:92] [0] "/var/run/configmaps/aggregator-client-ca/ca-bundle.crt" client-ca certificate: "aggregator-signer" [] issuer="<self>" (2019-04-07 13:26:11 +0000 UTC to 2019-04-08 13:26:11 +0000 UTC (now=2019-04-07 20:02:44.505327761 +0000 UTC)) I0407 20:02:44.505780 1 clientca.go:92] [0] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-ca" [] issuer="<self>" (2019-04-07 13:26:08 +0000 UTC to 2029-04-04 13:26:08 +0000 UTC (now=2019-04-07 20:02:44.505770769 +0000 UTC)) I0407 20:02:44.505812 1 clientca.go:92] [1] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "admin-kubeconfig-signer" [] issuer="<self>" (2019-04-07 13:26:08 +0000 UTC to 2029-04-04 13:26:08 +0000 UTC (now=2019-04-07 20:02:44.50579267 +0000 UTC)) I0407 20:02:44.505822 1 clientca.go:92] [2] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kubelet-signer" [] issuer="<self>" (2019-04-07 13:26:13 +0000 UTC to 2019-04-08 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505817243 +0000 UTC)) I0407 20:02:44.505831 1 clientca.go:92] [3] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-control-plane-signer" [] issuer="<self>" (2019-04-07 13:26:13 +0000 UTC to 2020-04-06 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505826169 +0000 UTC)) I0407 20:02:44.505842 1 clientca.go:92] [4] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "system:kube-apiserver" [client] groups=[kube-master] issuer="kube-apiserver-to-kubelet-signer" (2019-04-07 13:26:13 +0000 UTC to 2020-04-06 13:26:14 +0000 UTC (now=2019-04-07 20:02:44.505836365 +0000 UTC)) I0407 20:02:44.505865 1 clientca.go:92] [5] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kubelet-bootstrap-kubeconfig-signer" [] issuer="<self>" (2019-04-07 13:26:09 +0000 UTC to 2029-04-04 13:26:09 +0000 UTC (now=2019-04-07 20:02:44.505860046 +0000 UTC)) I0407 20:02:44.505874 1 clientca.go:92] [6] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-csr-signer_@1554644021" [] issuer="kubelet-signer" (2019-04-07 13:33:40 +0000 UTC to 2019-04-08 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505869248 +0000 UTC)) I0407 20:02:44.508408 1 audit.go:362] Using audit backend: ignoreErrors<log> I0407 20:02:44.517343 1 plugins.go:84] Registered admission plugin "NamespaceLifecycle" I0407 20:02:44.517359 1 plugins.go:84] Registered admission plugin "Initializers" I0407 20:02:44.517364 1 plugins.go:84] Registered admission plugin "ValidatingAdmissionWebhook" I0407 20:02:44.517368 1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook" I0407 20:02:44.518190 1 glog.go:79] Admission plugin "project.openshift.io/ProjectRequestLimit" is not configured so it will be disabled. I0407 20:02:44.518358 1 glog.go:79] Admission plugin "scheduling.openshift.io/PodNodeConstraints" is not configured so it will be disabled. I0407 20:02:44.518754 1 plugins.go:158] Loaded 5 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,build.openshift.io/BuildConfigSecretInjector,image.openshift.io/ImageLimitRange,image.openshift.io/ImagePolicy,MutatingAdmissionWebhook. I0407 20:02:44.518766 1 plugins.go:161] Loaded 8 validating admission controller(s) successfully in the following order: OwnerReferencesPermissionEnforcement,build.openshift.io/BuildConfigSecretInjector,build.openshift.io/BuildByStrategy,image.openshift.io/ImageLimitRange,image.openshift.io/ImagePolicy,quota.openshift.io/ClusterResourceQuota,ValidatingAdmissionWebhook,ResourceQuota. I0407 20:02:44.524149 1 clientconn.go:551] parsed scheme: "" I0407 20:02:44.524165 1 clientconn.go:557] scheme "" not registered, fallback to default scheme I0407 20:02:44.524209 1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd.kube-system.svc:2379 0 <nil>}] I0407 20:02:44.524289 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd.kube-system.svc:2379 <nil>}] F0407 20:03:04.524406 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 openshift.io [https://etcd.kube-system.svc:2379] /var/run/secrets/etcd-client/tls.key /var/run/secrets/etcd-client/tls.crt /var/run/configmaps/etcd-serving-ca/ca-bundle.crt true {0xc420bfa990 0xc420bfaa20} <nil> 5m0s 1m0s}), err (context deadline exceeded)
Not sure about cause and effect here, but I noticed: ``` oc get csr NAME AGE REQUESTOR CONDITION csr-2bgjq 11m system:node:ip-10-0-157-133.us-east-2.compute.internal Pending csr-55sdk 11m system:node:ip-10-0-137-197.us-east-2.compute.internal Pending csr-9bjw6 20m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-d76nz 20m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-k8t4r 20m system:node:ip-10-0-137-197.us-east-2.compute.internal Pending csr-qj28q 20m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-qzd8x 11m system:node:ip-10-0-164-115.us-east-2.compute.internal Pending csr-vvvmx 20m system:node:ip-10-0-157-133.us-east-2.compute.internal Pending csr-x5rwh 19m system:node:ip-10-0-164-115.us-east-2.compute.internal Pending ``` And: ``` oc -n openshift-service-ca logs deploy/service-serving-cert-signer Error from server: Get https://ip-10-0-157-133.us-east-2.compute.internal:10250/containerLogs/openshift-service-ca/service-serving-cert-signer-5788d67844-jlnts/service-serving-cert-signer-controller: remote error: tls: internal error ``` I just tried a ` oc get csr | xargs oc adm certificate approve`
Dan Winship [2:48 PM] `10.88.0.0/16 dev cni0 proto kernel scope link src 10.88.0.1` no part of openshift creates a `cni0`
For reference in order to make the release payload here, I did: oc adm release new --from-image-stream=origin-v4.0 -n openshift --to-image registry.svc.ci.openshift.org/rhcos/release:latest machine-os-content=registry.svc.ci.openshift.org/rhcos/machine-os-content@sha256:e808a43cd9ef88cbf1c805179ff7672c03907aeff184bc1e902c82d9ab5aa9c8 Then: xokdinst launch -p aws -I registry.svc.ci.openshift.org/rhcos/release:latest horus (Which uses my installer wrapper, you need to do: env OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/rhcos/release:latest openshift-install ...) Now if e.g. some fixes land in kubelet (or anything host side) for this we'll need to re-push machine-os-content and rerun the above. Or if fixes land in some of the SDN pods, we can generate a custom payload that overrides those too to test it.
https://github.com/openshift/machine-config-operator/pull/608
*** Bug 1697213 has been marked as a duplicate of this bug. ***
I am quite confident https://github.com/openshift/machine-config-operator/pull/608 (which just merged) will fix this. But we also have a crio fix coming too that should work.
OK, latest promotion job is failing in e2e now: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/99/ Failing tests: [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel/minimal] [sig-storage] In-tree Volumes [Driver: hostPathSymlink] [Testpattern: Inline-volume (ext3)] volumes should allow exec of files on the volume [Suite:openshift/conformance/parallel] [Suite:k8s] Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20190409-001133.xml error: 2 fail, 569 pass, 547 skip (26m22s) Those look like usual flakes? Going to ask for retests.
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/101/ passed.
Kubelet is still 1.12.4 in the 4.0.0-0.nightly-2019-04-10-141956 build Moving this to MODIFIED until available in a build. Then it can go to ON_QA
Adding TestBlocker since it blocks Kube 1.13 regression testing
I just installed a cluster and it is running 1.13 kubelet $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-129-243.us-west-1.compute.internal Ready worker 25h v1.13.4+1ad602308 ip-10-0-131-69.us-west-1.compute.internal Ready master 25h v1.13.4+1ad602308 ip-10-0-139-220.us-west-1.compute.internal Ready worker 25h v1.13.4+1ad602308 ip-10-0-143-227.us-west-1.compute.internal Ready master 25h v1.13.4+1ad602308 ip-10-0-156-146.us-west-1.compute.internal Ready worker 25h v1.13.4+1ad602308 ip-10-0-158-107.us-west-1.compute.internal Ready master 25h v1.13.4+1ad602308 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.alpha-2019-04-09-164546 True False 25h Cluster version is 4.0.0-0.alpha-2019-04-09-164546
re: comment 30 Did you install an ART-produced OCP build? or a CI OKD build?
OKD CI build
Verified this bug on 4.1.0-0.nightly-2019-04-18-170154
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758