Description of problem: Job consistently fails to get past install/setup: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-informing#release-openshift-ocp-installer-e2e-aws-upi-4.3 bootkube logs seem to imply etcd container issues, so starting there: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/673 Jan 06 16:43:21 ip-10-0-12-116 podman[4914]: 2020-01-06 16:43:21.832024747 +0000 UTC m=+5.510706773 container died f6288c17846a3df0ce0b902c2f63f7713c61c092917398ae028b504ac8fd476c (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl) Jan 06 16:43:21 ip-10-0-12-116 podman[4914]: 2020-01-06 16:43:21.897275326 +0000 UTC m=+5.575957342 container remove f6288c17846a3df0ce0b902c2f63f7713c61c092917398ae028b504ac8fd476c (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl) Jan 06 16:43:21 ip-10-0-12-116 bootkube.sh[2596]: etcdctl failed. Retrying in 5 seconds... Jan 06 16:43:27 ip-10-0-12-116 podman[4976]: 2020-01-06 16:43:27.025709101 +0000 UTC m=+0.096469920 container create 5fa060dceb6b64a7f55ebda6cefd71e4d31a0cb9c17822c20465c8028c3bd81f (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl) Jan 06 16:43:27 ip-10-0-12-116 podman[4976]: 2020-01-06 16:43:27.378398937 +0000 UTC m=+0.449159867 container init 5fa060dceb6b64a7f55ebda6cefd71e4d31a0cb9c17822c20465c8028c3bd81f (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl) Jan 06 16:43:27 ip-10-0-12-116 podman[4976]: 2020-01-06 16:43:27.397308889 +0000 UTC m=+0.468069805 container start 5fa060dceb6b64a7f55ebda6cefd71e4d31a0cb9c17822c20465c8028c3bd81f (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl) Jan 06 16:43:27 ip-10-0-12-116 podman[4976]: 2020-01-06 16:43:27.397451396 +0000 UTC m=+0.468212391 container attach 5fa060dceb6b64a7f55ebda6cefd71e4d31a0cb9c17822c20465c8028c3bd81f (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl) Jan 06 16:43:32 ip-10-0-12-116 bootkube.sh[2596]: {"level":"warn","ts":"2020-01-06T16:43:32.408Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8f32bb10-505e-4a13-ad4e-2cddb61b4752/etcd-2.ci-op-xb9xh1j8-193b7.origin-ci-int-aws.dev.rhcloud.com:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.94.88:2379: connect: connection refused\""}
I made a PR against release-4.3[1] and licked off aws-upi tests. I was able to get on the bootstrap node during the test and snooped around. machine-config-server was reporting: ``` I0106 21:18:27.668381 1 api.go:97] Pool master requested by 10.0.52.33:37836 I0106 21:18:27.668416 1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml" I0106 21:18:27.668990 1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-0a2207c7f7dbb237c412503c01ed8b25.yaml" ``` followed by many ``` 2020/01/06 21:46:20 http: TLS handshake error from 10.0.52.33:12815: EOF 2020/01/06 21:46:21 http: TLS handshake error from 10.0.52.33:31445: EOF 2020/01/06 21:46:21 http: TLS handshake error from 10.0.52.33:2509: EOF 2020/01/06 21:46:22 http: TLS handshake error from 10.0.52.33:7812: EOF 2020/01/06 21:46:22 http: TLS handshake error from 10.0.52.33:51679: EOF 2020/01/06 21:46:23 http: TLS handshake error from 10.0.52.33:11177: EOF 2020/01/06 21:46:23 http: TLS handshake error from 10.0.52.33:37076: EOF ``` I have seen this before with Azure where the master node compute BIOS was not setting time with NTP. Because there was a drift of time the master nodes were failing to fetch ignition because the certs were not yet valid. The reason we see the etcdctl errors is master compute is not booting thus etcd static pods do not exist to connect to. I will talk with folks about ownership but this is not on etcd.
The cert itself was showing a date of Jan 6 21:01:36 2020 GMT I was unable to get the console of the failing master compute to verify time. $ openssl x509 -in test.crt -text -noout Certificate: Data: Version: 3 (0x2) Serial Number: 4156550144402622604 (0x39af08558037888c) Signature Algorithm: sha256WithRSAEncryption Issuer: OU = openshift, CN = root-ca Validity Not Before: Jan 6 21:01:36 2020 GMT Not After : Jan 3 21:01:38 2030 GMT Subject: CN = api-int.ci-op-b1g6xzk0-eadb4.origin-ci-int-aws.dev.rhcloud.com
[1] should be part of Go 1.14, due Feb 1 [2], which will help debug time divergence in Ignition logs, etc. [1]: https://github.com/golang/go/pull/34646 [2]: https://github.com/golang/go/milestone/95
Created attachment 1650218 [details] system-log-from-failed-master I think this might be closer to root cause, taked from failed compute. ``` Dependency failed for CoreOS: Mount (subsequent) /sysroot. [ 93.118191] systemd[1]: Dependency failed for CoreOS: Mount (subsequent) /sysroot. [ 93.125087] systemd[1]: ignition-ostree-mount-subsequent-sysroot.service: Job ignition-ostree-mount-subsequent-sysroot.service/start failed with result 'dependency'. Starting OSTree Prepare OS/... [ 93.137402] systemd[1]: dev-disk-by\x2dlabel-root.device: Job dev-disk-by\x2dlabel-root.device/start failed with result 'timeout'. [[0;1;31mFAILED[ 93.147018] systemd[1]: Starting OSTree Prepare OS/... [0m] Failed to start OSTree Prepare OS/. See 'systemctl s[ 93.152507] ostree-prepare-root[1025]: ostree-prepare-root: Couldn't find specified OSTree root '/sysroot//ostree/boot.0/rhcos/973e5ed2d11572ad1c70712cdea80b5b73247682143deb51c3971adbac3fcedd/0': No such file or directory tatus ostree-prepare-root.service' for details. [[0;32m OK [0m] [ 93.16800[ 93.168624] systemd[1]: ostree-prepare-root.service: Main process exited, code=exited, status=1/FAILURE 6] Stopped target Basic System.ostree-prepare-root [1025]: ostree-prepare-root:[ 93.177113] systemd[1]: ostree-prepare-root.service: Failed with result 'exit-code'. Couldn't find specified OSTree root '/sysroot//ostree/boot.0/rhcos/973e5ed2d115[ 93.184188] systemd[1]: Failed to start OSTree Prepare OS/. 72ad1c70712cdea80b5b73247682143deb51c3971adbac3fcedd/0': No such[ 93.189377] systemd[1]: ostree-prepare-root.service: Triggering OnFailure= dependencies. file or directory [[0;32m OK [0m] Stopped target System Initialization. [ 93.197090] systemd[1]: Stopped target Basic System. [[0;32m OK [[ 93.201688] systemd[1]: Stopped target System Initialization. ```
Last I looked the AWS UPI job had a hand-rolled mapping of AMIs. It's likely there's some sort of issue with pivoting from the UPI job's AMI to current 4.3. The real failure here is timing out waiting for the root partition; it looks like the system has a root= kernel argument, but doesn't have a partition with that name? I think I know - if those UPI job AMIs are old, then indeed things would break since https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/773 IOW - the UPI jobs *need* to use the installer as source of truth - which...yes this came up in https://github.com/openshift/release/pull/6248
OK right, until we conditionalize the AMIs or do the *right thing* and land https://github.com/openshift/installer/pull/2092 this is known broken. Reassigning to installer. The RHCOS team *will not* support users (including CI jobs) using arbitrary pinned AMIs, particularly *across release majors*.
https://github.com/openshift/release/pull/6586
master is 4.4, setting target release to that. If we need to pick this back, can we create a clone, etc.? Does this job need to be reworked to pull the current AMI from the installer? Is a JIRA card going to get created there?
> Does this job need to be reworked to pull the current AMI from the installer? There is no installer API exposing it's suggested AMI. More on that in bug 1782063 and [1]. [1]: https://github.com/openshift/release/pull/6248
https://github.com/openshift/release/pull/6835 picks the right AMI There are still several e2e tests that seem to represent persistent failures but the installation has been successful in the majority of the jobs since the original PR bumping the image merged and the follow up will ensure the image is correct for the foreseeable future. Expect build cop to follow up on specific test failures.