1788240 – e2e-aws-upi-4.3 job consistently failing

Bug 1788240 - e2e-aws-upi-4.3 job consistently failing

Summary: e2e-aws-upi-4.3 job consistently failing

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Abhinav Dahiya
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-06 19:27 UTC by Ben Parees
Modified:	2020-03-16 16:13 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-30 20:00:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
system-log-from-failed-master (63.02 KB, text/plain) 2020-01-06 23:19 UTC, Sam Batschelet	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift release pull 6586	0	None	closed	templates/aws-upi: Bump RHCOS to latest release-4.3	2020-09-23 11:20:09 UTC
Github	openshift release pull 6835	0	None	closed	Bug 1782063: aws/upi: Figure out AMI from rhcos.json	2020-09-23 11:20:13 UTC

Description Ben Parees 2020-01-06 19:27:49 UTC

Description of problem:
Job consistently fails to get past install/setup:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-informing#release-openshift-ocp-installer-e2e-aws-upi-4.3


bootkube logs seem to imply etcd container issues, so starting there:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.3/673

Jan 06 16:43:21 ip-10-0-12-116 podman[4914]: 2020-01-06 16:43:21.832024747 +0000 UTC m=+5.510706773 container died f6288c17846a3df0ce0b902c2f63f7713c61c092917398ae028b504ac8fd476c (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl)
Jan 06 16:43:21 ip-10-0-12-116 podman[4914]: 2020-01-06 16:43:21.897275326 +0000 UTC m=+5.575957342 container remove f6288c17846a3df0ce0b902c2f63f7713c61c092917398ae028b504ac8fd476c (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl)
Jan 06 16:43:21 ip-10-0-12-116 bootkube.sh[2596]: etcdctl failed. Retrying in 5 seconds...
Jan 06 16:43:27 ip-10-0-12-116 podman[4976]: 2020-01-06 16:43:27.025709101 +0000 UTC m=+0.096469920 container create 5fa060dceb6b64a7f55ebda6cefd71e4d31a0cb9c17822c20465c8028c3bd81f (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl)
Jan 06 16:43:27 ip-10-0-12-116 podman[4976]: 2020-01-06 16:43:27.378398937 +0000 UTC m=+0.449159867 container init 5fa060dceb6b64a7f55ebda6cefd71e4d31a0cb9c17822c20465c8028c3bd81f (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl)
Jan 06 16:43:27 ip-10-0-12-116 podman[4976]: 2020-01-06 16:43:27.397308889 +0000 UTC m=+0.468069805 container start 5fa060dceb6b64a7f55ebda6cefd71e4d31a0cb9c17822c20465c8028c3bd81f (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl)
Jan 06 16:43:27 ip-10-0-12-116 podman[4976]: 2020-01-06 16:43:27.397451396 +0000 UTC m=+0.468212391 container attach 5fa060dceb6b64a7f55ebda6cefd71e4d31a0cb9c17822c20465c8028c3bd81f (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:37f8801d5daf6fa247dc680842baf4acd422209ba9b2fadbe308be9883b94f1b, name=etcdctl)
Jan 06 16:43:32 ip-10-0-12-116 bootkube.sh[2596]: {"level":"warn","ts":"2020-01-06T16:43:32.408Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8f32bb10-505e-4a13-ad4e-2cddb61b4752/etcd-2.ci-op-xb9xh1j8-193b7.origin-ci-int-aws.dev.rhcloud.com:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.94.88:2379: connect: connection refused\""}

Comment 1 Sam Batschelet 2020-01-06 21:54:06 UTC

I made a PR against release-4.3[1] and licked off aws-upi tests.

I was able to get on the bootstrap node during the test and snooped around.


machine-config-server was reporting:
```
I0106 21:18:27.668381       1 api.go:97] Pool master requested by 10.0.52.33:37836
I0106 21:18:27.668416       1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml"
I0106 21:18:27.668990       1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-0a2207c7f7dbb237c412503c01ed8b25.yaml"
```

followed by many
```
2020/01/06 21:46:20 http: TLS handshake error from 10.0.52.33:12815: EOF
2020/01/06 21:46:21 http: TLS handshake error from 10.0.52.33:31445: EOF
2020/01/06 21:46:21 http: TLS handshake error from 10.0.52.33:2509: EOF
2020/01/06 21:46:22 http: TLS handshake error from 10.0.52.33:7812: EOF
2020/01/06 21:46:22 http: TLS handshake error from 10.0.52.33:51679: EOF
2020/01/06 21:46:23 http: TLS handshake error from 10.0.52.33:11177: EOF
2020/01/06 21:46:23 http: TLS handshake error from 10.0.52.33:37076: EOF
```

I have seen this before with Azure where the master node compute BIOS was not setting time with NTP. Because there was a drift of time the master nodes were failing to fetch ignition because the certs were not yet valid.

The reason we see the etcdctl errors is master compute is not booting thus etcd static pods do not exist to connect to. I will talk with folks about ownership but this is not on etcd.

Comment 2 Sam Batschelet 2020-01-06 22:05:40 UTC

The cert itself was showing a date of Jan  6 21:01:36 2020 GMT I was unable to get the console of the failing master compute to verify time.

$ openssl x509 -in test.crt -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 4156550144402622604 (0x39af08558037888c)
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: OU = openshift, CN = root-ca
        Validity
            Not Before: Jan  6 21:01:36 2020 GMT
            Not After : Jan  3 21:01:38 2030 GMT
        Subject: CN = api-int.ci-op-b1g6xzk0-eadb4.origin-ci-int-aws.dev.rhcloud.com

Comment 3 W. Trevor King 2020-01-06 23:07:43 UTC

[1] should be part of Go 1.14, due Feb 1 [2], which will help debug time divergence in Ignition logs, etc.

[1]: https://github.com/golang/go/pull/34646
[2]: https://github.com/golang/go/milestone/95

Comment 4 Sam Batschelet 2020-01-06 23:19:58 UTC

Created attachment 1650218 [details]
system-log-from-failed-master

I think this might be closer to root cause, taked from failed compute.

```
 Dependency failed for CoreOS: Mount (subsequent) /sysroot.
[   93.118191] systemd[1]: Dependency failed for CoreOS: Mount (subsequent) /sysroot.

[   93.125087] systemd[1]: ignition-ostree-mount-subsequent-sysroot.service: Job ignition-ostree-mount-subsequent-sysroot.service/start failed with result 'dependency'.
         Starting OSTree Prepare OS/...
[   93.137402] systemd[1]: dev-disk-by\x2dlabel-root.device: Job dev-disk-by\x2dlabel-root.device/start failed with result 'timeout'.
[[0;1;31mFAILED[   93.147018] systemd[1]: Starting OSTree Prepare OS/...
[0m] Failed to start OSTree Prepare OS/.
See 'systemctl s[   93.152507] ostree-prepare-root[1025]: ostree-prepare-root: Couldn't find specified OSTree root '/sysroot//ostree/boot.0/rhcos/973e5ed2d11572ad1c70712cdea80b5b73247682143deb51c3971adbac3fcedd/0': No such file or directory
tatus ostree-prepare-root.service' for details.
[[0;32m  OK  [0m] [   93.16800[   93.168624] systemd[1]: ostree-prepare-root.service: Main process exited, code=exited, status=1/FAILURE
6] Stopped target Basic System.ostree-prepare-root
[1025]: ostree-prepare-root:[   93.177113] systemd[1]: ostree-prepare-root.service: Failed with result 'exit-code'.
 Couldn't find specified OSTree root '/sysroot//ostree/boot.0/rhcos/973e5ed2d115[   93.184188] systemd[1]: Failed to start OSTree Prepare OS/.
72ad1c70712cdea80b5b73247682143deb51c3971adbac3fcedd/0': No such[   93.189377] systemd[1]: ostree-prepare-root.service: Triggering OnFailure= dependencies.
 file or directory
[[0;32m  OK  [0m] Stopped target System Initialization.
[   93.197090] systemd[1]: Stopped target Basic System.
[[0;32m  OK  [[   93.201688] systemd[1]: Stopped target System Initialization.
```

Comment 6 Colin Walters 2020-01-07 15:05:16 UTC

Last I looked the AWS UPI job had a hand-rolled mapping of AMIs.  It's likely there's some sort of issue with pivoting from the UPI job's AMI to current 4.3.

The real failure here is timing out waiting for the root partition; it looks like the system has a root= kernel argument, but doesn't have a partition with that name?

I think I know - if those UPI job AMIs are old, then indeed things would break since https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/773

IOW - the UPI jobs *need* to use the installer as source of truth - which...yes this came up in 

https://github.com/openshift/release/pull/6248

Comment 7 Colin Walters 2020-01-07 15:12:38 UTC

OK right, until we conditionalize the AMIs or do the *right thing* and land https://github.com/openshift/installer/pull/2092
this is known broken.

Reassigning to installer.  The RHCOS team *will not* support users (including CI jobs) using arbitrary pinned AMIs, particularly *across release majors*.

Comment 8 Colin Walters 2020-01-07 15:28:17 UTC

https://github.com/openshift/release/pull/6586

Comment 9 Stephen Cuppett 2020-01-07 16:39:20 UTC

master is 4.4, setting target release to that. If we need to pick this back, can we create a clone, etc.? Does this job need to be reworked to pull the current AMI from the installer? Is a JIRA card going to get created there?

Comment 10 W. Trevor King 2020-01-07 17:28:15 UTC

> Does this job need to be reworked to pull the current AMI from the installer?

There is no installer API exposing it's suggested AMI.  More on that in bug 1782063 and [1].

[1]: https://github.com/openshift/release/pull/6248

Comment 11 Scott Dodson 2020-01-30 20:00:32 UTC

https://github.com/openshift/release/pull/6835 picks the right AMI

There are still several e2e tests that seem to represent persistent failures but the installation has been successful in the majority of the jobs since the original PR bumping the image merged and the follow up will ensure the image is correct for the foreseeable future.

Expect build cop to follow up on specific test failures.

Note You need to log in before you can comment on or make changes to this bug.