8% of the past 12h ^release-.*4.3$ failures included messages like [1]: error retrieving resource lock.*etcdserver: request timed out Seems spread around the different job types [2,3,4]. Not sure if we need to bump our disks, or if FIPS, compact, and Azure all have different problems, or what. [1]: https://search.svc.ci.openshift.org/chart?name=^release-.*-4.3$&search=error%20retrieving%20resource%20lock.*etcdserver:%20request%20timed%20out [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.3/553 [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/191 [4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.3/40
Created attachment 1640730 [details] etcd_disk_wal_fsync_duration_seconds_bucket p99 on etcd_disk_wal_fsync_duration_seconds_bucket via https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.3/40/artifacts/e2e-aws/metrics/prometheus.tar
The above shows etcd consistently over the 10ms p99 baseline for the e2e-aws-compact-4.3 test which does point to resource issues. I am unsure exactly what the test does at this moment but it is concerning.
Created attachment 1640732 [details] etcd_server_heartbeat_send_failures_total_aws-compact This shows a slow member who has many more failed heart send failures. This member also was located in a different zone us-east-1b vs us-east-1a. I understand that this is the default but would be a consideration in understanding the issue. "beta.kubernetes.io/instance-type": "m4.xlarge", "failure-domain.beta.kubernetes.io/region": "us-east-1", "failure-domain.beta.kubernetes.io/zone": "us-east-1b", "kubernetes.io/hostname": "ip-10-0-148-94",
We believe a significant portion of this increased latency occurred at or around the time disk encryption landed in master (some believe it was https://github.com/openshift/installer/pull/2666). Compact had a few examples of etcd timeout before that, but then transitioned to always red after that, so that may be related.
I watched a random AWS EBS volume through a CI run, and it stayed at 100% Burst Balance the whole time, so this is probably not running out of burst credits when it happens on AWS [1,2]. On the other hand, a sample size of one is not great, and I don't know if that job was running 4.3/4.4 code (although odds are good that it was) [1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html#IOcredit [2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html#monitoring_burstbucket
(In reply to Clayton Coleman from comment #5) > We believe a significant portion of this increased latency occurred at or > around the time disk encryption landed in master (some believe it was > https://github.com/openshift/installer/pull/2666). Compact had a few > examples of etcd timeout before that, but then transitioned to always red > after that, so that may be related. After testings, it is believed that the disk-encryption is a red-herring. To test this: - downloaded the 2019-12-05 disk image - copied the disk and converted the encrypted root filesystem to unencrypted. This created a bare XFS filesystem. - mounted the null-cipher and bare XFS root disks - ran FIO against both null-cipher LUKS and bare XFS The tests showed very similar results between both test runs [1]. Further, the null-cipher LUKS disk setup is only present on new 4.3 clusters; upgraded clusters from 4.2 to 4.3 have root in a non-LUKS container. Finally, the release payload [2] pre-dates the changes where LUKS was enabled in [3]. It's true that the LUKS code was committed and merged, but it did not land until 20 November. [1] https://gist.github.com/darkmuggle/34ff7f19d6e7fe261ed8988526dc8ead [2] https://github.com/openshift/installer/pull/2666 https://github.com/openshift/installer/commit/68cc5772e5917bf2b8862d0bec5fe1e86c57c9a3#diff-a60e9ad59535dfd89b1bf063339ada3c [3] https://github.com/openshift/installer/commit/3c0562153762a16c1255ed32a6835674f8eeaaab#diff-a60e9ad59535dfd89b1bf063339ada3c
I'm setting this to 4.3 for now, this is ~10% of CI failures. We need to find root cause and revert whatever the cause was. We need to bisect to find the source.
Sam found etcd leader elections to be a good signal for this degradation, and example bracketing jobs are [1,2]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/338/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-1bf1ff6b6328b67f69d2cc50d217b80d51a28267b12a7dab46a3aceac26c99be/host_service_logs/masters/kubelet_service.log | grep -c 'etcdserver: leader changed' 0 $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/349/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-1bf1ff6b6328b67f69d2cc50d217b80d51a28267b12a7dab46a3aceac26c99be/host_service_logs/masters/kubelet_service.log | grep -c 'etcdserver: leader changed' 41 There's still a reasonably large gap between those runs (the intervening jobs all failed), so we still have some work to do to zero on on the cause. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/338 [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/349
If we suspect RHCOS, here's the change between those two jobs: $ for i in 338 349; do COMMIT=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name'); echo "$COMMIT $i"; done quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3154258679cab548b1c82d159f98e9863b76ce60932a4f1b25433924cdc55be 338 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 349 $ for HASH in e3154258679cab548b1c82d159f98e9863b76ce60932a4f1b25433924cdc55be 1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc; do oc image info --output json "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:${HASH}" | jq -r .config.config.Labels.version; done 43.81.201911190320.0 43.81.201911200632.0 Comparing the installer between the two builds: $ for i in 338 349; do COMMIT=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "installer").annotations["io.openshift.build.commit.id"]'); echo "$COMMIT $i"; done b13dcbfdc1eb30de6576cb7f73072856200aa028 338 0ebc529458c4754c07bbf454caa8a8f7515a5d13 349 So 43.81.201911081536.0 -> 43.81.201911192044.0 [1] for the bootimage. So the machine-os-content changes were: $ ./differ.py --first-endpoint art --first-version 43.81.201911081536.0 --second-endpoint art --second-version 43.81.201911200632.0 { "diff": { "coreos-installer": { "43.81.201911200632.0": "coreos-installer.0.0.10.rhaos4.3.gita284cdb.el8.noarch", "43.81.201911081536.0": "coreos-installer.0.0.9.rhaos4.2.git2fcf441.el8.noarch" }, "cri-o": { "43.81.201911200632.0": "cri-o.0.1.16.0.2.rhaos4.3.git8b6bfb5.el8.x86_64", "43.81.201911081536.0": "cri-o.0.1.16.0.0.6.dev.rhaos4.3.git9ad059b.el8.x86_64" }, "podman-manpages": { "43.81.201911200632.0": "podman-manpages.0.1.6.3.1.el8.noarch", "43.81.201911081536.0": "podman-manpages.0.1.6.2.1.el8.noarch" }, "machine-config-daemon": { "43.81.201911200632.0": "machine-config-daemon.0.4.3.0.201911192301.git.1.21e74c7.el8.x86_64", "43.81.201911081536.0": "machine-config-daemon.0.4.3.0.201911080552.git.1.68ea0bf.el8.x86_64" }, "podman": { "43.81.201911200632.0": "podman.0.1.6.3.1.el8.x86_64", "43.81.201911081536.0": "podman.0.1.6.2.1.el8.x86_64" }, "kernel-modules": { "43.81.201911200632.0": "kernel-modules.0.4.18.0.147.0.3.el8_1.x86_64", "43.81.201911081536.0": "kernel-modules.0.4.18.0.147.el8.x86_64" }, "hardlink": { "43.81.201911200632.0": "Not present", "43.81.201911081536.0": "hardlink.1.1.3.6.el8.x86_64" }, "openshift-clients": { "43.81.201911200632.0": "openshift-clients.0.4.3.0.201911190123.git.1.60c4109.el8.x86_64", "43.81.201911081536.0": "openshift-clients.0.4.3.0.201911080552.git.1.bd7390d.el8.x86_64" }, "dracut-network": { "43.81.201911200632.0": "dracut-network.0.049.27.git20190906.el8_1.1.x86_64", "43.81.201911081536.0": "dracut-network.0.049.57.git20191031.el8.x86_64" }, "kernel-modules-extra": { "43.81.201911200632.0": "kernel-modules-extra.0.4.18.0.147.0.3.el8_1.x86_64", "43.81.201911081536.0": "kernel-modules-extra.0.4.18.0.147.el8.x86_64" }, "kernel": { "43.81.201911200632.0": "kernel.0.4.18.0.147.0.3.el8_1.x86_64", "43.81.201911081536.0": "kernel.0.4.18.0.147.el8.x86_64" }, "coreos-installer-dracut": { "43.81.201911200632.0": "coreos-installer-dracut.0.0.10.rhaos4.3.gita284cdb.el8.noarch", "43.81.201911081536.0": "coreos-installer-dracut.0.0.9.rhaos4.2.git2fcf441.el8.noarch" }, "kernel-core": { "43.81.201911200632.0": "kernel-core.0.4.18.0.147.0.3.el8_1.x86_64", "43.81.201911081536.0": "kernel-core.0.4.18.0.147.el8.x86_64" }, "container-selinux": { "43.81.201911200632.0": "container-selinux.2.2.121.0.1.el8.noarch", "43.81.201911081536.0": "container-selinux.2.2.107.2.module+el8.1.0+4081+b29780af.noarch" }, "dracut": { "43.81.201911200632.0": "dracut.0.049.27.git20190906.el8_1.1.x86_64", "43.81.201911081536.0": "dracut.0.049.57.git20191031.el8.x86_64" }, "pigz": { "43.81.201911200632.0": "Not present", "43.81.201911081536.0": "pigz.0.2.4.2.el8.x86_64" }, "runc": { "43.81.201911200632.0": "runc.0.1.0.0.62.rc8.rhaos4.1.git3cbe540.el8.x86_64", "43.81.201911081536.0": "runc.0.1.0.0.62.rc8.el8.x86_64" }, "openshift-hyperkube": { "43.81.201911200632.0": "openshift-hyperkube.0.4.3.0.201911162201.git.0.ba68fd5.el8.x86_64", "43.81.201911081536.0": "openshift-hyperkube.0.4.3.0.201911080552.git.0.16ab709.el8.x86_64" }, "microcode_ctl": { "43.81.201911200632.0": "microcode_ctl.4.20190618.1.20191112.1.el8_1.x86_64", "43.81.201911081536.0": "microcode_ctl.4.20190618.1.20190918.2.el8_1.x86_64" } }, "sources": { "43.81.201911200632.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911200632.0/x86_64/commitmeta.json", "43.81.201911081536.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911081536.0/x86_64/commitmeta.json" } } and the bootimage changes were: $ ./differ.py --first-endpoint art --first-version 43.81.201911081536.0 --second-endpoint art --second-version 43.81.201911192044.0 { "sources": { "43.81.201911081536.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911081536.0/x86_64/commitmeta.json", "43.81.201911192044.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911192044.0/x86_64/commitmeta.json" }, "diff": { "dracut": { "43.81.201911081536.0": "dracut.0.049.57.git20191031.el8.x86_64", "43.81.201911192044.0": "dracut.0.049.27.git20190906.el8_1.1.x86_64" }, "kernel-core": { "43.81.201911081536.0": "kernel-core.0.4.18.0.147.el8.x86_64", "43.81.201911192044.0": "kernel-core.0.4.18.0.147.0.3.el8_1.x86_64" }, "cri-o": { "43.81.201911081536.0": "cri-o.0.1.16.0.0.6.dev.rhaos4.3.git9ad059b.el8.x86_64", "43.81.201911192044.0": "cri-o.0.1.16.0.2.rhaos4.3.git8b6bfb5.el8.x86_64" }, "kernel-modules": { "43.81.201911081536.0": "kernel-modules.0.4.18.0.147.el8.x86_64", "43.81.201911192044.0": "kernel-modules.0.4.18.0.147.0.3.el8_1.x86_64" }, "kernel": { "43.81.201911081536.0": "kernel.0.4.18.0.147.el8.x86_64", "43.81.201911192044.0": "kernel.0.4.18.0.147.0.3.el8_1.x86_64" }, "hardlink": { "43.81.201911081536.0": "hardlink.1.1.3.6.el8.x86_64", "43.81.201911192044.0": "Not present" }, "kernel-modules-extra": { "43.81.201911081536.0": "kernel-modules-extra.0.4.18.0.147.el8.x86_64", "43.81.201911192044.0": "kernel-modules-extra.0.4.18.0.147.0.3.el8_1.x86_64" }, "machine-config-daemon": { "43.81.201911081536.0": "machine-config-daemon.0.4.3.0.201911080552.git.1.68ea0bf.el8.x86_64", "43.81.201911192044.0": "machine-config-daemon.0.4.3.0.201911191042.git.1.3af21e9.el8.x86_64" }, "dracut-network": { "43.81.201911081536.0": "dracut-network.0.049.57.git20191031.el8.x86_64", "43.81.201911192044.0": "dracut-network.0.049.27.git20190906.el8_1.1.x86_64" }, "microcode_ctl": { "43.81.201911081536.0": "microcode_ctl.4.20190618.1.20190918.2.el8_1.x86_64", "43.81.201911192044.0": "microcode_ctl.4.20190618.1.20191112.1.el8_1.x86_64" }, "coreos-installer-dracut": { "43.81.201911081536.0": "coreos-installer-dracut.0.0.9.rhaos4.2.git2fcf441.el8.noarch", "43.81.201911192044.0": "coreos-installer-dracut.0.0.10.rhaos4.3.gita284cdb.el8.noarch" }, "podman": { "43.81.201911081536.0": "podman.0.1.6.2.1.el8.x86_64", "43.81.201911192044.0": "podman.0.1.6.3.1.el8.x86_64" }, "openshift-clients": { "43.81.201911081536.0": "openshift-clients.0.4.3.0.201911080552.git.1.bd7390d.el8.x86_64", "43.81.201911192044.0": "openshift-clients.0.4.3.0.201911190123.git.1.60c4109.el8.x86_64" }, "podman-manpages": { "43.81.201911081536.0": "podman-manpages.0.1.6.2.1.el8.noarch", "43.81.201911192044.0": "podman-manpages.0.1.6.3.1.el8.noarch" }, "runc": { "43.81.201911081536.0": "runc.0.1.0.0.62.rc8.el8.x86_64", "43.81.201911192044.0": "runc.0.1.0.0.62.rc8.rhaos4.1.git3cbe540.el8.x86_64" }, "pigz": { "43.81.201911081536.0": "pigz.0.2.4.2.el8.x86_64", "43.81.201911192044.0": "Not present" }, "coreos-installer": { "43.81.201911081536.0": "coreos-installer.0.0.9.rhaos4.2.git2fcf441.el8.noarch", "43.81.201911192044.0": "coreos-installer.0.0.10.rhaos4.3.gita284cdb.el8.noarch" }, "openshift-hyperkube": { "43.81.201911081536.0": "openshift-hyperkube.0.4.3.0.201911080552.git.0.16ab709.el8.x86_64", "43.81.201911192044.0": "openshift-hyperkube.0.4.3.0.201911162201.git.0.ba68fd5.el8.x86_64" } } } Along with the package changes, there were internal RHCOS changes (e.g. to systemd unit files and such) that are harder to dump here :p. [1]: https://github.com/openshift/installer/compare/b13dcbfdc1eb30de6576cb7f73072856200aa028...0ebc529458c4754c07bbf454caa8a8f7515a5d13#diff-a60e9ad59535dfd89b1bf063339ada3cL131
Wait, machine-os-content was 43.81.201911190320.0 -> 43.81.201911200632.0, not 43.81.201911081536.0 -> 43.81.201911200632.0. I can't seem to find commitmeta.json for 43.81.201911190320.0: $ for RHCOS in 43.81.201911190320.0 43.81.201911200632.0 43.81.201911081536.0 43.81.201911192044.0; do echo -n "${RHCOS} "; curl -sI "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/${RHCOS}/x86_64/commitmeta.json" | head -n1; done 43.81.201911190320.0 HTTP/1.1 403 Forbidden 43.81.201911200632.0 HTTP/1.1 200 OK 43.81.201911081536.0 HTTP/1.1 200 OK 43.81.201911192044.0 HTTP/1.1 200 OK
I'm also going to punt this over to RHCOS, although at this point it's really "anybody with any ideas, please help!" ;)
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/338 - Nov 19 12:24 - 13:44 UTC https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/349 - Nov 20 22:59 - Nov 21 00:13 UTC https://github.com/openshift/openshift-apiserver/pull/44 merged Nov 20 14:17 UTC
> Wait, machine-os-content was 43.81.201911190320.0 -> 43.81.201911200632.0, not 43.81.201911081536.0 -> 43.81.201911200632.0. I can't seem to find commitmeta.json for 43.81.201911190320.0: Very likely https://jira.coreos.com/browse/GRPA-946
The kernel bump included a change to disable TSX by default, but re-enabling it [1] did not resolve the issue: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2806/pull-ci-openshift-installer-release-4.3-e2e-aws/99/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-71pgjv9y-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/host_service_logs/masters/kubelet_service.log | grep -c 'etcdserver: leader changed' 15 $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2806/pull-ci-openshift-installer-release-4.3-e2e-aws/99/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-71pgjv9y-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-etcd/pods/etcd-member-ip-10-0-140-4.ec2.internal/etcd-member/etcd-member/logs/current.log | grep -c 'took too long' 339 And cluster:capacity_cpu_sockets_hyperthread_enabled:sum on the gathered metrics [2] gives three each for control-plane and compute pools. [1]: https://github.com/openshift/installer/pull/2806 [2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2806/pull-ci-openshift-installer-release-4.3-e2e-aws/99/artifacts/e2e-aws/metrics/prometheus.tar
Tried tsx_async_abort=off too, still lots of leader changed [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2806/pull-ci-openshift-installer-release-4.3-e2e-aws/102/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-ntbzylts-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/host_service_logs/masters/kubelet_service.log | grep -c 'etcdserver: leader changed' 123 Sam is checking out BFQ scheduling over in [2], since Vadim isn't seeing this on a recent FCOS, and Fedora defaults to BFQ (bug 1738828). [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2806/pull-ci-openshift-installer-release-4.3-e2e-aws/102 [2]: https://github.com/openshift/machine-config-operator/pull/1138
Created attachment 1644623 [details] prom-history.sh for max(etcd_server_leader_changes_seen_total) Better brackets on e2e-aws release promotion jobs via max(etcd_server_leader_changes_seen_total). Shows a hard cut from 1 to 2 leader changes in the older runs going to 5 to 16 after for newer runs. Data from around the transition: $ for RESULT in */result.json; do VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${RESULT}")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${RESULT/result/finished}")"; if test -n "${VALUE}"; then echo "${END_TIME} ${VALUE}"; fi; done 2>/dev/null | sort | grep '2019-11-2[01]' 2019-11-20T00:48:22Z 1 2019-11-20T04:25:50Z 1 2019-11-20T06:56:34Z 1 2019-11-20T08:38:31Z 1 2019-11-20T10:57:17Z 1 2019-11-20T15:49:17Z 1 2019-11-20T18:06:57Z 1 2019-11-20T21:31:27Z 10 2019-11-21T00:13:58Z 5 2019-11-21T04:22:07Z 11 2019-11-21T08:09:19Z 8 2019-11-21T14:01:32Z 8 2019-11-21T18:29:48Z 10 2019-11-21T23:35:53Z 14 And job numbers for that transition: $ for x in 3[34][0-9]; do FIN=$(jq -r '.timestamp | todateiso8601' $x/finished.json); echo "$FIN $x"; done | grep '2019-11-20T18:06:57Z\|2019-11-20T21:31:27Z' 2019-11-20T18:06:57Z 347 2019-11-20T21:31:27Z 348 Repeating the comment 11 analysis for the tighter bracket: $ for i in 347 348; do COMMIT=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name'); echo "$COMMIT $i"; done quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 347 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 348 So _not_ a machine-os-content change. Comparing commits feeding the release image: $ JQ='[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]' $ diff -U0 <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/347/artifacts/release-images-latest/release-images-latest | jq -r "${JQ}") <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/348/artifacts/release-images-latest/release-images-latest | jq -r "${JQ}") --- /dev/fd/63 2019-12-12 21:43:12.074842301 -0800 +++ /dev/fd/62 2019-12-12 21:43:12.074842301 -0800 @@ -3 +3 @@ -baremetal-installer https://github.com/openshift/installer/commit/e9b4862c83da4ea4bdb113e19ef2c82f49fb4499 +baremetal-installer https://github.com/openshift/installer/commit/5db8dbcddee17c8cd1451a5d5a186d0cf17e2d6a @@ -36 +36 @@ -console https://github.com/openshift/console/commit/fbcfd5d8b1aa21ad85af955eb3e0d2a7af70cce2 +console https://github.com/openshift/console/commit/fcc9619ff91bae830ebe81ae630540c4922a251b @@ -47 +47 @@ -hyperkube https://github.com/openshift/ose/commit/ba68fd59d82c52fc219757a2d8aff35bdb0adc87 +hyperkube https://github.com/eparis/origin/commit/88a02d205da4eba2e72dbdd1af5a63901ddcc766 @@ -49,2 +49,2 @@ -installer https://github.com/openshift/installer/commit/e9b4862c83da4ea4bdb113e19ef2c82f49fb4499 -installer-artifacts https://github.com/openshift/installer/commit/e9b4862c83da4ea4bdb113e19ef2c82f49fb4499 +installer https://github.com/openshift/installer/commit/5db8dbcddee17c8cd1451a5d5a186d0cf17e2d6a +installer-artifacts https://github.com/openshift/installer/commit/5db8dbcddee17c8cd1451a5d5a186d0cf17e2d6a @@ -71 +71 @@ -machine-config-operator https://github.com/openshift/machine-config-operator/commit/2e560b2f4847939ba2db3424cc4b71cbbabccc6d +machine-config-operator https://github.com/openshift/machine-config-operator/commit/e9033544f42191376529efdb7a67deb5427efd1d @@ -79 +79 @@ -openshift-apiserver https://github.com/openshift/openshift-apiserver/commit/29b6805e14572a1da2fe0cd0d13d7038b127f600 +openshift-apiserver https://github.com/openshift/openshift-apiserver/commit/2e7b3f676eaafde27984f2408fb737d3a1ce5e59 So [1] (just [2]), [3] (just [4]), [5] (just [6]), very outside chance at the console changes [7], and then whatever happened in origin around Eric's [8] (ba68fd59d82c has not made it into origin). Looks like: $ git log --format='%h%d %s' ba68fd59d..88a02d20 | grep origin/pr/ 88a02d205d (origin/pr/24181) bump(golang.org/x/sys) as it is needed for compatability with golang.org/x/crypto d70b19d4d6 (origin/pr/24166) bump(*): ghodss/yaml and library-go ce930dd4ee (origin/pr/24167) Fix cmd tests to be more resilient to changes 177cafee87 (origin/pr/24148) Bug 1771335: Verify builds can use pull secret 626640c289 (origin/pr/24159) UPSTREAM: 85223: apiextensions: fix items+type logic in API due to broken go-openapi validation 9cdbaf442f (origin/pr/24150) UPSTREAM: 85308: support SNI with IPs b34aeceb3c (origin/pr/24138) Stop excluding ClusterMachineApproverDown from firing check installer#2666 bootimages are probably still the primary suspect. I'll go see if there's a smoking gun in its presubmit jobs. [1]: https://github.com/openshift/installer/compare/e9b4862c83da4ea4bdb113e19ef2c82f49fb4499...5db8dbcddee17c8cd1451a5d5a186d0cf17e2d6a [2]: https://github.com/openshift/installer/pull/2666 [3]: https://github.com/openshift/openshift-apiserver/compare/29b6805e14572a1da2fe0cd0d13d7038b127f600...2e7b3f676eaafde27984f2408fb737d3a1ce5e59 [4]: https://github.com/openshift/openshift-apiserver/pull/44 [5]: https://github.com/openshift/machine-config-operator/compare/2e560b2f4847939ba2db3424cc4b71cbbabccc6d...e9033544f42191376529efdb7a67deb5427efd1d [6]: https://github.com/openshift/machine-config-operator/pull/1257 [7]: https://github.com/openshift/console/compare/fbcfd5d8b1aa21ad85af955eb3e0d2a7af70cce2...fcc9619ff91bae830ebe81ae630540c4922a251b [8]: https://github.com/openshift/origin/pull/24181
Checking installer#2666, it only ran four e2e-aws presubmits [1]. Here are the installer commits (there may have also been drift in non-installer release components): $ for i in 8586 8601 8645 8653; do curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2666/pull-ci-openshift-installer-master-e2e-aws/$i/build-log.txt | head -n1; done 2019/11/13 19:01:01 Resolved source https://github.com/openshift/installer to master@9f62449a, merging: #2666 12d45011 @stbenjam 2019/11/14 20:29:04 Resolved source https://github.com/openshift/installer to master@36b066c7, merging: #2666 20d993ea @stbenjam 2019/11/20 00:38:15 Resolved source https://github.com/openshift/installer to master@de10297f, merging: #2666 3c056215 @stbenjam 2019/11/20 13:17:51 Resolved source https://github.com/openshift/installer to master@e9b4862c, merging: #2666 3c056215 @stbenjam Running max(etcd_server_leader_changes_seen_total) on those jobs using the analysis script from comment 20: $ for RESULT in */result.json; do VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${RESULT}")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${RESULT/result/finished}")"; if test -n "${VALUE}"; then echo "${END_TIME} ${VALUE}"; fi; done 2>/dev/null | sort 2019-11-13T20:27:03Z 3 2019-11-14T21:52:17Z 7 2019-11-20T02:03:11Z 9 2019-11-20T14:38:36Z 9 Looks pretty bad, and possibly getting worse with the later commits. This confirms the RHCOS assignment as far as I'm concerned. Still not clear on what it was about the RHCOS bootimage bump that would break etcd so badly. Getting RHCOS versions: $ for COMMIT in 12d45011 20d993ea 3c056215; do curl -s https://raw.githubusercontent.com/openshift/installer/$COMMIT/data/data/rhcos.json | jq -r .buildid; done 43.81.201911131545.0 43.81.201911131833.0 43.81.201911192044.0 Assuming the 3->7 bump was the bulk of the problem: $ ./differ.py --first-endpoint art --first-version 43.81.201911131545.0 --second-endpoint art --second-version 43.81.201911131833.0 { "diff": { "kernel-core": { "43.81.201911131833.0": "kernel-core.0.4.18.0.147.0.3.el8_1.x86_64", "43.81.201911131545.0": "kernel-core.0.4.18.0.147.0.2.el8_1.x86_64" }, "kernel": { "43.81.201911131833.0": "kernel.0.4.18.0.147.0.3.el8_1.x86_64", "43.81.201911131545.0": "kernel.0.4.18.0.147.0.2.el8_1.x86_64" }, "kernel-modules": { "43.81.201911131833.0": "kernel-modules.0.4.18.0.147.0.3.el8_1.x86_64", "43.81.201911131545.0": "kernel-modules.0.4.18.0.147.0.2.el8_1.x86_64" }, "kernel-modules-extra": { "43.81.201911131833.0": "kernel-modules-extra.0.4.18.0.147.0.3.el8_1.x86_64", "43.81.201911131545.0": "kernel-modules-extra.0.4.18.0.147.0.2.el8_1.x86_64" }, "openshift-hyperkube": { "43.81.201911131833.0": "openshift-hyperkube.0.4.3.0.201911131246.git.0.015ef3e.el8.x86_64", "43.81.201911131545.0": "openshift-hyperkube.0.4.3.0.201911130601.git.0.65d6168.el8.x86_64" } }, "sources": { "43.81.201911131833.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911131833.0/x86_64/commitmeta.json", "43.81.201911131545.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911131545.0/x86_64/commitmeta.json" } } Really feels like a kernel thing, although there's still a chance this is a hyperkube-load thing. And I'm still not clear on how a bootimage change is having such an impact on presumably post-pivot cluster behavior (e2e success). [1]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2666/pull-ci-openshift-installer-master-e2e-aws/
There's some suspicion that the issue may be AWS bumping its microcode for the TSX thing, in which case we won't be able to turn it off. Arguing against that is the fact that 4.2 AWS CI from today is seeing 1 for max(etcd_server_leader_changes_seen_total). But to get another angle on it, I checked our 4.3 nightly promotion for e2e-azure from the window where e2e-aws saw the leader-change spike: $ for i in $(seq 590); do mkdir -p "${i}"; curl "https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/${i}/finished.json" > "${i}/finished.json"; done $ for x in */finished.json; do FIN=$(jq -r '.timestamp | todateiso8601' $x); PASS=$(jq -r .passed $x); echo "$FIN $x $PASS"; done 2>/dev/null | sort | grep 2019-11-20 2019-11-20T01:11:39Z 435/finished.json false 2019-11-20T04:42:41Z 436/finished.json false 2019-11-20T07:18:41Z 437/finished.json false 2019-11-20T09:19:00Z 438/finished.json false 2019-11-20T11:38:38Z 439/finished.json false 2019-11-20T16:16:38Z 440/finished.json false 2019-11-20T18:32:20Z 441/finished.json false 2019-11-20T22:02:34Z 442/finished.json false $ ... prom-history.sh stuff to pull down and pull max(etcd_server_leader_changes_seen_total) from prometheus.tar for [1] through [2] $ for RESULT in */result.json; do VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${RESULT}")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${RESULT/result/finished}")"; if test -n "${VALUE}"; then echo "${END_TIME} ${VALUE} ${RESULT}"; fi; done 2>/dev/null | sort 2019-11-19T06:49:15Z 6 430/result.json 2019-11-19T07:31:25Z 6 431/result.json 2019-11-19T14:22:38Z 4 432/result.json 2019-11-19T18:41:35Z 2 433/result.json 2019-11-19T22:11:09Z 6 434/result.json 2019-11-20T01:11:39Z 5 435/result.json 2019-11-20T04:42:41Z 5 436/result.json 2019-11-20T07:18:41Z 5 437/result.json 2019-11-20T09:19:00Z 7 438/result.json 2019-11-20T11:38:38Z 7 439/result.json 2019-11-20T16:16:38Z 4 440/result.json 2019-11-20T18:32:20Z 5 441/result.json 2019-11-20T22:02:34Z 7 442/result.json 2019-11-21T00:58:54Z 8 443/result.json 2019-11-21T04:45:17Z 7 444/result.json 2019-11-21T08:01:35Z null 445/result.json 2019-11-21T14:53:06Z 12 446/result.json 2019-11-21T17:38:53Z null 447/result.json 2019-11-21T23:25:15Z null 448/result.json 2019-11-22T01:12:43Z 7 449/result.json 2019-11-22T04:26:48Z 8 450/result.json Not a great signal, but maybe a weak trend upwards. And checking metal in a different directory: $ for i in $(seq 921); do mkdir -p "${i}"; curl https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/${i}/finished.json > "${i}/finished.json"; done $ for x in */finished.json; do FIN=$(jq -r '.timestamp | todateiso8601' $x); PASS=$(jq -r .passed $x); echo "$FIN $x $PASS"; done 2>/dev/null | sort | grep 2019-11-20 2019-11-20T00:53:32Z 765/finished.json false 2019-11-20T04:19:05Z 766/finished.json false 2019-11-20T07:03:42Z 767/finished.json false 2019-11-20T08:40:29Z 768/finished.json false 2019-11-20T10:59:18Z 769/finished.json false 2019-11-20T16:19:49Z 770/finished.json false 2019-11-20T18:05:16Z 771/finished.json false 2019-11-20T20:52:47Z 772/finished.json false $ ... prom-history.sh stuff to pull down prometheus.tar for [3] through [4] $ ls -l */prometheus.tar -rw-r--r--. 1 trking trking 67365811 Dec 13 09:43 760/prometheus.tar ... -rw-r--r--. 1 trking trking 24107212 Dec 13 09:44 770/prometheus.tar -rw-r--r--. 1 trking trking 0 Dec 13 09:44 771/prometheus.tar ... -rw-r--r--. 1 trking trking 0 Dec 13 09:44 780/prometheus.tar Checking [5]: level=fatal msg="waiting for Kubernetes API: context deadline exceeded" so no metal results after that. But running the prom-history.sh logic to analyze the ones we got: $ for RESULT in */result.json; do VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${RESULT}")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${RESULT/result/finished}")"; if test -n "${VALUE}"; then echo "${END_TIME} ${VALUE}"; fi; done 2019-11-19T07:08:52Z 2 2019-11-19T07:09:51Z 1 2019-11-19T14:06:39Z 1 2019-11-19T18:50:36Z 1 2019-11-19T21:56:06Z 1 2019-11-20T00:53:32Z 1 2019-11-20T04:19:05Z 1 2019-11-20T07:03:42Z 2 2019-11-20T08:40:29Z 1 2019-11-20T10:59:18Z 1 2019-11-20T16:19:49Z 1 So that's looking solid. Dunno if we've had any successful metal runs since (Packet was having some internal issue and we lost signal), but I'll look... [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/430 [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/450 [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/760 [4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/780 [5]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/771
Looking at CI builds for e2e-aws testing, from [1] through [2]: $ for i in $(seq 1460 1480); do mkdir -p "${i}"; curl "https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/${i}/finished.json" > "${i}/finished.json"; curl "https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/${i}/artifacts/e2e-aws/metrics/prometheus.tar" > "${i}/prometheus.tar"; done $ ... prom-history.sh stuff to pull down and pull max(etcd_server_leader_changes_seen_total) from prometheus.tar $ for RESULT in */result.json; do VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${RESULT}")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${RESULT/result/finished}")"; if test -n "${VALUE}"; then echo "${END_TIME} ${VALUE} ${RESULT}"; fi; done 2>/dev/null | sort 2019-11-16T21:51:19Z 1 1460/result.json 2019-11-17T17:07:25Z 1 1461/result.json 2019-11-18T17:05:41Z 1 1462/result.json 2019-11-18T17:23:47Z 1 1463/result.json 2019-11-19T17:23:57Z 1 1465/result.json 2019-11-19T17:30:56Z 1 1464/result.json 2019-11-20T17:27:30Z 1 1466/result.json 2019-11-20T17:27:53Z 1 1467/result.json 2019-11-21T17:27:59Z 8 1468/result.json 2019-11-21T21:33:21Z 1 1469/result.json 2019-11-22T21:30:12Z 9 1470/result.json 2019-11-23T17:08:17Z 1 1471/result.json 2019-11-24T17:09:03Z 8 1472/result.json 2019-11-25T17:16:40Z 7 1473/result.json 2019-11-26T16:24:59Z 1 1474/result.json 2019-11-27T16:28:11Z 7 1475/result.json 2019-11-27T16:58:08Z 1 1476/result.json 2019-11-28T16:50:50Z 8 1477/result.json 2019-11-28T20:18:40Z null 1478/result.json 2019-11-28T21:08:47Z null 1479/result.json 2019-11-29T15:44:19Z null 1480/result.json (empty/corrupt prometheus.tar for those last three). Very strange that it was flapping 1<->high after 2019-11-21T17:27:59Z. The 2019-11-20T17:27:53Z -> 2019-11-21T17:27:59Z window matches up well with the 2019-11-20T18:06:57Z -> 2019-11-20T21:31:27Z window from e2e-aws nightlies from comment 20. Looking at machine-os-content: $ for i in $(seq 1460 1477); do PULLSPEC=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name'); echo -n "$PULLSPEC $i "; oc image info --output json "${PULLSPEC}" | jq -r .config.config.Labels.version; done registry.svc.ci.openshift.org/ocp/4.3-2019-11-16-153916@sha256:ddc632cd9bd33291f6f57b05f615e6db85a7d9022a654096f6a21a0c10839b9a 1460 error: image does not exist registry.svc.ci.openshift.org/origin/4.3-2019-11-17-154428@sha256:03858607139c63bd541891b90ad0a72220bbb6a4d0889bcb8fa3a484e4bd72fd 1461 error: image does not exist registry.svc.ci.openshift.org/ocp/4.3-2019-11-18-140404@sha256:2989b16f5e904423af3292cd884f449277950c7e5d6d1d90954dc657bfb3c854 1462 error: image does not exist registry.svc.ci.openshift.org/origin/4.3-2019-11-18-160156@sha256:32a963604cd306d5207438e8a148b856e0766bae5b53e46e561c10ef691f8ed3 1463 error: image does not exist registry.svc.ci.openshift.org/ocp/4.3-2019-11-19-095016@sha256:e3154258679cab548b1c82d159f98e9863b76ce60932a4f1b25433924cdc55be 1464 error: image does not exist registry.svc.ci.openshift.org/origin/4.3-2019-11-19-160737@sha256:a576f9703d7b2ac990b58b9ebccf4ada350366f2c6726b3c98e0a7725930b0d0 1465 error: image does not exist registry.svc.ci.openshift.org/ocp/4.3-2019-11-20-134433@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1466 error: image does not exist registry.svc.ci.openshift.org/origin/4.3-2019-11-20-160525@sha256:7264695c700dd766ec9d2c737a57e90bacc8f2af9d68cc4a2db32be0feaa472e 1467 error: image does not exist registry.svc.ci.openshift.org/ocp/4.3-2019-11-21-103638@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1468 43.81.201911200632.0 registry.svc.ci.openshift.org/origin/4.3-2019-11-21-195816@sha256:0bb9819f10189ff116838ea703c2cec9049136ff474854ef436c52a8bd4a6c12 1469 error: image does not exist registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1470 43.81.201911200632.0 registry.svc.ci.openshift.org/origin/4.3-2019-11-23-155208@sha256:384624be66fae6263e4a2a4ae5d5c16ab3fc59ab15b50a2ec9a41d6d8dd66676 1471 error: image does not exist registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1472 43.81.201911200632.0 registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1473 43.81.201911200632.0 registry.svc.ci.openshift.org/origin/4.3-2019-11-26-150235@sha256:c6c62bf72d02e6bb43b0d4133e77f85427198f575dcbf4a43293c3e8771c61cc 1474 error: image does not exist registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1475 43.81.201911200632.0 registry.svc.ci.openshift.org/origin/4.3-2019-11-27-153431@sha256:e6f2be92fe3c225b0a5609a04e9942a41138297b24ba91a946dff7de29d01e40 1476 error: image does not exist registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1477 43.81.201911200632.0 Looks like we were flapping between CI and ART sources? Joining the leader change data with the machine-os-content data: $ for i in $(seq 1460 1477); do PULLSPEC=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name'); VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${i}/result.json")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${i}/finished.json")"; echo "${END_TIME} ${VALUE} ${i} ${PULLSPEC}"; done 2019-11-16T21:51:19Z 1 1460 registry.svc.ci.openshift.org/ocp/4.3-2019-11-16-153916@sha256:ddc632cd9bd33291f6f57b05f615e6db85a7d9022a654096f6a21a0c10839b9a 2019-11-17T17:07:25Z 1 1461 registry.svc.ci.openshift.org/origin/4.3-2019-11-17-154428@sha256:03858607139c63bd541891b90ad0a72220bbb6a4d0889bcb8fa3a484e4bd72fd 2019-11-18T17:05:41Z 1 1462 registry.svc.ci.openshift.org/ocp/4.3-2019-11-18-140404@sha256:2989b16f5e904423af3292cd884f449277950c7e5d6d1d90954dc657bfb3c854 2019-11-18T17:23:47Z 1 1463 registry.svc.ci.openshift.org/origin/4.3-2019-11-18-160156@sha256:32a963604cd306d5207438e8a148b856e0766bae5b53e46e561c10ef691f8ed3 2019-11-19T17:30:56Z 1 1464 registry.svc.ci.openshift.org/ocp/4.3-2019-11-19-095016@sha256:e3154258679cab548b1c82d159f98e9863b76ce60932a4f1b25433924cdc55be 2019-11-19T17:23:57Z 1 1465 registry.svc.ci.openshift.org/origin/4.3-2019-11-19-160737@sha256:a576f9703d7b2ac990b58b9ebccf4ada350366f2c6726b3c98e0a7725930b0d0 2019-11-20T17:27:30Z 1 1466 registry.svc.ci.openshift.org/ocp/4.3-2019-11-20-134433@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 2019-11-20T17:27:53Z 1 1467 registry.svc.ci.openshift.org/origin/4.3-2019-11-20-160525@sha256:7264695c700dd766ec9d2c737a57e90bacc8f2af9d68cc4a2db32be0feaa472e 2019-11-21T17:27:59Z 8 1468 registry.svc.ci.openshift.org/ocp/4.3-2019-11-21-103638@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 2019-11-21T21:33:21Z 1 1469 registry.svc.ci.openshift.org/origin/4.3-2019-11-21-195816@sha256:0bb9819f10189ff116838ea703c2cec9049136ff474854ef436c52a8bd4a6c12 2019-11-22T21:30:12Z 9 1470 registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 2019-11-23T17:08:17Z 1 1471 registry.svc.ci.openshift.org/origin/4.3-2019-11-23-155208@sha256:384624be66fae6263e4a2a4ae5d5c16ab3fc59ab15b50a2ec9a41d6d8dd66676 2019-11-24T17:09:03Z 8 1472 registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 2019-11-25T17:16:40Z 7 1473 registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 2019-11-26T16:24:59Z 1 1474 registry.svc.ci.openshift.org/origin/4.3-2019-11-26-150235@sha256:c6c62bf72d02e6bb43b0d4133e77f85427198f575dcbf4a43293c3e8771c61cc 2019-11-27T16:28:11Z 7 1475 registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 2019-11-27T16:58:08Z 1 1476 registry.svc.ci.openshift.org/origin/4.3-2019-11-27-153431@sha256:e6f2be92fe3c225b0a5609a04e9942a41138297b24ba91a946dff7de29d01e40 2019-11-28T16:50:50Z 8 1477 registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc So that pretty clearly gets AWS-side microcode off the hook and makes this an ART vs. CI machine-os-content thing (or maybe there are other ART vs. CI differences between these images as well). [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/1460 [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/1480
(In reply to W. Trevor King from comment #20) > [3]: > https://github.com/openshift/openshift-apiserver/compare/ > 29b6805e14572a1da2fe0cd0d13d7038b127f600... > 2e7b3f676eaafde27984f2408fb737d3a1ce5e59 > [4]: https://github.com/openshift/openshift-apiserver/pull/44 David opened [1] to try reverting this, but dropping [2] into PromeCIeus [3] shows max(etcd_server_leader_changes_seen_total) at 11. So the bump to 1.17 is probably not what triggered this. [1]: https://github.com/openshift/openshift-apiserver/pull/61 [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/61/pull-ci-openshift-openshift-apiserver-master-e2e-aws/217 [3]: https://promecieus.svc.ci.openshift.org/
(In reply to W. Trevor King from comment #23) > 2019-11-20T17:27:30Z 1 1466 registry.svc.ci.openshift.org/ocp/4.3-2019-11-20-134433@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc > 2019-11-20T17:27:53Z 1 1467 registry.svc.ci.openshift.org/origin/4.3-2019-11-20-160525@sha256:7264695c700dd766ec9d2c737a57e90bacc8f2af9d68cc4a2db32be0feaa472e > 2019-11-21T17:27:59Z 8 1468 registry.svc.ci.openshift.org/ocp/4.3-2019-11-21-103638@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc $ for i in $(seq 1466 1468); do COMMIT=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "installer").annotations["io.openshift.build.commit.id"]'); git --no-pager log -1 --date=short --format='%h %cd%d %s' "${COMMIT}"; done e9b4862c8 2019-11-20 Merge pull request #2667 from jcpowermac/issue-2634 23f40299a 2019-11-12 Merge pull request #2548 from vrutkovs/okd 0ebc52945 2019-11-20 Merge pull request #2691 from pierreprinetti/os_upi_docs_fix So for some reason the runs using the OCP/ART machine-os-content are using recent installer code and flipping into the broken state with [1] (contains [2,3]). The runs using the origin machine-os-content are fine, running code from the week before (probably for both the installer and other operators and such). Diff for release-images-latest for the break looks a lot like comment 20. Checking release.openshift.io/from-image-stream in their release-images-latest, we have: * origin 1466: ocp/4.3-2019-11-20-134433 * origin 1468: ocp/4.3-2019-11-21-103638 * ocp 347: ocp/4.3-art-latest-2019-11-20-165519 * ocp 348: ocp/4.3-art-latest-2019-11-20-195514 so comment 20 is still giving us the tightest bracket. [1]: https://github.com/openshift/installer/compare/e9b4862c8...0ebc52945 [2]: https://github.com/openshift/installer/pull/2666 [3]: https://github.com/openshift/installer/pull/2691
Back to 1 max(etcd_server_leader_changes_seen_total) by rolling back the AMI to the pre-installer#2666 state while keeping a modern hyperkube [1]. Comment 21 has some mumbling about the different prerelease jobs (which ran on three different AMIs) for that #2666, and [2] suggests it's not the kernel change, but probably best to just run a naive bisection with the different RHCOS skipped over in the #2666 bump (43.81.201911081536.0 -> 43.81.201911192044.0) because I have no idea how this AMI change has the effect it does ;). [1]: https://github.com/openshift/installer/pull/2815#issuecomment-565664333 [2]: https://github.com/openshift/installer/pull/2817#issuecomment-565658441
This issue has been pinned definitively to the "null cipher" LUKS container. Work is ongoing RHCOS side to investigate possible fixes.
The root cause is that the null-cipher when activated through dm-crypt results in a significant throughput/bandwidth penalty. For the null-cipher case the root filesystem is activated by mapping a dm-linear target to the offset of the root filesystem. Fixes are committed and working their way through CI now. Calling this release pending for now.
"change has landed and is in the pipe" is MODIFIED. Still needs QE to sign off on it to get to VERIFIED, and then have a release candidate built (or maybe a GA candidate build?) to get to RELEASE_PENDING.
Latest 4.3 nightly is [1] with: $ oc image info -o json $(oc adm release info --image-for=machine-os-content registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-18-195156) | jq -r '.config.config.Labels.version' 43.81.201912181720.0 Does that have the fix? [1]: https://openshift-release.svc.ci.openshift.org/#4.3.0-0.nightly
It looks like it does. 43.81.201912181720.0's build json[0] points to the commit[1] with the fix. Does this fix require a boot image bump also? [0] https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201912181720.0/x86_64/meta.json [1] https://gitlab.cee.redhat.com/coreos/redhat-coreos/commit/25a9b3ddbdb6f1a90604b75b142fe38fe173086f
It's supposed to work without a bootimage bump, so ON_QA is appropriate. And we can punt back if it turns out to not work ;)
[1] has 2 for max(etcd_server_leader_changes_seen_total). That's >1, but still lower than the 5+ we were seeing when the bug was bad. And we were seeing 2 occasionally before the break (comment 20). [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/509
4.3.0-0.nightly-2019-12-18-215416 is the first accepted release with the fix. [0] has 1 max(etcd_server_leader_changes_seen_total). I ran some basic encryption with luks and no encryption tests against the RHCOS build 43.81.201912181720.0 and it looks good. I'm standing up a cluster now and will do a few more checks tomrorow before closing this as verified. [0] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/510
Final checks look good. Fix is in the build and it only has 1 max(etcd_server_leader_changes_seen_total). Closing as verified. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2019-12-18-215416 True False 10h Cluster version is 4.3.0-0.nightly-2019-12-18-215416 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-136-167.ec2.internal Ready worker 11h v1.16.2 ip-10-0-139-193.ec2.internal Ready master 11h v1.16.2 ip-10-0-146-132.ec2.internal Ready worker 11h v1.16.2 ip-10-0-157-48.ec2.internal Ready master 11h v1.16.2 ip-10-0-163-220.ec2.internal Ready worker 11h v1.16.2 ip-10-0-170-74.ec2.internal Ready master 11h v1.16.2 $ oc debug node/ip-10-0-139-193.ec2.internal Starting pod/ip-10-0-139-193ec2internal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# journalctl -b 0 | grep coreos-cryptfs Dec 19 02:31:33 localhost coreos-cryptfs[1013]: coreos-cryptfs: /dev/xvda4 uses the 'cipher_null-ecb' Dec 19 02:31:33 localhost coreos-cryptfs[1013]: coreos-cryptfs: /dev/xvda4 is not encrypted. Device will be mounted as device-mapper linear target. Dec 19 02:31:33 localhost coreos-cryptfs[1013]: NOCHANGE: partition 4 is size 250607583. it cannot be grown sh-4.4# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 120G 0 disk |-xvda1 202:1 0 384M 0 part /boot |-xvda2 202:2 0 127M 0 part /boot/efi |-xvda3 202:3 0 1M 0 part `-xvda4 202:4 0 119.5G 0 part `-coreos-luks-root-nocrypt 253:0 0 119.5G 0 dm /sysroot
*** Bug 1771741 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062