Bug 1775878 - ~10% of CI jobs fail: Some tests fail with 'etcdserver: request timed out'
Summary: ~10% of CI jobs fail: Some tests fail with 'etcdserver: request timed out'
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.3.0
Assignee: Ben Howard
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 1771741 (view as bug list)
Depends On:
Blocks: 1771741 1776402 1779810
TreeView+ depends on / blocked
 
Reported: 2019-11-23 06:32 UTC by W. Trevor King
Modified: 2023-09-07 21:05 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-23 11:13:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
etcd_disk_wal_fsync_duration_seconds_bucket (75.89 KB, image/png)
2019-11-29 18:54 UTC, Sam Batschelet
no flags Details
etcd_server_heartbeat_send_failures_total_aws-compact (61.83 KB, image/png)
2019-11-29 21:01 UTC, Sam Batschelet
no flags Details
prom-history.sh for max(etcd_server_leader_changes_seen_total) (2.83 KB, text/plain)
2019-12-13 05:59 UTC, W. Trevor King
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:14:20 UTC

Description W. Trevor King 2019-11-23 06:32:33 UTC
8% of the past 12h ^release-.*4.3$ failures included messages like [1]:

  error retrieving resource lock.*etcdserver: request timed out

Seems spread around the different job types [2,3,4].  Not sure if we need to bump our disks, or if FIPS, compact, and Azure all have different problems, or what.

[1]: https://search.svc.ci.openshift.org/chart?name=^release-.*-4.3$&search=error%20retrieving%20resource%20lock.*etcdserver:%20request%20timed%20out
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.3/553
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/191
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.3/40

Comment 2 Sam Batschelet 2019-11-29 18:54:33 UTC
Created attachment 1640730 [details]
etcd_disk_wal_fsync_duration_seconds_bucket

p99 on etcd_disk_wal_fsync_duration_seconds_bucket via https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-compact-4.3/40/artifacts/e2e-aws/metrics/prometheus.tar

Comment 3 Sam Batschelet 2019-11-29 18:57:11 UTC
The above shows etcd consistently over the 10ms p99 baseline for the e2e-aws-compact-4.3 test which does point to resource issues. I am unsure exactly what the test does at this moment but it is concerning.

Comment 4 Sam Batschelet 2019-11-29 21:01:11 UTC
Created attachment 1640732 [details]
etcd_server_heartbeat_send_failures_total_aws-compact

This shows a slow member who has many more failed heart send failures. This member also was located in a different zone us-east-1b vs us-east-1a. I understand that this is the default but would be a consideration in understanding the issue.


"beta.kubernetes.io/instance-type": "m4.xlarge",
"failure-domain.beta.kubernetes.io/region": "us-east-1",
"failure-domain.beta.kubernetes.io/zone": "us-east-1b",
"kubernetes.io/hostname": "ip-10-0-148-94",

Comment 5 Clayton Coleman 2019-12-03 20:28:27 UTC
We believe a significant portion of this increased latency occurred at or around the time disk encryption landed in master (some believe it was https://github.com/openshift/installer/pull/2666).  Compact had a few examples of etcd timeout before that, but then transitioned to always red after that, so that may be related.

Comment 7 W. Trevor King 2019-12-10 21:35:52 UTC
I watched a random AWS EBS volume through a CI run, and it stayed at 100% Burst Balance the whole time, so this is probably not running out of burst credits when it happens on AWS [1,2].  On the other hand, a sample size of one is not great, and I don't know if that job was running 4.3/4.4 code (although odds are good that it was)

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html#IOcredit
[2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html#monitoring_burstbucket

Comment 8 Ben Howard 2019-12-10 22:00:43 UTC
(In reply to Clayton Coleman from comment #5)
> We believe a significant portion of this increased latency occurred at or
> around the time disk encryption landed in master (some believe it was
> https://github.com/openshift/installer/pull/2666).  Compact had a few
> examples of etcd timeout before that, but then transitioned to always red
> after that, so that may be related.

After testings, it is believed that the disk-encryption is a red-herring. To test this:
- downloaded the 2019-12-05 disk image
- copied the disk and converted the encrypted root filesystem to unencrypted. This created a bare XFS filesystem. 
- mounted the null-cipher and bare XFS root disks
- ran FIO against both null-cipher LUKS and bare XFS

The tests showed very similar results between both test runs [1].

Further, the null-cipher LUKS disk setup is only present on new 4.3 clusters; upgraded clusters from 4.2 to 4.3 have root in a non-LUKS container. 

Finally, the release payload [2] pre-dates the changes where LUKS was enabled in [3]. It's true that the LUKS code was committed and merged, but it did not land until 20 November. 

[1] https://gist.github.com/darkmuggle/34ff7f19d6e7fe261ed8988526dc8ead
[2] https://github.com/openshift/installer/pull/2666
    https://github.com/openshift/installer/commit/68cc5772e5917bf2b8862d0bec5fe1e86c57c9a3#diff-a60e9ad59535dfd89b1bf063339ada3c
[3] https://github.com/openshift/installer/commit/3c0562153762a16c1255ed32a6835674f8eeaaab#diff-a60e9ad59535dfd89b1bf063339ada3c

Comment 9 Clayton Coleman 2019-12-11 15:55:55 UTC
I'm setting this to 4.3 for now, this is ~10% of CI failures.  We need to find root cause and revert whatever the cause was.  We need to bisect to find the source.

Comment 11 W. Trevor King 2019-12-12 00:34:36 UTC
If we suspect RHCOS, here's the change between those two jobs:

$ for i in 338 349; do COMMIT=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name'); echo "$COMMIT $i"; done
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3154258679cab548b1c82d159f98e9863b76ce60932a4f1b25433924cdc55be 338
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 349
$ for HASH in e3154258679cab548b1c82d159f98e9863b76ce60932a4f1b25433924cdc55be 1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc; do oc image info --output json "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:${HASH}" | jq -r .config.config.Labels.version; done
43.81.201911190320.0
43.81.201911200632.0

Comparing the installer between the two builds:

$ for i in 338 349; do COMMIT=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "installer").annotations["io.openshift.build.commit.id"]'); echo "$COMMIT $i"; done
b13dcbfdc1eb30de6576cb7f73072856200aa028 338
0ebc529458c4754c07bbf454caa8a8f7515a5d13 349

So 43.81.201911081536.0 -> 43.81.201911192044.0 [1] for the bootimage.

So the machine-os-content changes were:

$ ./differ.py --first-endpoint art --first-version 43.81.201911081536.0 --second-endpoint art --second-version 43.81.201911200632.0
{
    "diff": {
        "coreos-installer": {
            "43.81.201911200632.0": "coreos-installer.0.0.10.rhaos4.3.gita284cdb.el8.noarch",
            "43.81.201911081536.0": "coreos-installer.0.0.9.rhaos4.2.git2fcf441.el8.noarch"
        },
        "cri-o": {
            "43.81.201911200632.0": "cri-o.0.1.16.0.2.rhaos4.3.git8b6bfb5.el8.x86_64",
            "43.81.201911081536.0": "cri-o.0.1.16.0.0.6.dev.rhaos4.3.git9ad059b.el8.x86_64"
        },
        "podman-manpages": {
            "43.81.201911200632.0": "podman-manpages.0.1.6.3.1.el8.noarch",
            "43.81.201911081536.0": "podman-manpages.0.1.6.2.1.el8.noarch"
        },
        "machine-config-daemon": {
            "43.81.201911200632.0": "machine-config-daemon.0.4.3.0.201911192301.git.1.21e74c7.el8.x86_64",
            "43.81.201911081536.0": "machine-config-daemon.0.4.3.0.201911080552.git.1.68ea0bf.el8.x86_64"
        },
        "podman": {
            "43.81.201911200632.0": "podman.0.1.6.3.1.el8.x86_64",
            "43.81.201911081536.0": "podman.0.1.6.2.1.el8.x86_64"
        },
        "kernel-modules": {
            "43.81.201911200632.0": "kernel-modules.0.4.18.0.147.0.3.el8_1.x86_64",
            "43.81.201911081536.0": "kernel-modules.0.4.18.0.147.el8.x86_64"
        },
        "hardlink": {
            "43.81.201911200632.0": "Not present",
            "43.81.201911081536.0": "hardlink.1.1.3.6.el8.x86_64"
        },
        "openshift-clients": {
            "43.81.201911200632.0": "openshift-clients.0.4.3.0.201911190123.git.1.60c4109.el8.x86_64",
            "43.81.201911081536.0": "openshift-clients.0.4.3.0.201911080552.git.1.bd7390d.el8.x86_64"
        },
        "dracut-network": {
            "43.81.201911200632.0": "dracut-network.0.049.27.git20190906.el8_1.1.x86_64",
            "43.81.201911081536.0": "dracut-network.0.049.57.git20191031.el8.x86_64"
        },
        "kernel-modules-extra": {
            "43.81.201911200632.0": "kernel-modules-extra.0.4.18.0.147.0.3.el8_1.x86_64",
            "43.81.201911081536.0": "kernel-modules-extra.0.4.18.0.147.el8.x86_64"
        },
        "kernel": {
            "43.81.201911200632.0": "kernel.0.4.18.0.147.0.3.el8_1.x86_64",
            "43.81.201911081536.0": "kernel.0.4.18.0.147.el8.x86_64"
        },
        "coreos-installer-dracut": {
            "43.81.201911200632.0": "coreos-installer-dracut.0.0.10.rhaos4.3.gita284cdb.el8.noarch",
            "43.81.201911081536.0": "coreos-installer-dracut.0.0.9.rhaos4.2.git2fcf441.el8.noarch"
        },
        "kernel-core": {
            "43.81.201911200632.0": "kernel-core.0.4.18.0.147.0.3.el8_1.x86_64",
            "43.81.201911081536.0": "kernel-core.0.4.18.0.147.el8.x86_64"
        },
        "container-selinux": {
            "43.81.201911200632.0": "container-selinux.2.2.121.0.1.el8.noarch",
            "43.81.201911081536.0": "container-selinux.2.2.107.2.module+el8.1.0+4081+b29780af.noarch"
        },
        "dracut": {
            "43.81.201911200632.0": "dracut.0.049.27.git20190906.el8_1.1.x86_64",
            "43.81.201911081536.0": "dracut.0.049.57.git20191031.el8.x86_64"
        },
        "pigz": {
            "43.81.201911200632.0": "Not present",
            "43.81.201911081536.0": "pigz.0.2.4.2.el8.x86_64"
        },
        "runc": {
            "43.81.201911200632.0": "runc.0.1.0.0.62.rc8.rhaos4.1.git3cbe540.el8.x86_64",
            "43.81.201911081536.0": "runc.0.1.0.0.62.rc8.el8.x86_64"
        },
        "openshift-hyperkube": {
            "43.81.201911200632.0": "openshift-hyperkube.0.4.3.0.201911162201.git.0.ba68fd5.el8.x86_64",
            "43.81.201911081536.0": "openshift-hyperkube.0.4.3.0.201911080552.git.0.16ab709.el8.x86_64"
        },
        "microcode_ctl": {
            "43.81.201911200632.0": "microcode_ctl.4.20190618.1.20191112.1.el8_1.x86_64",
            "43.81.201911081536.0": "microcode_ctl.4.20190618.1.20190918.2.el8_1.x86_64"
        }
    },
    "sources": {
        "43.81.201911200632.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911200632.0/x86_64/commitmeta.json",
        "43.81.201911081536.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911081536.0/x86_64/commitmeta.json"
    }
}

and the bootimage changes were:

$ ./differ.py --first-endpoint art --first-version 43.81.201911081536.0 --second-endpoint art --second-version 43.81.201911192044.0
{
    "sources": {
        "43.81.201911081536.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911081536.0/x86_64/commitmeta.json",
        "43.81.201911192044.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911192044.0/x86_64/commitmeta.json"
    },
    "diff": {
        "dracut": {
            "43.81.201911081536.0": "dracut.0.049.57.git20191031.el8.x86_64",
            "43.81.201911192044.0": "dracut.0.049.27.git20190906.el8_1.1.x86_64"
        },
        "kernel-core": {
            "43.81.201911081536.0": "kernel-core.0.4.18.0.147.el8.x86_64",
            "43.81.201911192044.0": "kernel-core.0.4.18.0.147.0.3.el8_1.x86_64"
        },
        "cri-o": {
            "43.81.201911081536.0": "cri-o.0.1.16.0.0.6.dev.rhaos4.3.git9ad059b.el8.x86_64",
            "43.81.201911192044.0": "cri-o.0.1.16.0.2.rhaos4.3.git8b6bfb5.el8.x86_64"
        },
        "kernel-modules": {
            "43.81.201911081536.0": "kernel-modules.0.4.18.0.147.el8.x86_64",
            "43.81.201911192044.0": "kernel-modules.0.4.18.0.147.0.3.el8_1.x86_64"
        },
        "kernel": {
            "43.81.201911081536.0": "kernel.0.4.18.0.147.el8.x86_64",
            "43.81.201911192044.0": "kernel.0.4.18.0.147.0.3.el8_1.x86_64"
        },
        "hardlink": {
            "43.81.201911081536.0": "hardlink.1.1.3.6.el8.x86_64",
            "43.81.201911192044.0": "Not present"
        },
        "kernel-modules-extra": {
            "43.81.201911081536.0": "kernel-modules-extra.0.4.18.0.147.el8.x86_64",
            "43.81.201911192044.0": "kernel-modules-extra.0.4.18.0.147.0.3.el8_1.x86_64"
        },
        "machine-config-daemon": {
            "43.81.201911081536.0": "machine-config-daemon.0.4.3.0.201911080552.git.1.68ea0bf.el8.x86_64",
            "43.81.201911192044.0": "machine-config-daemon.0.4.3.0.201911191042.git.1.3af21e9.el8.x86_64"
        },
        "dracut-network": {
            "43.81.201911081536.0": "dracut-network.0.049.57.git20191031.el8.x86_64",
            "43.81.201911192044.0": "dracut-network.0.049.27.git20190906.el8_1.1.x86_64"
        },
        "microcode_ctl": {
            "43.81.201911081536.0": "microcode_ctl.4.20190618.1.20190918.2.el8_1.x86_64",
            "43.81.201911192044.0": "microcode_ctl.4.20190618.1.20191112.1.el8_1.x86_64"
        },
        "coreos-installer-dracut": {
            "43.81.201911081536.0": "coreos-installer-dracut.0.0.9.rhaos4.2.git2fcf441.el8.noarch",
            "43.81.201911192044.0": "coreos-installer-dracut.0.0.10.rhaos4.3.gita284cdb.el8.noarch"
        },
        "podman": {
            "43.81.201911081536.0": "podman.0.1.6.2.1.el8.x86_64",
            "43.81.201911192044.0": "podman.0.1.6.3.1.el8.x86_64"
        },
        "openshift-clients": {
            "43.81.201911081536.0": "openshift-clients.0.4.3.0.201911080552.git.1.bd7390d.el8.x86_64",
            "43.81.201911192044.0": "openshift-clients.0.4.3.0.201911190123.git.1.60c4109.el8.x86_64"
        },
        "podman-manpages": {
            "43.81.201911081536.0": "podman-manpages.0.1.6.2.1.el8.noarch",
            "43.81.201911192044.0": "podman-manpages.0.1.6.3.1.el8.noarch"
        },
        "runc": {
            "43.81.201911081536.0": "runc.0.1.0.0.62.rc8.el8.x86_64",
            "43.81.201911192044.0": "runc.0.1.0.0.62.rc8.rhaos4.1.git3cbe540.el8.x86_64"
        },
        "pigz": {
            "43.81.201911081536.0": "pigz.0.2.4.2.el8.x86_64",
            "43.81.201911192044.0": "Not present"
        },
        "coreos-installer": {
            "43.81.201911081536.0": "coreos-installer.0.0.9.rhaos4.2.git2fcf441.el8.noarch",
            "43.81.201911192044.0": "coreos-installer.0.0.10.rhaos4.3.gita284cdb.el8.noarch"
        },
        "openshift-hyperkube": {
            "43.81.201911081536.0": "openshift-hyperkube.0.4.3.0.201911080552.git.0.16ab709.el8.x86_64",
            "43.81.201911192044.0": "openshift-hyperkube.0.4.3.0.201911162201.git.0.ba68fd5.el8.x86_64"
        }
    }
}

Along with the package changes, there were internal RHCOS changes (e.g. to systemd unit files and such) that are harder to dump here :p.

[1]: https://github.com/openshift/installer/compare/b13dcbfdc1eb30de6576cb7f73072856200aa028...0ebc529458c4754c07bbf454caa8a8f7515a5d13#diff-a60e9ad59535dfd89b1bf063339ada3cL131

Comment 12 W. Trevor King 2019-12-12 00:40:01 UTC
Wait, machine-os-content was 43.81.201911190320.0 -> 43.81.201911200632.0, not 43.81.201911081536.0 -> 43.81.201911200632.0.  I can't seem to find commitmeta.json for 43.81.201911190320.0:

$ for RHCOS in 43.81.201911190320.0 43.81.201911200632.0 43.81.201911081536.0 43.81.201911192044.0; do echo -n "${RHCOS} "; curl -sI "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/${RHCOS}/x86_64/commitmeta.json" | head -n1; done
43.81.201911190320.0 HTTP/1.1 403 Forbidden
43.81.201911200632.0 HTTP/1.1 200 OK
43.81.201911081536.0 HTTP/1.1 200 OK
43.81.201911192044.0 HTTP/1.1 200 OK

Comment 13 W. Trevor King 2019-12-12 00:56:00 UTC
I'm also going to punt this over to RHCOS, although at this point it's really "anybody with any ideas, please help!" ;)

Comment 16 Colin Walters 2019-12-12 14:41:59 UTC
> Wait, machine-os-content was 43.81.201911190320.0 -> 43.81.201911200632.0, not 43.81.201911081536.0 -> 43.81.201911200632.0.  I can't seem to find commitmeta.json for 43.81.201911190320.0:

Very likely https://jira.coreos.com/browse/GRPA-946

Comment 20 W. Trevor King 2019-12-13 05:59:48 UTC
Created attachment 1644623 [details]
prom-history.sh for max(etcd_server_leader_changes_seen_total)

Better brackets on e2e-aws release promotion jobs via max(etcd_server_leader_changes_seen_total).  Shows a hard cut from 1 to 2 leader changes in the older runs going to 5 to 16 after for newer runs.  Data from around the transition:

$ for RESULT in */result.json; do VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${RESULT}")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${RESULT/result/finished}")"; if test -n "${VALUE}"; then echo "${END_TIME} ${VALUE}"; fi; done 2>/dev/null | sort | grep '2019-11-2[01]'
2019-11-20T00:48:22Z 1
2019-11-20T04:25:50Z 1
2019-11-20T06:56:34Z 1
2019-11-20T08:38:31Z 1
2019-11-20T10:57:17Z 1
2019-11-20T15:49:17Z 1
2019-11-20T18:06:57Z 1
2019-11-20T21:31:27Z 10
2019-11-21T00:13:58Z 5
2019-11-21T04:22:07Z 11
2019-11-21T08:09:19Z 8
2019-11-21T14:01:32Z 8
2019-11-21T18:29:48Z 10
2019-11-21T23:35:53Z 14

And job numbers for that transition:

$ for x in 3[34][0-9]; do FIN=$(jq -r '.timestamp | todateiso8601' $x/finished.json); echo "$FIN $x"; done | grep '2019-11-20T18:06:57Z\|2019-11-20T21:31:27Z'
2019-11-20T18:06:57Z 347
2019-11-20T21:31:27Z 348

Repeating the comment 11 analysis for the tighter bracket:

$ for i in 347 348; do COMMIT=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name'); echo "$COMMIT $i"; done
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 347
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 348

So _not_ a machine-os-content change.  Comparing commits feeding the release image:

$ JQ='[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]'
$ diff -U0 <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/347/artifacts/release-images-latest/release-images-latest | jq -r "${JQ}") <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/348/artifacts/release-images-latest/release-images-latest | jq -r "${JQ}")
--- /dev/fd/63	2019-12-12 21:43:12.074842301 -0800
+++ /dev/fd/62	2019-12-12 21:43:12.074842301 -0800
@@ -3 +3 @@
-baremetal-installer https://github.com/openshift/installer/commit/e9b4862c83da4ea4bdb113e19ef2c82f49fb4499
+baremetal-installer https://github.com/openshift/installer/commit/5db8dbcddee17c8cd1451a5d5a186d0cf17e2d6a
@@ -36 +36 @@
-console https://github.com/openshift/console/commit/fbcfd5d8b1aa21ad85af955eb3e0d2a7af70cce2
+console https://github.com/openshift/console/commit/fcc9619ff91bae830ebe81ae630540c4922a251b
@@ -47 +47 @@
-hyperkube https://github.com/openshift/ose/commit/ba68fd59d82c52fc219757a2d8aff35bdb0adc87
+hyperkube https://github.com/eparis/origin/commit/88a02d205da4eba2e72dbdd1af5a63901ddcc766
@@ -49,2 +49,2 @@
-installer https://github.com/openshift/installer/commit/e9b4862c83da4ea4bdb113e19ef2c82f49fb4499
-installer-artifacts https://github.com/openshift/installer/commit/e9b4862c83da4ea4bdb113e19ef2c82f49fb4499
+installer https://github.com/openshift/installer/commit/5db8dbcddee17c8cd1451a5d5a186d0cf17e2d6a
+installer-artifacts https://github.com/openshift/installer/commit/5db8dbcddee17c8cd1451a5d5a186d0cf17e2d6a
@@ -71 +71 @@
-machine-config-operator https://github.com/openshift/machine-config-operator/commit/2e560b2f4847939ba2db3424cc4b71cbbabccc6d
+machine-config-operator https://github.com/openshift/machine-config-operator/commit/e9033544f42191376529efdb7a67deb5427efd1d
@@ -79 +79 @@
-openshift-apiserver https://github.com/openshift/openshift-apiserver/commit/29b6805e14572a1da2fe0cd0d13d7038b127f600
+openshift-apiserver https://github.com/openshift/openshift-apiserver/commit/2e7b3f676eaafde27984f2408fb737d3a1ce5e59

So [1] (just [2]), [3] (just [4]), [5] (just [6]), very outside chance at the console changes [7], and then whatever happened in origin around Eric's [8] (ba68fd59d82c has not made it into origin).  Looks like:

$ git log --format='%h%d %s' ba68fd59d..88a02d20 | grep origin/pr/
88a02d205d (origin/pr/24181) bump(golang.org/x/sys) as it is needed for compatability with golang.org/x/crypto
d70b19d4d6 (origin/pr/24166) bump(*): ghodss/yaml and library-go
ce930dd4ee (origin/pr/24167) Fix cmd tests to be more resilient to changes
177cafee87 (origin/pr/24148) Bug 1771335: Verify builds can use pull secret
626640c289 (origin/pr/24159) UPSTREAM: 85223: apiextensions: fix items+type logic in API due to broken go-openapi validation
9cdbaf442f (origin/pr/24150) UPSTREAM: 85308: support SNI with IPs
b34aeceb3c (origin/pr/24138) Stop excluding ClusterMachineApproverDown from firing check

installer#2666 bootimages are probably still the primary suspect.  I'll go see if there's a smoking gun in its presubmit jobs.

[1]: https://github.com/openshift/installer/compare/e9b4862c83da4ea4bdb113e19ef2c82f49fb4499...5db8dbcddee17c8cd1451a5d5a186d0cf17e2d6a
[2]: https://github.com/openshift/installer/pull/2666
[3]: https://github.com/openshift/openshift-apiserver/compare/29b6805e14572a1da2fe0cd0d13d7038b127f600...2e7b3f676eaafde27984f2408fb737d3a1ce5e59
[4]: https://github.com/openshift/openshift-apiserver/pull/44
[5]: https://github.com/openshift/machine-config-operator/compare/2e560b2f4847939ba2db3424cc4b71cbbabccc6d...e9033544f42191376529efdb7a67deb5427efd1d
[6]: https://github.com/openshift/machine-config-operator/pull/1257
[7]: https://github.com/openshift/console/compare/fbcfd5d8b1aa21ad85af955eb3e0d2a7af70cce2...fcc9619ff91bae830ebe81ae630540c4922a251b
[8]: https://github.com/openshift/origin/pull/24181

Comment 21 W. Trevor King 2019-12-13 06:32:07 UTC
Checking installer#2666, it only ran four e2e-aws presubmits [1].  Here are the installer commits (there may have also been drift in non-installer release components):

$ for i in 8586 8601 8645 8653; do curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2666/pull-ci-openshift-installer-master-e2e-aws/$i/build-log.txt | head -n1; done
2019/11/13 19:01:01 Resolved source https://github.com/openshift/installer to master@9f62449a, merging: #2666 12d45011 @stbenjam
2019/11/14 20:29:04 Resolved source https://github.com/openshift/installer to master@36b066c7, merging: #2666 20d993ea @stbenjam
2019/11/20 00:38:15 Resolved source https://github.com/openshift/installer to master@de10297f, merging: #2666 3c056215 @stbenjam
2019/11/20 13:17:51 Resolved source https://github.com/openshift/installer to master@e9b4862c, merging: #2666 3c056215 @stbenjam

Running max(etcd_server_leader_changes_seen_total) on those jobs using the analysis script from comment 20:

$ for RESULT in */result.json; do VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${RESULT}")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${RESULT/result/finished}")"; if test -n "${VALUE}"; then echo "${END_TIME} ${VALUE}"; fi; done 2>/dev/null | sort
2019-11-13T20:27:03Z 3
2019-11-14T21:52:17Z 7
2019-11-20T02:03:11Z 9
2019-11-20T14:38:36Z 9

Looks pretty bad, and possibly getting worse with the later commits.  This confirms the RHCOS assignment as far as I'm concerned.  Still not clear on what it was about the RHCOS bootimage bump that would break etcd so badly.  Getting RHCOS versions:

$ for COMMIT in 12d45011 20d993ea 3c056215; do curl -s https://raw.githubusercontent.com/openshift/installer/$COMMIT/data/data/rhcos.json | jq -r .buildid; done
43.81.201911131545.0
43.81.201911131833.0
43.81.201911192044.0

Assuming the 3->7 bump was the bulk of the problem:

$ ./differ.py --first-endpoint art --first-version 43.81.201911131545.0 --second-endpoint art --second-version 43.81.201911131833.0
{
    "diff": {
        "kernel-core": {
            "43.81.201911131833.0": "kernel-core.0.4.18.0.147.0.3.el8_1.x86_64",
            "43.81.201911131545.0": "kernel-core.0.4.18.0.147.0.2.el8_1.x86_64"
        },
        "kernel": {
            "43.81.201911131833.0": "kernel.0.4.18.0.147.0.3.el8_1.x86_64",
            "43.81.201911131545.0": "kernel.0.4.18.0.147.0.2.el8_1.x86_64"
        },
        "kernel-modules": {
            "43.81.201911131833.0": "kernel-modules.0.4.18.0.147.0.3.el8_1.x86_64",
            "43.81.201911131545.0": "kernel-modules.0.4.18.0.147.0.2.el8_1.x86_64"
        },
        "kernel-modules-extra": {
            "43.81.201911131833.0": "kernel-modules-extra.0.4.18.0.147.0.3.el8_1.x86_64",
            "43.81.201911131545.0": "kernel-modules-extra.0.4.18.0.147.0.2.el8_1.x86_64"
        },
        "openshift-hyperkube": {
            "43.81.201911131833.0": "openshift-hyperkube.0.4.3.0.201911131246.git.0.015ef3e.el8.x86_64",
            "43.81.201911131545.0": "openshift-hyperkube.0.4.3.0.201911130601.git.0.65d6168.el8.x86_64"
        }
    },
    "sources": {
        "43.81.201911131833.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911131833.0/x86_64/commitmeta.json",
        "43.81.201911131545.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201911131545.0/x86_64/commitmeta.json"
    }
}

Really feels like a kernel thing, although there's still a chance this is a hyperkube-load thing.  And I'm still not clear on how a bootimage change is having such an impact on presumably post-pivot cluster behavior (e2e success).

[1]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2666/pull-ci-openshift-installer-master-e2e-aws/

Comment 22 W. Trevor King 2019-12-13 17:54:28 UTC
There's some suspicion that the issue may be AWS bumping its microcode for the TSX thing, in which case we won't be able to turn it off.  Arguing against that is the fact that 4.2 AWS CI from today is seeing 1 for max(etcd_server_leader_changes_seen_total).  But to get another angle on it, I checked our 4.3 nightly promotion for e2e-azure from the window where e2e-aws saw the leader-change spike:

$ for i in $(seq 590); do mkdir -p "${i}"; curl "https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/${i}/finished.json" > "${i}/finished.json"; done
$ for x in */finished.json; do FIN=$(jq -r '.timestamp | todateiso8601' $x); PASS=$(jq -r .passed $x); echo "$FIN $x $PASS"; done 2>/dev/null | sort | grep 2019-11-20
2019-11-20T01:11:39Z 435/finished.json false
2019-11-20T04:42:41Z 436/finished.json false
2019-11-20T07:18:41Z 437/finished.json false
2019-11-20T09:19:00Z 438/finished.json false
2019-11-20T11:38:38Z 439/finished.json false
2019-11-20T16:16:38Z 440/finished.json false
2019-11-20T18:32:20Z 441/finished.json false
2019-11-20T22:02:34Z 442/finished.json false
$ ... prom-history.sh stuff to pull down and pull max(etcd_server_leader_changes_seen_total) from prometheus.tar for [1] through [2]
$ for RESULT in */result.json; do VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${RESULT}")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${RESULT/result/finished}")"; if test -n "${VALUE}"; then echo "${END_TIME} ${VALUE} ${RESULT}"; fi; done 2>/dev/null | sort
2019-11-19T06:49:15Z 6 430/result.json
2019-11-19T07:31:25Z 6 431/result.json
2019-11-19T14:22:38Z 4 432/result.json
2019-11-19T18:41:35Z 2 433/result.json
2019-11-19T22:11:09Z 6 434/result.json
2019-11-20T01:11:39Z 5 435/result.json
2019-11-20T04:42:41Z 5 436/result.json
2019-11-20T07:18:41Z 5 437/result.json
2019-11-20T09:19:00Z 7 438/result.json
2019-11-20T11:38:38Z 7 439/result.json
2019-11-20T16:16:38Z 4 440/result.json
2019-11-20T18:32:20Z 5 441/result.json
2019-11-20T22:02:34Z 7 442/result.json
2019-11-21T00:58:54Z 8 443/result.json
2019-11-21T04:45:17Z 7 444/result.json
2019-11-21T08:01:35Z null 445/result.json
2019-11-21T14:53:06Z 12 446/result.json
2019-11-21T17:38:53Z null 447/result.json
2019-11-21T23:25:15Z null 448/result.json
2019-11-22T01:12:43Z 7 449/result.json
2019-11-22T04:26:48Z 8 450/result.json

Not a great signal, but maybe a weak trend upwards.  And checking metal in a different directory:

$ for i in $(seq 921); do mkdir -p "${i}"; curl https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/${i}/finished.json > "${i}/finished.json"; done
$ for x in */finished.json; do FIN=$(jq -r '.timestamp | todateiso8601' $x); PASS=$(jq -r .passed $x); echo "$FIN $x $PASS"; done 2>/dev/null | sort | grep 2019-11-20
2019-11-20T00:53:32Z 765/finished.json false
2019-11-20T04:19:05Z 766/finished.json false
2019-11-20T07:03:42Z 767/finished.json false
2019-11-20T08:40:29Z 768/finished.json false
2019-11-20T10:59:18Z 769/finished.json false
2019-11-20T16:19:49Z 770/finished.json false
2019-11-20T18:05:16Z 771/finished.json false
2019-11-20T20:52:47Z 772/finished.json false
$ ... prom-history.sh stuff to pull down prometheus.tar for [3] through [4]
$ ls -l */prometheus.tar
-rw-r--r--. 1 trking trking 67365811 Dec 13 09:43 760/prometheus.tar
...
-rw-r--r--. 1 trking trking 24107212 Dec 13 09:44 770/prometheus.tar
-rw-r--r--. 1 trking trking        0 Dec 13 09:44 771/prometheus.tar
...
-rw-r--r--. 1 trking trking        0 Dec 13 09:44 780/prometheus.tar

Checking [5]:

  level=fatal msg="waiting for Kubernetes API: context deadline exceeded" 

so no metal results after that.  But running the prom-history.sh logic to analyze the ones we got:

$ for RESULT in */result.json; do VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${RESULT}")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${RESULT/result/finished}")"; if test -n "${VALUE}"; then echo "${END_TIME} ${VALUE}"; fi; done
2019-11-19T07:08:52Z 2
2019-11-19T07:09:51Z 1
2019-11-19T14:06:39Z 1
2019-11-19T18:50:36Z 1
2019-11-19T21:56:06Z 1
2019-11-20T00:53:32Z 1
2019-11-20T04:19:05Z 1
2019-11-20T07:03:42Z 2
2019-11-20T08:40:29Z 1
2019-11-20T10:59:18Z 1
2019-11-20T16:19:49Z 1

So that's looking solid.  Dunno if we've had any successful metal runs since (Packet was having some internal issue and we lost signal), but I'll look...

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/430
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/450
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/760
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/780
[5]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/771

Comment 23 W. Trevor King 2019-12-13 19:18:03 UTC
Looking at CI builds for e2e-aws testing, from [1] through [2]:

$ for i in $(seq 1460 1480); do mkdir -p "${i}"; curl "https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/${i}/finished.json" > "${i}/finished.json"; curl "https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/${i}/artifacts/e2e-aws/metrics/prometheus.tar" > "${i}/prometheus.tar"; done
$ ... prom-history.sh stuff to pull down and pull max(etcd_server_leader_changes_seen_total) from prometheus.tar
$ for RESULT in */result.json; do VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${RESULT}")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${RESULT/result/finished}")"; if test -n "${VALUE}"; then echo "${END_TIME} ${VALUE} ${RESULT}"; fi; done 2>/dev/null | sort
2019-11-16T21:51:19Z 1 1460/result.json
2019-11-17T17:07:25Z 1 1461/result.json
2019-11-18T17:05:41Z 1 1462/result.json
2019-11-18T17:23:47Z 1 1463/result.json
2019-11-19T17:23:57Z 1 1465/result.json
2019-11-19T17:30:56Z 1 1464/result.json
2019-11-20T17:27:30Z 1 1466/result.json
2019-11-20T17:27:53Z 1 1467/result.json
2019-11-21T17:27:59Z 8 1468/result.json
2019-11-21T21:33:21Z 1 1469/result.json
2019-11-22T21:30:12Z 9 1470/result.json
2019-11-23T17:08:17Z 1 1471/result.json
2019-11-24T17:09:03Z 8 1472/result.json
2019-11-25T17:16:40Z 7 1473/result.json
2019-11-26T16:24:59Z 1 1474/result.json
2019-11-27T16:28:11Z 7 1475/result.json
2019-11-27T16:58:08Z 1 1476/result.json
2019-11-28T16:50:50Z 8 1477/result.json
2019-11-28T20:18:40Z null 1478/result.json
2019-11-28T21:08:47Z null 1479/result.json
2019-11-29T15:44:19Z null 1480/result.json

(empty/corrupt prometheus.tar for those last three).  Very strange that it was flapping 1<->high after 2019-11-21T17:27:59Z.  The 2019-11-20T17:27:53Z -> 2019-11-21T17:27:59Z window matches up well with the 2019-11-20T18:06:57Z -> 2019-11-20T21:31:27Z window from e2e-aws nightlies from comment 20.  Looking at machine-os-content:

$ for i in $(seq 1460 1477); do PULLSPEC=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name'); echo -n "$PULLSPEC $i "; oc image info --output json "${PULLSPEC}" | jq -r .config.config.Labels.version; done
registry.svc.ci.openshift.org/ocp/4.3-2019-11-16-153916@sha256:ddc632cd9bd33291f6f57b05f615e6db85a7d9022a654096f6a21a0c10839b9a 1460 error: image does not exist
registry.svc.ci.openshift.org/origin/4.3-2019-11-17-154428@sha256:03858607139c63bd541891b90ad0a72220bbb6a4d0889bcb8fa3a484e4bd72fd 1461 error: image does not exist
registry.svc.ci.openshift.org/ocp/4.3-2019-11-18-140404@sha256:2989b16f5e904423af3292cd884f449277950c7e5d6d1d90954dc657bfb3c854 1462 error: image does not exist
registry.svc.ci.openshift.org/origin/4.3-2019-11-18-160156@sha256:32a963604cd306d5207438e8a148b856e0766bae5b53e46e561c10ef691f8ed3 1463 error: image does not exist
registry.svc.ci.openshift.org/ocp/4.3-2019-11-19-095016@sha256:e3154258679cab548b1c82d159f98e9863b76ce60932a4f1b25433924cdc55be 1464 error: image does not exist
registry.svc.ci.openshift.org/origin/4.3-2019-11-19-160737@sha256:a576f9703d7b2ac990b58b9ebccf4ada350366f2c6726b3c98e0a7725930b0d0 1465 error: image does not exist
registry.svc.ci.openshift.org/ocp/4.3-2019-11-20-134433@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1466 error: image does not exist
registry.svc.ci.openshift.org/origin/4.3-2019-11-20-160525@sha256:7264695c700dd766ec9d2c737a57e90bacc8f2af9d68cc4a2db32be0feaa472e 1467 error: image does not exist
registry.svc.ci.openshift.org/ocp/4.3-2019-11-21-103638@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1468 43.81.201911200632.0
registry.svc.ci.openshift.org/origin/4.3-2019-11-21-195816@sha256:0bb9819f10189ff116838ea703c2cec9049136ff474854ef436c52a8bd4a6c12 1469 error: image does not exist
registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1470 43.81.201911200632.0
registry.svc.ci.openshift.org/origin/4.3-2019-11-23-155208@sha256:384624be66fae6263e4a2a4ae5d5c16ab3fc59ab15b50a2ec9a41d6d8dd66676 1471 error: image does not exist
registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1472 43.81.201911200632.0
registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1473 43.81.201911200632.0
registry.svc.ci.openshift.org/origin/4.3-2019-11-26-150235@sha256:c6c62bf72d02e6bb43b0d4133e77f85427198f575dcbf4a43293c3e8771c61cc 1474 error: image does not exist
registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1475 43.81.201911200632.0
registry.svc.ci.openshift.org/origin/4.3-2019-11-27-153431@sha256:e6f2be92fe3c225b0a5609a04e9942a41138297b24ba91a946dff7de29d01e40 1476 error: image does not exist
registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc 1477 43.81.201911200632.0

Looks like we were flapping between CI and ART sources? Joining the leader change data with the machine-os-content data:

$ for i in $(seq 1460 1477); do PULLSPEC=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name'); VALUE="$(jq '[.data.result[].values[][1] | tonumber] | max' "${i}/result.json")"; END_TIME="$(jq -r '.timestamp | todateiso8601' "${i}/finished.json")"; echo "${END_TIME} ${VALUE} ${i} ${PULLSPEC}"; done
2019-11-16T21:51:19Z 1 1460 registry.svc.ci.openshift.org/ocp/4.3-2019-11-16-153916@sha256:ddc632cd9bd33291f6f57b05f615e6db85a7d9022a654096f6a21a0c10839b9a
2019-11-17T17:07:25Z 1 1461 registry.svc.ci.openshift.org/origin/4.3-2019-11-17-154428@sha256:03858607139c63bd541891b90ad0a72220bbb6a4d0889bcb8fa3a484e4bd72fd
2019-11-18T17:05:41Z 1 1462 registry.svc.ci.openshift.org/ocp/4.3-2019-11-18-140404@sha256:2989b16f5e904423af3292cd884f449277950c7e5d6d1d90954dc657bfb3c854
2019-11-18T17:23:47Z 1 1463 registry.svc.ci.openshift.org/origin/4.3-2019-11-18-160156@sha256:32a963604cd306d5207438e8a148b856e0766bae5b53e46e561c10ef691f8ed3
2019-11-19T17:30:56Z 1 1464 registry.svc.ci.openshift.org/ocp/4.3-2019-11-19-095016@sha256:e3154258679cab548b1c82d159f98e9863b76ce60932a4f1b25433924cdc55be
2019-11-19T17:23:57Z 1 1465 registry.svc.ci.openshift.org/origin/4.3-2019-11-19-160737@sha256:a576f9703d7b2ac990b58b9ebccf4ada350366f2c6726b3c98e0a7725930b0d0
2019-11-20T17:27:30Z 1 1466 registry.svc.ci.openshift.org/ocp/4.3-2019-11-20-134433@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc
2019-11-20T17:27:53Z 1 1467 registry.svc.ci.openshift.org/origin/4.3-2019-11-20-160525@sha256:7264695c700dd766ec9d2c737a57e90bacc8f2af9d68cc4a2db32be0feaa472e
2019-11-21T17:27:59Z 8 1468 registry.svc.ci.openshift.org/ocp/4.3-2019-11-21-103638@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc
2019-11-21T21:33:21Z 1 1469 registry.svc.ci.openshift.org/origin/4.3-2019-11-21-195816@sha256:0bb9819f10189ff116838ea703c2cec9049136ff474854ef436c52a8bd4a6c12
2019-11-22T21:30:12Z 9 1470 registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc
2019-11-23T17:08:17Z 1 1471 registry.svc.ci.openshift.org/origin/4.3-2019-11-23-155208@sha256:384624be66fae6263e4a2a4ae5d5c16ab3fc59ab15b50a2ec9a41d6d8dd66676
2019-11-24T17:09:03Z 8 1472 registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc
2019-11-25T17:16:40Z 7 1473 registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc
2019-11-26T16:24:59Z 1 1474 registry.svc.ci.openshift.org/origin/4.3-2019-11-26-150235@sha256:c6c62bf72d02e6bb43b0d4133e77f85427198f575dcbf4a43293c3e8771c61cc
2019-11-27T16:28:11Z 7 1475 registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc
2019-11-27T16:58:08Z 1 1476 registry.svc.ci.openshift.org/origin/4.3-2019-11-27-153431@sha256:e6f2be92fe3c225b0a5609a04e9942a41138297b24ba91a946dff7de29d01e40
2019-11-28T16:50:50Z 8 1477 registry.svc.ci.openshift.org/ocp/4.3-2019-11-22-122829@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc

So that pretty clearly gets AWS-side microcode off the hook and makes this an ART vs. CI machine-os-content thing (or maybe there are other ART vs. CI differences between these images as well).

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/1460
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/1480

Comment 24 W. Trevor King 2019-12-13 19:47:58 UTC
(In reply to W. Trevor King from comment #20)
> [3]:
> https://github.com/openshift/openshift-apiserver/compare/
> 29b6805e14572a1da2fe0cd0d13d7038b127f600...
> 2e7b3f676eaafde27984f2408fb737d3a1ce5e59
> [4]: https://github.com/openshift/openshift-apiserver/pull/44

David opened [1] to try reverting this, but dropping [2] into PromeCIeus [3] shows max(etcd_server_leader_changes_seen_total) at 11.  So the bump to 1.17 is probably not what triggered this.

[1]: https://github.com/openshift/openshift-apiserver/pull/61
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/61/pull-ci-openshift-openshift-apiserver-master-e2e-aws/217
[3]: https://promecieus.svc.ci.openshift.org/

Comment 25 W. Trevor King 2019-12-13 20:16:25 UTC
(In reply to W. Trevor King from comment #23)
> 2019-11-20T17:27:30Z 1 1466 registry.svc.ci.openshift.org/ocp/4.3-2019-11-20-134433@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc
> 2019-11-20T17:27:53Z 1 1467 registry.svc.ci.openshift.org/origin/4.3-2019-11-20-160525@sha256:7264695c700dd766ec9d2c737a57e90bacc8f2af9d68cc4a2db32be0feaa472e
> 2019-11-21T17:27:59Z 8 1468 registry.svc.ci.openshift.org/ocp/4.3-2019-11-21-103638@sha256:1a3754d7cc780286c6803c2855ec5b2a59f6280dc403d3f23e2e28b514432cdc

$ for i in $(seq 1466 1468); do COMMIT=$(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/$i/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "installer").annotations["io.openshift.build.commit.id"]'); git --no-pager log -1 --date=short --format='%h %cd%d %s' "${COMMIT}"; done
e9b4862c8 2019-11-20 Merge pull request #2667 from jcpowermac/issue-2634
23f40299a 2019-11-12 Merge pull request #2548 from vrutkovs/okd
0ebc52945 2019-11-20 Merge pull request #2691 from pierreprinetti/os_upi_docs_fix

So for some reason the runs using the OCP/ART machine-os-content are using recent installer code and flipping into the broken state with [1] (contains [2,3]).  The runs using the origin machine-os-content are fine, running code from the week before (probably for both the installer and other operators and such).  Diff for release-images-latest for the break looks a lot like comment 20.  Checking release.openshift.io/from-image-stream in their release-images-latest, we have:

* origin 1466: ocp/4.3-2019-11-20-134433
* origin 1468: ocp/4.3-2019-11-21-103638
* ocp 347: ocp/4.3-art-latest-2019-11-20-165519
* ocp 348: ocp/4.3-art-latest-2019-11-20-195514

so comment 20 is still giving us the tightest bracket.

[1]: https://github.com/openshift/installer/compare/e9b4862c8...0ebc52945
[2]: https://github.com/openshift/installer/pull/2666
[3]: https://github.com/openshift/installer/pull/2691

Comment 26 W. Trevor King 2019-12-14 03:49:34 UTC
Back to 1 max(etcd_server_leader_changes_seen_total) by rolling back the AMI to the pre-installer#2666 state while keeping a modern hyperkube [1].  Comment 21 has some mumbling about the different prerelease jobs (which ran on three different AMIs) for that #2666, and [2] suggests it's not the kernel change, but probably best to just run a naive bisection with the different RHCOS skipped over in the #2666 bump (43.81.201911081536.0 -> 43.81.201911192044.0) because I have no idea how this AMI change has the effect it does ;).

[1]: https://github.com/openshift/installer/pull/2815#issuecomment-565664333
[2]: https://github.com/openshift/installer/pull/2817#issuecomment-565658441

Comment 27 Colin Walters 2019-12-17 19:35:26 UTC
This issue has been pinned definitively to the "null cipher" LUKS container.  Work is ongoing RHCOS side to investigate possible fixes.

Comment 28 Ben Howard 2019-12-18 19:04:45 UTC
The root cause is that the null-cipher when activated through dm-crypt results in a significant throughput/bandwidth penalty. 

For the null-cipher case the root filesystem is activated by mapping a dm-linear target to the offset of the root filesystem. Fixes are committed and working their way through CI now.

Calling this release pending for now.

Comment 29 W. Trevor King 2019-12-18 19:23:58 UTC
"change has landed and is in the pipe" is MODIFIED.  Still needs QE to sign off on it to get to VERIFIED, and then have a release candidate built (or maybe a GA candidate build?) to get to RELEASE_PENDING.

Comment 31 W. Trevor King 2019-12-18 21:53:40 UTC
Latest 4.3 nightly is [1] with:

$ oc image info -o json $(oc adm release info --image-for=machine-os-content registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-18-195156) | jq -r '.config.config.Labels.version'
43.81.201912181720.0

Does that have the fix?
 
[1]: https://openshift-release.svc.ci.openshift.org/#4.3.0-0.nightly

Comment 32 Michael Nguyen 2019-12-18 22:23:18 UTC
It looks like it does.  43.81.201912181720.0's build json[0] points to the commit[1] with the fix. Does this fix require a boot image bump also?

[0] https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.3/43.81.201912181720.0/x86_64/meta.json
[1] https://gitlab.cee.redhat.com/coreos/redhat-coreos/commit/25a9b3ddbdb6f1a90604b75b142fe38fe173086f

Comment 33 W. Trevor King 2019-12-18 22:26:26 UTC
It's supposed to work without a bootimage bump, so ON_QA is appropriate.  And we can punt back if it turns out to not work ;)

Comment 34 W. Trevor King 2019-12-18 22:29:28 UTC
[1] has 2 for max(etcd_server_leader_changes_seen_total).  That's >1, but still lower than the 5+ we were seeing when the bug was bad.  And we were seeing 2 occasionally before the break (comment 20).

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/509

Comment 35 Michael Nguyen 2019-12-19 02:29:54 UTC
4.3.0-0.nightly-2019-12-18-215416 is the first accepted release with the fix.  [0] has 1 max(etcd_server_leader_changes_seen_total).  I ran some basic encryption with luks and no encryption tests against the RHCOS build 43.81.201912181720.0 and it looks good.  I'm standing up a cluster now and will do a few more checks tomrorow before closing this as verified.

[0] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/510

Comment 36 Michael Nguyen 2019-12-19 13:49:01 UTC
Final checks look good.  Fix is in the build and it only has  1 max(etcd_server_leader_changes_seen_total).  Closing as verified.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-12-18-215416   True        False         10h     Cluster version is 4.3.0-0.nightly-2019-12-18-215416

$ oc get nodes
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-136-167.ec2.internal   Ready    worker   11h   v1.16.2
ip-10-0-139-193.ec2.internal   Ready    master   11h   v1.16.2
ip-10-0-146-132.ec2.internal   Ready    worker   11h   v1.16.2
ip-10-0-157-48.ec2.internal    Ready    master   11h   v1.16.2
ip-10-0-163-220.ec2.internal   Ready    worker   11h   v1.16.2
ip-10-0-170-74.ec2.internal    Ready    master   11h   v1.16.2
$ oc debug node/ip-10-0-139-193.ec2.internal
Starting pod/ip-10-0-139-193ec2internal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# journalctl -b 0 | grep coreos-cryptfs
Dec 19 02:31:33 localhost coreos-cryptfs[1013]: coreos-cryptfs: /dev/xvda4 uses the 'cipher_null-ecb'
Dec 19 02:31:33 localhost coreos-cryptfs[1013]: coreos-cryptfs: /dev/xvda4 is not encrypted. Device will be mounted as device-mapper linear target.
Dec 19 02:31:33 localhost coreos-cryptfs[1013]: NOCHANGE: partition 4 is size 250607583. it cannot be grown
sh-4.4# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
xvda                         202:0    0   120G  0 disk 
|-xvda1                      202:1    0   384M  0 part /boot
|-xvda2                      202:2    0   127M  0 part /boot/efi
|-xvda3                      202:3    0     1M  0 part 
`-xvda4                      202:4    0 119.5G  0 part 
  `-coreos-luks-root-nocrypt 253:0    0 119.5G  0 dm   /sysroot

Comment 37 Greg Blomquist 2019-12-20 15:15:01 UTC
*** Bug 1771741 has been marked as a duplicate of this bug. ***

Comment 39 errata-xmlrpc 2020-01-23 11:13:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.