Bug 1788321 - [Feature:DisasterRecovery][Disruptive] [dr-quorum-restore] Cluster should restore itself after quorum loss
Summary: [Feature:DisasterRecovery][Disruptive] [dr-quorum-restore] Cluster should res...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.5.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-06 22:44 UTC by Ben Parees
Modified: 2020-06-30 18:56 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
The disruptive tests are internal CI tools, and the end users do not need to know about these bug fixes to our CI tests.
Clone Of:
Environment:
Last Closed: 2020-05-13 21:27:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:20:31 UTC

Description Ben Parees 2020-01-06 22:44:03 UTC
Description of problem:
test is consistently failing in e2e-aws-disruptive-4.3

example:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/99


fail [github.com/openshift/origin/test/extended/dr/common.go:303]: Unexpected error:
    <*errors.errorString | 0xc00327d380>: {
        s: "failed running \"sudo -i /bin/bash -x /usr/local/bin/etcd-snapshot-restore.sh /root/assets/backup/snapshot.db etcd-member-ip-10-0-149-192.ec2.internal=https://etcd-1.ci-op-14v1qhjk-2770b.origin-ci-int-aws.dev.rhcloud.com:2380\": <nil> (exit code 1, stderr + set -o errexit\n+ set -o pipefail\n+ [[ 0 -ne 0 ]]\n+ '[' /root/assets/backup/snapshot.db == '' ']'\n+ '[' etcd-member-ip-10-0-149-192.ec2.internal=https://etcd-1.ci-op-14v1qhjk-2770b.origin-ci-int-aws.dev.rhcloud.com:2380 == '' ']'\n+ BACKUP_FILE=/root/assets/backup/snapshot.db\n+ INITIAL_CLUSTER=etcd-member-ip-10-0-149-192.ec2.internal=https://etcd-1.ci-op-14v1qhjk-2770b.origin-ci-int-aws.dev.rhcloud.com:2380\n+ ASSET_DIR=./assets\n+ CONFIG_FILE_DIR=/etc/kubernetes\n+ MANIFEST_DIR=/etc/kubernetes/manifests\n+ MANIFEST_STOPPED_DIR=./assets/manifests-stopped\n+ RUN_ENV=/run/etcd/environment\n+ ETCDCTL=./assets/bin/etcdctl\n+ ETCD_DATA_DIR=/var/lib/etcd\n+ ETCD_MANIFEST=/etc/kubernetes/manifests/etcd-member.yaml\n+ ETCD_STATIC_RESOURCES=/etc/kubernetes/static-pod-resources/etcd-member\n+ STOPPED_STATIC_PODS=./assets/tmp/stopped-static-pods\n+ '[' '!' -f /root/assets/backup/snapshot.db ']'\n+ echo 'etcd snapshot /root/assets/backup/snapshot.db does not exist.'\n+ exit 1\n)",
    }
    failed running "sudo -i /bin/bash -x /usr/local/bin/etcd-snapshot-restore.sh /root/assets/backup/snapshot.db etcd-member-ip-10-0-149-192.ec2.internal=https://etcd-1.ci-op-14v1qhjk-2770b.origin-ci-int-aws.dev.rhcloud.com:2380": <nil> (exit code 1, stderr + set -o errexit
    + set -o pipefail
    + [[ 0 -ne 0 ]]
    + '[' /root/assets/backup/snapshot.db == '' ']'
    + '[' etcd-member-ip-10-0-149-192.ec2.internal=https://etcd-1.ci-op-14v1qhjk-2770b.origin-ci-int-aws.dev.rhcloud.com:2380 == '' ']'
    + BACKUP_FILE=/root/assets/backup/snapshot.db
    + INITIAL_CLUSTER=etcd-member-ip-10-0-149-192.ec2.internal=https://etcd-1.ci-op-14v1qhjk-2770b.origin-ci-int-aws.dev.rhcloud.com:2380
    + ASSET_DIR=./assets
    + CONFIG_FILE_DIR=/etc/kubernetes
    + MANIFEST_DIR=/etc/kubernetes/manifests
    + MANIFEST_STOPPED_DIR=./assets/manifests-stopped
    + RUN_ENV=/run/etcd/environment
    + ETCDCTL=./assets/bin/etcdctl
    + ETCD_DATA_DIR=/var/lib/etcd
    + ETCD_MANIFEST=/etc/kubernetes/manifests/etcd-member.yaml
    + ETCD_STATIC_RESOURCES=/etc/kubernetes/static-pod-resources/etcd-member
    + STOPPED_STATIC_PODS=./assets/tmp/stopped-static-pods
    + '[' '!' -f /root/assets/backup/snapshot.db ']'
    + echo 'etcd snapshot /root/assets/backup/snapshot.db does not exist.'
    + exit 1
    )
occurred

Comment 11 errata-xmlrpc 2020-01-23 11:20:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 12 Aniket Bhat 2020-02-28 19:40:37 UTC
Seeing this as a buildcop again today (02/28/2020) here: Seen on CI: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/152#1:build-log.txt%3A2237

Comment 13 Ben Parees 2020-03-10 22:09:26 UTC
this is the top failure cause for our release informing e2e-aws-disruptive job:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-informing#release-openshift-origin-installer-e2e-aws-disruptive-4.3&sort-by-flakiness=

Comment 20 errata-xmlrpc 2020-05-13 21:27:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.