Bug 1766792 - e2e-aws-scaleup-rhel7 constantly failing
Summary: e2e-aws-scaleup-rhel7 constantly failing
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.3.0
Assignee: Russell Teague
QA Contact: Russell Teague
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-29 22:03 UTC by Kirsten Garrison
Modified: 2019-11-18 13:28 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-18 13:27:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 12009 0 'None' closed Bug 1766792: Restrict new scaleup machinesets to 1 replica 2020-12-24 21:03:03 UTC

Description Kirsten Garrison 2019-10-29 22:03:51 UTC
Description of problem:
this job seems to be basically broken.  going into the history in the last 168 runs it's only passed 17 times.  can we eliminate these tests until it is actually able to run correctly? it doesn't seem like an efficient use of our resources right now.

For ref: https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7?buildId=


How reproducible:
Look at most of its ci runs

Actual results:
It always fails

Expected results:
It should generally be passing unless there is a reason for it to fail.

Comment 1 Scott Dodson 2019-10-30 13:51:54 UTC
We should fix the test rather than disabling it. Success rate was much higher before 4.3 branching so it's very likely that something in the product has actually regressed in addition to the higher than normal failure rate of this versus pure RHCOS clusters.

Comment 2 Russell Teague 2019-11-01 20:46:47 UTC
An issue was found with the CentOS AMI which was being used for scaleup.  The AMI was switched to the latest RHEL image [1] and additional repos [2] have been added to provide required packages.


[1] https://github.com/openshift/release/pull/5735
[2] https://github.com/openshift/release/pull/5742

Comment 3 Russell Teague 2019-11-08 20:08:08 UTC
The rhel7 jobs are on longer failing constantly after the above PRs merged.  However, they are failing fairly regulary on "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]".  Looking into why.

Comment 4 Russell Teague 2019-11-08 21:01:48 UTC
Sample of alerts firing:

ALERTS{alertname="MachineWithNoRunningPhase", alertstate="firing", endpoint="https", exported_namespace="openshift-machine-api", instance="10.130.0.16:8443", job="machine-api-operator", name="ci-op-yksgk3g1-881e8-mfmqf-worker-us-east-1a-centos-g9lvg", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-858d4c987-8nb6j", service="machine-api-operator", severity="critical", spec_provider_id="aws:///us-east-1a/i-06377d87cdacbe120"} => 1 @[1573241866.713]

ALERTS{alertname="MachineWithoutValidNode", alertstate="firing", endpoint="https", exported_namespace="openshift-machine-api", instance="10.130.0.16:8443", job="machine-api-operator", name="ci-op-yksgk3g1-881e8-mfmqf-worker-us-east-1a-centos-g9lvg", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-858d4c987-8nb6j", service="machine-api-operator", severity="critical", spec_provider_id="aws:///us-east-1a/i-06377d87cdacbe120"} => 1

Comment 5 Russell Teague 2019-11-08 21:48:41 UTC
The machineset we copy from us-east-1a now has replicas: 2.  The scaleup playbooks only expected one so don't configure the other machine to be a node.  I need to update the scaleup playbooks to account for this.


Note You need to log in before you can comment on or make changes to this bug.