Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1766792

Summary: e2e-aws-scaleup-rhel7 constantly failing
Product: OpenShift Container Platform Reporter: Kirsten Garrison <kgarriso>
Component: InstallerAssignee: Russell Teague <rteague>
Installer sub component: openshift-ansible QA Contact: Russell Teague <rteague>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: gpei, rteague, sdodson
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-18 13:27:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kirsten Garrison 2019-10-29 22:03:51 UTC
Description of problem:
this job seems to be basically broken.  going into the history in the last 168 runs it's only passed 17 times.  can we eliminate these tests until it is actually able to run correctly? it doesn't seem like an efficient use of our resources right now.

For ref: https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7?buildId=


How reproducible:
Look at most of its ci runs

Actual results:
It always fails

Expected results:
It should generally be passing unless there is a reason for it to fail.

Comment 1 Scott Dodson 2019-10-30 13:51:54 UTC
We should fix the test rather than disabling it. Success rate was much higher before 4.3 branching so it's very likely that something in the product has actually regressed in addition to the higher than normal failure rate of this versus pure RHCOS clusters.

Comment 2 Russell Teague 2019-11-01 20:46:47 UTC
An issue was found with the CentOS AMI which was being used for scaleup.  The AMI was switched to the latest RHEL image [1] and additional repos [2] have been added to provide required packages.


[1] https://github.com/openshift/release/pull/5735
[2] https://github.com/openshift/release/pull/5742

Comment 3 Russell Teague 2019-11-08 20:08:08 UTC
The rhel7 jobs are on longer failing constantly after the above PRs merged.  However, they are failing fairly regulary on "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]".  Looking into why.

Comment 4 Russell Teague 2019-11-08 21:01:48 UTC
Sample of alerts firing:

ALERTS{alertname="MachineWithNoRunningPhase", alertstate="firing", endpoint="https", exported_namespace="openshift-machine-api", instance="10.130.0.16:8443", job="machine-api-operator", name="ci-op-yksgk3g1-881e8-mfmqf-worker-us-east-1a-centos-g9lvg", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-858d4c987-8nb6j", service="machine-api-operator", severity="critical", spec_provider_id="aws:///us-east-1a/i-06377d87cdacbe120"} => 1 @[1573241866.713]

ALERTS{alertname="MachineWithoutValidNode", alertstate="firing", endpoint="https", exported_namespace="openshift-machine-api", instance="10.130.0.16:8443", job="machine-api-operator", name="ci-op-yksgk3g1-881e8-mfmqf-worker-us-east-1a-centos-g9lvg", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-858d4c987-8nb6j", service="machine-api-operator", severity="critical", spec_provider_id="aws:///us-east-1a/i-06377d87cdacbe120"} => 1

Comment 5 Russell Teague 2019-11-08 21:48:41 UTC
The machineset we copy from us-east-1a now has replicas: 2.  The scaleup playbooks only expected one so don't configure the other machine to be a node.  I need to update the scaleup playbooks to account for this.