Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1766792

Summary:	e2e-aws-scaleup-rhel7 constantly failing
Product:	OpenShift Container Platform	Reporter:	Kirsten Garrison <kgarriso>
Component:	Installer	Assignee:	Russell Teague <rteague>
Installer sub component:	openshift-ansible	QA Contact:	Russell Teague <rteague>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	gpei, rteague, sdodson
Version:	4.2.0
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-18 13:27:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Kirsten Garrison 2019-10-29 22:03:51 UTC

Description of problem:
this job seems to be basically broken.  going into the history in the last 168 runs it's only passed 17 times.  can we eliminate these tests until it is actually able to run correctly? it doesn't seem like an efficient use of our resources right now.

For ref: https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7?buildId=


How reproducible:
Look at most of its ci runs

Actual results:
It always fails

Expected results:
It should generally be passing unless there is a reason for it to fail.

Comment 1 Scott Dodson 2019-10-30 13:51:54 UTC

We should fix the test rather than disabling it. Success rate was much higher before 4.3 branching so it's very likely that something in the product has actually regressed in addition to the higher than normal failure rate of this versus pure RHCOS clusters.

Comment 2 Russell Teague 2019-11-01 20:46:47 UTC

An issue was found with the CentOS AMI which was being used for scaleup.  The AMI was switched to the latest RHEL image [1] and additional repos [2] have been added to provide required packages.


[1] https://github.com/openshift/release/pull/5735
[2] https://github.com/openshift/release/pull/5742

Comment 3 Russell Teague 2019-11-08 20:08:08 UTC

The rhel7 jobs are on longer failing constantly after the above PRs merged.  However, they are failing fairly regulary on "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]".  Looking into why.

Comment 4 Russell Teague 2019-11-08 21:01:48 UTC

Sample of alerts firing:

ALERTS{alertname="MachineWithNoRunningPhase", alertstate="firing", endpoint="https", exported_namespace="openshift-machine-api", instance="10.130.0.16:8443", job="machine-api-operator", name="ci-op-yksgk3g1-881e8-mfmqf-worker-us-east-1a-centos-g9lvg", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-858d4c987-8nb6j", service="machine-api-operator", severity="critical", spec_provider_id="aws:///us-east-1a/i-06377d87cdacbe120"} => 1 @[1573241866.713]

ALERTS{alertname="MachineWithoutValidNode", alertstate="firing", endpoint="https", exported_namespace="openshift-machine-api", instance="10.130.0.16:8443", job="machine-api-operator", name="ci-op-yksgk3g1-881e8-mfmqf-worker-us-east-1a-centos-g9lvg", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-858d4c987-8nb6j", service="machine-api-operator", severity="critical", spec_provider_id="aws:///us-east-1a/i-06377d87cdacbe120"} => 1

Comment 5 Russell Teague 2019-11-08 21:48:41 UTC

The machineset we copy from us-east-1a now has replicas: 2.  The scaleup playbooks only expected one so don't configure the other machine to be a node.  I need to update the scaleup playbooks to account for this.