Description of problem: this job seems to be basically broken. going into the history in the last 168 runs it's only passed 17 times. can we eliminate these tests until it is actually able to run correctly? it doesn't seem like an efficient use of our resources right now. For ref: https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7?buildId= How reproducible: Look at most of its ci runs Actual results: It always fails Expected results: It should generally be passing unless there is a reason for it to fail.
We should fix the test rather than disabling it. Success rate was much higher before 4.3 branching so it's very likely that something in the product has actually regressed in addition to the higher than normal failure rate of this versus pure RHCOS clusters.
An issue was found with the CentOS AMI which was being used for scaleup. The AMI was switched to the latest RHEL image [1] and additional repos [2] have been added to provide required packages. [1] https://github.com/openshift/release/pull/5735 [2] https://github.com/openshift/release/pull/5742
The rhel7 jobs are on longer failing constantly after the above PRs merged. However, they are failing fairly regulary on "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]". Looking into why.
Sample of alerts firing: ALERTS{alertname="MachineWithNoRunningPhase", alertstate="firing", endpoint="https", exported_namespace="openshift-machine-api", instance="10.130.0.16:8443", job="machine-api-operator", name="ci-op-yksgk3g1-881e8-mfmqf-worker-us-east-1a-centos-g9lvg", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-858d4c987-8nb6j", service="machine-api-operator", severity="critical", spec_provider_id="aws:///us-east-1a/i-06377d87cdacbe120"} => 1 @[1573241866.713] ALERTS{alertname="MachineWithoutValidNode", alertstate="firing", endpoint="https", exported_namespace="openshift-machine-api", instance="10.130.0.16:8443", job="machine-api-operator", name="ci-op-yksgk3g1-881e8-mfmqf-worker-us-east-1a-centos-g9lvg", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-858d4c987-8nb6j", service="machine-api-operator", severity="critical", spec_provider_id="aws:///us-east-1a/i-06377d87cdacbe120"} => 1
The machineset we copy from us-east-1a now has replicas: 2. The scaleup playbooks only expected one so don't configure the other machine to be a node. I need to update the scaleup playbooks to account for this.