Bug 1766792
| Summary: | e2e-aws-scaleup-rhel7 constantly failing | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Kirsten Garrison <kgarriso> |
| Component: | Installer | Assignee: | Russell Teague <rteague> |
| Installer sub component: | openshift-ansible | QA Contact: | Russell Teague <rteague> |
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | gpei, rteague, sdodson |
| Version: | 4.2.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.3.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-11-18 13:27:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Kirsten Garrison
2019-10-29 22:03:51 UTC
We should fix the test rather than disabling it. Success rate was much higher before 4.3 branching so it's very likely that something in the product has actually regressed in addition to the higher than normal failure rate of this versus pure RHCOS clusters. An issue was found with the CentOS AMI which was being used for scaleup. The AMI was switched to the latest RHEL image [1] and additional repos [2] have been added to provide required packages. [1] https://github.com/openshift/release/pull/5735 [2] https://github.com/openshift/release/pull/5742 The rhel7 jobs are on longer failing constantly after the above PRs merged. However, they are failing fairly regulary on "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]". Looking into why. Sample of alerts firing:
ALERTS{alertname="MachineWithNoRunningPhase", alertstate="firing", endpoint="https", exported_namespace="openshift-machine-api", instance="10.130.0.16:8443", job="machine-api-operator", name="ci-op-yksgk3g1-881e8-mfmqf-worker-us-east-1a-centos-g9lvg", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-858d4c987-8nb6j", service="machine-api-operator", severity="critical", spec_provider_id="aws:///us-east-1a/i-06377d87cdacbe120"} => 1 @[1573241866.713]
ALERTS{alertname="MachineWithoutValidNode", alertstate="firing", endpoint="https", exported_namespace="openshift-machine-api", instance="10.130.0.16:8443", job="machine-api-operator", name="ci-op-yksgk3g1-881e8-mfmqf-worker-us-east-1a-centos-g9lvg", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-858d4c987-8nb6j", service="machine-api-operator", severity="critical", spec_provider_id="aws:///us-east-1a/i-06377d87cdacbe120"} => 1
The machineset we copy from us-east-1a now has replicas: 2. The scaleup playbooks only expected one so don't configure the other machine to be a node. I need to update the scaleup playbooks to account for this. |