Bug 1974283
| Summary: | [oVirt] 4.7 -> 4.8 upgrade test "Cluster should remain functional during upgrade [Disruptive]" fails due to "AggregatedAPIErrors" | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Gal Zaidman <gzaidman> |
| Component: | kube-apiserver | Assignee: | Stefan Schimanski <sttts> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Ke Wang <kewang> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.9 | CC: | aos-bugs, mfojtik, sttts, xxia |
| Target Milestone: | --- | Flags: | gzaidman:
needinfo?
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | LifecycleStale | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-08-18 12:04:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Looking at 4.8->4.9 upgrade jobs it seems that the issue is gone, but still remains on 4.7->4.8 jobs. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. As there is nobody triaging this and it is not proven to be a general problem rather than oVirt and/or storage related, closing. |
Description of problem: On oVirt CI we are seeing that the upgrade fails due to alerts on "AggregatedAPIErrors". This is different between runs, and you can see the job dashboard in[1]. Some examples: Job[2]: """ alert AggregatedAPIErrors fired for 30 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"} """ Job[3] """ alert AggregatedAPIErrors fired for 120 seconds with labels: {name="v1.packages.operators.coreos.com", namespace="default", severity="warning"} alert AggregatedAPIErrors fired for 30 seconds with labels: {name="v1.oauth.openshift.io", namespace="default", severity="warning"} alert AggregatedAPIErrors fired for 90 seconds with labels: {name="v1.build.openshift.io", namespace="default", severity="warning"} alert AggregatedAPIErrors fired for 90 seconds with labels: {name="v1.project.openshift.io", namespace="default", severity="warning"} """ We had a couple of green jobs before 16.6, those jobs used a 800GB NFS storage domain for the VMs which is a bit better than the current on 500GB ISCSI, but with the better storage, we also saw more failed jobs than passing job and still hit "AggregatedAPIErrors" so I'm not sure that storage is the only reason. I loaded the job metrics into prometheus and ran: - histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) - histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) and although there is a spick at the end of the job, it looks ok overall and there is not much difference between green and red jobs. Also on 4.6->4.7 we were very stable with a storage of 300GB ISCSI, so I wonder if anything changed that it should require a much stronger storage. [1] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade&sort-by-failures=&width=20 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406371256583852032 [3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406733759566319616