1974283 – [oVirt] 4.7 -> 4.8 upgrade test "Cluster should remain functional during upgrade [Disruptive]" fails due to "AggregatedAPIErrors"

Bug 1974283 - [oVirt] 4.7 -> 4.8 upgrade test "Cluster should remain functional during upgrade [Disruptive]" fails due to "AggregatedAPIErrors" [NEEDINFO]

Summary: [oVirt] 4.7 -> 4.8 upgrade test "Cluster should remain functional during upgr...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Stefan Schimanski
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:	LifecycleStale
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-21 09:41 UTC by Gal Zaidman
Modified:	2021-08-18 12:04 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-18 12:04:57 UTC
Target Upstream Version:
Embargoed:
Flags:	gzaidman: needinfo?

Attachments	(Terms of Use)

Description Gal Zaidman 2021-06-21 09:41:55 UTC

Description of problem:

On oVirt CI we are seeing that the upgrade fails due to alerts on "AggregatedAPIErrors".
This is different between runs, and you can see the job dashboard in[1].
Some examples:
Job[2]:
"""
alert AggregatedAPIErrors fired for 30 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}
"""

Job[3]
"""
alert AggregatedAPIErrors fired for 120 seconds with labels: {name="v1.packages.operators.coreos.com", namespace="default", severity="warning"}
alert AggregatedAPIErrors fired for 30 seconds with labels: {name="v1.oauth.openshift.io", namespace="default", severity="warning"}
alert AggregatedAPIErrors fired for 90 seconds with labels: {name="v1.build.openshift.io", namespace="default", severity="warning"}
alert AggregatedAPIErrors fired for 90 seconds with labels: {name="v1.project.openshift.io", namespace="default", severity="warning"}
"""

We had a couple of green jobs before 16.6, those jobs used a 800GB NFS storage domain for the VMs which is a bit better than the current on 500GB ISCSI, but with the better storage, we also saw more failed jobs than passing job and still hit "AggregatedAPIErrors" so I'm not sure that storage is the only reason.

I loaded the job metrics into prometheus and ran:
- histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
- histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))

and although there is a spick at the end of the job, it looks ok overall and there is not much difference between green and red jobs.

Also on 4.6->4.7 we were very stable with a storage of 300GB ISCSI, so I wonder if anything changed that it should require a much stronger storage.

[1] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade&sort-by-failures=&width=20
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406371256583852032
[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade/1406733759566319616

Comment 3 Gal Zaidman 2021-07-19 07:28:26 UTC

Looking at 4.8->4.9 upgrade jobs it seems that the issue is gone, but still remains on 4.7->4.8 jobs.

Comment 4 Michal Fojtik 2021-08-18 07:53:25 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 5 Stefan Schimanski 2021-08-18 12:04:57 UTC

As there is nobody triaging this and it is not proven to be a general problem rather than oVirt and/or storage related, closing.

Note You need to log in before you can comment on or make changes to this bug.