1920610 – e2e-aws-4.7-cnv consistently failing on Hyperconverged Cluster Operator

Bug 1920610 - e2e-aws-4.7-cnv consistently failing on Hyperconverged Cluster Operator

Summary: e2e-aws-4.7-cnv consistently failing on Hyperconverged Cluster Operator

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	2.6.0
Assignee:	Nico Schieder
QA Contact:	Inbar Rose
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-26 17:54 UTC by jamo luhrsen
Modified:	2021-03-10 11:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-10 11:23:40 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift release pull 15315	0	None	closed	4.7 hotfix for failing HCO tests	2021-02-18 19:07:14 UTC
Red Hat Product Errata	RHSA-2021:0799	0	None	None	None	2021-03-10 11:24:48 UTC

Description jamo luhrsen 2021-01-26 17:54:40 UTC

Description of problem:

This failure showed up while monitoring CI signal for the 4.7 release and seems to be a new
consistent failure since ~ 1/5/2021. There was one passing job since then, but 9 of 10
jobs failed the same way.

job:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#canary-release-openshift-origin-installer-e2e-aws-4.7-cnv

there is some discussion in this slack thread:
https://coreos.slack.com/archives/C01CQA76KMX/p1611618946032100

but, to also put that info here:


the build log shows this error 35 times:

{"level":"error","ts":1611609679.651495,"logger":"controller_hyperconverged","msg":"Failed to update HCO Status","Request.Namespace":"kubevirt-hyperconverged","Request.Name":"kubevirt-hyperconverged","error":"Operation cannot be fulfilled on hyperconvergeds.hco.kubevirt.io \"kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:132\ngithub.com/kubevirt/hyperconverged-cluster-operator/pkg/controller/hyperconverged.(*ReconcileHyperConverged).Reconcile\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/pkg/controller/hyperconverged/hyperconverged_controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/kubevirt/hyperconverged-cluster-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"}


-  failure to set NetworkAddonsConfig cluster status [0], but I think it eventually succeeds and probably working as expected (described in that doc you linked)

-  lots of controller-runtime.healthz failures, but maybe expected? [1]

-  some nmstatectl.py script is barfing [2] with some json validation? ValidationError: {'name': 'vxlan0', 'type': 'ovs-port', 'state': 'down', 'ipv4': {'enabled': False}, 'ipv6': {'enabled': False}, 'lldp': {'enabled': False}} is not valid under any of the given schemas\n\nFailed validating 'oneOf' in schema['properties']['interfaces']['items']['allOf'][5]:

[0] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/canary-release-openshi[…]95f4c-4z6tj_cluster-network-addons-operator.log
[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/canary-release-openshi[…]fbc6b-b2slk_hyperconverged-cluster-operator.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/canary-release-openshi[…]erged_nmstate-handler-nfwbh_nmstate-handler.log


Version-Release number of selected component (if applicable):

4.7

I do not see this trouble in 4.6:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#canary-release-openshift-origin-installer-e2e-aws-4.6-cnv


How reproducible:

10/11 runs have failed, but one was infra related. so 9/10 runs hit this problem.

Comment 1 jamo luhrsen 2021-01-26 17:55:55 UTC

I really don't have a good handle on what component to use for this bug. I assigned it to Networking/openshift-sdn because I saw some networking related logs
in the errors I found. This may need to be moved to some other component, if someone knows better.

Comment 2 jamo luhrsen 2021-01-26 23:40:40 UTC

Comment 3 jamo luhrsen 2021-01-26 23:43:39 UTC

Current theory from @

Comment 4 jamo luhrsen 2021-01-26 23:44:45 UTC

Current theory from @stirabos is that a regression was introduced with https://github.com/kubevirt/hyperconverged-cluster-operator/pull/1047 and they will try to fix asap

Comment 5 Petr Horáček 2021-02-01 16:13:56 UTC

Moving to installation based on https://bugzilla.redhat.com/show_bug.cgi?id=1920610#c4

Comment 6 Pan Ousley 2021-02-12 16:32:50 UTC

@stirabos Does this require a release note for 2.6? It came up in my search because of the requires_release_note? flag but it looks like it's fixed. Thanks!

Comment 7 Nico Schieder 2021-02-15 09:56:06 UTC

(In reply to Pan Ousley from comment #6)
> @stirabos Does this require a release note for 2.6? It came up in
> my search because of the requires_release_note? flag but it looks like it's
> fixed. Thanks!

Hi Pan,
this was just a small regression hotfix.
We don't think this needs a release note for 2.6.

But thanks for checking back! :)

Comment 8 Pan Ousley 2021-02-18 19:07:48 UTC

Thanks, Nico!

Comment 11 errata-xmlrpc 2021-03-10 11:23:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0799

Note You need to log in before you can comment on or make changes to this bug.