1967388 – 4.7 network operator degrades pushing v1alpha1 FlowSchema to 4.8 API-servers

Bug 1967388 - 4.7 network operator degrades pushing v1alpha1 FlowSchema to 4.8 API-servers

Summary: 4.7 network operator degrades pushing v1alpha1 FlowSchema to 4.8 API-servers

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Alexander Constantinescu
QA Contact:	Ying Wang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1971835 (view as bug list)
Depends On:	1913399
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-03 04:43 UTC by W. Trevor King
Modified:	2021-06-29 04:20 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-29 04:19:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 1118	0	None	open	[release-4.7] Bug 1967388: promote flowcontrol to v1beta1	2021-06-04 13:59:58 UTC
Red Hat Product Errata	RHBA-2021:2502	0	None	None	None	2021-06-29 04:20:08 UTC

Description W. Trevor King 2021-06-03 04:43:08 UTC

As seen most dramatically in [1], where a sticking monitoring operator extends the overlap period.  The network operator condition is:

  Operator upgrade network	0s
  Failed to upgrade network, operator was degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (flowcontrol.apiserver.k8s.io/v1alpha1, Kind=FlowSchema) /openshift-ovn-kubernetes: could not create (flowcontrol.apiserver.k8s.io/v1alpha1, Kind=FlowSchema) /openshift-ovn-kubernetes: the server could not find the requested resource

The Kube API-server completed its update at 04:45:43Z:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/.*versions:'
Jun 02 04:30:36.290 I clusteroperator/etcd versions: raw-internal 4.7.13 -> 4.8.0-0.ci-2021-06-01-222341
Jun 02 04:34:31.419 I clusteroperator/etcd versions: operator 4.7.13 -> 4.8.0-0.ci-2021-06-01-222341, etcd 4.7.13 -> 4.8.0-0.ci-2021-06-01-222341
Jun 02 04:34:53.817 I clusteroperator/kube-apiserver versions: raw-internal 4.7.13 -> 4.8.0-0.ci-2021-06-01-222341
Jun 02 04:45:43.728 I clusteroperator/kube-apiserver versions: kube-apiserver 1.20.0-beta.2 -> 1.21.1, operator 4.7.13 -> 4.8.0-0.ci-2021-06-01-222341
...

Which is pretty much when the network operator started complaining:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/artifacts/e2e.log | grep clusteroperator/network
Jun 02 04:45:40.173 - 8147s E clusteroperator/network condition/Degraded status/True reason/Error while updating operator configuration: could not apply (flowcontrol.apiserver.k8s.io/v1alpha1, Kind=FlowSchema) /openshift-ovn-kubernetes: could not create (flowcontrol.apiserver.k8s.io/v1alpha1, Kind=FlowSchema) /openshift-ovn-kubernetes: the server could not find the requested resource
Jun 02 04:45:40.173 E clusteroperator/network condition/Degraded status/True reason/ApplyOperatorConfig changed: Error while updating operator configuration: could not apply (flowcontrol.apiserver.k8s.io/v1alpha1, Kind=FlowSchema) /openshift-ovn-kubernetes: could not create (flowcontrol.apiserver.k8s.io/v1alpha1, Kind=FlowSchema) /openshift-ovn-kubernetes: the server could not find the requested resource
[bz-Networking] clusteroperator/network should not change condition/Degraded

The 4.8 network operator moved to v1beta1 in bug 1913399 [2].  I see some discussion of 4.7 handling in [3], but don't see this new-API-server vs. old-network-operator angle discussed.  Ideally, the 4.7 network operator attempts to write either v1alpha1 or v1beta1, and falls back on not-recognized to the other version.  That way the 4.7 network operator is compatible with 4.6, 4.7, and 4.8 Kube API-servers.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616
[2]: https://github.com/openshift/cluster-network-operator/pull/937
[3]: https://github.com/openshift/cluster-network-operator/pull/920#discussion_r552553510

Comment 1 W. Trevor King 2021-06-03 04:47:38 UTC

Seems popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=flowcontrol.apiserver.k8s.io/v1alpha1.*FlowSchema.*the+server+cou
ld+not+find+the+requested+resource' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 37 runs, 100% failed, 95% of failures match = 95% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 34 runs, 94% failed, 103% of failures match = 97% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 8 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 8 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 5 runs, 100% failed, 80% of failures match = 80% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-from-stable-4.7-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 12 runs, 100% failed, 92% of failures match = 92% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 12 runs, 100% failed, 67% of failures match = 67% impact
pull-ci-openshift-ovn-kubernetes-master-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 34 runs, 88% failed, 97% of failures match = 85% impact
rehearse-18937-periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-from-stable-4.7-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-aws-upgrade (all) - 22 runs, 23% failed, 20% of failures match = 5% impact
release-openshift-origin-installer-launch-aws (all) - 86 runs, 49% failed, 2% of failures match = 1% impact
release-openshift-origin-installer-launch-gcp (all) - 220 runs, 32% failed, 3% of failures match = 1% impact
release-openshift-origin-installer-old-rhcos-e2e-aws-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Comment 2 W. Trevor King 2021-06-03 04:50:05 UTC

And just confirming that all of those^ are the network operator:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&search=flowcontrol.apiserver.k8s.io/v1alpha1.*FlowSchema.*the+server+could+not+find+the+requested+resource' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed -n 's|.*v1alpha1.*Kind=FlowSchema) /\([^:]*\):.*could not find the requested resource.*|\1|p' | sort | uniq -c
    154 openshift-ovn-kubernetes
    155 openshift-sdn

Comment 4 Alexander Constantinescu 2021-06-04 13:59:36 UTC

Marking this as blocker for 4.8 since it leads to a degraded network operator and failed upgrade. I have a PR up already: https://github.com/openshift/cluster-network-operator/pull/1118 which will need to get in on 4.7, but I need the API server peoples' input on that.

Comment 5 Alexander Constantinescu 2021-06-07 08:10:24 UTC

Setting the target release to 4.7 and lowering the urgency since the upgrades are not really blocked by this bug. I've confirmed the behavior with API server and CVO teams. 

The CNO does go degraded due to this error and it blocks reconciliation, however the CVO should force the CNO to upgrade after a while to its 4.8 version. That should have the CNO push the right version of this resource and un-block it after a while. The only concern might be if the general upgrade gets hung after a while, which would not have the CVO force update the CNO. 

In any case, the fix should get in on 4.7, so the target release needs to change in any case.

Comment 8 Ying Wang 2021-06-15 02:33:29 UTC

Have tried upgrade from 4.7.0-0.nightly-2021-06-10-082247 to 4.8.0-0.nightly-2021-06-11-024306 for both sdn and ovn. Both work fine, upgrading succeeded without operator degrades issue.

https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/15078/console

https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/15079/console

Comment 9 Alexander Constantinescu 2021-06-16 13:56:51 UTC

*** Bug 1971835 has been marked as a duplicate of this bug. ***

Comment 10 OpenShift Automated Release Tooling 2021-06-17 12:29:08 UTC

OpenShift engineering has decided to not ship Red Hat OpenShift Container Platform 4.7.17 due a regression https://bugzilla.redhat.com/show_bug.cgi?id=1973006. All the fixes which were part of 4.7.17 will be now part of 4.7.18 and planned to be available in candidate channel on June 23 2021 and in fast channel on June 28th.

Comment 14 errata-xmlrpc 2021-06-29 04:19:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2502

Note You need to log in before you can comment on or make changes to this bug.