Bug 1906936

Summary:	unable to rollback to 4.6 while upgrading to 4.7; unable to update prometheusrule
Product:	OpenShift Container Platform	Reporter:	jamo luhrsen <jluhrsen>
Component:	Networking	Assignee:	Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	high	CC:	aconstan, alegrand, anpicker, erooth, kakkoyun, lcosic, lmohanty, mloibl, pkrupa, surbania, wking, xxia, yanyang
Version:	4.7	Keywords:	TestBlocker, Upgrades
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-15 15:29:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description jamo luhrsen 2020-12-11 20:41:47 UTC

Description of problem:

upgrade rollback 4.6->4.7->4.6 fails when rolling back to 4.6. The cluster
version log [0] shows the upgrade rollback process happening:

     "history": [
                    {
                        "completionTime": null,
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf",
                        "startedTime": "2020-12-11T15:06:56Z",
                        "state": "Partial",
                        "verified": false,
                        "version": "4.6.8"
                    },
                    {
                        "completionTime": "2020-12-11T15:06:56Z",
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:91bd93a846989e40440c7ac93f566c4cb9bdd13878efbad829424dae34b091bd",
                        "startedTime": "2020-12-11T14:06:22Z",
                        "state": "Partial",
                        "verified": false,
                        "version": "4.7.0-0.ci-2020-12-10-144849"
                    },
                    {
                        "completionTime": "2020-12-11T14:00:22Z",
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf",
                        "startedTime": "2020-12-11T13:29:28Z",
                        "state": "Completed",
                        "verified": false,
                        "version": "4.6.8"
                    }
                ],


and the failure cause is also shown:

                    {
                        "lastTransitionTime": "2020-12-11T16:17:03Z",
                        "message": "Could not update prometheusrule \"openshift-cluster-version/cluster-version-operator\" (9 of 617): the server is reporting an internal error",
                        "reason": "UpdatePayloadClusterError",
                        "status": "True",
                        "type": "Failing"
                    },
                    {
                        "lastTransitionTime": "2020-12-11T14:06:22Z",
                        "message": "Unable to apply 4.6.8: the control plane is reporting an internal error",
                        "reason": "UpdatePayloadClusterError",
                        "status": "True",
                        "type": "Progressing"
                    }


[0]   https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/clusterversion.json



Version-Release number of selected component (if applicable):

rollback from 4.7 to 4.6


How reproducible:

unsure, as the job that failed has just recently been improved to gather logs.


Steps to Reproduce:

run this job:

  https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7



Additional info:

originally this was discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1893348, but that is now closed as it was really
tracking the fact that we did not collect logs in this failing 4.6->4.7->4.6 job.

also, two slack threads that may or may not help:
  https://coreos.slack.com/archives/C01CQA76KMX/p1607713778276800
  https://coreos.slack.com/archives/C0VMT03S5/p1607715542396800

Comment 1 Sergiusz Urbaniak 2020-12-14 08:29:09 UTC

Looking at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-cluster-version_cluster-version-operator-77f99b4fb7-xdn4l_cluster-version-operator.log

reveals issues with the network:

I1211 16:17:01.844709       1 sync_worker.go:869] Update error 9 of 617: UpdatePayloadClusterError Could not update prometheusrule "openshift-cluster-version/cluster-version-op
erator" (9 of 617): the server is reporting an internal error (*errors.StatusError: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https:
//prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.63:8080: connect: no route to host)
E1211 16:17:01.844735       1 sync_worker.go:348] unable to synchronize image (waiting 2m50.956499648s): Could not update prometheusrule "openshift-cluster-version/cluster-vers
ion-operator" (9 of 617): the server is reporting an internal error

Hence reassigning to the network team to assert.

fwiw prometheus-operator does not reveal any issues: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-monitoring_prometheus-operator-5986f78f55-tc8s6_prometheus-operator.log

Comment 2 Juan Luis de Sousa-Valadas 2020-12-15 17:38:39 UTC

Minor update:
I confirm this is a network issue. The reason why communication is broken because the hostsubnets.network.openshift.io CRD is deleted, causing the whole SDN to delete all the node to node flows, which is why node to node communication fails in the overlay while the rest works.

I have lost the logs in the middle of the update and I'm not certain about anything but I guess what happens is:
1- Operator deletes the hostsubnet CRD
2- Because the hostsubnet CRD is deleted, prior to it all hostsubnets are wiped
3- Becuase all hostsubnets are wiped, the SDN pods delete the OVS flows related to OVS
4- Everything that needs communication from pod to pod on the overlay, also stops working
5- CNO deletes the SDN daemonset (not that at this point it really matters but I'm left without SDN pod logs)
6- Because everything is broken CNO is apparently unable to recreate the daemonsets.

Comment 3 Juan Luis de Sousa-Valadas 2021-01-12 15:18:40 UTC

*** Bug 1913620 has been marked as a duplicate of this bug. ***

Comment 5 Juan Luis de Sousa-Valadas 2021-01-14 15:33:08 UTC

We just need https://github.com/openshift/cluster-network-operator/pull/945 merged and a backport of it.
I tested it manually and the downgrade works just fine.

Comment 6 Juan Luis de Sousa-Valadas 2021-01-15 15:29:05 UTC

The bug symptoms are not the same  but it has the same root cause and PR https://github.com/openshift/cluster-network-operator/pull/953 will fix it.

*** This bug has been marked as a duplicate of bug 1916601 ***