Bug 1906936

Summary: unable to rollback to 4.6 while upgrading to 4.7; unable to update prometheusrule
Product: OpenShift Container Platform Reporter: jamo luhrsen <jluhrsen>
Component: NetworkingAssignee: Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: aconstan, alegrand, anpicker, erooth, kakkoyun, lcosic, lmohanty, mloibl, pkrupa, surbania, wking, xxia, yanyang
Version: 4.7Keywords: TestBlocker, Upgrades
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-15 15:29:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description jamo luhrsen 2020-12-11 20:41:47 UTC
Description of problem:

upgrade rollback 4.6->4.7->4.6 fails when rolling back to 4.6. The cluster
version log [0] shows the upgrade rollback process happening:

     "history": [
                        "completionTime": null,
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf",
                        "startedTime": "2020-12-11T15:06:56Z",
                        "state": "Partial",
                        "verified": false,
                        "version": "4.6.8"
                        "completionTime": "2020-12-11T15:06:56Z",
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:91bd93a846989e40440c7ac93f566c4cb9bdd13878efbad829424dae34b091bd",
                        "startedTime": "2020-12-11T14:06:22Z",
                        "state": "Partial",
                        "verified": false,
                        "version": "4.7.0-0.ci-2020-12-10-144849"
                        "completionTime": "2020-12-11T14:00:22Z",
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf",
                        "startedTime": "2020-12-11T13:29:28Z",
                        "state": "Completed",
                        "verified": false,
                        "version": "4.6.8"

and the failure cause is also shown:

                        "lastTransitionTime": "2020-12-11T16:17:03Z",
                        "message": "Could not update prometheusrule \"openshift-cluster-version/cluster-version-operator\" (9 of 617): the server is reporting an internal error",
                        "reason": "UpdatePayloadClusterError",
                        "status": "True",
                        "type": "Failing"
                        "lastTransitionTime": "2020-12-11T14:06:22Z",
                        "message": "Unable to apply 4.6.8: the control plane is reporting an internal error",
                        "reason": "UpdatePayloadClusterError",
                        "status": "True",
                        "type": "Progressing"

[0]   https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/clusterversion.json

Version-Release number of selected component (if applicable):

rollback from 4.7 to 4.6

How reproducible:

unsure, as the job that failed has just recently been improved to gather logs.

Steps to Reproduce:

run this job:


Additional info:

originally this was discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1893348, but that is now closed as it was really
tracking the fact that we did not collect logs in this failing 4.6->4.7->4.6 job.

also, two slack threads that may or may not help:

Comment 1 Sergiusz Urbaniak 2020-12-14 08:29:09 UTC
Looking at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-cluster-version_cluster-version-operator-77f99b4fb7-xdn4l_cluster-version-operator.log

reveals issues with the network:

I1211 16:17:01.844709       1 sync_worker.go:869] Update error 9 of 617: UpdatePayloadClusterError Could not update prometheusrule "openshift-cluster-version/cluster-version-op
erator" (9 of 617): the server is reporting an internal error (*errors.StatusError: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https:
//prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp connect: no route to host)
E1211 16:17:01.844735       1 sync_worker.go:348] unable to synchronize image (waiting 2m50.956499648s): Could not update prometheusrule "openshift-cluster-version/cluster-vers
ion-operator" (9 of 617): the server is reporting an internal error

Hence reassigning to the network team to assert.

fwiw prometheus-operator does not reveal any issues: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-monitoring_prometheus-operator-5986f78f55-tc8s6_prometheus-operator.log

Comment 2 Juan Luis de Sousa-Valadas 2020-12-15 17:38:39 UTC
Minor update:
I confirm this is a network issue. The reason why communication is broken because the hostsubnets.network.openshift.io CRD is deleted, causing the whole SDN to delete all the node to node flows, which is why node to node communication fails in the overlay while the rest works.

I have lost the logs in the middle of the update and I'm not certain about anything but I guess what happens is:
1- Operator deletes the hostsubnet CRD
2- Because the hostsubnet CRD is deleted, prior to it all hostsubnets are wiped
3- Becuase all hostsubnets are wiped, the SDN pods delete the OVS flows related to OVS
4- Everything that needs communication from pod to pod on the overlay, also stops working
5- CNO deletes the SDN daemonset (not that at this point it really matters but I'm left without SDN pod logs)
6- Because everything is broken CNO is apparently unable to recreate the daemonsets.

Comment 3 Juan Luis de Sousa-Valadas 2021-01-12 15:18:40 UTC
*** Bug 1913620 has been marked as a duplicate of this bug. ***

Comment 5 Juan Luis de Sousa-Valadas 2021-01-14 15:33:08 UTC
We just need https://github.com/openshift/cluster-network-operator/pull/945 merged and a backport of it.
I tested it manually and the downgrade works just fine.

Comment 6 Juan Luis de Sousa-Valadas 2021-01-15 15:29:05 UTC
The bug symptoms are not the same  but it has the same root cause and PR https://github.com/openshift/cluster-network-operator/pull/953 will fix it.

*** This bug has been marked as a duplicate of bug 1916601 ***