Description of problem: upgrade rollback 4.6->4.7->4.6 fails when rolling back to 4.6. The cluster version log [0] shows the upgrade rollback process happening: "history": [ { "completionTime": null, "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf", "startedTime": "2020-12-11T15:06:56Z", "state": "Partial", "verified": false, "version": "4.6.8" }, { "completionTime": "2020-12-11T15:06:56Z", "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:91bd93a846989e40440c7ac93f566c4cb9bdd13878efbad829424dae34b091bd", "startedTime": "2020-12-11T14:06:22Z", "state": "Partial", "verified": false, "version": "4.7.0-0.ci-2020-12-10-144849" }, { "completionTime": "2020-12-11T14:00:22Z", "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf", "startedTime": "2020-12-11T13:29:28Z", "state": "Completed", "verified": false, "version": "4.6.8" } ], and the failure cause is also shown: { "lastTransitionTime": "2020-12-11T16:17:03Z", "message": "Could not update prometheusrule \"openshift-cluster-version/cluster-version-operator\" (9 of 617): the server is reporting an internal error", "reason": "UpdatePayloadClusterError", "status": "True", "type": "Failing" }, { "lastTransitionTime": "2020-12-11T14:06:22Z", "message": "Unable to apply 4.6.8: the control plane is reporting an internal error", "reason": "UpdatePayloadClusterError", "status": "True", "type": "Progressing" } [0] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/clusterversion.json Version-Release number of selected component (if applicable): rollback from 4.7 to 4.6 How reproducible: unsure, as the job that failed has just recently been improved to gather logs. Steps to Reproduce: run this job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7 Additional info: originally this was discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1893348, but that is now closed as it was really tracking the fact that we did not collect logs in this failing 4.6->4.7->4.6 job. also, two slack threads that may or may not help: https://coreos.slack.com/archives/C01CQA76KMX/p1607713778276800 https://coreos.slack.com/archives/C0VMT03S5/p1607715542396800
Looking at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-cluster-version_cluster-version-operator-77f99b4fb7-xdn4l_cluster-version-operator.log reveals issues with the network: I1211 16:17:01.844709 1 sync_worker.go:869] Update error 9 of 617: UpdatePayloadClusterError Could not update prometheusrule "openshift-cluster-version/cluster-version-op erator" (9 of 617): the server is reporting an internal error (*errors.StatusError: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https: //prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.63:8080: connect: no route to host) E1211 16:17:01.844735 1 sync_worker.go:348] unable to synchronize image (waiting 2m50.956499648s): Could not update prometheusrule "openshift-cluster-version/cluster-vers ion-operator" (9 of 617): the server is reporting an internal error Hence reassigning to the network team to assert. fwiw prometheus-operator does not reveal any issues: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-monitoring_prometheus-operator-5986f78f55-tc8s6_prometheus-operator.log
Minor update: I confirm this is a network issue. The reason why communication is broken because the hostsubnets.network.openshift.io CRD is deleted, causing the whole SDN to delete all the node to node flows, which is why node to node communication fails in the overlay while the rest works. I have lost the logs in the middle of the update and I'm not certain about anything but I guess what happens is: 1- Operator deletes the hostsubnet CRD 2- Because the hostsubnet CRD is deleted, prior to it all hostsubnets are wiped 3- Becuase all hostsubnets are wiped, the SDN pods delete the OVS flows related to OVS 4- Everything that needs communication from pod to pod on the overlay, also stops working 5- CNO deletes the SDN daemonset (not that at this point it really matters but I'm left without SDN pod logs) 6- Because everything is broken CNO is apparently unable to recreate the daemonsets.
*** Bug 1913620 has been marked as a duplicate of this bug. ***
We just need https://github.com/openshift/cluster-network-operator/pull/945 merged and a backport of it. I tested it manually and the downgrade works just fine.
The bug symptoms are not the same but it has the same root cause and PR https://github.com/openshift/cluster-network-operator/pull/953 will fix it. *** This bug has been marked as a duplicate of bug 1916601 ***