Bug 1906936 - unable to rollback to 4.6 while upgrading to 4.7; unable to update prometheusrule
Summary: unable to rollback to 4.6 while upgrading to 4.7; unable to update prometheus...
Keywords:
Status: CLOSED DUPLICATE of bug 1916601
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Juan Luis de Sousa-Valadas
QA Contact: Anurag saxena
URL:
Whiteboard:
: 1913620 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-11 20:41 UTC by jamo luhrsen
Modified: 2021-01-15 15:29 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-15 15:29:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description jamo luhrsen 2020-12-11 20:41:47 UTC
Description of problem:

upgrade rollback 4.6->4.7->4.6 fails when rolling back to 4.6. The cluster
version log [0] shows the upgrade rollback process happening:

     "history": [
                    {
                        "completionTime": null,
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf",
                        "startedTime": "2020-12-11T15:06:56Z",
                        "state": "Partial",
                        "verified": false,
                        "version": "4.6.8"
                    },
                    {
                        "completionTime": "2020-12-11T15:06:56Z",
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:91bd93a846989e40440c7ac93f566c4cb9bdd13878efbad829424dae34b091bd",
                        "startedTime": "2020-12-11T14:06:22Z",
                        "state": "Partial",
                        "verified": false,
                        "version": "4.7.0-0.ci-2020-12-10-144849"
                    },
                    {
                        "completionTime": "2020-12-11T14:00:22Z",
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf",
                        "startedTime": "2020-12-11T13:29:28Z",
                        "state": "Completed",
                        "verified": false,
                        "version": "4.6.8"
                    }
                ],


and the failure cause is also shown:

                    {
                        "lastTransitionTime": "2020-12-11T16:17:03Z",
                        "message": "Could not update prometheusrule \"openshift-cluster-version/cluster-version-operator\" (9 of 617): the server is reporting an internal error",
                        "reason": "UpdatePayloadClusterError",
                        "status": "True",
                        "type": "Failing"
                    },
                    {
                        "lastTransitionTime": "2020-12-11T14:06:22Z",
                        "message": "Unable to apply 4.6.8: the control plane is reporting an internal error",
                        "reason": "UpdatePayloadClusterError",
                        "status": "True",
                        "type": "Progressing"
                    }


[0]   https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/clusterversion.json



Version-Release number of selected component (if applicable):

rollback from 4.7 to 4.6


How reproducible:

unsure, as the job that failed has just recently been improved to gather logs.


Steps to Reproduce:

run this job:

  https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7



Additional info:

originally this was discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1893348, but that is now closed as it was really
tracking the fact that we did not collect logs in this failing 4.6->4.7->4.6 job.

also, two slack threads that may or may not help:
  https://coreos.slack.com/archives/C01CQA76KMX/p1607713778276800
  https://coreos.slack.com/archives/C0VMT03S5/p1607715542396800

Comment 1 Sergiusz Urbaniak 2020-12-14 08:29:09 UTC
Looking at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-cluster-version_cluster-version-operator-77f99b4fb7-xdn4l_cluster-version-operator.log

reveals issues with the network:

I1211 16:17:01.844709       1 sync_worker.go:869] Update error 9 of 617: UpdatePayloadClusterError Could not update prometheusrule "openshift-cluster-version/cluster-version-op
erator" (9 of 617): the server is reporting an internal error (*errors.StatusError: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https:
//prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.63:8080: connect: no route to host)
E1211 16:17:01.844735       1 sync_worker.go:348] unable to synchronize image (waiting 2m50.956499648s): Could not update prometheusrule "openshift-cluster-version/cluster-vers
ion-operator" (9 of 617): the server is reporting an internal error

Hence reassigning to the network team to assert.

fwiw prometheus-operator does not reveal any issues: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-monitoring_prometheus-operator-5986f78f55-tc8s6_prometheus-operator.log

Comment 2 Juan Luis de Sousa-Valadas 2020-12-15 17:38:39 UTC
Minor update:
I confirm this is a network issue. The reason why communication is broken because the hostsubnets.network.openshift.io CRD is deleted, causing the whole SDN to delete all the node to node flows, which is why node to node communication fails in the overlay while the rest works.

I have lost the logs in the middle of the update and I'm not certain about anything but I guess what happens is:
1- Operator deletes the hostsubnet CRD
2- Because the hostsubnet CRD is deleted, prior to it all hostsubnets are wiped
3- Becuase all hostsubnets are wiped, the SDN pods delete the OVS flows related to OVS
4- Everything that needs communication from pod to pod on the overlay, also stops working
5- CNO deletes the SDN daemonset (not that at this point it really matters but I'm left without SDN pod logs)
6- Because everything is broken CNO is apparently unable to recreate the daemonsets.

Comment 3 Juan Luis de Sousa-Valadas 2021-01-12 15:18:40 UTC
*** Bug 1913620 has been marked as a duplicate of this bug. ***

Comment 5 Juan Luis de Sousa-Valadas 2021-01-14 15:33:08 UTC
We just need https://github.com/openshift/cluster-network-operator/pull/945 merged and a backport of it.
I tested it manually and the downgrade works just fine.

Comment 6 Juan Luis de Sousa-Valadas 2021-01-15 15:29:05 UTC
The bug symptoms are not the same  but it has the same root cause and PR https://github.com/openshift/cluster-network-operator/pull/953 will fix it.

*** This bug has been marked as a duplicate of bug 1916601 ***


Note You need to log in before you can comment on or make changes to this bug.