Bug 1906936
Summary: | unable to rollback to 4.6 while upgrading to 4.7; unable to update prometheusrule | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | jamo luhrsen <jluhrsen> |
Component: | Networking | Assignee: | Juan Luis de Sousa-Valadas <jdesousa> |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aconstan, alegrand, anpicker, erooth, kakkoyun, lcosic, lmohanty, mloibl, pkrupa, surbania, wking, xxia, yanyang |
Version: | 4.7 | Keywords: | TestBlocker, Upgrades |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-01-15 15:29:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
jamo luhrsen
2020-12-11 20:41:47 UTC
Looking at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-cluster-version_cluster-version-operator-77f99b4fb7-xdn4l_cluster-version-operator.log reveals issues with the network: I1211 16:17:01.844709 1 sync_worker.go:869] Update error 9 of 617: UpdatePayloadClusterError Could not update prometheusrule "openshift-cluster-version/cluster-version-op erator" (9 of 617): the server is reporting an internal error (*errors.StatusError: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https: //prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.63:8080: connect: no route to host) E1211 16:17:01.844735 1 sync_worker.go:348] unable to synchronize image (waiting 2m50.956499648s): Could not update prometheusrule "openshift-cluster-version/cluster-vers ion-operator" (9 of 617): the server is reporting an internal error Hence reassigning to the network team to assert. fwiw prometheus-operator does not reveal any issues: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/pods/openshift-monitoring_prometheus-operator-5986f78f55-tc8s6_prometheus-operator.log Minor update: I confirm this is a network issue. The reason why communication is broken because the hostsubnets.network.openshift.io CRD is deleted, causing the whole SDN to delete all the node to node flows, which is why node to node communication fails in the overlay while the rest works. I have lost the logs in the middle of the update and I'm not certain about anything but I guess what happens is: 1- Operator deletes the hostsubnet CRD 2- Because the hostsubnet CRD is deleted, prior to it all hostsubnets are wiped 3- Becuase all hostsubnets are wiped, the SDN pods delete the OVS flows related to OVS 4- Everything that needs communication from pod to pod on the overlay, also stops working 5- CNO deletes the SDN daemonset (not that at this point it really matters but I'm left without SDN pod logs) 6- Because everything is broken CNO is apparently unable to recreate the daemonsets. *** Bug 1913620 has been marked as a duplicate of this bug. *** We just need https://github.com/openshift/cluster-network-operator/pull/945 merged and a backport of it. I tested it manually and the downgrade works just fine. The bug symptoms are not the same but it has the same root cause and PR https://github.com/openshift/cluster-network-operator/pull/953 will fix it. *** This bug has been marked as a duplicate of bug 1916601 *** |