Bug 1940207
Summary: | 4.7->4.6 rollbacks stuck on prometheusrules admission webhook "no route to host" | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
Component: | Networking | Assignee: | Federico Paolinelli <fpaoline> | |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | aconstan, alegrand, anpicker, ccoleman, crawford, erooth, fpaoline, jerzhang, kakkoyun, lcosic, pkrupa, spasquie, surbania | |
Version: | 4.6 | Keywords: | Upgrades | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1947477 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 22:53:56 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1947477, 1950261 |
Description
W. Trevor King
2021-03-17 19:49:59 UTC
Looking at the endpoint and pod resources, nothing seems wrong: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/endpoints.json |jq '.items | map(. | select(.metadata.name == "prometheus-operator") | .subsets)' [ [ { "addresses": [ { "ip": "10.129.0.38", "nodeName": "ip-10-0-159-237.ec2.internal", "targetRef": { "kind": "Pod", "name": "prometheus-operator-5d47c59c7c-mw8cq", "namespace": "openshift-monitoring", "resourceVersion": "80128", "uid": "a0bf1d06-8008-4337-9201-4ed7db053432" } } ], "ports": [ { "name": "web", "port": 8080, "protocol": "TCP" }, { "name": "https", "port": 8443, "protocol": "TCP" } ] } ] ] $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods.json | jq '.items | map(. | select(.metadata.name == "prometheus-operator-5d47c59c7c-mw8cq") | .metadata.name + " " + .status.podIP + " " + .spec.nodeName)' [ "prometheus-operator-5d47c59c7c-mw8cq 10.129.0.38 ip-10-0-159-237.ec2.internal" ] The prometheus-operator logs [1] show lots of connect errors with the same "no route to host" issue: E0317 00:44:22.308929 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-samples-operator/prometheusrules?resourceVersion=83959": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:22.308944 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-controller-manager-operator/configmaps?labelSelector=prometheus-name&resourceVersion=84656": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:22.308965 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-authentication-operator/prometheusrules?resourceVersion=83959": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:22.308987 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-ingress/configmaps?labelSelector=prometheus-name&resourceVersion=84656": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:22.309002 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ServiceMonitor: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-multus/servicemonitors?resourceVersion=84164": dial tcp 172.30.0.1:443: connect: no route to host Same goes for cluster-monitoring-operator [2]: E0317 00:43:30.787516 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?resourceVersion=84838": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:43:48.707537 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:194: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-monitoring/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:43:51.779555 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-config-managed/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:06.371587 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?resourceVersion=84838": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:09.444569 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-user-workload-monitoring/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:12.579576 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-config/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host Based on the current information, I'm reassigning to the Networking component. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-5d47c59c7c-mw8cq_prometheus-operator.log [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-655bc474db-97dgn_cluster-monitoring-operator.log I think I found the reason. The 4.6 ovs pod relies [1] on a file dropped by MCO [2] to understand if ovs is running on the host or not. That file is not created anymore in 4.7. When we rollback, CNO is rolled back before MCO, so it starts with the 4.7 version of that systemd unit that does not create the sentinel file anymore [3]. The result is two instances of ovs running on the node at the same time, which are likely to cause the errors I am seeing in the ovs logs. I am testing the rollback manually right now. [1] https://github.com/openshift/cluster-network-operator/blob/bb19869f526665792d4e42effee98afc4688e766/bindata/network/openshift-sdn/sdn-ovs.yaml#L55 [2] https://github.com/openshift/machine-config-operator/blob/0d140929e3758f5bac3e50c561b467fada11a1ed/templates/common/_base/files/configure-ovs-network.yaml#L17 [3] https://github.com/openshift/machine-config-operator/blob/6c42eaa4d333d2c575540eec7dc866e7cce527d7/templates/common/_base/files/configure-ovs-network.yaml#L7 And here the list of operators (from the last failed job): omg get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.23 True False True 45m baremetal 4.7.0-0.ci-2021-03-30-013032 True False False 2h40m cloud-credential 4.6.23 True False False 1h43m cluster-autoscaler 4.6.23 True False False 1h36m config-operator 4.6.23 True False False 3h38m console 4.6.23 True False False 1h24m csi-snapshot-controller 4.6.23 True False False 1h43m dns 4.7.0-0.ci-2021-03-30-013032 True False True 1h34m etcd 4.6.23 True False False 2h4m image-registry 4.6.23 True True False 1h23m ingress 4.6.23 True False False 1h39m insights 4.6.23 True False False 3h32m kube-apiserver 4.6.23 True False False 1h50m kube-controller-manager 4.6.23 True False False 1h49m kube-scheduler 4.6.23 True False False 1h48m kube-storage-version-migrator 4.6.23 True False False 1h43m machine-api 4.6.23 True False True 1h18m machine-approver 4.6.23 True False False 3h37m machine-config 4.7.0-0.ci-2021-03-30-013032 True False False 2h9m marketplace 4.6.23 True False False 1h42m monitoring 4.6.23 True False False 1h37m network 4.6.23 True False False 1h24m node-tuning 4.6.23 True False False 1h42m openshift-apiserver 4.6.23 True False False 1h44m openshift-controller-manager 4.6.23 True False False 1h42m openshift-samples 4.6.23 True False False 1h39m operator-lifecycle-manager 4.6.23 True False False 1h34m operator-lifecycle-manager-catalog 4.6.23 True True False 1h39m operator-lifecycle-manager-packageserver 4.6.23 False True False 10m service-ca 4.6.23 True False False 1h38m storage 4.6.23 True False False 1h23m I tested the fix, sdn works with that but the rollback is stopped by MCO with: lastSyncError: 'pool master has not progressed to latest configuration: controller version mismatch for rendered-master-cb2db7df54e993c796b76a2242b3e08a expected d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af has b5723620cfe40e2e4e8cbdcb105d6ae534be1753: pool is degraded because rendering fails with "": "Failed to render configuration for pool master: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1", retrying' master: 'pool is degraded because rendering fails with "": "Failed to render configuration for pool master: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1"' worker: 'pool is degraded because rendering fails with "": "Failed to render configuration for pool worker: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1"' I think that's another bug on MCO that will block rollbacks. Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed. > Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed.
Separate bug filed after this one gets to MODIFIED or later makes sense to me.
(In reply to W. Trevor King from comment #6) > > Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed. > > Separate bug filed after this one gets to MODIFIED or later makes sense to > me. Sound good to me Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |