Hide Forgot
4.6->4.7->4.6 update jobs are permafailing [1]. Picking a recent job [2], the build-log presentation is the step timing out: {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 3h0m0s timeout","severity":"error","time":"2021-03-17T00:37:49Z"} ClusterVersion is stuck early [3]: { "lastTransitionTime": "2021-03-16T21:41:26Z", "message": "Working towards 4.6.21: 1% complete", "status": "True", "type": "Progressing" }, Checking the ClusterOperators: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | (.status.versions[] | select(.name == "operator").version) + " " + .metadata.name' | sort ... 4.6.21 storage 4.7.0-0.ci-2021-03-15-164837 baremetal 4.7.0-0.ci-2021-03-15-164837 machine-config baremetal is new in 4.7, so that's why it hasn't moved. machine-config is in the process of rolling us back to 4.6, and rolling nodes gives us a new cluster-version operator pod, and we don't preserve completion percentage across CVO restarts, so that's why the CVO is only claiming 1%. Here's where the CVO is stuck: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-798b6d49d9-rg5lp_cluster-version-operator.log | grep 'Running sync.*state\|Result of work' | tail -n5 I0317 00:31:42.787489 1 sync_worker.go:513] Running sync registry.build02.ci.openshift.org/ci-op-t7wzziyv/release@sha256:6ae80e777c206b7314732aff542be105db892bf0e114a6757cb9e34662b8f891 (force=true) on generation 3 in state Updating at attempt 13 I0317 00:31:42.922011 1 task_graph.go:555] Result of work: [] I0317 00:37:24.701243 1 task_graph.go:555] Result of work: [Could not update prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 619): the server is reporting an internal error] I0317 00:40:18.223604 1 sync_worker.go:513] Running sync registry.build02.ci.openshift.org/ci-op-t7wzziyv/release@sha256:6ae80e777c206b7314732aff542be105db892bf0e114a6757cb9e34662b8f891 (force=true) on generation 3 in state Updating at attempt 14 I0317 00:40:18.365749 1 task_graph.go:555] Result of work: [] $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-798b6d49d9-rg5lp_cluster-version-operator.log | grep 'error running apply for prometheusrule.*cluster-version-operator' | tail -n1 E0317 00:43:58.615624 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 619): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.38:8080: connect: no route to host I'm not clear on the "no route to host" bit, but possibly something about the webhook was new in 4.7, and is not getting removed by the 4.6 monitoring components? [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback [2]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992#1:build-log.txt%3A168 [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json
Looking at the endpoint and pod resources, nothing seems wrong: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/endpoints.json |jq '.items | map(. | select(.metadata.name == "prometheus-operator") | .subsets)' [ [ { "addresses": [ { "ip": "10.129.0.38", "nodeName": "ip-10-0-159-237.ec2.internal", "targetRef": { "kind": "Pod", "name": "prometheus-operator-5d47c59c7c-mw8cq", "namespace": "openshift-monitoring", "resourceVersion": "80128", "uid": "a0bf1d06-8008-4337-9201-4ed7db053432" } } ], "ports": [ { "name": "web", "port": 8080, "protocol": "TCP" }, { "name": "https", "port": 8443, "protocol": "TCP" } ] } ] ] $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods.json | jq '.items | map(. | select(.metadata.name == "prometheus-operator-5d47c59c7c-mw8cq") | .metadata.name + " " + .status.podIP + " " + .spec.nodeName)' [ "prometheus-operator-5d47c59c7c-mw8cq 10.129.0.38 ip-10-0-159-237.ec2.internal" ] The prometheus-operator logs [1] show lots of connect errors with the same "no route to host" issue: E0317 00:44:22.308929 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-samples-operator/prometheusrules?resourceVersion=83959": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:22.308944 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-controller-manager-operator/configmaps?labelSelector=prometheus-name&resourceVersion=84656": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:22.308965 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-authentication-operator/prometheusrules?resourceVersion=83959": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:22.308987 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-ingress/configmaps?labelSelector=prometheus-name&resourceVersion=84656": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:22.309002 1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ServiceMonitor: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-multus/servicemonitors?resourceVersion=84164": dial tcp 172.30.0.1:443: connect: no route to host Same goes for cluster-monitoring-operator [2]: E0317 00:43:30.787516 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?resourceVersion=84838": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:43:48.707537 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:194: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-monitoring/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:43:51.779555 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-config-managed/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:06.371587 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?resourceVersion=84838": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:09.444569 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-user-workload-monitoring/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host E0317 00:44:12.579576 1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-config/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host Based on the current information, I'm reassigning to the Networking component. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-5d47c59c7c-mw8cq_prometheus-operator.log [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-655bc474db-97dgn_cluster-monitoring-operator.log
I think I found the reason. The 4.6 ovs pod relies [1] on a file dropped by MCO [2] to understand if ovs is running on the host or not. That file is not created anymore in 4.7. When we rollback, CNO is rolled back before MCO, so it starts with the 4.7 version of that systemd unit that does not create the sentinel file anymore [3]. The result is two instances of ovs running on the node at the same time, which are likely to cause the errors I am seeing in the ovs logs. I am testing the rollback manually right now. [1] https://github.com/openshift/cluster-network-operator/blob/bb19869f526665792d4e42effee98afc4688e766/bindata/network/openshift-sdn/sdn-ovs.yaml#L55 [2] https://github.com/openshift/machine-config-operator/blob/0d140929e3758f5bac3e50c561b467fada11a1ed/templates/common/_base/files/configure-ovs-network.yaml#L17 [3] https://github.com/openshift/machine-config-operator/blob/6c42eaa4d333d2c575540eec7dc866e7cce527d7/templates/common/_base/files/configure-ovs-network.yaml#L7
And here the list of operators (from the last failed job): omg get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.23 True False True 45m baremetal 4.7.0-0.ci-2021-03-30-013032 True False False 2h40m cloud-credential 4.6.23 True False False 1h43m cluster-autoscaler 4.6.23 True False False 1h36m config-operator 4.6.23 True False False 3h38m console 4.6.23 True False False 1h24m csi-snapshot-controller 4.6.23 True False False 1h43m dns 4.7.0-0.ci-2021-03-30-013032 True False True 1h34m etcd 4.6.23 True False False 2h4m image-registry 4.6.23 True True False 1h23m ingress 4.6.23 True False False 1h39m insights 4.6.23 True False False 3h32m kube-apiserver 4.6.23 True False False 1h50m kube-controller-manager 4.6.23 True False False 1h49m kube-scheduler 4.6.23 True False False 1h48m kube-storage-version-migrator 4.6.23 True False False 1h43m machine-api 4.6.23 True False True 1h18m machine-approver 4.6.23 True False False 3h37m machine-config 4.7.0-0.ci-2021-03-30-013032 True False False 2h9m marketplace 4.6.23 True False False 1h42m monitoring 4.6.23 True False False 1h37m network 4.6.23 True False False 1h24m node-tuning 4.6.23 True False False 1h42m openshift-apiserver 4.6.23 True False False 1h44m openshift-controller-manager 4.6.23 True False False 1h42m openshift-samples 4.6.23 True False False 1h39m operator-lifecycle-manager 4.6.23 True False False 1h34m operator-lifecycle-manager-catalog 4.6.23 True True False 1h39m operator-lifecycle-manager-packageserver 4.6.23 False True False 10m service-ca 4.6.23 True False False 1h38m storage 4.6.23 True False False 1h23m
I tested the fix, sdn works with that but the rollback is stopped by MCO with: lastSyncError: 'pool master has not progressed to latest configuration: controller version mismatch for rendered-master-cb2db7df54e993c796b76a2242b3e08a expected d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af has b5723620cfe40e2e4e8cbdcb105d6ae534be1753: pool is degraded because rendering fails with "": "Failed to render configuration for pool master: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1", retrying' master: 'pool is degraded because rendering fails with "": "Failed to render configuration for pool master: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1"' worker: 'pool is degraded because rendering fails with "": "Failed to render configuration for pool worker: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1"' I think that's another bug on MCO that will block rollbacks. Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed.
> Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed. Separate bug filed after this one gets to MODIFIED or later makes sense to me.
(In reply to W. Trevor King from comment #6) > > Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed. > > Separate bug filed after this one gets to MODIFIED or later makes sense to > me. Sound good to me
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438