Bug 1940207

Summary:	4.7->4.6 rollbacks stuck on prometheusrules admission webhook "no route to host"
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Networking	Assignee:	Federico Paolinelli <fpaoline>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	aconstan, alegrand, anpicker, ccoleman, crawford, erooth, fpaoline, jerzhang, kakkoyun, lcosic, pkrupa, spasquie, surbania
Version:	4.6	Keywords:	Upgrades
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1947477 (view as bug list)		Environment:
Last Closed:	2021-07-27 22:53:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1947477, 1950261

Description W. Trevor King 2021-03-17 19:49:59 UTC

4.6->4.7->4.6 update jobs are permafailing [1].  Picking a recent job [2], the build-log presentation is the step timing out:

 
 {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 3h0m0s timeout","severity":"error","time":"2021-03-17T00:37:49Z"}

ClusterVersion is stuck early [3]:

                    {
                        "lastTransitionTime": "2021-03-16T21:41:26Z",
                        "message": "Working towards 4.6.21: 1% complete",
                        "status": "True",
                        "type": "Progressing"
                    },

Checking the ClusterOperators:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | (.status.versions[] | select(.name == "operator").version) + " " + .metadata.name' | sort
  ...
  4.6.21 storage
  4.7.0-0.ci-2021-03-15-164837 baremetal
  4.7.0-0.ci-2021-03-15-164837 machine-config

baremetal is new in 4.7, so that's why it hasn't moved.  machine-config is in the process of rolling us back to 4.6, and rolling nodes gives us a new cluster-version operator pod, and we don't preserve completion percentage across CVO restarts, so that's why the CVO is only claiming 1%.  Here's where the CVO is stuck:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-798b6d49d9-rg5lp_cluster-version-operator.log | grep 'Running sync.*state\|Result of work' | tail -n5
  I0317 00:31:42.787489       1 sync_worker.go:513] Running sync registry.build02.ci.openshift.org/ci-op-t7wzziyv/release@sha256:6ae80e777c206b7314732aff542be105db892bf0e114a6757cb9e34662b8f891 (force=true) on generation 3 in state Updating at attempt 13
  I0317 00:31:42.922011       1 task_graph.go:555] Result of work: []
  I0317 00:37:24.701243       1 task_graph.go:555] Result of work: [Could not update prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 619): the server is reporting an internal error]
  I0317 00:40:18.223604       1 sync_worker.go:513] Running sync registry.build02.ci.openshift.org/ci-op-t7wzziyv/release@sha256:6ae80e777c206b7314732aff542be105db892bf0e114a6757cb9e34662b8f891 (force=true) on generation 3 in state Updating at attempt 14
  I0317 00:40:18.365749       1 task_graph.go:555] Result of work: []
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-798b6d49d9-rg5lp_cluster-version-operator.log | grep 'error running apply for prometheusrule.*cluster-version-operator' | tail -n1
  E0317 00:43:58.615624       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 619): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.38:8080: connect: no route to host

I'm not clear on the "no route to host" bit, but possibly something about the webhook was new in 4.7, and is not getting removed by the 4.6 monitoring components?

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback
[2]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992#1:build-log.txt%3A168
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json

Comment 1 Simon Pasquier 2021-03-18 14:22:55 UTC

Looking at the endpoint and pod resources, nothing seems wrong:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/endpoints.json |jq '.items | map(. | select(.metadata.name == "prometheus-operator") | .subsets)'

[
  [
    {
      "addresses": [
        {
          "ip": "10.129.0.38",
          "nodeName": "ip-10-0-159-237.ec2.internal",
          "targetRef": {
            "kind": "Pod",
            "name": "prometheus-operator-5d47c59c7c-mw8cq",
            "namespace": "openshift-monitoring",
            "resourceVersion": "80128",
            "uid": "a0bf1d06-8008-4337-9201-4ed7db053432"
          }
        }
      ],
      "ports": [
        {
          "name": "web",
          "port": 8080,
          "protocol": "TCP"
        },
        {
          "name": "https",
          "port": 8443,
          "protocol": "TCP"
        }
      ]
    }
  ]
]

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods.json | jq '.items | map(. | select(.metadata.name == "prometheus-operator-5d47c59c7c-mw8cq") | .metadata.name + " " + .status.podIP + " " + .spec.nodeName)'
[
  "prometheus-operator-5d47c59c7c-mw8cq 10.129.0.38 ip-10-0-159-237.ec2.internal"
]

The prometheus-operator logs [1] show lots of connect errors with the same "no route to host" issue: 

E0317 00:44:22.308929       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-samples-operator/prometheusrules?resourceVersion=83959": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:22.308944       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-controller-manager-operator/configmaps?labelSelector=prometheus-name&resourceVersion=84656": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:22.308965       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-authentication-operator/prometheusrules?resourceVersion=83959": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:22.308987       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-ingress/configmaps?labelSelector=prometheus-name&resourceVersion=84656": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:22.309002       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ServiceMonitor: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-multus/servicemonitors?resourceVersion=84164": dial tcp 172.30.0.1:443: connect: no route to host

Same goes for cluster-monitoring-operator [2]:

E0317 00:43:30.787516       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?resourceVersion=84838": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:43:48.707537       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:194: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-monitoring/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:43:51.779555       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-config-managed/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:06.371587       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?resourceVersion=84838": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:09.444569       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-user-workload-monitoring/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:12.579576       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-config/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host

Based on the current information, I'm reassigning to the Networking component.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-5d47c59c7c-mw8cq_prometheus-operator.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-655bc474db-97dgn_cluster-monitoring-operator.log

Comment 3 Federico Paolinelli 2021-03-31 16:26:34 UTC

I think I found the reason.

The 4.6 ovs pod relies [1] on a file dropped by MCO [2] to understand if ovs is running on the host or not.
That file is not created anymore in 4.7.

When we rollback, CNO is rolled back before MCO, so it starts with the 4.7 version of that systemd unit that does not
create the sentinel file anymore [3].
The result is two instances of ovs running on the node at the same time, which are likely to cause the errors I am seeing in the ovs logs. I am testing the rollback manually right now.



[1] https://github.com/openshift/cluster-network-operator/blob/bb19869f526665792d4e42effee98afc4688e766/bindata/network/openshift-sdn/sdn-ovs.yaml#L55
[2] https://github.com/openshift/machine-config-operator/blob/0d140929e3758f5bac3e50c561b467fada11a1ed/templates/common/_base/files/configure-ovs-network.yaml#L17
[3] https://github.com/openshift/machine-config-operator/blob/6c42eaa4d333d2c575540eec7dc866e7cce527d7/templates/common/_base/files/configure-ovs-network.yaml#L7

Comment 4 Federico Paolinelli 2021-03-31 16:27:26 UTC

And here the list of operators (from the last failed job):

omg get clusteroperators           
NAME                                      VERSION                       AVAILABLE  PROGRESSING  DEGRADED  SINCE
authentication                            4.6.23                        True       False        True      45m
baremetal                                 4.7.0-0.ci-2021-03-30-013032  True       False        False     2h40m
cloud-credential                          4.6.23                        True       False        False     1h43m
cluster-autoscaler                        4.6.23                        True       False        False     1h36m
config-operator                           4.6.23                        True       False        False     3h38m
console                                   4.6.23                        True       False        False     1h24m
csi-snapshot-controller                   4.6.23                        True       False        False     1h43m
dns                                       4.7.0-0.ci-2021-03-30-013032  True       False        True      1h34m
etcd                                      4.6.23                        True       False        False     2h4m
image-registry                            4.6.23                        True       True         False     1h23m
ingress                                   4.6.23                        True       False        False     1h39m
insights                                  4.6.23                        True       False        False     3h32m
kube-apiserver                            4.6.23                        True       False        False     1h50m
kube-controller-manager                   4.6.23                        True       False        False     1h49m
kube-scheduler                            4.6.23                        True       False        False     1h48m
kube-storage-version-migrator             4.6.23                        True       False        False     1h43m
machine-api                               4.6.23                        True       False        True      1h18m
machine-approver                          4.6.23                        True       False        False     3h37m
machine-config                            4.7.0-0.ci-2021-03-30-013032  True       False        False     2h9m
marketplace                               4.6.23                        True       False        False     1h42m
monitoring                                4.6.23                        True       False        False     1h37m
network                                   4.6.23                        True       False        False     1h24m
node-tuning                               4.6.23                        True       False        False     1h42m
openshift-apiserver                       4.6.23                        True       False        False     1h44m
openshift-controller-manager              4.6.23                        True       False        False     1h42m
openshift-samples                         4.6.23                        True       False        False     1h39m
operator-lifecycle-manager                4.6.23                        True       False        False     1h34m
operator-lifecycle-manager-catalog        4.6.23                        True       True         False     1h39m
operator-lifecycle-manager-packageserver  4.6.23                        False      True         False     10m
service-ca                                4.6.23                        True       False        False     1h38m
storage                                   4.6.23                        True       False        False     1h23m

Comment 5 Federico Paolinelli 2021-04-06 14:37:04 UTC

I tested the fix, sdn works with that but the rollback is stopped by MCO with:

    lastSyncError: 'pool master has not progressed to latest configuration: controller version mismatch for rendered-master-cb2db7df54e993c796b76a2242b3e08a expected d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af has b5723620cfe40e2e4e8cbdcb105d6ae534be1753: pool is degraded because rendering fails with "": "Failed to render configuration for pool master: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1", retrying'
    master: 'pool is degraded because rendering fails with "": "Failed to render configuration for pool master: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1"'
    worker: 'pool is degraded because rendering fails with "": "Failed to render configuration for pool worker: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1"'

I think that's another bug on MCO that will block rollbacks.

Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed.

Comment 6 W. Trevor King 2021-04-06 18:12:51 UTC

> Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed.

Separate bug filed after this one gets to MODIFIED or later makes sense to me.

Comment 7 Federico Paolinelli 2021-04-07 07:34:32 UTC

(In reply to W. Trevor King from comment #6)
> > Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed.
> 
> Separate bug filed after this one gets to MODIFIED or later makes sense to
> me.

Sound good to me

Comment 21 errata-xmlrpc 2021-07-27 22:53:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438