Bug 1940207 - 4.7->4.6 rollbacks stuck on prometheusrules admission webhook "no route to host"
Summary: 4.7->4.6 rollbacks stuck on prometheusrules admission webhook "no route to host"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Federico Paolinelli
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1947477 1950261
TreeView+ depends on / blocked
 
Reported: 2021-03-17 19:49 UTC by W. Trevor King
Modified: 2021-07-27 22:54 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1947477 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:53:56 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2506 0 None open Bug 1940207: create the ovs-config-executed file to signal ovs is running on the host 2021-04-06 14:33:26 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:54:17 UTC

Description W. Trevor King 2021-03-17 19:49:59 UTC
4.6->4.7->4.6 update jobs are permafailing [1].  Picking a recent job [2], the build-log presentation is the step timing out:

 
 {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 3h0m0s timeout","severity":"error","time":"2021-03-17T00:37:49Z"}

ClusterVersion is stuck early [3]:

                    {
                        "lastTransitionTime": "2021-03-16T21:41:26Z",
                        "message": "Working towards 4.6.21: 1% complete",
                        "status": "True",
                        "type": "Progressing"
                    },

Checking the ClusterOperators:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | (.status.versions[] | select(.name == "operator").version) + " " + .metadata.name' | sort
  ...
  4.6.21 storage
  4.7.0-0.ci-2021-03-15-164837 baremetal
  4.7.0-0.ci-2021-03-15-164837 machine-config

baremetal is new in 4.7, so that's why it hasn't moved.  machine-config is in the process of rolling us back to 4.6, and rolling nodes gives us a new cluster-version operator pod, and we don't preserve completion percentage across CVO restarts, so that's why the CVO is only claiming 1%.  Here's where the CVO is stuck:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-798b6d49d9-rg5lp_cluster-version-operator.log | grep 'Running sync.*state\|Result of work' | tail -n5
  I0317 00:31:42.787489       1 sync_worker.go:513] Running sync registry.build02.ci.openshift.org/ci-op-t7wzziyv/release@sha256:6ae80e777c206b7314732aff542be105db892bf0e114a6757cb9e34662b8f891 (force=true) on generation 3 in state Updating at attempt 13
  I0317 00:31:42.922011       1 task_graph.go:555] Result of work: []
  I0317 00:37:24.701243       1 task_graph.go:555] Result of work: [Could not update prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 619): the server is reporting an internal error]
  I0317 00:40:18.223604       1 sync_worker.go:513] Running sync registry.build02.ci.openshift.org/ci-op-t7wzziyv/release@sha256:6ae80e777c206b7314732aff542be105db892bf0e114a6757cb9e34662b8f891 (force=true) on generation 3 in state Updating at attempt 14
  I0317 00:40:18.365749       1 task_graph.go:555] Result of work: []
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-798b6d49d9-rg5lp_cluster-version-operator.log | grep 'error running apply for prometheusrule.*cluster-version-operator' | tail -n1
  E0317 00:43:58.615624       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 619): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.38:8080: connect: no route to host

I'm not clear on the "no route to host" bit, but possibly something about the webhook was new in 4.7, and is not getting removed by the 4.6 monitoring components?

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback
[2]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992#1:build-log.txt%3A168
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json

Comment 1 Simon Pasquier 2021-03-18 14:22:55 UTC
Looking at the endpoint and pod resources, nothing seems wrong:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/endpoints.json |jq '.items | map(. | select(.metadata.name == "prometheus-operator") | .subsets)'

[
  [
    {
      "addresses": [
        {
          "ip": "10.129.0.38",
          "nodeName": "ip-10-0-159-237.ec2.internal",
          "targetRef": {
            "kind": "Pod",
            "name": "prometheus-operator-5d47c59c7c-mw8cq",
            "namespace": "openshift-monitoring",
            "resourceVersion": "80128",
            "uid": "a0bf1d06-8008-4337-9201-4ed7db053432"
          }
        }
      ],
      "ports": [
        {
          "name": "web",
          "port": 8080,
          "protocol": "TCP"
        },
        {
          "name": "https",
          "port": 8443,
          "protocol": "TCP"
        }
      ]
    }
  ]
]

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods.json | jq '.items | map(. | select(.metadata.name == "prometheus-operator-5d47c59c7c-mw8cq") | .metadata.name + " " + .status.podIP + " " + .spec.nodeName)'
[
  "prometheus-operator-5d47c59c7c-mw8cq 10.129.0.38 ip-10-0-159-237.ec2.internal"
]

The prometheus-operator logs [1] show lots of connect errors with the same "no route to host" issue: 

E0317 00:44:22.308929       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-samples-operator/prometheusrules?resourceVersion=83959": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:22.308944       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-controller-manager-operator/configmaps?labelSelector=prometheus-name&resourceVersion=84656": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:22.308965       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.PrometheusRule: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-authentication-operator/prometheusrules?resourceVersion=83959": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:22.308987       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-ingress/configmaps?labelSelector=prometheus-name&resourceVersion=84656": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:22.309002       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/informers/informers.go:75: Failed to list *v1.ServiceMonitor: Get "https://172.30.0.1:443/apis/monitoring.coreos.com/v1/namespaces/openshift-multus/servicemonitors?resourceVersion=84164": dial tcp 172.30.0.1:443: connect: no route to host

Same goes for cluster-monitoring-operator [2]:

E0317 00:43:30.787516       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?resourceVersion=84838": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:43:48.707537       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:194: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-monitoring/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:43:51.779555       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-config-managed/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:06.371587       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?resourceVersion=84838": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:09.444569       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-user-workload-monitoring/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host
E0317 00:44:12.579576       1 reflector.go:127] github.com/openshift/cluster-monitoring-operator/pkg/operator/operator.go:197: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-config/configmaps?resourceVersion=84501": dial tcp 172.30.0.1:443: connect: no route to host

Based on the current information, I'm reassigning to the Networking component.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-5d47c59c7c-mw8cq_prometheus-operator.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1371929264202452992/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-655bc474db-97dgn_cluster-monitoring-operator.log

Comment 3 Federico Paolinelli 2021-03-31 16:26:34 UTC
I think I found the reason.

The 4.6 ovs pod relies [1] on a file dropped by MCO [2] to understand if ovs is running on the host or not.
That file is not created anymore in 4.7.

When we rollback, CNO is rolled back before MCO, so it starts with the 4.7 version of that systemd unit that does not
create the sentinel file anymore [3].
The result is two instances of ovs running on the node at the same time, which are likely to cause the errors I am seeing in the ovs logs. I am testing the rollback manually right now.



[1] https://github.com/openshift/cluster-network-operator/blob/bb19869f526665792d4e42effee98afc4688e766/bindata/network/openshift-sdn/sdn-ovs.yaml#L55
[2] https://github.com/openshift/machine-config-operator/blob/0d140929e3758f5bac3e50c561b467fada11a1ed/templates/common/_base/files/configure-ovs-network.yaml#L17
[3] https://github.com/openshift/machine-config-operator/blob/6c42eaa4d333d2c575540eec7dc866e7cce527d7/templates/common/_base/files/configure-ovs-network.yaml#L7

Comment 4 Federico Paolinelli 2021-03-31 16:27:26 UTC
And here the list of operators (from the last failed job):

omg get clusteroperators           
NAME                                      VERSION                       AVAILABLE  PROGRESSING  DEGRADED  SINCE
authentication                            4.6.23                        True       False        True      45m
baremetal                                 4.7.0-0.ci-2021-03-30-013032  True       False        False     2h40m
cloud-credential                          4.6.23                        True       False        False     1h43m
cluster-autoscaler                        4.6.23                        True       False        False     1h36m
config-operator                           4.6.23                        True       False        False     3h38m
console                                   4.6.23                        True       False        False     1h24m
csi-snapshot-controller                   4.6.23                        True       False        False     1h43m
dns                                       4.7.0-0.ci-2021-03-30-013032  True       False        True      1h34m
etcd                                      4.6.23                        True       False        False     2h4m
image-registry                            4.6.23                        True       True         False     1h23m
ingress                                   4.6.23                        True       False        False     1h39m
insights                                  4.6.23                        True       False        False     3h32m
kube-apiserver                            4.6.23                        True       False        False     1h50m
kube-controller-manager                   4.6.23                        True       False        False     1h49m
kube-scheduler                            4.6.23                        True       False        False     1h48m
kube-storage-version-migrator             4.6.23                        True       False        False     1h43m
machine-api                               4.6.23                        True       False        True      1h18m
machine-approver                          4.6.23                        True       False        False     3h37m
machine-config                            4.7.0-0.ci-2021-03-30-013032  True       False        False     2h9m
marketplace                               4.6.23                        True       False        False     1h42m
monitoring                                4.6.23                        True       False        False     1h37m
network                                   4.6.23                        True       False        False     1h24m
node-tuning                               4.6.23                        True       False        False     1h42m
openshift-apiserver                       4.6.23                        True       False        False     1h44m
openshift-controller-manager              4.6.23                        True       False        False     1h42m
openshift-samples                         4.6.23                        True       False        False     1h39m
operator-lifecycle-manager                4.6.23                        True       False        False     1h34m
operator-lifecycle-manager-catalog        4.6.23                        True       True         False     1h39m
operator-lifecycle-manager-packageserver  4.6.23                        False      True         False     10m
service-ca                                4.6.23                        True       False        False     1h38m
storage                                   4.6.23                        True       False        False     1h23m

Comment 5 Federico Paolinelli 2021-04-06 14:37:04 UTC
I tested the fix, sdn works with that but the rollback is stopped by MCO with:

    lastSyncError: 'pool master has not progressed to latest configuration: controller version mismatch for rendered-master-cb2db7df54e993c796b76a2242b3e08a expected d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af has b5723620cfe40e2e4e8cbdcb105d6ae534be1753: pool is degraded because rendering fails with "": "Failed to render configuration for pool master: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1", retrying'
    master: 'pool is degraded because rendering fails with "": "Failed to render configuration for pool master: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1"'
    worker: 'pool is degraded because rendering fails with "": "Failed to render configuration for pool worker: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1"'

I think that's another bug on MCO that will block rollbacks.

Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed.

Comment 6 W. Trevor King 2021-04-06 18:12:51 UTC
> Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed.

Separate bug filed after this one gets to MODIFIED or later makes sense to me.

Comment 7 Federico Paolinelli 2021-04-07 07:34:32 UTC
(In reply to W. Trevor King from comment #6)
> > Not sure how to handle that from a bug tracking perspective, it won't show up until the network error won't be fixed.
> 
> Separate bug filed after this one gets to MODIFIED or later makes sense to
> me.

Sound good to me

Comment 21 errata-xmlrpc 2021-07-27 22:53:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.