1812223 – Upgrade from 4.2.22 to 4.3.5 failing due OCM resync failure.

Bug 1812223 - Upgrade from 4.2.22 to 4.3.5 failing due OCM resync failure.

Summary: Upgrade from 4.2.22 to 4.3.5 failing due OCM resync failure.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Lili Cosic
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-10 19:20 UTC by Simon
Modified:	2020-05-06 08:13 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-06 08:13:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Simon 2020-03-10 19:20:11 UTC

Description of problem:
Upgrade from 4.2.22 to 4.3.5 is failing because issues with OCM.

How reproducible:
1 attempt so far

Steps to Reproduce:
1. Upgrade OCP 4.2.22 to 4.3.5
2. Check clusteroperators

Actual results:
machine-config operator is not updated.

Expected results:
All operators are updated


Additional info:

oc get clusteroperators.config.openshift.io machine-config -o yaml

apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-03-10T14:57:51Z"
  generation: 1
  name: machine-config
  resourceVersion: "182792"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: 83b3f3d2-62df-11ea-a4f9-06f81ba6124a
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-03-10T15:24:21Z"
    message: Cluster not available for 4.2.22
    status: "False"
    type: Available
  - lastTransitionTime: "2020-03-10T14:58:53Z"
    message: Cluster version is 4.2.22
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-03-10T15:24:21Z"
    message: 'Failed to resync 4.2.22 because: timed out waiting for the condition
      during syncRequiredMachineConfigPools: error pool master is not ready, retrying.
      Status: (pool degraded: false total: 3, ready 0, updated: 0, unavailable: 1)'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-03-10T14:58:53Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension:
    lastSyncError: 'error pool master is not ready, retrying. Status: (pool degraded:
      false total: 3, ready 0, updated: 0, unavailable: 1)'
    master: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-35cd0d33b5c46b6fb4cde7a1b2fa131a
    worker: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-worker-3011b7012f42ff93c169237b4b5f4098
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  - group: machineconfiguration.openshift.io
    name: master
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: worker
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: machine-config-controller
    resource: controllerconfigs
  versions:
  - name: operator
    version: 4.2.22

Comment 1 Kirsten Garrison 2020-03-10 19:22:52 UTC

Please provide a must-gather from the cluster.

Comment 3 Kirsten Garrison 2020-03-10 20:44:03 UTC

Logging in with kubeconfig I see
```
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.22    True        True          3h37m   Unable to apply 4.3.5: some cluster configuration is invalid
```

It looks like api-server just came up??? and you have several operators not updated..
```
$ oc get clusteroperators
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.5     True        False         False      5h20m
cloud-credential                           4.3.5     True        False         False      5h35m
cluster-autoscaler                         4.3.5     True        False         False      5h25m
console                                    4.3.5     True        False         False      3h26m
dns                                        4.2.22    True        False         False      5h34m
image-registry                             4.3.5     True        False         False      5h25m
ingress                                    4.3.5     True        False         False      3h29m
insights                                   4.3.5     True        False         False      5h35m
kube-apiserver                             4.3.5     True        True          False      5h32m
kube-controller-manager                    4.3.5     True        True          False      5h32m
kube-scheduler                             4.3.5     True        True          False      5h33m
machine-api                                4.3.5     True        False         False      5h35m
machine-config                             4.2.22    False       False         True       5h9m
marketplace                                4.3.5     True        False         False      3h29m
monitoring                                 4.2.22    False       True          True       3h27m
network                                    4.2.22    True        False         False      5h34m
node-tuning                                4.3.5     True        False         False      3h30m
openshift-apiserver                        4.3.5     True        False         False      61s
openshift-controller-manager               4.3.5     True        False         False      5h32m
openshift-samples                          4.3.5     True        False         False      3h29m
operator-lifecycle-manager                 4.3.5     True        False         False      5h34m
operator-lifecycle-manager-catalog         4.3.5     True        False         False      5h34m
operator-lifecycle-manager-packageserver   4.3.5     True        False         False      3h29m
service-ca                                 4.3.5     True        False         False      5h35m
service-catalog-apiserver                  4.3.5     True        False         False      5h31m
service-catalog-controller-manager         4.3.5     True        False         False      5h31m
storage                                    4.3.5     True        False         False      3h30m
```

Comment 4 Kirsten Garrison 2020-03-10 20:45:08 UTC

DNS has a bunch of issues:
```
$ oc logs -f -n openshift-dns dns-default-648hl -c dns

2020-03-10T20:42:55.105Z [ERROR] plugin/errors: 2 registry.redhat.io. AAAA: read udp 10.131.0.3:50379->10.0.0.2:53: i/o timeout
2020-03-10T20:42:55.576Z [ERROR] plugin/errors: 2 registry-1.docker.io. AAAA: read udp 10.131.0.3:35305->10.0.0.2:53: i/o timeout
2020-03-10T20:42:59.137Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. AAAA: read udp 10.131.0.3:56060->10.0.0.2:53: i/o timeout
2020-03-10T20:42:59.204Z [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.131.0.3:38440->10.0.0.2:53: i/o timeout
2020-03-10T20:43:01.206Z [ERROR] plugin/errors: 2 quay.io. A: read udp 10.131.0.3:42072->10.0.0.2:53: i/o timeout
2020-03-10T20:43:01.374Z [ERROR] plugin/errors: 2 registry-1.docker.io.us-west-2.compute.internal. A: read udp 10.131.0.3:33374->10.0.0.2:53: i/o timeout
2020-03-10T20:43:01.959Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. A: read udp 10.131.0.3:41918->10.0.0.2:53: i/o timeout
2020-03-10T20:43:05.962Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. AAAA: read udp 10.131.0.3:60062->10.0.0.2:53: i/o timeout
2020-03-10T20:43:07.464Z [ERROR] plugin/errors: 2 registry.redhat.io. A: read udp 10.131.0.3:55397->10.0.0.2:53: i/o timeout
2020-03-10T20:43:07.965Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. AAAA: read udp 10.131.0.3:60919->10.0.0.2:53: i/o timeout
2020-03-10T20:43:08.004Z [ERROR] plugin/errors: 2 registry-1.docker.io.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:54397->10.0.0.2:53: i/o timeout
2020-03-10T20:43:10.002Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:40866->10.0.0.2:53: i/o timeout
2020-03-10T20:43:10.008Z [ERROR] plugin/errors: 2 registry-1.docker.io. A: read udp 10.131.0.3:51763->10.0.0.2:53: i/o timeout
2020-03-10T20:43:12.004Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:55778->10.0.0.2:53: i/o timeout
2020-03-10T20:43:12.339Z [ERROR] plugin/errors: 2 auth.docker.io.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:56428->10.0.0.2:53: i/o timeout
2020-03-10T20:43:12.339Z [ERROR] plugin/errors: 2 auth.docker.io.us-west-2.compute.internal. A: read udp 10.131.0.3:60564->10.0.0.2:53: i/o timeout
2020-03-10T20:43:14.343Z [ERROR] plugin/errors: 2 auth.docker.io. AAAA: read udp 10.131.0.3:52000->10.0.0.2:53: i/o timeout
2020-03-10T20:43:15.478Z [ERROR] plugin/errors: 2 registry.redhat.io. A: read udp 10.131.0.3:35092->10.0.0.2:53: i/o timeout
2020-03-10T20:43:16.683Z [ERROR] plugin/errors: 2 registry-1.docker.io. AAAA: read udp 10.131.0.3:49069->10.0.0.2:53: i/o timeout
2020-03-10T20:43:17.559Z [ERROR] plugin/errors: 2 registry.redhat.io. A: read udp 10.131.0.3:60137->10.0.0.2:53: i/o timeout
2020-03-10T20:43:19.671Z [ERROR] plugin/errors: 2 registry.redhat.io.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:53410->10.0.0.2:53: i/o timeout
2020-03-10T20:43:23.019Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. A: read udp 10.131.0.3:41911->10.0.0.2:53: i/o timeout
2020-03-10T20:43:23.769Z [ERROR] plugin/errors: 2 registry.redhat.io. AAAA: read udp 10.131.0.3:47555->10.0.0.2:53: i/o timeout
2020-03-10T20:43:23.769Z [ERROR] plugin/errors: 2 registry.redhat.io. A: read udp 10.131.0.3:35956->10.0.0.2:53: i/o timeout
I0310 20:43:25.778065       1 trace.go:82] Trace[1721636195]: "Reflector github.com/coredns/coredns/plugin/kubernetes/controller.go:264 ListAndWatch" (started: 2020-03-10 20:42:55.777677807 +0000 UTC m=+19561.595638782) (total time: 30.000355098s):
Trace[1721636195]: [30.000355098s] [30.000355098s] END
E0310 20:43:25.778085       1 reflector.go:126] github.com/coredns/coredns/plugin/kubernetes/controller.go:264: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout
I0310 20:43:25.779056       1 trace.go:82] Trace[1458976464]: "Reflector github.com/coredns/coredns/plugin/kubernetes/controller.go:266 ListAndWatch" (started: 2020-03-10 20:42:55.778744221 +0000 UTC m=+19561.596705164) (total time: 30.000290889s):
Trace[1458976464]: [30.000290889s] [30.000290889s] END
E0310 20:43:25.779072       1 reflector.go:126] github.com/coredns/coredns/plugin/kubernetes/controller.go:266: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout
I0310 20:43:25.780079       1 trace.go:82] Trace[1908220677]: "Reflector github.com/coredns/coredns/plugin/kubernetes/controller.go:271 ListAndWatch" (started: 2020-03-10 20:42:55.779812037 +0000 UTC m=+19561.597773304) (total time: 30.000252627s):
Trace[1908220677]: [30.000252627s] [30.000252627s] END
E0310 20:43:25.780091       1 reflector.go:126] github.com/coredns/coredns/plugin/kubernetes/controller.go:271: Failed to list *v1.Namespace: Get https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout
2020-03-10T20:43:27.090Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:60623->10.0.0.2:53: i/o timeout
2020-03-10T20:43:27.863Z [ERROR] plugin/errors: 2 registry.redhat.io.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:33351->10.0.0.2:53: i/o timeout
2020-03-10T20:43:35.055Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. AAAA: read udp 10.131.0.3:56889->10.0.0.2:53: i/o timeout
2020-03-10T20:43:35.055Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. A: read udp 10.131.0.3:37185->10.0.0.2:53: i/o timeout
2020-03-10T20:43:35.573Z [ERROR] plugin/errors: 2 registry-1.docker.io.us-west-2.compute.internal. A: read udp 10.131.0.3:37963->10.0.0.2:53: i/o timeout
2020-03-10T20:43:37.057Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. A: read udp 10.131.0.3:55006->10.0.0.2:53: i/o timeout
2020-03-10T20:43:44.078Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:54601->10.0.0.2:53: i/o timeout
2020-03-10T20:43:46.299Z [ERROR] plugin/errors: 2 registry.redhat.io. AAAA: read udp 10.131.0.3:47394->10.0.0.2:53: i/o timeout
```

Comment 5 Kirsten Garrison 2020-03-10 21:20:37 UTC

I dont think that this is an MCO driven failure. Given that the CVO error is: `Unable to apply 4.3.5: some cluster configuration is invalid` and we see that the apiserver took a really long time to come up, DNS is not updated, network is not updated, and must gather isn't run...

```
machine.openshift.io/machine: openshift-machine-api/skordas0310a-tvsgm-master-0
    machineconfiguration.openshift.io/currentConfig: rendered-master-dda6182ea623f68d853aa998601e892b
    machineconfiguration.openshift.io/desiredConfig: rendered-master-35cd0d33b5c46b6fb4cde7a1b2fa131a
    machineconfiguration.openshift.io/state: Working
    ovn_host_subnet: 10.129.0.0/23
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
```

I think I found the reason why your cluster configuration is invalid:
```
$ oc logs -f -n openshift-monitoring cluster-monitoring-operator-698c7bc7df-57nf6 

E0310 21:14:52.025657       1 operator.go:271] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0310 21:14:52.028551       1 operator.go:272] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Alertmanager failed: reconciling Alertmanager object failed: updating Alertmanager object failed: Alertmanager.monitoring.coreos.com "main" is invalid: []: Invalid value: map[string]interface {}{"apiVersion":"monitoring.coreos.com/v1", "kind":"Alertmanager", "metadata":map[string]interface {}{"creationTimestamp":"2020-03-10T15:07:26Z", "generation":3, "labels":map[string]interface {}{"alertmanager":"main"}, "name":"main", "namespace":"openshift-monitoring", "resourceVersion":"13855", "uid":"dab3f19e-62e0-11ea-aff7-0a50e3591b0a"}, "spec":map[string]interface {}{"affinity":map[string]interface {}{"podAntiAffinity":map[string]interface {}{"preferredDuringSchedulingIgnoredDuringExecution":[]interface {}{map[string]interface {}{"podAffinityTerm":map[string]interface {}{"labelSelector":map[string]interface {}{"matchExpressions":[]interface {}{map[string]interface {}{"key":"alertmanager", "operator":"In", "values":[]interface {}{"main"}}}}, "namespaces":[]interface {}{"openshift-monitoring"}, "topologyKey":"kubernetes.io/hostname"}, "weight":100}}}}, "baseImage":"openshift/prometheus-alertmanager", "containers":[]interface {}{map[string]interface {}{"args":[]interface {}{"-provider=openshift", "-https-address=:9095", "-http-address=", "-email-domain=*", "-upstream=http://localhost:9093", "-openshift-sar={\"resource\": \"namespaces\", \"verb\": \"get\"}", "-openshift-delegate-urls={\"/\": {\"resource\": \"namespaces\", \"verb\": \"get\"}}", "-tls-cert=/etc/tls/private/tls.crt", "-tls-key=/etc/tls/private/tls.key", "-client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token", "-cookie-secret-file=/etc/proxy/secrets/session_secret", "-openshift-service-account=alertmanager-main", "-openshift-ca=/etc/pki/tls/cert.pem", "-openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt", "-skip-auth-regex=^/metrics"}, "env":[]interface {}{map[string]interface {}{"name":"HTTP_PROXY"}, map[string]interface {}{"name":"HTTPS_PROXY"}, map[string]interface {}{"name":"NO_PROXY"}}, "image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e2cc56931552bee158ed77d2f0a7e75202aba9204612f370d08630ba3992ec77", "name":"alertmanager-proxy", "ports":[]interface {}{map[string]interface {}{"containerPort":9095, "name":"web"}}, "resources":map[string]interface {}{}, "terminationMessagePolicy":"FallbackToLogsOnError", "volumeMounts":[]interface {}{map[string]interface {}{"mountPath":"/etc/tls/private", "name":"secret-alertmanager-main-tls"}, map[string]interface {}{"mountPath":"/etc/proxy/secrets", "name":"secret-alertmanager-main-proxy"}, map[string]interface {}{"mountPath":"/etc/pki/ca-trust/extracted/pem/", "name":"alertmanager-trusted-ca-bundle", "readOnly":true}}}}, "externalUrl":"https://alertmanager-main-openshift-monitoring.apps.skordas0310a.perf-testing.devcluster.openshift.com/", "image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b5db4653b80fb2c8eb2c17ba466921e34c473ebfa144b2ac8943996fada7685", "listenLocal":true, "nodeSelector":map[string]interface {}{"kubernetes.io/os":"linux"}, "priorityClassName":"system-cluster-critical", "replicas":3, "resources":map[string]interface {}{}, "secrets":[]interface {}{"alertmanager-main-tls", "alertmanager-main-proxy"}, "securityContext":map[string]interface {}{}, "serviceAccountName":"alertmanager-main", "version":"v0.15.2", "volumeMounts":[]interface {}{map[string]interface {}{"mountPath":"/etc/pki/ca-trust/extracted/pem/", "name":"alertmanager-trusted-ca-bundle", "readOnly":true}}, "volumes":[]interface {}{map[string]interface {}{"configMap":map[string]interface {}{"items":[]interface {}{map[string]interface {}{"key":"ca-bundle.crt", "path":"tls-ca-bundle.pem"}}, "name":"alertmanager-trusted-ca-bundle-39man1pbaa8jq", "optional":true}, "name":"alertmanager-trusted-ca-bundle"}}}}: validation failure list:
type in spec.volumes.configMap is required
```

Comment 6 Kirsten Garrison 2020-03-10 21:21:57 UTC

FTR, I was able to upgrade sucessfully in a vanila AWS cluster so something about this failure might have to do with the reporter's setup...]

Passing this over to monitoring, maybe they can advise on this error.

Comment 8 Vadim Rutkovsky 2020-03-12 06:47:24 UTC

Removed UpgradeBlocker upgrade as we haven't hit this in any of the test upgrades yet

Comment 10 Lili Cosic 2020-04-23 05:34:03 UTC

> is potentially a transient log error during upgrades. Rationale: we rely on k8s 1.17 API validations in 4.3 code. So if the control plane is still at k8s 1.16 the above error occurs and goes away once control plane upgrades to k8s 1.17.

Yes agreed, it's not a problem with us, if anything the validation problem should be backported upstream issue -> https://github.com/kubernetes-sigs/kubefed/issues/469

I suggest we close this, as it's a transient error and we reconcile eventually.

Comment 11 Lili Cosic 2020-05-06 08:13:28 UTC

Closing as per last comment, transient errors can happen but it eventually reconciles so it's not a bug.

Note You need to log in before you can comment on or make changes to this bug.