Description of problem: Upgrade from 4.2.22 to 4.3.5 is failing because issues with OCM. How reproducible: 1 attempt so far Steps to Reproduce: 1. Upgrade OCP 4.2.22 to 4.3.5 2. Check clusteroperators Actual results: machine-config operator is not updated. Expected results: All operators are updated Additional info: oc get clusteroperators.config.openshift.io machine-config -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2020-03-10T14:57:51Z" generation: 1 name: machine-config resourceVersion: "182792" selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config uid: 83b3f3d2-62df-11ea-a4f9-06f81ba6124a spec: {} status: conditions: - lastTransitionTime: "2020-03-10T15:24:21Z" message: Cluster not available for 4.2.22 status: "False" type: Available - lastTransitionTime: "2020-03-10T14:58:53Z" message: Cluster version is 4.2.22 status: "False" type: Progressing - lastTransitionTime: "2020-03-10T15:24:21Z" message: 'Failed to resync 4.2.22 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 0, updated: 0, unavailable: 1)' reason: RequiredPoolsFailed status: "True" type: Degraded - lastTransitionTime: "2020-03-10T14:58:53Z" reason: AsExpected status: "True" type: Upgradeable extension: lastSyncError: 'error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 0, updated: 0, unavailable: 1)' master: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-35cd0d33b5c46b6fb4cde7a1b2fa131a worker: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-worker-3011b7012f42ff93c169237b4b5f4098 relatedObjects: - group: "" name: openshift-machine-config-operator resource: namespaces - group: machineconfiguration.openshift.io name: master resource: machineconfigpools - group: machineconfiguration.openshift.io name: worker resource: machineconfigpools - group: machineconfiguration.openshift.io name: machine-config-controller resource: controllerconfigs versions: - name: operator version: 4.2.22
Please provide a must-gather from the cluster.
Logging in with kubeconfig I see ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.22 True True 3h37m Unable to apply 4.3.5: some cluster configuration is invalid ``` It looks like api-server just came up??? and you have several operators not updated.. ``` $ oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.5 True False False 5h20m cloud-credential 4.3.5 True False False 5h35m cluster-autoscaler 4.3.5 True False False 5h25m console 4.3.5 True False False 3h26m dns 4.2.22 True False False 5h34m image-registry 4.3.5 True False False 5h25m ingress 4.3.5 True False False 3h29m insights 4.3.5 True False False 5h35m kube-apiserver 4.3.5 True True False 5h32m kube-controller-manager 4.3.5 True True False 5h32m kube-scheduler 4.3.5 True True False 5h33m machine-api 4.3.5 True False False 5h35m machine-config 4.2.22 False False True 5h9m marketplace 4.3.5 True False False 3h29m monitoring 4.2.22 False True True 3h27m network 4.2.22 True False False 5h34m node-tuning 4.3.5 True False False 3h30m openshift-apiserver 4.3.5 True False False 61s openshift-controller-manager 4.3.5 True False False 5h32m openshift-samples 4.3.5 True False False 3h29m operator-lifecycle-manager 4.3.5 True False False 5h34m operator-lifecycle-manager-catalog 4.3.5 True False False 5h34m operator-lifecycle-manager-packageserver 4.3.5 True False False 3h29m service-ca 4.3.5 True False False 5h35m service-catalog-apiserver 4.3.5 True False False 5h31m service-catalog-controller-manager 4.3.5 True False False 5h31m storage 4.3.5 True False False 3h30m ```
DNS has a bunch of issues: ``` $ oc logs -f -n openshift-dns dns-default-648hl -c dns 2020-03-10T20:42:55.105Z [ERROR] plugin/errors: 2 registry.redhat.io. AAAA: read udp 10.131.0.3:50379->10.0.0.2:53: i/o timeout 2020-03-10T20:42:55.576Z [ERROR] plugin/errors: 2 registry-1.docker.io. AAAA: read udp 10.131.0.3:35305->10.0.0.2:53: i/o timeout 2020-03-10T20:42:59.137Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. AAAA: read udp 10.131.0.3:56060->10.0.0.2:53: i/o timeout 2020-03-10T20:42:59.204Z [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.131.0.3:38440->10.0.0.2:53: i/o timeout 2020-03-10T20:43:01.206Z [ERROR] plugin/errors: 2 quay.io. A: read udp 10.131.0.3:42072->10.0.0.2:53: i/o timeout 2020-03-10T20:43:01.374Z [ERROR] plugin/errors: 2 registry-1.docker.io.us-west-2.compute.internal. A: read udp 10.131.0.3:33374->10.0.0.2:53: i/o timeout 2020-03-10T20:43:01.959Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. A: read udp 10.131.0.3:41918->10.0.0.2:53: i/o timeout 2020-03-10T20:43:05.962Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. AAAA: read udp 10.131.0.3:60062->10.0.0.2:53: i/o timeout 2020-03-10T20:43:07.464Z [ERROR] plugin/errors: 2 registry.redhat.io. A: read udp 10.131.0.3:55397->10.0.0.2:53: i/o timeout 2020-03-10T20:43:07.965Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. AAAA: read udp 10.131.0.3:60919->10.0.0.2:53: i/o timeout 2020-03-10T20:43:08.004Z [ERROR] plugin/errors: 2 registry-1.docker.io.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:54397->10.0.0.2:53: i/o timeout 2020-03-10T20:43:10.002Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:40866->10.0.0.2:53: i/o timeout 2020-03-10T20:43:10.008Z [ERROR] plugin/errors: 2 registry-1.docker.io. A: read udp 10.131.0.3:51763->10.0.0.2:53: i/o timeout 2020-03-10T20:43:12.004Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:55778->10.0.0.2:53: i/o timeout 2020-03-10T20:43:12.339Z [ERROR] plugin/errors: 2 auth.docker.io.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:56428->10.0.0.2:53: i/o timeout 2020-03-10T20:43:12.339Z [ERROR] plugin/errors: 2 auth.docker.io.us-west-2.compute.internal. A: read udp 10.131.0.3:60564->10.0.0.2:53: i/o timeout 2020-03-10T20:43:14.343Z [ERROR] plugin/errors: 2 auth.docker.io. AAAA: read udp 10.131.0.3:52000->10.0.0.2:53: i/o timeout 2020-03-10T20:43:15.478Z [ERROR] plugin/errors: 2 registry.redhat.io. A: read udp 10.131.0.3:35092->10.0.0.2:53: i/o timeout 2020-03-10T20:43:16.683Z [ERROR] plugin/errors: 2 registry-1.docker.io. AAAA: read udp 10.131.0.3:49069->10.0.0.2:53: i/o timeout 2020-03-10T20:43:17.559Z [ERROR] plugin/errors: 2 registry.redhat.io. A: read udp 10.131.0.3:60137->10.0.0.2:53: i/o timeout 2020-03-10T20:43:19.671Z [ERROR] plugin/errors: 2 registry.redhat.io.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:53410->10.0.0.2:53: i/o timeout 2020-03-10T20:43:23.019Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. A: read udp 10.131.0.3:41911->10.0.0.2:53: i/o timeout 2020-03-10T20:43:23.769Z [ERROR] plugin/errors: 2 registry.redhat.io. AAAA: read udp 10.131.0.3:47555->10.0.0.2:53: i/o timeout 2020-03-10T20:43:23.769Z [ERROR] plugin/errors: 2 registry.redhat.io. A: read udp 10.131.0.3:35956->10.0.0.2:53: i/o timeout I0310 20:43:25.778065 1 trace.go:82] Trace[1721636195]: "Reflector github.com/coredns/coredns/plugin/kubernetes/controller.go:264 ListAndWatch" (started: 2020-03-10 20:42:55.777677807 +0000 UTC m=+19561.595638782) (total time: 30.000355098s): Trace[1721636195]: [30.000355098s] [30.000355098s] END E0310 20:43:25.778085 1 reflector.go:126] github.com/coredns/coredns/plugin/kubernetes/controller.go:264: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout I0310 20:43:25.779056 1 trace.go:82] Trace[1458976464]: "Reflector github.com/coredns/coredns/plugin/kubernetes/controller.go:266 ListAndWatch" (started: 2020-03-10 20:42:55.778744221 +0000 UTC m=+19561.596705164) (total time: 30.000290889s): Trace[1458976464]: [30.000290889s] [30.000290889s] END E0310 20:43:25.779072 1 reflector.go:126] github.com/coredns/coredns/plugin/kubernetes/controller.go:266: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout I0310 20:43:25.780079 1 trace.go:82] Trace[1908220677]: "Reflector github.com/coredns/coredns/plugin/kubernetes/controller.go:271 ListAndWatch" (started: 2020-03-10 20:42:55.779812037 +0000 UTC m=+19561.597773304) (total time: 30.000252627s): Trace[1908220677]: [30.000252627s] [30.000252627s] END E0310 20:43:25.780091 1 reflector.go:126] github.com/coredns/coredns/plugin/kubernetes/controller.go:271: Failed to list *v1.Namespace: Get https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout 2020-03-10T20:43:27.090Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:60623->10.0.0.2:53: i/o timeout 2020-03-10T20:43:27.863Z [ERROR] plugin/errors: 2 registry.redhat.io.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:33351->10.0.0.2:53: i/o timeout 2020-03-10T20:43:35.055Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. AAAA: read udp 10.131.0.3:56889->10.0.0.2:53: i/o timeout 2020-03-10T20:43:35.055Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. A: read udp 10.131.0.3:37185->10.0.0.2:53: i/o timeout 2020-03-10T20:43:35.573Z [ERROR] plugin/errors: 2 registry-1.docker.io.us-west-2.compute.internal. A: read udp 10.131.0.3:37963->10.0.0.2:53: i/o timeout 2020-03-10T20:43:37.057Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com. A: read udp 10.131.0.3:55006->10.0.0.2:53: i/o timeout 2020-03-10T20:43:44.078Z [ERROR] plugin/errors: 2 skordas0310a-tvsgm-image-registry-us-west-2-pftudglhqoowtphlcu.s3.us-west-2.amazonaws.com.us-west-2.compute.internal. AAAA: read udp 10.131.0.3:54601->10.0.0.2:53: i/o timeout 2020-03-10T20:43:46.299Z [ERROR] plugin/errors: 2 registry.redhat.io. AAAA: read udp 10.131.0.3:47394->10.0.0.2:53: i/o timeout ```
I dont think that this is an MCO driven failure. Given that the CVO error is: `Unable to apply 4.3.5: some cluster configuration is invalid` and we see that the apiserver took a really long time to come up, DNS is not updated, network is not updated, and must gather isn't run... ``` machine.openshift.io/machine: openshift-machine-api/skordas0310a-tvsgm-master-0 machineconfiguration.openshift.io/currentConfig: rendered-master-dda6182ea623f68d853aa998601e892b machineconfiguration.openshift.io/desiredConfig: rendered-master-35cd0d33b5c46b6fb4cde7a1b2fa131a machineconfiguration.openshift.io/state: Working ovn_host_subnet: 10.129.0.0/23 volumes.kubernetes.io/controller-managed-attach-detach: "true" ``` I think I found the reason why your cluster configuration is invalid: ``` $ oc logs -f -n openshift-monitoring cluster-monitoring-operator-698c7bc7df-57nf6 E0310 21:14:52.025657 1 operator.go:271] Syncing "openshift-monitoring/cluster-monitoring-config" failed E0310 21:14:52.028551 1 operator.go:272] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Alertmanager failed: reconciling Alertmanager object failed: updating Alertmanager object failed: Alertmanager.monitoring.coreos.com "main" is invalid: []: Invalid value: map[string]interface {}{"apiVersion":"monitoring.coreos.com/v1", "kind":"Alertmanager", "metadata":map[string]interface {}{"creationTimestamp":"2020-03-10T15:07:26Z", "generation":3, "labels":map[string]interface {}{"alertmanager":"main"}, "name":"main", "namespace":"openshift-monitoring", "resourceVersion":"13855", "uid":"dab3f19e-62e0-11ea-aff7-0a50e3591b0a"}, "spec":map[string]interface {}{"affinity":map[string]interface {}{"podAntiAffinity":map[string]interface {}{"preferredDuringSchedulingIgnoredDuringExecution":[]interface {}{map[string]interface {}{"podAffinityTerm":map[string]interface {}{"labelSelector":map[string]interface {}{"matchExpressions":[]interface {}{map[string]interface {}{"key":"alertmanager", "operator":"In", "values":[]interface {}{"main"}}}}, "namespaces":[]interface {}{"openshift-monitoring"}, "topologyKey":"kubernetes.io/hostname"}, "weight":100}}}}, "baseImage":"openshift/prometheus-alertmanager", "containers":[]interface {}{map[string]interface {}{"args":[]interface {}{"-provider=openshift", "-https-address=:9095", "-http-address=", "-email-domain=*", "-upstream=http://localhost:9093", "-openshift-sar={\"resource\": \"namespaces\", \"verb\": \"get\"}", "-openshift-delegate-urls={\"/\": {\"resource\": \"namespaces\", \"verb\": \"get\"}}", "-tls-cert=/etc/tls/private/tls.crt", "-tls-key=/etc/tls/private/tls.key", "-client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token", "-cookie-secret-file=/etc/proxy/secrets/session_secret", "-openshift-service-account=alertmanager-main", "-openshift-ca=/etc/pki/tls/cert.pem", "-openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt", "-skip-auth-regex=^/metrics"}, "env":[]interface {}{map[string]interface {}{"name":"HTTP_PROXY"}, map[string]interface {}{"name":"HTTPS_PROXY"}, map[string]interface {}{"name":"NO_PROXY"}}, "image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e2cc56931552bee158ed77d2f0a7e75202aba9204612f370d08630ba3992ec77", "name":"alertmanager-proxy", "ports":[]interface {}{map[string]interface {}{"containerPort":9095, "name":"web"}}, "resources":map[string]interface {}{}, "terminationMessagePolicy":"FallbackToLogsOnError", "volumeMounts":[]interface {}{map[string]interface {}{"mountPath":"/etc/tls/private", "name":"secret-alertmanager-main-tls"}, map[string]interface {}{"mountPath":"/etc/proxy/secrets", "name":"secret-alertmanager-main-proxy"}, map[string]interface {}{"mountPath":"/etc/pki/ca-trust/extracted/pem/", "name":"alertmanager-trusted-ca-bundle", "readOnly":true}}}}, "externalUrl":"https://alertmanager-main-openshift-monitoring.apps.skordas0310a.perf-testing.devcluster.openshift.com/", "image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b5db4653b80fb2c8eb2c17ba466921e34c473ebfa144b2ac8943996fada7685", "listenLocal":true, "nodeSelector":map[string]interface {}{"kubernetes.io/os":"linux"}, "priorityClassName":"system-cluster-critical", "replicas":3, "resources":map[string]interface {}{}, "secrets":[]interface {}{"alertmanager-main-tls", "alertmanager-main-proxy"}, "securityContext":map[string]interface {}{}, "serviceAccountName":"alertmanager-main", "version":"v0.15.2", "volumeMounts":[]interface {}{map[string]interface {}{"mountPath":"/etc/pki/ca-trust/extracted/pem/", "name":"alertmanager-trusted-ca-bundle", "readOnly":true}}, "volumes":[]interface {}{map[string]interface {}{"configMap":map[string]interface {}{"items":[]interface {}{map[string]interface {}{"key":"ca-bundle.crt", "path":"tls-ca-bundle.pem"}}, "name":"alertmanager-trusted-ca-bundle-39man1pbaa8jq", "optional":true}, "name":"alertmanager-trusted-ca-bundle"}}}}: validation failure list: type in spec.volumes.configMap is required ```
FTR, I was able to upgrade sucessfully in a vanila AWS cluster so something about this failure might have to do with the reporter's setup...] Passing this over to monitoring, maybe they can advise on this error.
Removed UpgradeBlocker upgrade as we haven't hit this in any of the test upgrades yet
> is potentially a transient log error during upgrades. Rationale: we rely on k8s 1.17 API validations in 4.3 code. So if the control plane is still at k8s 1.16 the above error occurs and goes away once control plane upgrades to k8s 1.17. Yes agreed, it's not a problem with us, if anything the validation problem should be backported upstream issue -> https://github.com/kubernetes-sigs/kubefed/issues/469 I suggest we close this, as it's a transient error and we reconcile eventually.
Closing as per last comment, transient errors can happen but it eventually reconciles so it's not a bug.