1. Bug Overview: a) Description of bug report: [RHOCP4.4] Unable to upgrade OCP4.3.19 to OCP4.4 in disconnected env b) Bug Description: In disconnected environment, upgrading OCP 4.3.19 to OCP 4.4 gets stuck. It seems to be caused by that nodes are upgraded before kubeapi-server or other master components are upgraded. We already tried to upgrade OCP4.3 to OCP4.4 in disconnected env on both of AWS and baremetal env multiple times, and failed in all trials. (We already filed Case 02653914 for AWS) Version information: $ oc version Client Version: 4.3.19 Server Version: 4.3.19 Kubernetes Version: v1.16.2 2. Bug Details: a) Architectures: 64-bit Intel EM64T/AMD64 x86_64 b) Bugzilla Dependencies: c) Drivers or hardware dependencies: d) Upstream acceptance information: e) External links: f) Severity (H,M,L): H g) How reproducible: Always h) Steps to Reproduce: Step1. Create OCP 4.3.19 cluster in disconnected environment following the document below. https://access.redhat.com/documentation/en-us/openshift_container_platform/4.3/html/installing_on_bare_metal/installing-on-bare-metal#installing-restricted-networks-bare-metal Step2. Mirror images for v4.4.4 to local mirror registry. $ export OCP_RELEASE=4.4.4-x86_64 $ export LOCAL_REGISTRY=<local mirror registry> $ export LOCAL_REPOSITORY=ocp4/openshift4 $ export PRODUCT_REPO='openshift-release-dev' $ export LOCAL_SECRET_JSON=/ocp4/pull-secret.json $ export RELEASE_NAME="ocp-release" $ oc adm -a ${LOCAL_SECRET_JSON} release mirror \ --from=quay.io/${PRODUCT_REPO}/${RELEASE_NAME}:${OCP_RELEASE} \ --to=${LOCAL_REGISTRY}/${LOCAL_REPOSITORY} \ --to-release-image=${LOCAL_REGISTRY}/${LOCAL_REPOSITORY}:${OCP_RELEASE} Step3. Upgrade cluster by using "oc adm upgrade" command. $ export RELEASE=4.4.4-x86_64 $ oc adm release info quay.io/openshift-release-dev/ocp-release:$RELEASE | grep Pull Pull From: quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d0 $ oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-reelease@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 --alloow-explicit-upgrade --force=true Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 i) Actual results: Upgrading gets stuck on the way. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.19 True False 28m Error while reconciling 4.4.4: some resources could not be updated $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.4 True False False 20h cloud-credential 4.3.19 True False False 20h cluster-autoscaler 4.4.4 True False False 20h console 4.3.19 True False False 43m csi-snapshot-controller 4.4.4 True False False 12m dns 4.4.4 True False False 20h image-registry 4.4.4 True False False 35m ingress 4.3.19 True False False 51m insights 4.3.19 True False False 20h kube-apiserver 4.3.19 True False True 20h kube-controller-manager 4.3.19 True False True 20h kube-scheduler 4.3.19 True False True 20h machine-api 4.4.4 True False False 20h machine-config 4.3.19 False True True 3m35s marketplace 4.4.4 True False False 13m monitoring 4.4.4 True False False 11m network 4.3.19 True False False 20h node-tuning 4.4.4 True False False 20m openshift-apiserver 4.4.4 True False True 17m openshift-controller-manager 4.4.4 True False False 20h openshift-samples 4.3.19 True False False 20h operator-lifecycle-manager 4.3.19 True False False 20h operator-lifecycle-manager-catalog 4.3.19 True False False 20h operator-lifecycle-manager-packageserver 4.3.19 True False False 13m service-ca 4.4.4 True False False 20h service-catalog-apiserver 4.4.4 True False False 20h service-catalog-controller-manager 4.4.4 True False False 20h storage 4.4.4 True False False Some of nodes have been stuck at NotReady. Some of nodes have been already updated to OCP4.4(v1.17.1), even though kube-apiserver or other master components haven't been updated yet. $ oc get nodes NAME STATUS ROLES AGE VERSION host-172-25-69-12 NotReady master 22h v1.17.1 host-172-25-69-18 Ready infra,worker 21h v1.16.2 host-172-25-69-27 NotReady infra,worker 21h v1.16.2 host-172-25-69-28 NotReady worker 22h v1.17.1 host-172-25-69-29 Ready master 22h v1.16.2 host-172-25-69-35 Ready master 22h v1.16.2 host-172-25-69-36 Ready worker 22h v1.16.2 host-172-25-69-39 NotReady infra,worker 21h v1.17.1 j) Expected results: Upgrade process finishes with no problem. k) Additional information From the logs of kubelet, it failed to initialize CSINode. [core@host-172-25-69-39 ~]$ systemctl status kubelet ...(snip)... May 21 04:30:53 host-172-25-69-12 hyperkube[207013]: I0521 04:30:53.866160 207013 nodeinfomanager.go:402] Failed to publish CSINode: the server could not find the requested resource May 21 04:30:53 host-172-25-69-12 hyperkube[207013]: I0521 04:30:53.912220 207013 nodeinfomanager.go:402] Failed to publish CSINode: the server could not find the requested resource May 21 04:30:53 host-172-25-69-12 hyperkube[207013]: I0521 04:30:53.985799 207013 nodeinfomanager.go:402] Failed to publish CSINode: the server could not find the requested resource May 21 04:30:54 host-172-25-69-12 hyperkube[207013]: I0521 04:30:54.266762 207013 nodeinfomanager.go:402] Failed to publish CSINode: the server could not find the requested resource May 21 04:30:54 host-172-25-69-12 hyperkube[207013]: E0521 04:30:54.266833 207013 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: the server could not find the requested resource We know CSI migration feature was introduced from OCP4.4(v1.17.1), so I disabled it from kubelet.conf a try. [core@host-172-25-69-39 ~]$ vi/etc/kubernetes/kubelet.conf ...(snip)... featureGates: CSIMigration: false <=== Added this entry in kubelet.conf ...(snip)>.. After that, the node's status was changed to Ready. $ oc get nodes NAME STATUS ROLES AGE VERSION host-172-25-69-12 NotReady master 22h v1.17.1 host-172-25-69-18 Ready infra,worker 21h v1.16.2 host-172-25-69-27 NotReady infra,worker 21h v1.16.2 host-172-25-69-28 NotReady worker 22h v1.17.1 host-172-25-69-29 Ready master 22h v1.16.2 host-172-25-69-35 Ready master 22h v1.16.2 host-172-25-69-36 Ready worker 22h v1.16.2 host-172-25-69-39 Ready infra,worker 21h v1.17.1 <=== The status was changed So, why the node became Unready is that the node was upgraded before master components are updated. Upgrading nodes should be blocked until master components are updated. 3. Business impact: All of Japanese customers use OpenShift4 in disconnected env. Red Hat requests them to upgrade their cluster frequently, but their cluster would be broken by upgrade due to this issue. It's a very big problem. 4. Primary contact at Red Hat, email, phone (chat) mfuruta 5. Primary contact at Partner, email, phone (chat) mas-hatada.nec.com (Masaki Hatada)
The command to update seems wrong as you should be using the local registry instead of quay.io for that i.e oc adm upgrade --allow-explicit-upgrade --to-image --force=true LOCAL_REGISTRY/REPOSITORY@sha256:<sha256sum>. Can you check if this fixes your issue?
Dear Lalatendu, > The command to update seems wrong as you should be using the local registry instead of quay.io for that i.e oc adm upgrade --allow-explicit-upgrade --to-image --force=true LOCAL_REGISTRY/REPOSITORY@sha256:<sha256sum>. Can you check if this fixes your issue? We don't think that our command line is wrong, because we used this command line when testing upgrading on OCP4.3 AWS disconnected env for getting SupportException from Red Hat. And, thanks to ImageContentSourcePolicy, the URL of quay.io is redirected to our private registry automatically. So there is no problem even from the technical side. Even if our commandline was wrong, we cannot understand why the order of upgrading Operators is changed by that. Could you explain the relation between the wrong order and our commandline? [1] Upgrades and order https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md Best Regards, Masaki Hatada
[1] explains how CVO decides the order. It seems that the order information is generated from the file names stored in the image. OCP upgrade process started with our commandline, so we think CVO must get the information from the image stored in our private registry...
Right, ImageContentSourcePolicy will take care of the registry URls in the oc command.
If you can attach a must-gather [0] to your case, that would be helpful. In the absence of one, here's a bit of a walkthrough for a 4.3.19 -> 4.4.4 CI job [1]. * 16:04Z, master-0, master-1, and master-2 come up with v.1.16.2 kubelets [5]. * 17:03Z, update triggered [2]. * 17:04Z, etcd operator levels at 4.4.4 [4]. * 17:07Z, kube-apiserver levels at 4.4.4 [4]. * 17:11Z, openshift-apiserver levels at 4.4.4 [4]. * 17:17Z, csi-snapshot-controller levels at 4.4.4 [4]. * 17:19Z, authentication operator levels at 4.4.4 [4]. * 17:31Z, etcd operator finishes progressing [9]. * 17:32:03Z, openshift-apiserver goes Degraded=True, APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable [4] * 17:32:13Z, openshift-apiserver goes Degraded=False * 17:36Z, machine-config begins transition to 4.4.4 [4]. * 17:42Z, openshift-apiserver goes Degraded=True, APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable [4] * 17:44Z, master-1 and worker-eastus21 ready again [3], both on v1.17.1 kubelets [5,6]. * 17:48Z, master-0 ready again [3] on the v1.17.1 kubelet [5]. * 17:49Z, worker-eastus22 ready again [3] on the v1.17.1 kubelet [6]. * 17:53Z, master-2 ready again [3] on the v1.17.1 kubelet [5]. * 17:53Z, machine-config levels at 4.4.4 [4]. * 17:55Z, worker-eastus23 ready again [3] on the v1.17.1 kubelet [6]. * 17:58Z, update complete [2]. And checking the operator order [7], we can see that kube-apiserver is at level 20 in 4.4.4's manifest graph, the CSI snapshot controller is level 50, and the machine-config operator that rolls out the new kubelet binaries along with other machine-os-content is at level 80 (more on manifest graphs in [8]). Dumping all the operator conditions [9], there is some strangeness. I haven't dug into why node-tuning goes Available=True way late at 17:57Z. Or why csi-snapshot-controller sets Progressing=False at 17:51Z. But as you point out, the expected behavior is definitely to update the control plane components before we roll out the new kubelets, and if you want us to figure out why that didn't happen for your cluster, we'll probably need some additional logs (ideally a must-gather). [0]: https://docs.openshift.com/container-platform/4.4/support/gathering-cluster-data.html#about-must-gather_gathering-cluster-data [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388 [2]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version' 2020-05-13T17:03:15Z 2020-05-13T17:58:48Z Completed 4.4.4 2020-05-13T16:06:09Z 2020-05-13T16:44:00Z Completed 4.3.19 [3]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/nodes.json | jq -r '.items[] | .ready = ([.status.conditions[] | select(.type == "Ready")][0]) | .ready.lastTransitionTime + " " + .ready.status + " " + .metadata.name' | sort 2020-05-13T17:44:15Z True ci-ln-6bdji0t-002ac-jw4hp-master-1 2020-05-13T17:44:29Z True ci-ln-6bdji0t-002ac-jw4hp-worker-eastus21-8fwhr 2020-05-13T17:48:35Z True ci-ln-6bdji0t-002ac-jw4hp-master-0 2020-05-13T17:49:25Z True ci-ln-6bdji0t-002ac-jw4hp-worker-eastus22-dlssq 2020-05-13T17:53:07Z True ci-ln-6bdji0t-002ac-jw4hp-master-2 2020-05-13T17:55:27Z True ci-ln-6bdji0t-002ac-jw4hp-worker-eastus23-8jr2j [4]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/events.json | jq -r '[.items[] | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.namespace + " " + .message' ... 2020-05-13T17:04:08Z - 2020-05-13T17:04:08Z (1) openshift-etcd-operator clusteroperator/etcd version "operator" changed from "" to "4.4.4" ... 2020-05-13T17:07:57Z - 2020-05-13T17:07:57Z (1) openshift-kube-apiserver-operator clusteroperator/kube-apiserver version "operator" changed from "4.3.19" to "4.4.4" ... 2020-05-13T17:11:23Z - 2020-05-13T17:11:23Z (2) openshift-apiserver-operator clusteroperator/openshift-apiserver version "operator" changed from "4.3.19" to "4.4.4" ... 2020-05-13T17:17:25Z - 2020-05-13T17:17:25Z (2) openshift-cluster-storage-operator clusteroperator/csi-snapshot-controller version "operator" changed from "" to "4.4.4" ... 2020-05-13T17:19:29Z - 2020-05-13T17:19:29Z (2) openshift-authentication-operator clusteroperator/authentication version "operator" changed from "4.3.19" to "4.4.4" ... 2020-05-13T17:32:03Z - 2020-05-13T17:32:03Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded changed from False to True ("APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable") ... 2020-05-13T17:32:13Z - 2020-05-13T17:32:13Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded changed from True to False ("") ... 2020-05-13T17:36:11Z - 2020-05-13T17:36:11Z (1) default clusteroperator/machine-config-operator started a version change from [{operator 4.3.19}] to [{operator 4.4.4}] ... 2020-05-13T17:42:34Z - 2020-05-13T17:42:40Z (3) openshift-apiserver-operator (combined from similar events): Status for clusteroperator/openshift-apiserver changed: Degraded changed from False to True ("APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"),Progressing message changed from "OperatorConfigProgressing: openshiftapiserveroperatorconfigs/instance: observed generation is 9, desired generation is 10.\nAPIServerDeploymentProgressing: deployment/apiserver.openshift-operator: observed generation is 5, desired generation is 6." to "APIServerDeploymentProgressing: deployment/apiserver.openshift-operator: observed generation is 5, desired generation is 6." ... 2020-05-13T17:53:53Z - 2020-05-13T17:53:53Z (1) default clusteroperator/machine-config-operator version changed from [{operator 4.3.19}] to [{operator 4.4.4}] ... [5]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/nodes/masters-journal | gunzip | grep 'v1\.1[67]' | sort May 13 16:04:11.897824 ci-ln-6bdji0t-002ac-jw4hp-master-1 hyperkube[1430]: I0513 16:04:11.897795 1430 server.go:418] Version: v1.16.2 May 13 16:04:12.260155 ci-ln-6bdji0t-002ac-jw4hp-master-2 hyperkube[1435]: I0513 16:04:12.260123 1435 server.go:418] Version: v1.16.2 May 13 16:04:27.466254 ci-ln-6bdji0t-002ac-jw4hp-master-0 hyperkube[1435]: I0513 16:04:27.466202 1435 server.go:418] Version: v1.16.2 May 13 17:44:15.059318 ci-ln-6bdji0t-002ac-jw4hp-master-1 hyperkube[1444]: I0513 17:44:15.059247 1444 server.go:424] Version: v1.17.1 May 13 17:48:35.575490 ci-ln-6bdji0t-002ac-jw4hp-master-0 hyperkube[1455]: I0513 17:48:35.575423 1455 server.go:424] Version: v1.17.1 May 13 17:53:06.963297 ci-ln-6bdji0t-002ac-jw4hp-master-2 hyperkube[1447]: I0513 17:53:06.963249 1447 server.go:424] Version: v1.17.1 [6]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/nodes/workers-journal | gunzip | grep 'v1\.1[67]' | sort May 13 16:30:56.604435 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus21-8fwhr hyperkube[1332]: I0513 16:30:56.604406 1332 server.go:418] Version: v1.16.2 May 13 16:39:59.592251 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus23-8jr2j hyperkube[1342]: I0513 16:39:59.592232 1342 server.go:418] Version: v1.16.2 May 13 16:40:05.287868 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus22-dlssq hyperkube[1350]: I0513 16:40:05.287842 1350 server.go:418] Version: v1.16.2 May 13 17:44:18.742095 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus21-8fwhr hyperkube[1351]: I0513 17:44:18.742008 1351 server.go:424] Version: v1.17.1 May 13 17:49:15.002727 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus22-dlssq hyperkube[1360]: I0513 17:49:15.002690 1360 server.go:424] Version: v1.17.1 May 13 17:55:16.467479 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus23-8jr2j hyperkube[1347]: I0513 17:55:16.467199 1347 server.go:424] Version: v1.17.1 [7]: $ oc adm release extract --to=manifests quay.io/openshift-release-dev/ocp-release:4.4.4-x86_64 $ grep -rl 'kind: ClusterOperator' manifests | sort manifests/0000_12_etcd-operator_07_clusteroperator.yaml manifests/0000_20_kube-apiserver-operator_07_clusteroperator.yaml manifests/0000_25_kube-controller-manager-operator_07_clusteroperator.yaml manifests/0000_25_kube-scheduler-operator_07_clusteroperator.yaml manifests/0000_30_machine-api-operator_12_clusteroperator.yaml manifests/0000_30_openshift-apiserver-operator_08_clusteroperator.yaml manifests/0000_40_kube-storage-version-migrator-operator_09_clusteroperator.yaml manifests/0000_50_cloud-credential-operator_10_cluster-operator.yaml manifests/0000_50_cluster-authentication-operator_08_clusteroperator.yaml manifests/0000_50_cluster-autoscaler-operator_08_clusteroperator.yaml manifests/0000_50_cluster-csi-snapshot-controller-operator_08_clusteroperator.yaml manifests/0000_50_cluster-image-registry-operator_08-clusteroperator.yaml manifests/0000_50_cluster-ingress-operator_03-cluster-operator.yaml manifests/0000_50_cluster_monitoring_operator_06-clusteroperator.yaml manifests/0000_50_cluster-node-tuning-operator_05-clusteroperator.yaml manifests/0000_50_cluster-openshift-controller-manager-operator_10_clusteroperator.yaml manifests/0000_50_cluster-samples-operator_07-clusteroperator.yaml manifests/0000_50_cluster-storage-operator_03-cluster-operator.yaml manifests/0000_50_cluster-svcat-apiserver-operator_08_cluster-operator.yaml manifests/0000_50_cluster-svcat-controller-manager-operator_10_cluster-operator.yaml manifests/0000_50_console-operator_08-clusteroperator.yaml manifests/0000_50_insights-operator_07-cluster-operator.yaml manifests/0000_50_olm_99-operatorstatus.yaml manifests/0000_50_operator-marketplace_10_clusteroperator.yaml manifests/0000_50_service-ca-operator_07_clusteroperator.yaml manifests/0000_70_cluster-network-operator_04_clusteroperator.yaml manifests/0000_70_dns-operator_03-cluster-operator.yaml manifests/0000_80_machine-config-operator_06_clusteroperator.yaml [8]: https://github.com/openshift/cluster-version-operator/blob/master/docs/user/reconciliation.md#manifest-graph [9]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/clusteroperators.json | jq -r '[.items[] | .metadata.name as $name | .status.conditions[] | .lastTransitionTime + " " + $name + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")] | sort[]' ... 2020-05-13T17:17:21Z csi-snapshot-controller Degraded=False AsExpected: - 2020-05-13T17:17:26Z cloud-credential Progressing=False ReconcilingComplete: 3 of 3 credentials requests provisioned and reconciled. 2020-05-13T17:17:26Z csi-snapshot-controller Upgradeable=True AsExpected: - ... 2020-05-13T17:31:11Z etcd Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3 EtcdMembersProgressing: all members have started 2020-05-13T17:32:32Z kube-scheduler Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 7 2020-05-13T17:41:30Z ingress Degraded=False NoIngressControllersDegraded: - 2020-05-13T17:43:19Z monitoring Available=True RollOutDone: Successfully rolled out the stack. 2020-05-13T17:43:19Z monitoring Degraded=False -: - 2020-05-13T17:43:19Z monitoring Progressing=False -: - 2020-05-13T17:45:52Z service-ca Progressing=False AsExpected: Progressing: All service-ca-operator deployments updated 2020-05-13T17:45:57Z dns Degraded=False AsExpected: All desired DNS DaemonSets available and operand Namespace exists 2020-05-13T17:46:40Z csi-snapshot-controller Available=True AsExpected: - 2020-05-13T17:49:41Z authentication Progressing=False AsExpected: - 2020-05-13T17:49:45Z console Available=True AsExpected: - 2020-05-13T17:49:46Z operator-lifecycle-manager-packageserver Available=True -: - 2020-05-13T17:49:46Z operator-lifecycle-manager-packageserver Progressing=False -: Deployed version 0.14.2 2020-05-13T17:50:23Z marketplace Available=True OperatorAvailable: Available release version: 4.4.4 2020-05-13T17:50:23Z marketplace Progressing=False OperatorAvailable: Successfully progressed to release version: 4.4.4 2020-05-13T17:50:23Z marketplace Upgradeable=True OperatorAvailable: Marketplace is upgradeable 2020-05-13T17:51:46Z kube-storage-version-migrator Available=True AsExpected: - 2020-05-13T17:51:51Z openshift-apiserver Progressing=False AsExpected: - 2020-05-13T17:51:58Z ingress Available=True -: desired and current number of IngressControllers are equal 2020-05-13T17:51:58Z ingress Progressing=False -: desired and current number of IngressControllers are equal 2020-05-13T17:51:59Z csi-snapshot-controller Progressing=False AsExpected: - 2020-05-13T17:52:47Z image-registry Available=True Ready: The registry is ready 2020-05-13T17:52:47Z image-registry Progressing=False Ready: The registry is ready 2020-05-13T17:53:54Z machine-config Progressing=False -: Cluster version is 4.4.4 2020-05-13T17:54:27Z etcd Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found 2020-05-13T17:54:59Z openshift-samples Available=True -: Samples installation successful at 4.4.4 2020-05-13T17:54:59Z openshift-samples Degraded=False -: - 2020-05-13T17:54:59Z openshift-samples Progressing=True -: Samples processing to 4.4.4 2020-05-13T17:55:25Z openshift-apiserver Degraded=False AsExpected: - 2020-05-13T17:56:17Z dns Progressing=False AsExpected: Desired and available number of DNS DaemonSets are equal 2020-05-13T17:56:21Z network Progressing=False -: - 2020-05-13T17:57:03Z node-tuning Progressing=False AsExpected: Cluster version is "4.4.4" 2020-05-13T17:57:04Z node-tuning Available=True AsExpected: Cluster has deployed "4.4.4" 2020-05-13T18:02:16Z monitoring Upgradeable=True -: -
Created attachment 1691677 [details] Logs_ or both of successful and unsuccessful env Dear Trevor, Thank you for your update. > If you can attach a must-gather [0] to your case, that would be helpful. In the absence of one, here's a bit of a walkthrough for a 4.3.19 -> 4.4.4 CI job [1]. Unfortunately, must-gather gets error as we reported at https://access.redhat.com/support/cases/#/case/02653914?commentId=a0a2K00000UzrmKQAR . > you should be using the local registry instead of quay.io for that i.e oc adm upgrade --allow-explicit-upgrade --to-image --force=true LOCAL_REGISTRY/REPOSITORY@sha256:<sha256sum>. We attempted to upgrade with the above command line for test. Somehow, it succeeded. We don't know why it succeeded. It might have succeeded accidentally. We are currently testing it on our other env. We uploaded here the results of the followings commands of successful and unsuccessful env. Please let us know if you want to get other information. - oc version - oc get co - oc get clusterversion -o yaml - oc logs <cluster-version-operator pod> Best Regards, Masaki Hatada
> > you should be using the local registry instead of quay.io for that i.e oc adm upgrade --allow-explicit-upgrade --to-image --force=true LOCAL_REGISTRY/REPOSITORY@sha256:<sha256sum>. > > We attempted to upgrade with the above command line for test. > Somehow, it succeeded. > > We don't know why it succeeded. It might have succeeded accidentally. > We are currently testing it on our other env. We attempted to upgrade with --to-image=LOCAL_REGISTRY/REPOSITORY@sha256:<sha256sum>, but it failed. So the successful of the previous seemded to be a coincidence... Currently we can complete upgrading OCP4.3 to OCP4.4 in NON disconnected env, on the other hand it fails with high probability in disconnected env. Would Red Hat have some idea about this issue and how we should investigate this issue? --- $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.4 True False False 25h cloud-credential 4.3.19 True False False 25h cluster-autoscaler 4.4.4 True False False 25h console 4.3.19 True False False 23h dns 4.4.4 True False False 25h etcd 4.4.4 True False True 15m image-registry 4.4.4 True False False 23h ingress 4.3.19 True False False 4h13m insights 4.3.19 True False False 25h kube-apiserver 4.3.19 True False True 25h kube-controller-manager 4.3.19 True False True 25h kube-scheduler 4.3.19 True False True 25h machine-api 4.4.4 True False False 25h machine-config 4.3.19 False True True 22h marketplace 4.4.4 True False False 23h monitoring 4.4.4 True False False 138m network 4.3.19 True False False 25h node-tuning 4.4.4 True False False 23h openshift-apiserver 4.4.4 True False True 22h openshift-controller-manager 4.4.4 True False False 25h openshift-samples 4.3.19 True False False 25h operator-lifecycle-manager 4.3.19 True False False 25h operator-lifecycle-manager-catalog 4.3.19 True False False 25h operator-lifecycle-manager-packageserver 4.3.19 True False False 6h58m service-ca 4.4.4 True False False 25h service-catalog-apiserver 4.4.4 True False False 25h service-catalog-controller-manager 4.4.4 True False False 25h storage 4.4.4 True False False 23h $ oc get clusterversion -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2020-05-25T03:48:21Z" generation: 2 name: version resourceVersion: "696811" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: f9d80fa8-727c-40d3-8855-8b0dc0c2435d spec: channel: stable-4.3 clusterID: 0c741002-21b9-4f18-83d8-8c05a6b556c2 desiredUpdate: force: true image: matsuo-lbdns.nec.test:5000/ocp4/openshift4@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 version: "" upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: null conditions: - lastTransitionTime: "2020-05-25T04:05:41Z" message: Done applying 4.3.19 status: "True" type: Available - lastTransitionTime: "2020-05-26T05:25:10Z" status: "False" type: Failing - lastTransitionTime: "2020-05-25T06:19:15Z" message: 'Working towards 4.4.4: 11% complete' status: "True" type: Progressing - lastTransitionTime: "2020-05-25T03:48:34Z" message: 'Unable to retrieve available updates: Get https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.3&id=0c741002-21b9-4f18-83d8-8c05a6b556c2&version=4.4.4: dial tcp 52.5.215.228:443: connect: connection timed out' reason: RemoteFailed status: "False" type: RetrievedUpdates desired: force: true image: matsuo-lbdns.nec.test:5000/ocp4/openshift4@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 version: 4.4.4 history: - completionTime: null image: matsuo-lbdns.nec.test:5000/ocp4/openshift4@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 startedTime: "2020-05-25T05:52:26Z" state: Partial verified: false version: 4.4.4 - completionTime: "2020-05-25T04:05:41Z" image: quay.io/openshift-release-dev/ocp-release@sha256:28c451bfb531e5af71332a933073a2d97673ed157150bacb014a4358501f05ad startedTime: "2020-05-25T03:48:34Z" state: Completed verified: false version: 4.3.19 observedGeneration: 2 versionHash: Q1yJVFYRV4A= kind: List metadata: resourceVersion: "" selfLink: ""
Dear Red Hat, According to the log of cluster-version-operator, it seems that at least 3 upgrade threads were running at the same time. I0524 16:11:29.481202 1 sync_worker.go:621] Running sync for namespace "openshift-cluster-storage-operator" (307 of 573) I0524 16:11:29.481223 1 sync_worker.go:621] Running sync for credentialsrequest "openshift-cloud-credential-operator/openshift-machine-api-aws" (99 of 573) I0524 16:11:29.487623 1 sync_worker.go:621] Running sync for clusterrolebinding "cluster-storage-operator" (308 of 573) I0524 16:11:29.490376 1 sync_worker.go:621] Running sync for clusterrole "cluster-storage-operator" (309 of 573) I0524 16:11:29.493510 1 sync_worker.go:621] Running sync for rolebinding "openshift-cluster-storage-operator/cluster-storage-operator" (310 of 573) I0524 16:11:29.497488 1 sync_worker.go:621] Running sync for role "openshift-cluster-storage-operator/cluster-storage-operator" (311 of 573) I0524 16:11:29.500029 1 sync_worker.go:621] Running sync for serviceaccount "openshift-cluster-storage-operator/cluster-storage-operator" (312 of 573) I0524 16:11:29.503096 1 sync_worker.go:621] Running sync for deployment "openshift-cluster-storage-operator/cluster-storage-operator" (313 of 573) I0524 16:11:29.685004 1 sync_worker.go:621] Running sync for clusteroperator "storage" (314 of 573) I0524 16:11:29.685135 1 sync_worker.go:621] Running sync for namespace "openshift-controller-manager-operator" (270 of 573) I0524 16:11:29.784595 1 sync_worker.go:621] Running sync for namespace "openshift-controller-manager" (271 of 573) I0524 16:11:29.884781 1 sync_worker.go:621] Running sync for customresourcedefinition "openshiftcontrollermanagers.operator.openshift.io" (272 of 573) I0524 16:11:29.985774 1 sync_worker.go:621] Running sync for openshiftcontrollermanager "cluster" (273 of 573) I0524 16:11:31.785411 1 sync_worker.go:621] Running sync for configmap "openshift-controller-manager-operator/openshift-controller-manager-operator-config" (274 of 573) I0524 16:11:31.836388 1 sync_worker.go:621] Running sync for credentialsrequest "openshift-cloud-credential-operator/openshift-machine-api-azure" (100 of 573) Thread1 ... 307 -> 308 -> ... -> 314 Thread2 ... 270 -> 271 -> ... -> 274 Thread3 ... 99 -> 100 And it seems that Thread1(which was processing machine-config) and Thread3(which was processing openshift-apiserver) failed at the same time. E0524 16:17:14.530902 1 task.go:77] error running apply for clusteroperator "machine-config" (475 of 573): Cluster operator machine-config is still updating I0524 16:17:14.530965 1 task_graph.go:590] Running 20 on 0 I0524 16:17:14.530982 1 task_graph.go:583] Canceled worker 0 E0524 16:17:14.530902 1 task.go:77] error running apply for clusteroperator "openshift-apiserver" (139 of 573): Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable So currently we imagine: - The order is not wrong, but multiple upgrade threads are running at the same time somehow - CVO tries to access internet for getting image signature. In disconnected env it takes a time until timeout happens. So an additional upgrade thread might be kicked from CVO before timeout happened - If the upgrade target was the same minor(4.3.y) version, upgrade wouldn't fail even if multiple upgrade threads ran, because kubelet would be able to run even before api-server or other master components are upgraded. But if the target was the different minor version, the upgrade would fail Currently we have no evidence but at least it's true that multiple upgrade threads were running on the same time. We'd like Red Hat to check this point. Best Regards, Masaki Hatada
Created attachment 1692291 [details] qe cvo log
From comment 8: > - The order is not wrong, but multiple upgrade threads are running at the same time somehow Yes, the CVO spawns several workers to walk parallel portions of the manifest graph simultaneously [1]. We still expect prerequisite graph nodes to block all subsequent graph nodes. > - CVO tries to access internet for getting image signature. In disconnected env it takes a time until timeout happens. So an additional upgrade thread might be kicked from CVO before timeout happened From comment 10: > Created attachment 1692291 [details] Logs are from the 4.3.19 CVO: $ grep ClusterVersionOperator cvo.log I0526 13:09:44.984520 1 start.go:19] ClusterVersionOperator v4.3.19-202005041055-dirty I0526 13:11:25.565022 1 cvo.go:332] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s You can see the request to go to 4.4.4 (forcing the update): $ grep -i desired cvo.log | head -n1 I0526 13:11:25.665927 1 cvo.go:468] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5", Force:true} $ oc adm release info quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 | head -n3 Name: 4.4.4 Digest: sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 Created: 2020-05-13T15:41:16Z CVO seems to be attempting to walk signatures with two layers of timeouts: $ grep 'unable to load signature' cvo.log I0526 13:13:36.123153 1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-1: dial tcp 172.217.4.48:443: connect: connection timed out I0526 13:15:47.195128 1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-2: dial tcp 172.217.9.80:443: connect: connection timed out I0526 13:17:10.718027 1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-3: context deadline exceeded I0526 13:19:44.764143 1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-1: dial tcp 172.217.4.48:443: connect: connection timed out I0526 13:21:55.835063 1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-2: dial tcp 172.217.4.48:443: connect: connection timed out I0526 13:23:18.233801 1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-3: context deadline exceeded That's... fine. My initial reaction was "it should just give up on that host when signature-1 times out, but on second thought you could have a sig server where signature-1 timed out but signature-2 or some later one did not. I've spun bug 1840343 about limiting denial-of-service sorts of things in situations like this. Back to our update: $ grep -A2 signature-3 cvo.log | head -n3 I0526 13:17:10.718027 1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-3: context deadline exceeded W0526 13:17:10.718079 1 updatepayload.go:100] An image was retrieved from "quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5" that failed verification: The update cannot be verified: context deadline exceeded E0526 13:17:10.739803 1 sync_worker.go:329] unable to synchronize image (waiting 21.565712806s): Unable to download and prepare the update: timed out waiting for the condition So the bug seems to be that the sig-fetching timeouts are consuming our whole available timeout, and we leave no time for preconditions. We should address this by extending the sync-fetching Context timeout, and then setting smaller sub-timeouts for post-fetch processing like signature retrieval and update precondition checks. [1]: https://github.com/openshift/cluster-version-operator/blob/master/docs/user/reconciliation.md#synchronizing-the-graph
In the meantime, you should be able to work around this issue by creating a signature ConfigMap locally, as described in the in-flight [1]. That way the CVO wouldn't bother attempting to hit the external signature stores, and you'd have plenty of time for the other update precondition checks. As a bonus, you'd have verification that the release image you were updating to was indeed signed by Red Hat and you woudn't need to use --force. [1]: https://github.com/openshift/openshift-docs/pull/21993
From comment 6: > Created attachment 1691677 [details] > Logs_ or both of successful and unsuccessful env Ah, double-checking these against my guess from comment 13: $ grep -A2 signature-3 Logs_for_both_environments/*.log Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log:I0524 16:08:36.496037 1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc/signature-3: context deadline exceeded Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log-W0524 16:08:36.496081 1 updatepayload.go:100] An image was retrieved from "quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc" that failed verification: The update cannot be verified: context deadline exceeded Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log-I0524 16:08:36.496154 1 payload.go:210] Loading updatepayload from "/etc/cvo/updatepayloads/umpiuvwSTU7_qyH_OBk4PQ" So that looks the same. But now I'm noticing that after the signature timeout we continue to load the payload in both of these cases. $ grep -i payload cvo.log I0526 13:09:44.995515 1 payload.go:210] Loading updatepayload from "/" I0526 13:11:25.666448 1 sync_worker.go:477] Loading payload W0526 13:17:10.718079 1 updatepayload.go:100] An image was retrieved from "quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5" that failed verification: The update cannot be verified: context deadline exceeded I0526 13:17:33.182208 1 sync_worker.go:477] Loading payload W0526 13:23:18.233849 1 updatepayload.go:100] An image was retrieved from "quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5" that failed verification: The update cannot be verified: context deadline exceeded I0526 13:23:18.234022 1 payload.go:210] Loading updatepayload from "/etc/cvo/updatepayloads/mUcauUwti1y9EInjIaBdzg" I0526 13:23:18.687185 1 sync_worker.go:539] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 with hash 3PF9oWfsPLw= $ grep -i payload Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log I0524 16:02:50.780437 1 payload.go:210] Loading updatepayload from "/" I0524 16:02:51.444412 1 sync_worker.go:477] Loading payload W0524 16:08:36.496081 1 updatepayload.go:100] An image was retrieved from "quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc" that failed verification: The update cannot be verified: context deadline exceeded I0524 16:08:36.496154 1 payload.go:210] Loading updatepayload from "/etc/cvo/updatepayloads/umpiuvwSTU7_qyH_OBk4PQ" I0524 16:08:36.950963 1 sync_worker.go:539] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc with hash h110xMINmng= That makes sense because in all cases we started applying some target-version stuff and didn't just block on the signature. So that invalidated my "stuck getting signatures" hypothesis from comment 13. Comparing the logs again: $ grep ClusterVersionOperator cvo.log Logs_for_both_environments/*.log cvo.log:I0526 13:09:44.984520 1 start.go:19] ClusterVersionOperator v4.3.19-202005041055-dirty cvo.log:I0526 13:11:25.565022 1 cvo.go:332] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log:I0524 16:02:50.774740 1 start.go:19] ClusterVersionOperator v4.3.19-202005041055-dirty Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log:I0524 16:02:51.342076 1 cvo.go:332] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s Why are folks attaching logs from the 4.3.19 CVO if that CVO is happily triggering a pivot to a 4.4 CVO, and then having that 4.4 CVO get stuck trying to herd the cluster from 4.3 to 4.4?
Dear Trevor, Thank you for commenting. (In reply to W. Trevor King from comment #14) > In the meantime, you should be able to work around this issue by creating a > signature ConfigMap locally, as described in the in-flight [1]. That way > the CVO wouldn't bother attempting to hit the external signature stores, and > you'd have plenty of time for the other update precondition checks. As a > bonus, you'd have verification that the release image you were updating to > was indeed signed by Red Hat and you woudn't need to use --force. > > [1]: https://github.com/openshift/openshift-docs/pull/21993 We had thought that the feature can be used with OCP4.4 or later. But it seems to be wrong since the feature was introduced in OCP4.3 thanks to https://github.com/openshift/cluster-version-operator/commit/bc98508b7621696e4eee57061c5bf66820c4ccfe#diff-4f12874ad5cd49216fb65c33c0fa4519 We will test it and share the result later. Best Regards, Masaki Hatada
Today I tried upgrade against the same configuration of cluster twice, the only difference is different image for --to-image option. One is using the local registry instead of quay.io, such as: oc adm upgrade --allow-explicit-upgrade --to-image ec2-18-216-81-116.us-east-2.compute.amazonaws.com:5000/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 --force Another is using quay.io image path, such as: oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 --force Interesting give totally different result. The local registry one get passed. The quay.io one get failed. [root@preserve-jialiu-ansible jialiu-v4-installer-autotest]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.19 True False 160m Cluster version is 4.4.4 [root@preserve-jialiu-ansible jialiu-v4-installer-autotest]# oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-6bd6f794d9-dd25n 1/1 Running 0 3h36m version--mm75v-nnjlp 0/1 Completed 0 170m [root@preserve-jialiu-ansible jialiu-v4-installer-autotest]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.4 True False False 3h53m cloud-credential 4.3.19 True False False 3h50m cluster-autoscaler 4.4.4 True False False 4h21m console 4.3.19 True False False 3h45m csi-snapshot-controller 4.4.4 True False False 156m dns 4.4.4 True False False 4h25m image-registry 4.4.4 True False False 3h39m ingress 4.3.19 True False False 159m insights 4.3.19 True False False 4h22m kube-apiserver 4.3.19 True False True 4h24m kube-controller-manager 4.3.19 True False True 4h23m kube-scheduler 4.3.19 True False True 4h23m machine-api 4.4.4 True False False 4h26m machine-config 4.3.19 False True True 149m marketplace 4.4.4 True False False 166m monitoring 4.4.4 True False False 3h48m network 4.3.19 True False False 4h26m node-tuning 4.4.4 True False False 164m openshift-apiserver 4.4.4 True False True 159m openshift-controller-manager 4.4.4 True False False 4h24m openshift-samples 4.3.19 True False False 4h12m operator-lifecycle-manager 4.3.19 True False False 4h22m operator-lifecycle-manager-catalog 4.3.19 True False False 4h22m operator-lifecycle-manager-packageserver 4.3.19 True False False 3h39m service-ca 4.4.4 True False False 4h26m service-catalog-apiserver 4.4.4 True False False 4h22m service-catalog-controller-manager 4.4.4 True False False 4h22m storage 4.4.4 True False False 169m [root@preserve-jialiu-ansible jialiu-v4-installer-autotest]# oc get node NAME STATUS ROLES AGE VERSION ip-10-0-54-209.us-east-2.compute.internal Ready worker 4h22m v1.16.2 ip-10-0-56-228.us-east-2.compute.internal NotReady master 4h30m v1.16.2 ip-10-0-60-34.us-east-2.compute.internal Ready master 4h30m v1.16.2 ip-10-0-69-74.us-east-2.compute.internal Ready master 4h30m v1.16.2 ip-10-0-71-125.us-east-2.compute.internal NotReady worker 4h22m v1.16.2 ip-10-0-74-136.us-east-2.compute.internal Ready worker 4h22m v1.16.2 From upgrade progress log, seem like CVO suddenly report "Cluster version is 4.4.4" [2], even not any message like "Working towards 4.4.4: N% complete" [1], but actually the cluster is never really get completed. [1]: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/installer-auto-test/1661/console [2]: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/installer-auto-test/1665/console
Created attachment 1692594 [details] qe cvo log 2
Went though QE logs and found a lot of similarity in the logs, however the API server does not go down. Still not sure why we are seeing this.
Ok, we have a better lead on this issue after looking at some local reproducers and then taking a closer look at the CVO logs attached in comment 6. From the attached unsuccessful_oc_logs_cluster_version_operator.log: I0524 16:02:50.774740 1 start.go:19] ClusterVersionOperator v4.3.19-202005041055-dirty ... I0524 16:02:51.342076 1 cvo.go:332] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s ... I0524 16:02:51.444381 1 sync_worker.go:471] Running sync quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc (force=true) on generation 2 in state Updating at attempt 0 ... I0524 16:08:36.950963 1 sync_worker.go:539] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc with hash h110xMINmng= ... I0524 16:08:36.953283 1 task_graph.go:611] Result of work: [update was cancelled at 0 of 573] ... I0524 16:11:29.479262 1 sync_worker.go:471] Running sync quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc (force=true) on generation 2 in state Reconciling at attempt 0 ... So the 4.3.19 CVO loads the 4.4.3 (in that case) manifests, begins updating to them, immediately hits a cancel/timeout [1], and then decides (mistakenly) that it successfully completed the update and start Reconciling. We're still not clear on exactly what the mistake is. In the meantime, reconciliation's shuffled, flattened manifest graph can do bad things like updating the kubelets before updating the Kubernetes API server. Raising to urgent while we work on bottoming this out. [1]: https://github.com/openshift/cluster-version-operator/pull/372 to improve the logging here, but the 5m45s duration between 2:51 and 8:36 roughly matches 2 * 2m52s [2]. [2]: https://github.com/openshift/cluster-version-operator/blob/86b9bdba55a85e2e071603916db4c43b481e7588/pkg/cvo/sync_worker.go#L296
PR submitted. We should backport through 4.2, when we started supporting restricted-network flows, because the timing out signature retrievals plus forced updates common there are what makes tripping this race more likely. Here's a full impact statement, now that we understand the issue: Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? All customers upgrading out of a CVO that does not contain the patch are potentially affected, but the chance of tripping the race is very small except for restricted-network users who are forcing updates. The impact when the race trips is also small for patch-level bumps, so the main concern is restricted-network users who are performing minor bumps like 4.2->4.3. What is the impact? Is it serious enough to warrant blocking edges? The CVO enters reconciliation mode on the target version, attempting to apply a flat, shuffled manifest graph. All kinds of terrible things could happen like the machine-config trying to roll out the newer machine-os-content and its 4.4 hyperkube binary before rolling out prerequisites like the 4.4 kube-apiserver operator. That one will make manifest application sticky, but it would not surprise me if you could find some terrible ordering that might brick a cluster. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Admin must update to a CVO that is not vulnerable to the race. Using an unforced update (e.g. by copying in a signature ConfigMap [1] for 4.3.12+ or 4.4.0+) would help reduce the likelihood of tripping the race. Using a patch-level update would reduce the impact if the race trips anyway. [1]: https://github.com/openshift/openshift-docs/pull/21993
Are clusters that hit this bug permanently wedged or is there a chance a subsequent attempt avoids the race?
> Are clusters that hit this bug permanently wedged or is there a chance a subsequent attempt avoids the race? Once you trip the race and move from UpdatingPayload to ReconcilingPayload, that CVO will not go back to updating. You can re-target your update with 'oc adm upgrade ...' and that will get you back into UpdatingPayload mode. But while the CVO was running between the trace trip and your update, it could have been doing all sorts of things as it tried to push out the flattened, shuffled manifest graph. Recovering a cluster that has hit this bug is going to be hard, and will probably involve a case-by-case review of its current state to try to determine a next-hop update target that is as close as possible to what the cluster is currently running. And also accounts for which directions and orders operators can transition in. Worst case short of a bricked cluster would be having to turn the CVO off entirely, and push manifests one at a time on its behalf to slowly unwind any components that had been too tangled up.
I've marked all comments as private which seemed to contain confidential information. Can you please allow us to make this bug public so that we can reference it directly? Comment 7 may need to be made private as it references hostnames but I believe those to be test systems which of course are not Internet facing.
Created attachment 1693220 [details] Upgrade OCP without force option (In reply to W. Trevor King from comment #14) > In the meantime, you should be able to work around this issue by creating a > signature ConfigMap locally, as described in the in-flight [1]. That way > the CVO wouldn't bother attempting to hit the external signature stores, and > you'd have plenty of time for the other update precondition checks. As a > bonus, you'd have verification that the release image you were updating to > was indeed signed by Red Hat and you woudn't need to use --force. > > [1]: https://github.com/openshift/openshift-docs/pull/21993 We attempted to upgrade OCP4.3.19 to OCP4.4.3 on AWS disconnected env with the above way, and it worked. (Please see attached file for detail) We are still not sure if it succeeded as a coincidence. So currently we are trying the workaround on our vSphere disconnected env. (In reply to Scott Dodson from comment #27) > I've marked all comments as private which seemed to contain confidential > information. Can you please allow us to make this bug public so that we can > reference it directly? Comment 7 may need to be made private as it > references hostnames but I believe those to be test systems which of course > are not Internet facing. Kii-san will handle your request. But we would like to know the detail why you want to make this Bugzilla public. Best Regards, Masaki Hatada
> Admin must update to a CVO that is not vulnerable to the race. Using an unforced update (e.g. by copying in a signature ConfigMap [1] for 4.3.12+ or 4.4.0+) would help reduce the likelihood of tripping the race. Using a patch-level update would reduce the impact if the race trips anyway. > > [1]: https://github.com/openshift/openshift-docs/pull/21993 We'll ask our customer to use the signature ConfigMap way when upgrading their cluster in disconnected env, but we also hope that Red Hat writes the version information(4.3.12+ or 4.4.0+) in the doc. BTW, the signature ConfigMap seems to be able to used from 4.3.11. (Although we have never tested with this version actually) $ podman run --rm --entrypoint /usr/bin/strings $(curl -s https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.3.11/release.txt | grep cluster-version-operator | awk '{print $2}') /usr/bin/cluster-version-operator | grep verification-signatures >/dev/null && echo found || echo not found found $ podman run --rm --entrypoint /usr/bin/strings $(curl -s https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.3.10/release.txt | grep cluster-version-operator | awk '{print $2}') /usr/bin/cluster-version-operator | grep verification-signatures >/dev/null && echo found || echo not found not found Best Regards, Masaki Hatada
> But we would like to know the detail why you want to make this Bugzilla public. The race fixed by the in-flight PR can have, as you saw, pretty serious consequences on cluster health. We want to be able to link Red Hat support and other users to this bug so they can see things like the impact statement in comment 24. Having a public issue would also allow folks to find the bug (and impact statement and fix) by searching if they trip over the same CSIMigration kube-API-server/kubelet version mismatch symptoms.
(In reply to W. Trevor King from comment #30) > > But we would like to know the detail why you want to make this Bugzilla public. > > The race fixed by the in-flight PR can have, as you saw, pretty serious > consequences on cluster health. We want to be able to link Red Hat support > and other users to this bug so they can see things like the impact statement > in comment 24. Having a public issue would also allow folks to find the bug > (and impact statement and fix) by searching if they trip over the same > CSIMigration kube-API-server/kubelet version mismatch symptoms. Thank you for your explanation. We understand the reason why and agree with making this issue known to the public. Before making this bugzilla public, we have a question for your request. Is there any plan to make an article regarding this issue in Red Hat knowledge base? We think making the article in your knowledge base is suitable to publicize. At this point, as you said, there are no confidential information in this bugzilla. But we concern we might have an accident to upload our confidential information when we discuss this issue in the future. And most of bugzillas we reported are marked as "private" flag and "NEC confidential" group. We don't want to make an exceptional case.
(In reply to Keiichi Kii (NEC) from comment #31) > Before making this bugzilla public, we have a question for your request. > Is there any plan to make an article regarding this issue in Red Hat knowledge base? > We think making the article in your knowledge base is suitable to publicize. Possibly, that decision is up to the support organization to make. > At this point, as you said, there are no confidential information in this > bugzilla. > But we concern we might have an accident to upload our confidential > information when we discuss this issue in the future. > And most of bugzillas we reported are marked as "private" flag and "NEC > confidential" group. > We don't want to make an exceptional case. We link public GitHub PRs back to public bugs and those bugs end up in product release notes. It's really best for us if our bugs are public. If we don't make this public we'll create a new bug that restates the problem now that it's believed to be understood, with no NEC specific information, this bug will be closed as a duplicate.
> We link public GitHub PRs back to public bugs and those bugs end up in > product release notes. It's really best for us if our bugs are public. > If we don't make this public we'll create a new bug that restates the > problem now that it's believed to be understood, with no NEC specific > information, this bug will be closed as a duplicate. Thank you for explanation again. Please make this bugzilla public. We will internally handle this bug as exceptional case.
(In reply to Masaki Hatada from comment #28) > We attempted to upgrade OCP4.3.19 to OCP4.4.3 on AWS disconnected env with > the above way, and it worked. > (Please see attached file for detail) > > We are still not sure if it succeeded as a coincidence. So currently we are > trying the workaround on our vSphere disconnected env. Sorry, I forgot to share the result of the vSphere disconnected env. We tested upgrading from OCP4.3.18 to OCP4.4.3 on this env with the workaround, and it worked. Best Regards, Masaki Hatada
Cloned back to bug 1843526 for 4.5. Will continue to backport further as QE verifies the fix in master-ward releases.
Verified this bug with 4.6.0-0.ci-2020-06-04-011434, Passed. Set up a pure disconnected cluster with 4.5.0-0.nightly-2020-06-04-001344, trigger upgrade with --force to 4.6.0-0.ci-2020-06-04-011434. [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.ci-2020-06-04-011434 True False 32m Cluster version is 4.6.0-0.ci-2020-06-04-011434 [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.ci-2020-06-04-011434 True False False 11h cloud-credential 4.6.0-0.ci-2020-06-04-011434 True False False 11h cluster-autoscaler 4.6.0-0.ci-2020-06-04-011434 True False False 12h config-operator 4.6.0-0.ci-2020-06-04-011434 True False False 8h console 4.6.0-0.ci-2020-06-04-011434 True False False 38m csi-snapshot-controller 4.6.0-0.ci-2020-06-04-011434 True False False 42m dns 4.6.0-0.ci-2020-06-04-011434 True False False 12h etcd 4.6.0-0.ci-2020-06-04-011434 True False False 9h image-registry 4.6.0-0.ci-2020-06-04-011434 True False False 12h ingress 4.6.0-0.ci-2020-06-04-011434 True False False 67m insights 4.6.0-0.ci-2020-06-04-011434 True False False 12h kube-apiserver 4.6.0-0.ci-2020-06-04-011434 True False False 12h kube-controller-manager 4.6.0-0.ci-2020-06-04-011434 True False False 12h kube-scheduler 4.6.0-0.ci-2020-06-04-011434 True False False 12h kube-storage-version-migrator 4.6.0-0.ci-2020-06-04-011434 True False False 42m machine-api 4.6.0-0.ci-2020-06-04-011434 True False False 12h machine-approver 4.6.0-0.ci-2020-06-04-011434 True False False 9h machine-config 4.6.0-0.ci-2020-06-04-011434 True False False 12h marketplace 4.6.0-0.ci-2020-06-04-011434 True False False 40m monitoring 4.6.0-0.ci-2020-06-04-011434 True False False 7h33m network 4.6.0-0.ci-2020-06-04-011434 True False False 12h node-tuning 4.6.0-0.ci-2020-06-04-011434 True False False 67m openshift-apiserver 4.6.0-0.ci-2020-06-04-011434 True False False 7h51m openshift-controller-manager 4.6.0-0.ci-2020-06-04-011434 True False False 12h openshift-samples 4.6.0-0.ci-2020-06-04-011434 True False False 59m operator-lifecycle-manager 4.6.0-0.ci-2020-06-04-011434 True False False 12h operator-lifecycle-manager-catalog 4.6.0-0.ci-2020-06-04-011434 True False False 12h operator-lifecycle-manager-packageserver 4.6.0-0.ci-2020-06-04-011434 True False False 37m service-ca 4.6.0-0.ci-2020-06-04-011434 True False False 12h storage 4.6.0-0.ci-2020-06-04-011434 True False False 67m [root@preserve-jialiu-ansible ~]# oc get node NAME STATUS ROLES AGE VERSION ip-10-0-50-112.us-east-2.compute.internal Ready master 12h v1.18.3+a637491 ip-10-0-50-64.us-east-2.compute.internal Ready master 12h v1.18.3+a637491 ip-10-0-62-65.us-east-2.compute.internal Ready worker 12h v1.18.3+a637491 ip-10-0-67-224.us-east-2.compute.internal Ready master 12h v1.18.3+a637491 ip-10-0-75-234.us-east-2.compute.internal Ready worker 12h v1.18.3+a637491 ip-10-0-78-237.us-east-2.compute.internal Ready worker 12h v1.18.3+a637491
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475