1838497 – [RHOCP4.4] Unable to upgrade OCP4.3.19 to OCP4.4 in disconnected env: CVO enters reconciling mode without applying any manifests in update mode

Bug 1838497 - [RHOCP4.4] Unable to upgrade OCP4.3.19 to OCP4.4 in disconnected env: CVO enters reconciling mode without applying any manifests in update mode

Summary: [RHOCP4.4] Unable to upgrade OCP4.3.19 to OCP4.4 in disconnected env: CVO ent...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.3.z
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	W. Trevor King
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1780703 1723620 1843526
TreeView+	depends on / blocked

Reported:	2020-05-21 09:26 UTC by Masaki Hatada
Modified:	2023-10-06 20:11 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The Cluster Version Operator had a race where it would consider a timed-out update reconciliation cycle as a successful update. The race was very rare, except for restricted-network clusters where the operator timed out attempting to fetch release image signatures. Consequence: The Cluster Version Operator would enter its shuffled-manifest reconciliation mode, possibly breaking the cluster if the manifests were applied in an order that the components could not handle. Fix: The Cluster Version Operator now treats those timed-out updates as failures. Result: The Cluster Version Operator no longer enters reconciling mode before the update succeeds.
Clone Of:
Environment:
Last Closed:	2020-10-27 16:00:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
qe cvo log (91.02 KB, application/gzip) 2020-05-26 14:25 UTC, Johnny Liu	no flags	Details
qe cvo log 2 (1.80 MB, application/gzip) 2020-05-27 08:11 UTC, Johnny Liu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 372	0	None	closed	Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success	2021-01-27 03:16:35 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:00:42 UTC

Description Masaki Hatada 2020-05-21 09:26:40 UTC

1. Bug Overview:
a) Description of bug report: 

  [RHOCP4.4] Unable to upgrade OCP4.3.19 to OCP4.4 in disconnected env

b) Bug Description: 

  In disconnected environment, upgrading OCP 4.3.19 to OCP 4.4 gets stuck.
  It seems to be caused by that nodes are upgraded before kubeapi-server or other master components are upgraded.

  We already tried to upgrade OCP4.3 to OCP4.4 in disconnected env on both of AWS and baremetal env multiple times, and failed in all trials.
  (We already filed Case 02653914 for AWS)

  Version information:
    $ oc version
    Client Version: 4.3.19
    Server Version: 4.3.19
    Kubernetes Version: v1.16.2

2. Bug Details:

a) Architectures: 64-bit Intel EM64T/AMD64
  x86_64

b) Bugzilla Dependencies:

c) Drivers or hardware dependencies:

d) Upstream acceptance information: 

e) External links:

f) Severity (H,M,L):
  H

g) How reproducible: 
  Always

h) Steps to Reproduce: 

  Step1. Create OCP 4.3.19 cluster in disconnected environment following the document below.

    https://access.redhat.com/documentation/en-us/openshift_container_platform/4.3/html/installing_on_bare_metal/installing-on-bare-metal#installing-restricted-networks-bare-metal

  Step2. Mirror images for v4.4.4 to local mirror registry.

    $ export OCP_RELEASE=4.4.4-x86_64
    $ export LOCAL_REGISTRY=<local mirror registry>
    $ export LOCAL_REPOSITORY=ocp4/openshift4
    $ export PRODUCT_REPO='openshift-release-dev'
    $ export LOCAL_SECRET_JSON=/ocp4/pull-secret.json
    $ export RELEASE_NAME="ocp-release" 

  $ oc adm -a ${LOCAL_SECRET_JSON} release mirror \
     --from=quay.io/${PRODUCT_REPO}/${RELEASE_NAME}:${OCP_RELEASE} \
     --to=${LOCAL_REGISTRY}/${LOCAL_REPOSITORY} \
     --to-release-image=${LOCAL_REGISTRY}/${LOCAL_REPOSITORY}:${OCP_RELEASE}

  Step3. Upgrade cluster by using "oc adm upgrade" command.

    $ export RELEASE=4.4.4-x86_64

    $ oc adm release info quay.io/openshift-release-dev/ocp-release:$RELEASE | grep Pull
    Pull From: quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d0

    $ oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-reelease@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 --alloow-explicit-upgrade --force=true
    Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5

i) Actual results:

  Upgrading gets stuck on the way.

    $ oc get clusterversion
    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.3.19    True        False         28m     Error while reconciling 4.4.4: some resources could not be updated
  
    $ oc get co
    NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
    authentication                             4.4.4     True        False         False      20h
    cloud-credential                           4.3.19    True        False         False      20h
    cluster-autoscaler                         4.4.4     True        False         False      20h
    console                                    4.3.19    True        False         False      43m
    csi-snapshot-controller                    4.4.4     True        False         False      12m
    dns                                        4.4.4     True        False         False      20h
    image-registry                             4.4.4     True        False         False      35m
    ingress                                    4.3.19    True        False         False      51m
    insights                                   4.3.19    True        False         False      20h
    kube-apiserver                             4.3.19    True        False         True       20h
    kube-controller-manager                    4.3.19    True        False         True       20h
    kube-scheduler                             4.3.19    True        False         True       20h
    machine-api                                4.4.4     True        False         False      20h
    machine-config                             4.3.19    False       True          True       3m35s
    marketplace                                4.4.4     True        False         False      13m
    monitoring                                 4.4.4     True        False         False      11m
    network                                    4.3.19    True        False         False      20h
    node-tuning                                4.4.4     True        False         False      20m
    openshift-apiserver                        4.4.4     True        False         True       17m
    openshift-controller-manager               4.4.4     True        False         False      20h
    openshift-samples                          4.3.19    True        False         False      20h
    operator-lifecycle-manager                 4.3.19    True        False         False      20h
    operator-lifecycle-manager-catalog         4.3.19    True        False         False      20h
    operator-lifecycle-manager-packageserver   4.3.19    True        False         False      13m
    service-ca                                 4.4.4     True        False         False      20h
    service-catalog-apiserver                  4.4.4     True        False         False      20h
    service-catalog-controller-manager         4.4.4     True        False         False      20h
    storage                                    4.4.4     True        False         False

  Some of nodes have been stuck at NotReady.
  Some of nodes have been already updated to OCP4.4(v1.17.1), even though kube-apiserver or other master components haven't been updated yet.

    $ oc get nodes
    NAME                STATUS     ROLES          AGE   VERSION
    host-172-25-69-12   NotReady   master         22h   v1.17.1
    host-172-25-69-18   Ready      infra,worker   21h   v1.16.2
    host-172-25-69-27   NotReady   infra,worker   21h   v1.16.2
    host-172-25-69-28   NotReady   worker         22h   v1.17.1
    host-172-25-69-29   Ready      master         22h   v1.16.2
    host-172-25-69-35   Ready      master         22h   v1.16.2
    host-172-25-69-36   Ready      worker         22h   v1.16.2
    host-172-25-69-39   NotReady   infra,worker   21h   v1.17.1

j) Expected results: 

  Upgrade process finishes with no problem.

k) Additional information

  From the logs of kubelet, it failed to initialize CSINode.

    [core@host-172-25-69-39 ~]$ systemctl status kubelet
    ...(snip)...
    May 21 04:30:53 host-172-25-69-12 hyperkube[207013]: I0521 04:30:53.866160  207013 nodeinfomanager.go:402] Failed to publish CSINode: the server could not find the requested resource
    May 21 04:30:53 host-172-25-69-12 hyperkube[207013]: I0521 04:30:53.912220  207013 nodeinfomanager.go:402] Failed to publish CSINode: the server could not find the requested resource
    May 21 04:30:53 host-172-25-69-12 hyperkube[207013]: I0521 04:30:53.985799  207013 nodeinfomanager.go:402] Failed to publish CSINode: the server could not find the requested resource
    May 21 04:30:54 host-172-25-69-12 hyperkube[207013]: I0521 04:30:54.266762  207013 nodeinfomanager.go:402] Failed to publish CSINode: the server could not find the requested resource
    May 21 04:30:54 host-172-25-69-12 hyperkube[207013]: E0521 04:30:54.266833  207013 csi_plugin.go:273] Failed to initialize CSINode: error updating CSINode annotation: timed out waiting for the condition; caused by: the server could not find the requested resource

  We know CSI migration feature was introduced from OCP4.4(v1.17.1), so I disabled it from kubelet.conf a try.

    [core@host-172-25-69-39 ~]$ vi/etc/kubernetes/kubelet.conf
    ...(snip)...
    featureGates:
      CSIMigration: false <=== Added this entry in kubelet.conf
    ...(snip)>..

  After that, the node's status was changed to Ready.

    $ oc get nodes
    NAME                STATUS     ROLES          AGE   VERSION
    host-172-25-69-12   NotReady   master         22h   v1.17.1
    host-172-25-69-18   Ready      infra,worker   21h   v1.16.2
    host-172-25-69-27   NotReady   infra,worker   21h   v1.16.2
    host-172-25-69-28   NotReady   worker         22h   v1.17.1
    host-172-25-69-29   Ready      master         22h   v1.16.2
    host-172-25-69-35   Ready      master         22h   v1.16.2
    host-172-25-69-36   Ready      worker         22h   v1.16.2
    host-172-25-69-39   Ready      infra,worker   21h   v1.17.1 <=== The status was changed

  So, why the node became Unready is that the node was upgraded before master components are updated.
  Upgrading nodes should be blocked until master components are updated.

3. Business impact:

  All of Japanese customers use OpenShift4 in disconnected env.
  Red Hat requests them to upgrade their cluster frequently, but their cluster would be broken by upgrade due to this issue.
  It's a very big problem.

4. Primary contact at Red Hat, email, phone (chat)
  mfuruta

5. Primary contact at Partner, email, phone (chat)
  mas-hatada.nec.com (Masaki Hatada)

Comment 1 Lalatendu Mohanty 2020-05-21 12:21:41 UTC

The command to update seems wrong as you should be using the local registry instead of quay.io for that i.e oc adm upgrade --allow-explicit-upgrade --to-image --force=true LOCAL_REGISTRY/REPOSITORY@sha256:<sha256sum>. Can you check if this fixes your issue?

Comment 2 Masaki Hatada 2020-05-21 13:26:45 UTC

Dear Lalatendu,

> The command to update seems wrong as you should be using the local registry instead of quay.io for that i.e oc adm upgrade --allow-explicit-upgrade --to-image --force=true LOCAL_REGISTRY/REPOSITORY@sha256:<sha256sum>. Can you check if this fixes your issue?

We don't think that our command line is wrong, because we used this command line when testing upgrading on OCP4.3 AWS disconnected env for getting SupportException from Red Hat.
And, thanks to ImageContentSourcePolicy, the URL of quay.io is redirected to our private registry automatically. So there is no problem even from the technical side.

Even if our commandline was wrong, we cannot understand why the order of upgrading Operators is changed by that.
Could you explain the relation between the wrong order and our commandline?

[1] Upgrades and order
    https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md

Best Regards,
Masaki Hatada

Comment 3 Masaki Hatada 2020-05-21 13:28:32 UTC

[1] explains how CVO decides the order. It seems that the order information is generated from the file names stored in the image.
OCP upgrade process started with our commandline, so we think CVO must get the information from the image stored in our private registry...

Comment 4 Lalatendu Mohanty 2020-05-22 17:24:24 UTC

Right, ImageContentSourcePolicy will take care of the registry URls in the oc command.

Comment 5 W. Trevor King 2020-05-22 20:07:02 UTC

If you can attach a must-gather [0] to your case, that would be helpful.  In the absence of one, here's a bit of a walkthrough for a 4.3.19 -> 4.4.4 CI job [1].

* 16:04Z, master-0, master-1, and master-2 come up with v.1.16.2 kubelets [5].
* 17:03Z, update triggered [2].
* 17:04Z, etcd operator levels at 4.4.4 [4].
* 17:07Z, kube-apiserver levels at 4.4.4 [4].
* 17:11Z, openshift-apiserver levels at 4.4.4 [4].
* 17:17Z, csi-snapshot-controller levels at 4.4.4 [4].
* 17:19Z, authentication operator levels at 4.4.4 [4].
* 17:31Z, etcd operator finishes progressing [9].
* 17:32:03Z, openshift-apiserver goes Degraded=True, APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable [4]
* 17:32:13Z, openshift-apiserver goes Degraded=False
* 17:36Z, machine-config begins transition to 4.4.4 [4].
* 17:42Z, openshift-apiserver goes Degraded=True, APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable [4]
* 17:44Z, master-1 and worker-eastus21 ready again [3], both on v1.17.1 kubelets [5,6].
* 17:48Z, master-0 ready again [3] on the v1.17.1 kubelet [5].
* 17:49Z, worker-eastus22 ready again [3] on the v1.17.1 kubelet [6].
* 17:53Z, master-2 ready again [3] on the v1.17.1 kubelet [5].
* 17:53Z, machine-config levels at 4.4.4 [4].
* 17:55Z, worker-eastus23 ready again [3] on the v1.17.1 kubelet [6].
* 17:58Z, update complete [2].

And checking the operator order [7], we can see that kube-apiserver is at level 20 in 4.4.4's manifest graph, the CSI snapshot controller is level 50, and the machine-config operator that rolls out the new kubelet binaries along with other machine-os-content is at level 80 (more on manifest graphs in [8]).

Dumping all the operator conditions [9], there is some strangeness.  I haven't dug into why node-tuning goes Available=True way late at 17:57Z.  Or why csi-snapshot-controller sets Progressing=False at 17:51Z.  But as you point out, the expected behavior is definitely to update the control plane components before we roll out the new kubelets, and if you want us to figure out why that didn't happen for your cluster, we'll probably need some additional logs (ideally a must-gather).

[0]: https://docs.openshift.com/container-platform/4.4/support/gathering-cluster-data.html#about-must-gather_gathering-cluster-data
[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388
[2]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version'
     2020-05-13T17:03:15Z 2020-05-13T17:58:48Z Completed 4.4.4
     2020-05-13T16:06:09Z 2020-05-13T16:44:00Z Completed 4.3.19
[3]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/nodes.json | jq -r '.items[] | .ready = ([.status.conditions[] | select(.type == "Ready")][0]) | .ready.lastTransitionTime + " " + .ready.status + " " + .metadata.name' | sort
     2020-05-13T17:44:15Z True ci-ln-6bdji0t-002ac-jw4hp-master-1
     2020-05-13T17:44:29Z True ci-ln-6bdji0t-002ac-jw4hp-worker-eastus21-8fwhr
     2020-05-13T17:48:35Z True ci-ln-6bdji0t-002ac-jw4hp-master-0
     2020-05-13T17:49:25Z True ci-ln-6bdji0t-002ac-jw4hp-worker-eastus22-dlssq
     2020-05-13T17:53:07Z True ci-ln-6bdji0t-002ac-jw4hp-master-2
     2020-05-13T17:55:27Z True ci-ln-6bdji0t-002ac-jw4hp-worker-eastus23-8jr2j
[4]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/events.json | jq -r '[.items[] | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.namespace + " " + .message'
     ...
     2020-05-13T17:04:08Z - 2020-05-13T17:04:08Z (1) openshift-etcd-operator clusteroperator/etcd version "operator" changed from "" to "4.4.4"
     ...
     2020-05-13T17:07:57Z - 2020-05-13T17:07:57Z (1) openshift-kube-apiserver-operator clusteroperator/kube-apiserver version "operator" changed from "4.3.19" to "4.4.4"
     ...
     2020-05-13T17:11:23Z - 2020-05-13T17:11:23Z (2) openshift-apiserver-operator clusteroperator/openshift-apiserver version "operator" changed from "4.3.19" to "4.4.4"
     ...
     2020-05-13T17:17:25Z - 2020-05-13T17:17:25Z (2) openshift-cluster-storage-operator clusteroperator/csi-snapshot-controller version "operator" changed from "" to "4.4.4"
     ...
     2020-05-13T17:19:29Z - 2020-05-13T17:19:29Z (2) openshift-authentication-operator clusteroperator/authentication version "operator" changed from "4.3.19" to "4.4.4"
     ...
     2020-05-13T17:32:03Z - 2020-05-13T17:32:03Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded changed from False to True ("APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable")
     ...
     2020-05-13T17:32:13Z - 2020-05-13T17:32:13Z (1) openshift-apiserver-operator Status for clusteroperator/openshift-apiserver changed: Degraded changed from True to False ("")
     ...
     2020-05-13T17:36:11Z - 2020-05-13T17:36:11Z (1) default clusteroperator/machine-config-operator started a version change from [{operator 4.3.19}] to [{operator 4.4.4}]
     ...
     2020-05-13T17:42:34Z - 2020-05-13T17:42:40Z (3) openshift-apiserver-operator (combined from similar events): Status for clusteroperator/openshift-apiserver changed: Degraded changed from False to True ("APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"),Progressing message changed from "OperatorConfigProgressing: openshiftapiserveroperatorconfigs/instance: observed generation is 9, desired generation is 10.\nAPIServerDeploymentProgressing: deployment/apiserver.openshift-operator: observed generation is 5, desired generation is 6." to "APIServerDeploymentProgressing: deployment/apiserver.openshift-operator: observed generation is 5, desired generation is 6."
     ...
     2020-05-13T17:53:53Z - 2020-05-13T17:53:53Z (1) default clusteroperator/machine-config-operator version changed from [{operator 4.3.19}] to [{operator 4.4.4}]
     ...
[5]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/nodes/masters-journal | gunzip | grep 'v1\.1[67]' | sort
     May 13 16:04:11.897824 ci-ln-6bdji0t-002ac-jw4hp-master-1 hyperkube[1430]: I0513 16:04:11.897795    1430 server.go:418] Version: v1.16.2
     May 13 16:04:12.260155 ci-ln-6bdji0t-002ac-jw4hp-master-2 hyperkube[1435]: I0513 16:04:12.260123    1435 server.go:418] Version: v1.16.2
     May 13 16:04:27.466254 ci-ln-6bdji0t-002ac-jw4hp-master-0 hyperkube[1435]: I0513 16:04:27.466202    1435 server.go:418] Version: v1.16.2
     May 13 17:44:15.059318 ci-ln-6bdji0t-002ac-jw4hp-master-1 hyperkube[1444]: I0513 17:44:15.059247    1444 server.go:424] Version: v1.17.1
     May 13 17:48:35.575490 ci-ln-6bdji0t-002ac-jw4hp-master-0 hyperkube[1455]: I0513 17:48:35.575423    1455 server.go:424] Version: v1.17.1
     May 13 17:53:06.963297 ci-ln-6bdji0t-002ac-jw4hp-master-2 hyperkube[1447]: I0513 17:53:06.963249    1447 server.go:424] Version: v1.17.1
[6]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/nodes/workers-journal | gunzip | grep 'v1\.1[67]' | sort
     May 13 16:30:56.604435 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus21-8fwhr hyperkube[1332]: I0513 16:30:56.604406    1332 server.go:418] Version: v1.16.2
     May 13 16:39:59.592251 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus23-8jr2j hyperkube[1342]: I0513 16:39:59.592232    1342 server.go:418] Version: v1.16.2
     May 13 16:40:05.287868 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus22-dlssq hyperkube[1350]: I0513 16:40:05.287842    1350 server.go:418] Version: v1.16.2
     May 13 17:44:18.742095 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus21-8fwhr hyperkube[1351]: I0513 17:44:18.742008    1351 server.go:424] Version: v1.17.1
     May 13 17:49:15.002727 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus22-dlssq hyperkube[1360]: I0513 17:49:15.002690    1360 server.go:424] Version: v1.17.1
     May 13 17:55:16.467479 ci-ln-6bdji0t-002ac-jw4hp-worker-eastus23-8jr2j hyperkube[1347]: I0513 17:55:16.467199    1347 server.go:424] Version: v1.17.1
[7]: $ oc adm release extract --to=manifests quay.io/openshift-release-dev/ocp-release:4.4.4-x86_64
     $ grep -rl 'kind: ClusterOperator' manifests | sort
     manifests/0000_12_etcd-operator_07_clusteroperator.yaml
     manifests/0000_20_kube-apiserver-operator_07_clusteroperator.yaml
     manifests/0000_25_kube-controller-manager-operator_07_clusteroperator.yaml
     manifests/0000_25_kube-scheduler-operator_07_clusteroperator.yaml
     manifests/0000_30_machine-api-operator_12_clusteroperator.yaml
     manifests/0000_30_openshift-apiserver-operator_08_clusteroperator.yaml
     manifests/0000_40_kube-storage-version-migrator-operator_09_clusteroperator.yaml
     manifests/0000_50_cloud-credential-operator_10_cluster-operator.yaml
     manifests/0000_50_cluster-authentication-operator_08_clusteroperator.yaml
     manifests/0000_50_cluster-autoscaler-operator_08_clusteroperator.yaml
     manifests/0000_50_cluster-csi-snapshot-controller-operator_08_clusteroperator.yaml
     manifests/0000_50_cluster-image-registry-operator_08-clusteroperator.yaml
     manifests/0000_50_cluster-ingress-operator_03-cluster-operator.yaml
     manifests/0000_50_cluster_monitoring_operator_06-clusteroperator.yaml
     manifests/0000_50_cluster-node-tuning-operator_05-clusteroperator.yaml
     manifests/0000_50_cluster-openshift-controller-manager-operator_10_clusteroperator.yaml
     manifests/0000_50_cluster-samples-operator_07-clusteroperator.yaml
     manifests/0000_50_cluster-storage-operator_03-cluster-operator.yaml
     manifests/0000_50_cluster-svcat-apiserver-operator_08_cluster-operator.yaml
     manifests/0000_50_cluster-svcat-controller-manager-operator_10_cluster-operator.yaml
     manifests/0000_50_console-operator_08-clusteroperator.yaml
     manifests/0000_50_insights-operator_07-cluster-operator.yaml
     manifests/0000_50_olm_99-operatorstatus.yaml
     manifests/0000_50_operator-marketplace_10_clusteroperator.yaml
     manifests/0000_50_service-ca-operator_07_clusteroperator.yaml
     manifests/0000_70_cluster-network-operator_04_clusteroperator.yaml
     manifests/0000_70_dns-operator_03-cluster-operator.yaml
     manifests/0000_80_machine-config-operator_06_clusteroperator.yaml
[8]: https://github.com/openshift/cluster-version-operator/blob/master/docs/user/reconciliation.md#manifest-graph
[9]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/clusteroperators.json | jq -r '[.items[] | .metadata.name as $name | .status.conditions[] | .lastTransitionTime + " " + $name + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")] | sort[]'
     ...
     2020-05-13T17:17:21Z csi-snapshot-controller Degraded=False AsExpected: -
     2020-05-13T17:17:26Z cloud-credential Progressing=False ReconcilingComplete: 3 of 3 credentials requests provisioned and reconciled.
     2020-05-13T17:17:26Z csi-snapshot-controller Upgradeable=True AsExpected: -
     ...
     2020-05-13T17:31:11Z etcd Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3
     EtcdMembersProgressing: all members have started
     2020-05-13T17:32:32Z kube-scheduler Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 7
     2020-05-13T17:41:30Z ingress Degraded=False NoIngressControllersDegraded: -
     2020-05-13T17:43:19Z monitoring Available=True RollOutDone: Successfully rolled out the stack.
     2020-05-13T17:43:19Z monitoring Degraded=False -: -
     2020-05-13T17:43:19Z monitoring Progressing=False -: -
     2020-05-13T17:45:52Z service-ca Progressing=False AsExpected: Progressing: All service-ca-operator deployments updated
     2020-05-13T17:45:57Z dns Degraded=False AsExpected: All desired DNS DaemonSets available and operand Namespace exists
     2020-05-13T17:46:40Z csi-snapshot-controller Available=True AsExpected: -
     2020-05-13T17:49:41Z authentication Progressing=False AsExpected: -
     2020-05-13T17:49:45Z console Available=True AsExpected: -
     2020-05-13T17:49:46Z operator-lifecycle-manager-packageserver Available=True -: -
     2020-05-13T17:49:46Z operator-lifecycle-manager-packageserver Progressing=False -: Deployed version 0.14.2
     2020-05-13T17:50:23Z marketplace Available=True OperatorAvailable: Available release version: 4.4.4
     2020-05-13T17:50:23Z marketplace Progressing=False OperatorAvailable: Successfully progressed to release version: 4.4.4
     2020-05-13T17:50:23Z marketplace Upgradeable=True OperatorAvailable: Marketplace is upgradeable
     2020-05-13T17:51:46Z kube-storage-version-migrator Available=True AsExpected: -
     2020-05-13T17:51:51Z openshift-apiserver Progressing=False AsExpected: -
     2020-05-13T17:51:58Z ingress Available=True -: desired and current number of IngressControllers are equal
     2020-05-13T17:51:58Z ingress Progressing=False -: desired and current number of IngressControllers are equal
     2020-05-13T17:51:59Z csi-snapshot-controller Progressing=False AsExpected: -
     2020-05-13T17:52:47Z image-registry Available=True Ready: The registry is ready
     2020-05-13T17:52:47Z image-registry Progressing=False Ready: The registry is ready
     2020-05-13T17:53:54Z machine-config Progressing=False -: Cluster version is 4.4.4
     2020-05-13T17:54:27Z etcd Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready
     EtcdMembersDegraded: No unhealthy members found
     2020-05-13T17:54:59Z openshift-samples Available=True -: Samples installation successful at 4.4.4
     2020-05-13T17:54:59Z openshift-samples Degraded=False -: -
     2020-05-13T17:54:59Z openshift-samples Progressing=True -: Samples processing to 4.4.4
     2020-05-13T17:55:25Z openshift-apiserver Degraded=False AsExpected: -
     2020-05-13T17:56:17Z dns Progressing=False AsExpected: Desired and available number of DNS DaemonSets are equal
     2020-05-13T17:56:21Z network Progressing=False -: -
     2020-05-13T17:57:03Z node-tuning Progressing=False AsExpected: Cluster version is "4.4.4"
     2020-05-13T17:57:04Z node-tuning Available=True AsExpected: Cluster has deployed "4.4.4"
     2020-05-13T18:02:16Z monitoring Upgradeable=True -: -

Comment 6 Masaki Hatada 2020-05-25 03:21:57 UTC

Created attachment 1691677 [details]
Logs_ or both of successful and unsuccessful env

Dear Trevor,

Thank you for your update.

> If you can attach a must-gather [0] to your case, that would be helpful.  In the absence of one, here's a bit of a walkthrough for a 4.3.19 -> 4.4.4 CI job [1].

Unfortunately, must-gather gets error as we reported at https://access.redhat.com/support/cases/#/case/02653914?commentId=a0a2K00000UzrmKQAR .

> you should be using the local registry instead of quay.io for that i.e oc adm upgrade --allow-explicit-upgrade --to-image --force=true LOCAL_REGISTRY/REPOSITORY@sha256:<sha256sum>. 

We attempted to upgrade with the above command line for test.
Somehow, it succeeded.

We don't know why it succeeded. It might have succeeded accidentally.
We are currently testing it on our other env.

We uploaded here the results of the followings commands of successful and unsuccessful env.
Please let us know if you want to get other information.

- oc version
- oc get co
- oc get clusterversion -o yaml
- oc logs <cluster-version-operator pod>

Best Regards,
Masaki Hatada

Comment 7 Masaki Hatada 2020-05-26 05:40:02 UTC

> > you should be using the local registry instead of quay.io for that i.e oc adm upgrade --allow-explicit-upgrade --to-image --force=true LOCAL_REGISTRY/REPOSITORY@sha256:<sha256sum>. 
> 
> We attempted to upgrade with the above command line for test.
> Somehow, it succeeded.
> 
> We don't know why it succeeded. It might have succeeded accidentally.
> We are currently testing it on our other env.

We attempted to upgrade with --to-image=LOCAL_REGISTRY/REPOSITORY@sha256:<sha256sum>, but it failed.
So the successful of the previous seemded to be a coincidence...

Currently we can complete upgrading OCP4.3 to OCP4.4 in NON disconnected env, on the other hand it fails with high probability in disconnected env.

Would Red Hat have some idea about this issue and how we should investigate this issue?

---

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.4     True        False         False      25h
cloud-credential                           4.3.19    True        False         False      25h
cluster-autoscaler                         4.4.4     True        False         False      25h
console                                    4.3.19    True        False         False      23h
dns                                        4.4.4     True        False         False      25h
etcd                                       4.4.4     True        False         True       15m
image-registry                             4.4.4     True        False         False      23h
ingress                                    4.3.19    True        False         False      4h13m
insights                                   4.3.19    True        False         False      25h
kube-apiserver                             4.3.19    True        False         True       25h
kube-controller-manager                    4.3.19    True        False         True       25h
kube-scheduler                             4.3.19    True        False         True       25h
machine-api                                4.4.4     True        False         False      25h
machine-config                             4.3.19    False       True          True       22h
marketplace                                4.4.4     True        False         False      23h
monitoring                                 4.4.4     True        False         False      138m
network                                    4.3.19    True        False         False      25h
node-tuning                                4.4.4     True        False         False      23h
openshift-apiserver                        4.4.4     True        False         True       22h
openshift-controller-manager               4.4.4     True        False         False      25h
openshift-samples                          4.3.19    True        False         False      25h
operator-lifecycle-manager                 4.3.19    True        False         False      25h
operator-lifecycle-manager-catalog         4.3.19    True        False         False      25h
operator-lifecycle-manager-packageserver   4.3.19    True        False         False      6h58m
service-ca                                 4.4.4     True        False         False      25h
service-catalog-apiserver                  4.4.4     True        False         False      25h
service-catalog-controller-manager         4.4.4     True        False         False      25h
storage                                    4.4.4     True        False         False      23h

$ oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2020-05-25T03:48:21Z"
    generation: 2
    name: version
    resourceVersion: "696811"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: f9d80fa8-727c-40d3-8855-8b0dc0c2435d
  spec:
    channel: stable-4.3
    clusterID: 0c741002-21b9-4f18-83d8-8c05a6b556c2
    desiredUpdate:
      force: true
      image: matsuo-lbdns.nec.test:5000/ocp4/openshift4@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5
      version: ""
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2020-05-25T04:05:41Z"
      message: Done applying 4.3.19
      status: "True"
      type: Available
    - lastTransitionTime: "2020-05-26T05:25:10Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2020-05-25T06:19:15Z"
      message: 'Working towards 4.4.4: 11% complete'
      status: "True"
      type: Progressing
    - lastTransitionTime: "2020-05-25T03:48:34Z"
      message: 'Unable to retrieve available updates: Get https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.3&id=0c741002-21b9-4f18-83d8-8c05a6b556c2&version=4.4.4:
        dial tcp 52.5.215.228:443: connect: connection timed out'
      reason: RemoteFailed
      status: "False"
      type: RetrievedUpdates
    desired:
      force: true
      image: matsuo-lbdns.nec.test:5000/ocp4/openshift4@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5
      version: 4.4.4
    history:
    - completionTime: null
      image: matsuo-lbdns.nec.test:5000/ocp4/openshift4@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5
      startedTime: "2020-05-25T05:52:26Z"
      state: Partial
      verified: false
      version: 4.4.4
    - completionTime: "2020-05-25T04:05:41Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:28c451bfb531e5af71332a933073a2d97673ed157150bacb014a4358501f05ad
      startedTime: "2020-05-25T03:48:34Z"
      state: Completed
      verified: false
      version: 4.3.19
    observedGeneration: 2
    versionHash: Q1yJVFYRV4A=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 8 Masaki Hatada 2020-05-26 12:32:57 UTC

Dear Red Hat,

According to the log of cluster-version-operator, it seems that at least 3 upgrade threads were running at the same time.

 I0524 16:11:29.481202       1 sync_worker.go:621] Running sync for namespace "openshift-cluster-storage-operator" (307 of 573)
 I0524 16:11:29.481223       1 sync_worker.go:621] Running sync for credentialsrequest "openshift-cloud-credential-operator/openshift-machine-api-aws" (99 of 573)
 I0524 16:11:29.487623       1 sync_worker.go:621] Running sync for clusterrolebinding "cluster-storage-operator" (308 of 573)
 I0524 16:11:29.490376       1 sync_worker.go:621] Running sync for clusterrole "cluster-storage-operator" (309 of 573)
 I0524 16:11:29.493510       1 sync_worker.go:621] Running sync for rolebinding "openshift-cluster-storage-operator/cluster-storage-operator" (310 of 573)
 I0524 16:11:29.497488       1 sync_worker.go:621] Running sync for role "openshift-cluster-storage-operator/cluster-storage-operator" (311 of 573)
 I0524 16:11:29.500029       1 sync_worker.go:621] Running sync for serviceaccount "openshift-cluster-storage-operator/cluster-storage-operator" (312 of 573)
 I0524 16:11:29.503096       1 sync_worker.go:621] Running sync for deployment "openshift-cluster-storage-operator/cluster-storage-operator" (313 of 573)
 I0524 16:11:29.685004       1 sync_worker.go:621] Running sync for clusteroperator "storage" (314 of 573)
 I0524 16:11:29.685135       1 sync_worker.go:621] Running sync for namespace "openshift-controller-manager-operator" (270 of 573)
 I0524 16:11:29.784595       1 sync_worker.go:621] Running sync for namespace "openshift-controller-manager" (271 of 573)
 I0524 16:11:29.884781       1 sync_worker.go:621] Running sync for customresourcedefinition "openshiftcontrollermanagers.operator.openshift.io" (272 of 573)
 I0524 16:11:29.985774       1 sync_worker.go:621] Running sync for openshiftcontrollermanager "cluster" (273 of 573)
 I0524 16:11:31.785411       1 sync_worker.go:621] Running sync for configmap "openshift-controller-manager-operator/openshift-controller-manager-operator-config" (274 of 573)
 I0524 16:11:31.836388       1 sync_worker.go:621] Running sync for credentialsrequest "openshift-cloud-credential-operator/openshift-machine-api-azure" (100 of 573)

 Thread1 ... 307 -> 308 -> ... -> 314
 Thread2 ... 270 -> 271 -> ... -> 274
 Thread3 ... 99 -> 100

And it seems that Thread1(which was processing machine-config) and Thread3(which was processing openshift-apiserver) failed at the same time.

  E0524 16:17:14.530902       1 task.go:77] error running apply for clusteroperator "machine-config" (475 of 573): Cluster operator machine-config is still updating
  I0524 16:17:14.530965       1 task_graph.go:590] Running 20 on 0
  I0524 16:17:14.530982       1 task_graph.go:583] Canceled worker 0
  E0524 16:17:14.530902       1 task.go:77] error running apply for clusteroperator "openshift-apiserver" (139 of 573): Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable

So currently we imagine:
- The order is not wrong, but multiple upgrade threads are running at the same time somehow
- CVO tries to access internet for getting image signature. In disconnected env it takes a time until timeout happens. So an additional upgrade thread might be kicked from CVO before timeout happened
- If the upgrade target was the same minor(4.3.y) version, upgrade wouldn't fail even if multiple upgrade threads ran, because kubelet would be able to run even before api-server or other master components are upgraded. But if the target was the different minor version, the upgrade would fail

Currently we have no evidence but at least it's true that multiple upgrade threads were running on the same time.

We'd like Red Hat to check this point.

Best Regards,
Masaki Hatada

Comment 10 Johnny Liu 2020-05-26 14:25:56 UTC

Created attachment 1692291 [details]
qe cvo log

Comment 13 W. Trevor King 2020-05-26 18:46:33 UTC

From comment 8:
> - The order is not wrong, but multiple upgrade threads are running at the same time somehow

Yes, the CVO spawns several workers to walk parallel portions of the manifest graph simultaneously [1].  We still expect prerequisite graph nodes to block all subsequent graph nodes.

> - CVO tries to access internet for getting image signature. In disconnected env it takes a time until timeout happens. So an additional upgrade thread might be kicked from CVO before timeout happened

From comment 10:
> Created attachment 1692291 [details]

Logs are from the 4.3.19 CVO:

$ grep ClusterVersionOperator cvo.log 
I0526 13:09:44.984520       1 start.go:19] ClusterVersionOperator v4.3.19-202005041055-dirty
I0526 13:11:25.565022       1 cvo.go:332] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s

You can see the request to go to 4.4.4 (forcing the update):

$ grep -i desired cvo.log | head -n1
I0526 13:11:25.665927       1 cvo.go:468] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5", Force:true}
$ oc adm release info quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 | head -n3
Name:      4.4.4
Digest:    sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5
Created:   2020-05-13T15:41:16Z

CVO seems to be attempting to walk signatures with two layers of timeouts:

$ grep 'unable to load signature' cvo.log 
I0526 13:13:36.123153       1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-1: dial tcp 172.217.4.48:443: connect: connection timed out
I0526 13:15:47.195128       1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-2: dial tcp 172.217.9.80:443: connect: connection timed out
I0526 13:17:10.718027       1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-3: context deadline exceeded
I0526 13:19:44.764143       1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-1: dial tcp 172.217.4.48:443: connect: connection timed out
I0526 13:21:55.835063       1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-2: dial tcp 172.217.4.48:443: connect: connection timed out
I0526 13:23:18.233801       1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-3: context deadline exceeded

That's... fine.  My initial reaction was "it should just give up on that host when signature-1 times out, but on second thought you could have a sig server where signature-1 timed out but signature-2 or some later one did not.  I've spun bug 1840343 about limiting denial-of-service sorts of things in situations like this. 

Back to our update:

$ grep -A2 signature-3 cvo.log  | head -n3
I0526 13:17:10.718027       1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5/signature-3: context deadline exceeded
W0526 13:17:10.718079       1 updatepayload.go:100] An image was retrieved from "quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5" that failed verification: The update cannot be verified: context deadline exceeded
E0526 13:17:10.739803       1 sync_worker.go:329] unable to synchronize image (waiting 21.565712806s): Unable to download and prepare the update: timed out waiting for the condition

So the bug seems to be that the sig-fetching timeouts are consuming our whole available timeout, and we leave no time for preconditions.  We should address this by extending the sync-fetching Context timeout, and then setting smaller sub-timeouts for post-fetch processing like signature retrieval and update precondition checks.

[1]: https://github.com/openshift/cluster-version-operator/blob/master/docs/user/reconciliation.md#synchronizing-the-graph

Comment 14 W. Trevor King 2020-05-26 18:50:48 UTC

In the meantime, you should be able to work around this issue by creating a signature ConfigMap locally, as described in the in-flight [1].  That way the CVO wouldn't bother attempting to hit the external signature stores, and you'd have plenty of time for the other update precondition checks.  As a bonus, you'd have verification that the release image you were updating to was indeed signed by Red Hat and you woudn't need to use --force.

[1]: https://github.com/openshift/openshift-docs/pull/21993

Comment 15 W. Trevor King 2020-05-26 21:11:57 UTC

From comment 6:
> Created attachment 1691677 [details]
> Logs_ or both of successful and unsuccessful env

Ah, double-checking these against my guess from comment 13:

$ grep -A2 signature-3 Logs_for_both_environments/*.log
Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log:I0524 16:08:36.496037       1 verify.go:404] unable to load signature: Get https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc/signature-3: context deadline exceeded
Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log-W0524 16:08:36.496081       1 updatepayload.go:100] An image was retrieved from "quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc" that failed verification: The update cannot be verified: context deadline exceeded
Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log-I0524 16:08:36.496154       1 payload.go:210] Loading updatepayload from "/etc/cvo/updatepayloads/umpiuvwSTU7_qyH_OBk4PQ"

So that looks the same.  But now I'm noticing that after the signature timeout we continue to load the payload in both of these cases.

$ grep -i payload cvo.log 
I0526 13:09:44.995515       1 payload.go:210] Loading updatepayload from "/"
I0526 13:11:25.666448       1 sync_worker.go:477] Loading payload
W0526 13:17:10.718079       1 updatepayload.go:100] An image was retrieved from "quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5" that failed verification: The update cannot be verified: context deadline exceeded
I0526 13:17:33.182208       1 sync_worker.go:477] Loading payload
W0526 13:23:18.233849       1 updatepayload.go:100] An image was retrieved from "quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5" that failed verification: The update cannot be verified: context deadline exceeded
I0526 13:23:18.234022       1 payload.go:210] Loading updatepayload from "/etc/cvo/updatepayloads/mUcauUwti1y9EInjIaBdzg"
I0526 13:23:18.687185       1 sync_worker.go:539] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 with hash 3PF9oWfsPLw=
$ grep -i payload Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log
I0524 16:02:50.780437       1 payload.go:210] Loading updatepayload from "/"
I0524 16:02:51.444412       1 sync_worker.go:477] Loading payload
W0524 16:08:36.496081       1 updatepayload.go:100] An image was retrieved from "quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc" that failed verification: The update cannot be verified: context deadline exceeded
I0524 16:08:36.496154       1 payload.go:210] Loading updatepayload from "/etc/cvo/updatepayloads/umpiuvwSTU7_qyH_OBk4PQ"
I0524 16:08:36.950963       1 sync_worker.go:539] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc with hash h110xMINmng=

That makes sense because in all cases we started applying some target-version stuff and didn't just block on the signature.  So that invalidated my "stuck getting signatures" hypothesis from comment 13.  Comparing the logs again:

$ grep ClusterVersionOperator cvo.log Logs_for_both_environments/*.log
cvo.log:I0526 13:09:44.984520       1 start.go:19] ClusterVersionOperator v4.3.19-202005041055-dirty
cvo.log:I0526 13:11:25.565022       1 cvo.go:332] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s
Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log:I0524 16:02:50.774740       1 start.go:19] ClusterVersionOperator v4.3.19-202005041055-dirty
Logs_for_both_environments/unsuccessful_oc_logs_cluster_version_operator.log:I0524 16:02:51.342076       1 cvo.go:332] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s

Why are folks attaching logs from the 4.3.19 CVO if that CVO is happily triggering a pivot to a 4.4 CVO, and then having that 4.4 CVO get stuck trying to herd the cluster from 4.3 to 4.4?

Comment 16 Masaki Hatada 2020-05-27 02:40:00 UTC

Dear Trevor,

Thank you for commenting.

(In reply to W. Trevor King from comment #14)
> In the meantime, you should be able to work around this issue by creating a
> signature ConfigMap locally, as described in the in-flight [1].  That way
> the CVO wouldn't bother attempting to hit the external signature stores, and
> you'd have plenty of time for the other update precondition checks.  As a
> bonus, you'd have verification that the release image you were updating to
> was indeed signed by Red Hat and you woudn't need to use --force.
> 
> [1]: https://github.com/openshift/openshift-docs/pull/21993

We had thought that the feature can be used with OCP4.4 or later.
But it seems to be wrong since the feature was introduced in OCP4.3 thanks to https://github.com/openshift/cluster-version-operator/commit/bc98508b7621696e4eee57061c5bf66820c4ccfe#diff-4f12874ad5cd49216fb65c33c0fa4519

We will test it and share the result later.

Best Regards,
Masaki Hatada

Comment 17 Johnny Liu 2020-05-27 08:06:30 UTC

Today I tried upgrade against the same configuration of cluster twice, the only difference is different image for --to-image option. 
One is using the local registry instead of quay.io, such as: oc adm upgrade --allow-explicit-upgrade --to-image ec2-18-216-81-116.us-east-2.compute.amazonaws.com:5000/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 --force
Another is using quay.io image path, such as: oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:baa687f29b0ac155d8f4c6914056d36d68f343feb9c1e82b46eef95819d00be5 --force

Interesting give totally different result. The local registry one get passed. The quay.io one get failed.

[root@preserve-jialiu-ansible jialiu-v4-installer-autotest]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.19    True        False         160m    Cluster version is 4.4.4
[root@preserve-jialiu-ansible jialiu-v4-installer-autotest]# oc get po -n openshift-cluster-version
NAME                                        READY   STATUS      RESTARTS   AGE
cluster-version-operator-6bd6f794d9-dd25n   1/1     Running     0          3h36m
version--mm75v-nnjlp                        0/1     Completed   0          170m
[root@preserve-jialiu-ansible jialiu-v4-installer-autotest]# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.4     True        False         False      3h53m
cloud-credential                           4.3.19    True        False         False      3h50m
cluster-autoscaler                         4.4.4     True        False         False      4h21m
console                                    4.3.19    True        False         False      3h45m
csi-snapshot-controller                    4.4.4     True        False         False      156m
dns                                        4.4.4     True        False         False      4h25m
image-registry                             4.4.4     True        False         False      3h39m
ingress                                    4.3.19    True        False         False      159m
insights                                   4.3.19    True        False         False      4h22m
kube-apiserver                             4.3.19    True        False         True       4h24m
kube-controller-manager                    4.3.19    True        False         True       4h23m
kube-scheduler                             4.3.19    True        False         True       4h23m
machine-api                                4.4.4     True        False         False      4h26m
machine-config                             4.3.19    False       True          True       149m
marketplace                                4.4.4     True        False         False      166m
monitoring                                 4.4.4     True        False         False      3h48m
network                                    4.3.19    True        False         False      4h26m
node-tuning                                4.4.4     True        False         False      164m
openshift-apiserver                        4.4.4     True        False         True       159m
openshift-controller-manager               4.4.4     True        False         False      4h24m
openshift-samples                          4.3.19    True        False         False      4h12m
operator-lifecycle-manager                 4.3.19    True        False         False      4h22m
operator-lifecycle-manager-catalog         4.3.19    True        False         False      4h22m
operator-lifecycle-manager-packageserver   4.3.19    True        False         False      3h39m
service-ca                                 4.4.4     True        False         False      4h26m
service-catalog-apiserver                  4.4.4     True        False         False      4h22m
service-catalog-controller-manager         4.4.4     True        False         False      4h22m
storage                                    4.4.4     True        False         False      169m
[root@preserve-jialiu-ansible jialiu-v4-installer-autotest]# oc get node
NAME                                        STATUS     ROLES    AGE     VERSION
ip-10-0-54-209.us-east-2.compute.internal   Ready      worker   4h22m   v1.16.2
ip-10-0-56-228.us-east-2.compute.internal   NotReady   master   4h30m   v1.16.2
ip-10-0-60-34.us-east-2.compute.internal    Ready      master   4h30m   v1.16.2
ip-10-0-69-74.us-east-2.compute.internal    Ready      master   4h30m   v1.16.2
ip-10-0-71-125.us-east-2.compute.internal   NotReady   worker   4h22m   v1.16.2
ip-10-0-74-136.us-east-2.compute.internal   Ready      worker   4h22m   v1.16.2


From upgrade progress log, seem like CVO suddenly report "Cluster version is 4.4.4" [2], even not any message like "Working towards 4.4.4: N% complete" [1], but actually the cluster is never really get completed.

[1]: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/installer-auto-test/1661/console
[2]: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/installer-auto-test/1665/console

Comment 18 Johnny Liu 2020-05-27 08:11:01 UTC

Created attachment 1692594 [details]
qe cvo log 2

Comment 19 Lalatendu Mohanty 2020-05-27 15:45:57 UTC

Went though QE logs and found a lot of similarity in the logs, however the API server does not go down. Still not sure why we are seeing this.

Comment 23 W. Trevor King 2020-05-28 08:39:16 UTC

Ok, we have a better lead on this issue after looking at some local reproducers and then taking a closer look at the CVO logs attached in comment 6. From the attached unsuccessful_oc_logs_cluster_version_operator.log:

I0524 16:02:50.774740 1 start.go:19] ClusterVersionOperator v4.3.19-202005041055-dirty
...
I0524 16:02:51.342076 1 cvo.go:332] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s
...
I0524 16:02:51.444381 1 sync_worker.go:471] Running sync quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc (force=true) on generation 2 in state Updating at attempt 0
...
I0524 16:08:36.950963 1 sync_worker.go:539] Payload loaded from quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc with hash h110xMINmng=
...
I0524 16:08:36.953283 1 task_graph.go:611] Result of work: [update was cancelled at 0 of 573]
...
I0524 16:11:29.479262 1 sync_worker.go:471] Running sync quay.io/openshift-release-dev/ocp-release@sha256:039a4ef7c128a049ccf916a1d68ce93e8f5494b44d5a75df60c85e9e7191dacc (force=true) on generation 2 in state Reconciling at attempt 0
...

So the 4.3.19 CVO loads the 4.4.3 (in that case) manifests, begins updating to them, immediately hits a cancel/timeout [1], and then decides (mistakenly) that it successfully completed the update and start Reconciling. We're still not clear on exactly what the mistake is.

In the meantime, reconciliation's shuffled, flattened manifest graph can do bad things like updating the kubelets before updating the Kubernetes API server. Raising to urgent while we work on bottoming this out.

[1]: https://github.com/openshift/cluster-version-operator/pull/372 to improve the logging here, but the 5m45s duration between 2:51 and 8:36 roughly matches 2 * 2m52s [2].
[2]: https://github.com/openshift/cluster-version-operator/blob/86b9bdba55a85e2e071603916db4c43b481e7588/pkg/cvo/sync_worker.go#L296

Comment 24 W. Trevor King 2020-05-28 12:46:48 UTC

PR submitted. We should backport through 4.2, when we started supporting restricted-network flows, because the timing out signature retrievals plus forced updates common there are what makes tripping this race more likely. Here's a full impact statement, now that we understand the issue:

Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?
All customers upgrading out of a CVO that does not contain the patch are potentially affected, but the chance of tripping the race is very small except for restricted-network users who are forcing updates.
The impact when the race trips is also small for patch-level bumps, so the main concern is restricted-network users who are performing minor bumps like 4.2->4.3.
What is the impact? Is it serious enough to warrant blocking edges?
The CVO enters reconciliation mode on the target version, attempting to apply a flat, shuffled manifest graph. All kinds of terrible things could happen like the machine-config trying to roll out the newer machine-os-content and its 4.4 hyperkube binary before rolling out prerequisites like the 4.4 kube-apiserver operator. That one will make manifest application sticky, but it would not surprise me if you could find some terrible ordering that might brick a cluster.
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
Admin must update to a CVO that is not vulnerable to the race. Using an unforced update (e.g. by copying in a signature ConfigMap [1] for 4.3.12+ or 4.4.0+) would help reduce the likelihood of tripping the race. Using a patch-level update would reduce the impact if the race trips anyway.

[1]: https://github.com/openshift/openshift-docs/pull/21993

Comment 25 Brenton Leanhardt 2020-05-28 12:51:30 UTC

Are clusters that hit this bug permanently wedged or is there a chance a subsequent attempt avoids the race?

Comment 26 W. Trevor King 2020-05-28 13:11:06 UTC

> Are clusters that hit this bug permanently wedged or is there a chance a subsequent attempt avoids the race?

Once you trip the race and move from UpdatingPayload to ReconcilingPayload, that CVO will not go back to updating.  You can re-target your update with 'oc adm upgrade ...' and that will get you back into UpdatingPayload mode.  But while the CVO was running between the trace trip and your update, it could have been doing all sorts of things as it tried to push out the flattened, shuffled manifest graph.  Recovering a cluster that has hit this bug is going to be hard, and will probably involve a case-by-case review of its current state to try to determine a next-hop update target that is as close as possible to what the cluster is currently running.  And also accounts for which directions and orders operators can transition in.  Worst case short of a bricked cluster would be having to turn the CVO off entirely, and push manifests one at a time on its behalf to slowly unwind any components that had been too tangled up.

Comment 27 Scott Dodson 2020-05-28 13:31:48 UTC

I've marked all comments as private which seemed to contain confidential information. Can you please allow us to make this bug public so that we can reference it directly? Comment 7 may need to be made private as it references hostnames but I believe those to be test systems which of course are not Internet facing.

Comment 28 Masaki Hatada 2020-05-29 04:34:12 UTC

Created attachment 1693220 [details]
Upgrade OCP without force option

(In reply to W. Trevor King from comment #14)
> In the meantime, you should be able to work around this issue by creating a
> signature ConfigMap locally, as described in the in-flight [1].  That way
> the CVO wouldn't bother attempting to hit the external signature stores, and
> you'd have plenty of time for the other update precondition checks.  As a
> bonus, you'd have verification that the release image you were updating to
> was indeed signed by Red Hat and you woudn't need to use --force.
> 
> [1]: https://github.com/openshift/openshift-docs/pull/21993

We attempted to upgrade OCP4.3.19 to OCP4.4.3 on AWS disconnected env with the above way, and it worked.
(Please see attached file for detail)

We are still not sure if it succeeded as a coincidence. So currently we are trying the workaround on our vSphere disconnected env.

(In reply to Scott Dodson from comment #27)
> I've marked all comments as private which seemed to contain confidential
> information. Can you please allow us to make this bug public so that we can
> reference it directly? Comment 7 may need to be made private as it
> references hostnames but I believe those to be test systems which of course
> are not Internet facing.

Kii-san will handle your request.
But we would like to know the detail why you want to make this Bugzilla public.

Best Regards,
Masaki Hatada

Comment 29 Masaki Hatada 2020-05-29 04:48:58 UTC

> Admin must update to a CVO that is not vulnerable to the race.  Using an unforced update (e.g. by copying in a signature ConfigMap [1] for 4.3.12+ or 4.4.0+) would help reduce the likelihood of tripping the race.  Using a patch-level update would reduce the impact if the race trips anyway.
>
> [1]: https://github.com/openshift/openshift-docs/pull/21993

We'll ask our customer to use the signature ConfigMap way when upgrading their cluster in disconnected env,
but we also hope that Red Hat writes the version information(4.3.12+ or 4.4.0+) in the doc.

BTW, the signature ConfigMap seems to be able to used from 4.3.11.
(Although we have never tested with this version actually)

  $ podman run --rm --entrypoint /usr/bin/strings $(curl -s https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.3.11/release.txt | grep  cluster-version-operator  | awk '{print $2}')  /usr/bin/cluster-version-operator | grep verification-signatures >/dev/null && echo found || echo not found
  found

  $ podman run --rm --entrypoint /usr/bin/strings $(curl -s https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.3.10/release.txt | grep  cluster-version-operator  | awk '{print $2}')  /usr/bin/cluster-version-operator | grep verification-signatures >/dev/null && echo found || echo not found
  not found

Best Regards,
Masaki Hatada

Comment 30 W. Trevor King 2020-05-29 05:09:17 UTC

> But we would like to know the detail why you want to make this Bugzilla public.

The race fixed by the in-flight PR can have, as you saw, pretty serious consequences on cluster health.  We want to be able to link Red Hat support and other users to this bug so they can see things like the impact statement in comment 24.  Having a public issue would also allow folks to find the bug (and impact statement and fix) by searching if they trip over the same CSIMigration kube-API-server/kubelet version mismatch symptoms.

Comment 31 Keiichi Kii (NEC) 2020-05-29 19:25:45 UTC

(In reply to W. Trevor King from comment #30)
> > But we would like to know the detail why you want to make this Bugzilla public.
> 
> The race fixed by the in-flight PR can have, as you saw, pretty serious
> consequences on cluster health.  We want to be able to link Red Hat support
> and other users to this bug so they can see things like the impact statement
> in comment 24.  Having a public issue would also allow folks to find the bug
> (and impact statement and fix) by searching if they trip over the same
> CSIMigration kube-API-server/kubelet version mismatch symptoms.

Thank you for your explanation. We understand the reason why and agree with making
this issue known to the public.

Before making this bugzilla public, we have a question for your request.
Is there any plan to make an article regarding this issue in Red Hat knowledge base?
We think making the article in your knowledge base is suitable to publicize.

At this point, as you said, there are no confidential information in this bugzilla.
But we concern we might have an accident to upload our confidential information when we discuss this issue in the future.
And most of bugzillas we reported are marked as "private" flag and "NEC confidential" group.
We don't want to make an exceptional case.

Comment 32 Scott Dodson 2020-05-29 19:49:23 UTC

(In reply to Keiichi Kii (NEC) from comment #31)

> Before making this bugzilla public, we have a question for your request.
> Is there any plan to make an article regarding this issue in Red Hat knowledge base?
> We think making the article in your knowledge base is suitable to publicize.

Possibly, that decision is up to the support organization to make.

> At this point, as you said, there are no confidential information in this
> bugzilla.
> But we concern we might have an accident to upload our confidential
> information when we discuss this issue in the future.
> And most of bugzillas we reported are marked as "private" flag and "NEC
> confidential" group.
> We don't want to make an exceptional case.

We link public GitHub PRs back to public bugs and those bugs end up in product release notes. It's really best for us if our bugs are public.
If we don't make this public we'll create a new bug that restates the problem now that it's believed to be understood, with no NEC specific information, this bug will be closed as a duplicate.

Comment 33 Keiichi Kii (NEC) 2020-05-29 21:07:48 UTC

> We link public GitHub PRs back to public bugs and those bugs end up in
> product release notes. It's really best for us if our bugs are public.
> If we don't make this public we'll create a new bug that restates the
> problem now that it's believed to be understood, with no NEC specific
> information, this bug will be closed as a duplicate.

Thank you for explanation again.
Please make this bugzilla public. We will internally handle this bug as exceptional case.

Comment 34 Masaki Hatada 2020-06-02 08:27:36 UTC

(In reply to Masaki Hatada from comment #28)
> We attempted to upgrade OCP4.3.19 to OCP4.4.3 on AWS disconnected env with
> the above way, and it worked.
> (Please see attached file for detail)
> 
> We are still not sure if it succeeded as a coincidence. So currently we are
> trying the workaround on our vSphere disconnected env.

Sorry, I forgot to share the result of the vSphere disconnected env.
We tested upgrading from OCP4.3.18 to OCP4.4.3 on this env with the workaround, and it worked.

Best Regards,
Masaki Hatada

Comment 35 W. Trevor King 2020-06-03 13:29:13 UTC

Cloned back to bug 1843526 for 4.5.  Will continue to backport further as QE verifies the fix in master-ward releases.

Comment 38 Johnny Liu 2020-06-04 15:11:20 UTC

Verified this bug with 4.6.0-0.ci-2020-06-04-011434, Passed.

Set up a pure disconnected cluster with 4.5.0-0.nightly-2020-06-04-001344, trigger upgrade with --force to 4.6.0-0.ci-2020-06-04-011434.


[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.ci-2020-06-04-011434   True        False         32m     Cluster version is 4.6.0-0.ci-2020-06-04-011434
[root@preserve-jialiu-ansible ~]# oc get co
NAME                                       VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.ci-2020-06-04-011434   True        False         False      11h
cloud-credential                           4.6.0-0.ci-2020-06-04-011434   True        False         False      11h
cluster-autoscaler                         4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
config-operator                            4.6.0-0.ci-2020-06-04-011434   True        False         False      8h
console                                    4.6.0-0.ci-2020-06-04-011434   True        False         False      38m
csi-snapshot-controller                    4.6.0-0.ci-2020-06-04-011434   True        False         False      42m
dns                                        4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
etcd                                       4.6.0-0.ci-2020-06-04-011434   True        False         False      9h
image-registry                             4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
ingress                                    4.6.0-0.ci-2020-06-04-011434   True        False         False      67m
insights                                   4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
kube-apiserver                             4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
kube-controller-manager                    4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
kube-scheduler                             4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
kube-storage-version-migrator              4.6.0-0.ci-2020-06-04-011434   True        False         False      42m
machine-api                                4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
machine-approver                           4.6.0-0.ci-2020-06-04-011434   True        False         False      9h
machine-config                             4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
marketplace                                4.6.0-0.ci-2020-06-04-011434   True        False         False      40m
monitoring                                 4.6.0-0.ci-2020-06-04-011434   True        False         False      7h33m
network                                    4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
node-tuning                                4.6.0-0.ci-2020-06-04-011434   True        False         False      67m
openshift-apiserver                        4.6.0-0.ci-2020-06-04-011434   True        False         False      7h51m
openshift-controller-manager               4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
openshift-samples                          4.6.0-0.ci-2020-06-04-011434   True        False         False      59m
operator-lifecycle-manager                 4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
operator-lifecycle-manager-catalog         4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
operator-lifecycle-manager-packageserver   4.6.0-0.ci-2020-06-04-011434   True        False         False      37m
service-ca                                 4.6.0-0.ci-2020-06-04-011434   True        False         False      12h
storage                                    4.6.0-0.ci-2020-06-04-011434   True        False         False      67m
[root@preserve-jialiu-ansible ~]# oc get node
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-50-112.us-east-2.compute.internal   Ready    master   12h   v1.18.3+a637491
ip-10-0-50-64.us-east-2.compute.internal    Ready    master   12h   v1.18.3+a637491
ip-10-0-62-65.us-east-2.compute.internal    Ready    worker   12h   v1.18.3+a637491
ip-10-0-67-224.us-east-2.compute.internal   Ready    master   12h   v1.18.3+a637491
ip-10-0-75-234.us-east-2.compute.internal   Ready    worker   12h   v1.18.3+a637491
ip-10-0-78-237.us-east-2.compute.internal   Ready    worker   12h   v1.18.3+a637491

Comment 41 errata-xmlrpc 2020-10-27 16:00:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 42 W. Trevor King 2021-04-05 17:47:28 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.