Bug 1872742

Summary: Unable to apply upgrade in disconnected env - from 4.5.7 to 4.6 nightly with force flag
Product: OpenShift Container Platform Reporter: Polina Rabinovich <prabinov>
Component: UnknownAssignee: Sudha Ponnaganti <sponnaga>
Status: CLOSED DUPLICATE QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: urgent    
Version: 4.6CC: aos-bugs, brad, eparis, jima, jokerman, omichael, prabinov, smiron, wking, yanyang
Target Milestone: ---Keywords: TestBlocker, Triaged, Upgrades
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-22 16:51:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1865308    
Attachments:
Description Flags
cluster-version-operator logs none

Description Polina Rabinovich 2020-08-26 14:23:00 UTC
Description of problem:

I tried to upgrade my cluster (disconnected env) from version 4.5.7 to version 4.6 nightly with using force flag.
I let the upgrade run for some time, but it seems like it stucked always after 79% with this error message:

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.7     True        True          51m     Unable to apply 4.6.0-0.nightly-2020-08-26-032807: an unknown error has occurred: MultipleErrors

Operators condition:

[kni@provisionhost-0-0 ~]$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-08-26-032807   False       True          True       18m
cloud-credential                           4.6.0-0.nightly-2020-08-26-032807   True        False         False      138m
cluster-autoscaler                         4.6.0-0.nightly-2020-08-26-032807   True        False         False      111m
config-operator                            4.6.0-0.nightly-2020-08-26-032807   True        False         False      111m
console                                    4.6.0-0.nightly-2020-08-26-032807   True        False         True       26m
csi-snapshot-controller                    4.6.0-0.nightly-2020-08-26-032807   True        False         False      57m
dns                                        4.5.7                               True        False         False      121m
etcd                                       4.6.0-0.nightly-2020-08-26-032807   True        False         False      120m
image-registry                             4.6.0-0.nightly-2020-08-26-032807   True        False         False      57m
ingress                                    4.6.0-0.nightly-2020-08-26-032807   True        False         False      28m
insights                                   4.6.0-0.nightly-2020-08-26-032807   True        False         False      118m
kube-apiserver                             4.6.0-0.nightly-2020-08-26-032807   True        False         False      120m
kube-controller-manager                    4.6.0-0.nightly-2020-08-26-032807   True        False         False      119m
kube-scheduler                             4.6.0-0.nightly-2020-08-26-032807   True        False         False      119m
kube-storage-version-migrator              4.6.0-0.nightly-2020-08-26-032807   True        False         False      57m
machine-api                                4.6.0-0.nightly-2020-08-26-032807   True        False         False      110m
machine-approver                           4.6.0-0.nightly-2020-08-26-032807   True        False         False      120m
machine-config                             4.5.7                               True        False         False      121m
marketplace                                4.6.0-0.nightly-2020-08-26-032807   True        False         False      27m
monitoring                                 4.6.0-0.nightly-2020-08-26-032807   False       True          True       7m47s
network                                    4.6.0-0.nightly-2020-08-26-032807   True        False         False      122m
node-tuning                                4.6.0-0.nightly-2020-08-26-032807   True        False         False      28m
openshift-apiserver                        4.6.0-0.nightly-2020-08-26-032807   False       False         False      9m26s
openshift-controller-manager               4.6.0-0.nightly-2020-08-26-032807   True        False         False      117m
openshift-samples                          4.6.0-0.nightly-2020-08-26-032807   True        False         False      19m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-08-26-032807   True        False         False      121m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-08-26-032807   True        False         False      121m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-08-26-032807   False       True          False      8s
service-ca                                 4.6.0-0.nightly-2020-08-26-032807   True        False         False      122m
storage                                    4.6.0-0.nightly-2020-08-26-032807   True        False         False      28m


Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:
100%

Steps to Reproduce:

1. Deploy disconnected cluster with IPV6 provisioning network and IPV4 baremetal network,
   with version 4.5.7 : https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasetag/4.5.7

2.Upgrade the cluster to version 4.6 nightly ( i used this one: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasetag/4.6.0-0.nightly-2020-08-26-032807

$oc adm upgrade --to-image registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.6.0-0.nightly-2020-08-26-032807 --allow-explicit-upgrade --force
 

Actual results:
Cluster failed to upgrade (with force flag) to version 4.6 nightly.

Expected results:
Upgrade from 4.5.7 to 4.6 nightly passed successfully.

Comment 1 Beth White 2020-09-01 16:38:24 UTC
*** Bug 1874093 has been marked as a duplicate of this bug. ***

Comment 2 Kiran Thyagaraja 2020-09-09 13:09:17 UTC
It would be useful to see a detailed output of the failed ClusterOperators. Also what does the 'oc get clusterversion -o yaml' show? If there are imagepullbackoff errors, it would be worth trying to figure out why the image pull is failing? Auth problems, etc.

Comment 3 Polina Rabinovich 2020-09-10 12:53:39 UTC
(In reply to Kiran Thyagaraja from comment #2)
> It would be useful to see a detailed output of the failed ClusterOperators.
> Also what does the 'oc get clusterversion -o yaml' show? If there are
> imagepullbackoff errors, it would be worth trying to figure out why the
> image pull is failing? Auth problems, etc.

I tried to reproduce the bug, it is now from 4.5.8 to 4.6.0-fc.4 (with force flag), the problem is different, but related to the fact that
the upgrade is not working well. i added 'must-gather' now to the bug and also the output of 'oc get clusterversion -o yaml', hope it will help:

[kni@provisionhost-0-0 ~]$ oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2020-09-10T06:51:14Z"
    generation: 2
    managedFields:
    - apiVersion: config.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:channel: {}
          f:clusterID: {}
          f:upstream: {}
      manager: cluster-bootstrap
      operation: Update
      time: "2020-09-10T06:51:14Z"
    - apiVersion: config.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:desiredUpdate:
            .: {}
            f:force: {}
            f:image: {}
            f:version: {}
      manager: oc
      operation: Update
      time: "2020-09-10T09:36:37Z"
    - apiVersion: config.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:availableUpdates: {}
          f:conditions: {}
          f:desired:
            .: {}
            f:force: {}
            f:image: {}
            f:version: {}
          f:history: {}
          f:observedGeneration: {}
          f:versionHash: {}
      manager: cluster-version-operator
      operation: Update
      time: "2020-09-10T11:55:29Z"
    name: version
    resourceVersion: "217904"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: 627ac525-045d-4cce-a1c7-11576a5fbb25
  spec:
    channel: stable-4.5
    clusterID: 29d8707c-1679-4a4e-a09d-c9ffdd28252d
    desiredUpdate:
      force: true
      image: registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.6.0-fc.4
      version: ""
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2020-09-10T07:50:07Z"
      message: Done applying 4.5.8
      status: "True"
      type: Available
    - lastTransitionTime: "2020-09-10T11:55:29Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2020-09-10T09:36:44Z"
      message: 'Working towards 4.6.0-fc.4: 1% complete'
      status: "True"
      type: Progressing
    - lastTransitionTime: "2020-09-10T06:51:21Z"
      message: 'Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.5&id=29d8707c-1679-4a4e-a09d-c9ffdd28252d&version=4.6.0-fc.4":
        dial tcp 52.5.215.228:443: connect: connection timed out'
      reason: RemoteFailed
      status: "False"
      type: RetrievedUpdates
    - lastTransitionTime: "2020-09-10T10:10:34Z"
      message: 'Cluster operator cloud-credential cannot be upgraded between minor
        versions: Parent credential secret kube-system/aws-creds must be restored
        prior to upgrade'
      reason: CredentialsRootSecretMissing
      status: "False"
      type: Upgradeable
    desired:
      image: registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.6.0-fc.4
      version: 4.6.0-fc.4
    history:
    - completionTime: null
      image: registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.6.0-fc.4
      startedTime: "2020-09-10T09:36:44Z"
      state: Partial
      verified: false
      version: 4.6.0-fc.4
    - completionTime: "2020-09-10T07:50:07Z"
      image: registry.svc.ci.openshift.org/ocp/release@sha256:ae61753ad8c8a26ed67fa233eea578194600d6c72622edab2516879cfbf019fd
      startedTime: "2020-09-10T06:51:21Z"
      state: Completed
      verified: false
      version: 4.5.8
    observedGeneration: 2
    versionHash: Yqy8fQV18YE=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


must-gather:
https://drive.google.com/drive/folders/1oBFXrggyZRfx1v3jpet0yHOc7vNwCa61?usp=sharing

Comment 4 Kiran Thyagaraja 2020-09-17 13:29:40 UTC
If you notice that there is a status component that says: 'Unable to retrieve available updates:'. It looks like the disconnected IPV6 cluster is not able to reach outside (the cluster) addresses either because the route is wrong or non-functional. I have seen this in my deployment as well. IPv4 works fine IMO.

Comment 5 Kiran Thyagaraja 2020-09-17 14:42:19 UTC
Here is an example. I try to manually pull an image on an ipv6 cluster from one of the master nodes:


[core@master-1 ~]$ sudo podman pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07
Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07...
  Get https://quay.io/v2/: dial tcp 52.0.92.170:443: connect: network is unreachable
Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07: (Mirrors also failed: [virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07: Error reading manifest sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07 in virthost.ostest.test.metalkube.org:5000/localimages/local-release-image: unauthorized: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07: error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp 52.0.92.170:443: connect: network is unreachable

Clearly the route to the internet is not available. So, when the cluster is trying to upgrade from the internet, ignoring the local mirror, then we'll see failures.

Comment 7 W. Trevor King 2020-09-17 20:38:58 UTC
(In reply to Polina Rabinovich from comment #0)
> 2.Upgrade the cluster to version 4.6 nightly ( i used this one:
> https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasetag/4.6.0-
> 0.nightly-2020-08-26-032807
> 
> $oc adm upgrade --to-image
> registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-
> image:4.6.0-0.nightly-2020-08-26-032807 --allow-explicit-upgrade --force

Forcing is not recommended [1].  If your motivation was the lack of an available, trusted signature on the target release, you should:

* Stick to nightlies which have been mirrored to Quay [2] or use a feature candidate [3], and
* Mirror in the target release's signature [4].

[1]: https://github.com/openshift/oc/pull/387
[2]: https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/latest-4.6/
[3]: https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/4.6.0-fc.5/
[4]: https://docs.openshift.com/container-platform/4.5/updating/updating-restricted-network-cluster.html#updating-restricted-network-image-signature-configmap

(In reply to Kiran Thyagaraja from comment #4)
> If you notice that there is a status component that says: 'Unable to
> retrieve available updates...

--allow-explicit-upgrade tells 'oc' to not worry about targets that not in avialableUpdates, and --to-image means that you don't need availableUpdates to convert a version to a pullspec.  And all of this happens as part of requesting the update.  By the time the cluster-version operator is actually reconciling the cluster to the 4.6 target, further availableUpdates issues are completely irrelevant.  Whatever is sticking the update part way is something else.

Comment 8 Polina Rabinovich 2020-09-21 06:33:43 UTC
(In reply to W. Trevor King from comment #7)
> (In reply to Polina Rabinovich from comment #0)
> > 2.Upgrade the cluster to version 4.6 nightly ( i used this one:
> > https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasetag/4.6.0-
> > 0.nightly-2020-08-26-032807
> > 
> > $oc adm upgrade --to-image
> > registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-
> > image:4.6.0-0.nightly-2020-08-26-032807 --allow-explicit-upgrade --force
> 
> Forcing is not recommended [1].  If your motivation was the lack of an
> available, trusted signature on the target release, you should:
> 
> * Stick to nightlies which have been mirrored to Quay [2] or use a feature
> candidate [3], and
> * Mirror in the target release's signature [4].
> 
> [1]: https://github.com/openshift/oc/pull/387
> [2]:
> https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/
> latest-4.6/
> [3]:
> https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/
> 4.6.0-fc.5/
> [4]:
> https://docs.openshift.com/container-platform/4.5/updating/updating-
> restricted-network-cluster.html#updating-restricted-network-image-signature-
> configmap
> 
> (In reply to Kiran Thyagaraja from comment #4)
> > If you notice that there is a status component that says: 'Unable to
> > retrieve available updates...
> 
> --allow-explicit-upgrade tells 'oc' to not worry about targets that not in
> avialableUpdates, and --to-image means that you don't need availableUpdates
> to convert a version to a pullspec.  And all of this happens as part of
> requesting the update.  By the time the cluster-version operator is actually
> reconciling the cluster to the 4.6 target, further availableUpdates issues
> are completely irrelevant.  Whatever is sticking the update part way is
> something else.


We opened a bug (https://bugzilla.redhat.com/show_bug.cgi?id=1874093) without force but it was closed due to duplication. So the problem happened no matter if we use force flag or not.

Comment 9 W. Trevor King 2020-09-21 16:13:14 UTC
From comment 3's must gather, the ClusterVersion had [1]:

  - lastTransitionTime: "2020-09-10T09:36:44Z"
    message: 'Working towards 4.6.0-fc.4: 1% complete'
    status: "True"
    type: Progressing

namespaces/openshift-cluster-version/pods/cluster-version-operator-579fd8968b-k5f59 [2] had pod YAML showing the CVO was running, but no pod logs to say what it was up to.  Can you attach CVO pod logs from the hung update?

[1]: https://drive.google.com/drive/folders/1_WDGsXgMq40w5WTlf7fbkyCI5JK5meAW?usp=sharing
[2]: https://drive.google.com/drive/folders/1oHqBopZt7SEmhVdryNlZP7_Ij8ABpjfK?usp=sharing

Comment 10 Kiran Thyagaraja 2020-09-21 16:22:57 UTC
Hi Polina, Can you get us cluster-version-operator logs from your failed upgrade?

Comment 11 Polina Rabinovich 2020-09-22 13:07:49 UTC
Created attachment 1715701 [details]
cluster-version-operator logs

I attached - cluster-version-operator logs

Comment 12 Polina Rabinovich 2020-09-22 13:10:56 UTC
Hi Kiran, I attached cluster-version-operator logs from failed upgrade.

[kni@provisionhost-0-0 ~]$ oc get pods -n openshift-cluster-version 
NAME                                        READY   STATUS      RESTARTS   AGE
cluster-version-operator-579fd8968b-zgxs7   1/1     Running     0          77m
version--l5dj5-wm28h                        0/1     Completed   0          78m

Comment 13 Kiran Thyagaraja 2020-09-22 16:33:04 UTC
Looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1875237.

Comment 14 Brad P. Crochet 2020-09-22 16:51:09 UTC

*** This bug has been marked as a duplicate of bug 1875237 ***

Comment 15 W. Trevor King 2020-09-22 21:07:06 UTC
Feeding comment 11's logs into the CVO's log-explainer [1]

  $ ~/src/openshift/cluster-version-operator/hack/log-explainer.py <cluster-version-operator-logs.txt 
  WARNING:root:not finished: clusteroperator network: Cluster operator network is still updating

Analyzed logs show you're still forcing, despite comment 7 trying to explain why you should not need to force.  The analyzed logs also show that Attempt 4 (12:16 - 12:22) made it as far as waiting on the network operator.  Attempt 5 made it almost as far and stuck on operator-lifecycle-manager-packageserver.  Later attempts, starting at 12:34 all blew up trying to push the openshift-cluster-version/cluster-version-operator PrometheusRule, with errors like:

  error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 602): Put "https://api-int.ocp-edge-cluster-0.qe.lab.redhat.com:6443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-version/prometheusrules/cluster-version-operator": context deadline exceeded

I don't see any existing bugs mentioning that error message.  That error message is not much to go on, and without a full must-gather containing both CVO logs and ClusterOperator YAML, it's hard to string things together.  But if these CVO logs are from a similar situation to your initial comment 0 where monitoring was degraded, it might make sense to dig into why monitoring is sad to see if it's the source of the PrometheusRule server issues.

I'm not clear on the connection with bug 1875237.

[1]: https://github.com/openshift/cluster-version-operator/pull/452