Description of problem: I tried to upgrade my cluster (disconnected env) from version 4.5.7 to version 4.6 nightly with using force flag. I let the upgrade run for some time, but it seems like it stucked always after 79% with this error message: [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.7 True True 51m Unable to apply 4.6.0-0.nightly-2020-08-26-032807: an unknown error has occurred: MultipleErrors Operators condition: [kni@provisionhost-0-0 ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-08-26-032807 False True True 18m cloud-credential 4.6.0-0.nightly-2020-08-26-032807 True False False 138m cluster-autoscaler 4.6.0-0.nightly-2020-08-26-032807 True False False 111m config-operator 4.6.0-0.nightly-2020-08-26-032807 True False False 111m console 4.6.0-0.nightly-2020-08-26-032807 True False True 26m csi-snapshot-controller 4.6.0-0.nightly-2020-08-26-032807 True False False 57m dns 4.5.7 True False False 121m etcd 4.6.0-0.nightly-2020-08-26-032807 True False False 120m image-registry 4.6.0-0.nightly-2020-08-26-032807 True False False 57m ingress 4.6.0-0.nightly-2020-08-26-032807 True False False 28m insights 4.6.0-0.nightly-2020-08-26-032807 True False False 118m kube-apiserver 4.6.0-0.nightly-2020-08-26-032807 True False False 120m kube-controller-manager 4.6.0-0.nightly-2020-08-26-032807 True False False 119m kube-scheduler 4.6.0-0.nightly-2020-08-26-032807 True False False 119m kube-storage-version-migrator 4.6.0-0.nightly-2020-08-26-032807 True False False 57m machine-api 4.6.0-0.nightly-2020-08-26-032807 True False False 110m machine-approver 4.6.0-0.nightly-2020-08-26-032807 True False False 120m machine-config 4.5.7 True False False 121m marketplace 4.6.0-0.nightly-2020-08-26-032807 True False False 27m monitoring 4.6.0-0.nightly-2020-08-26-032807 False True True 7m47s network 4.6.0-0.nightly-2020-08-26-032807 True False False 122m node-tuning 4.6.0-0.nightly-2020-08-26-032807 True False False 28m openshift-apiserver 4.6.0-0.nightly-2020-08-26-032807 False False False 9m26s openshift-controller-manager 4.6.0-0.nightly-2020-08-26-032807 True False False 117m openshift-samples 4.6.0-0.nightly-2020-08-26-032807 True False False 19m operator-lifecycle-manager 4.6.0-0.nightly-2020-08-26-032807 True False False 121m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-08-26-032807 True False False 121m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-08-26-032807 False True False 8s service-ca 4.6.0-0.nightly-2020-08-26-032807 True False False 122m storage 4.6.0-0.nightly-2020-08-26-032807 True False False 28m Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version How reproducible: 100% Steps to Reproduce: 1. Deploy disconnected cluster with IPV6 provisioning network and IPV4 baremetal network, with version 4.5.7 : https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasetag/4.5.7 2.Upgrade the cluster to version 4.6 nightly ( i used this one: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasetag/4.6.0-0.nightly-2020-08-26-032807 $oc adm upgrade --to-image registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.6.0-0.nightly-2020-08-26-032807 --allow-explicit-upgrade --force Actual results: Cluster failed to upgrade (with force flag) to version 4.6 nightly. Expected results: Upgrade from 4.5.7 to 4.6 nightly passed successfully.
*** Bug 1874093 has been marked as a duplicate of this bug. ***
It would be useful to see a detailed output of the failed ClusterOperators. Also what does the 'oc get clusterversion -o yaml' show? If there are imagepullbackoff errors, it would be worth trying to figure out why the image pull is failing? Auth problems, etc.
(In reply to Kiran Thyagaraja from comment #2) > It would be useful to see a detailed output of the failed ClusterOperators. > Also what does the 'oc get clusterversion -o yaml' show? If there are > imagepullbackoff errors, it would be worth trying to figure out why the > image pull is failing? Auth problems, etc. I tried to reproduce the bug, it is now from 4.5.8 to 4.6.0-fc.4 (with force flag), the problem is different, but related to the fact that the upgrade is not working well. i added 'must-gather' now to the bug and also the output of 'oc get clusterversion -o yaml', hope it will help: [kni@provisionhost-0-0 ~]$ oc get clusterversion -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2020-09-10T06:51:14Z" generation: 2 managedFields: - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:channel: {} f:clusterID: {} f:upstream: {} manager: cluster-bootstrap operation: Update time: "2020-09-10T06:51:14Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:desiredUpdate: .: {} f:force: {} f:image: {} f:version: {} manager: oc operation: Update time: "2020-09-10T09:36:37Z" - apiVersion: config.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:status: .: {} f:availableUpdates: {} f:conditions: {} f:desired: .: {} f:force: {} f:image: {} f:version: {} f:history: {} f:observedGeneration: {} f:versionHash: {} manager: cluster-version-operator operation: Update time: "2020-09-10T11:55:29Z" name: version resourceVersion: "217904" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: 627ac525-045d-4cce-a1c7-11576a5fbb25 spec: channel: stable-4.5 clusterID: 29d8707c-1679-4a4e-a09d-c9ffdd28252d desiredUpdate: force: true image: registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.6.0-fc.4 version: "" upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: null conditions: - lastTransitionTime: "2020-09-10T07:50:07Z" message: Done applying 4.5.8 status: "True" type: Available - lastTransitionTime: "2020-09-10T11:55:29Z" status: "False" type: Failing - lastTransitionTime: "2020-09-10T09:36:44Z" message: 'Working towards 4.6.0-fc.4: 1% complete' status: "True" type: Progressing - lastTransitionTime: "2020-09-10T06:51:21Z" message: 'Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.5&id=29d8707c-1679-4a4e-a09d-c9ffdd28252d&version=4.6.0-fc.4": dial tcp 52.5.215.228:443: connect: connection timed out' reason: RemoteFailed status: "False" type: RetrievedUpdates - lastTransitionTime: "2020-09-10T10:10:34Z" message: 'Cluster operator cloud-credential cannot be upgraded between minor versions: Parent credential secret kube-system/aws-creds must be restored prior to upgrade' reason: CredentialsRootSecretMissing status: "False" type: Upgradeable desired: image: registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.6.0-fc.4 version: 4.6.0-fc.4 history: - completionTime: null image: registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image:4.6.0-fc.4 startedTime: "2020-09-10T09:36:44Z" state: Partial verified: false version: 4.6.0-fc.4 - completionTime: "2020-09-10T07:50:07Z" image: registry.svc.ci.openshift.org/ocp/release@sha256:ae61753ad8c8a26ed67fa233eea578194600d6c72622edab2516879cfbf019fd startedTime: "2020-09-10T06:51:21Z" state: Completed verified: false version: 4.5.8 observedGeneration: 2 versionHash: Yqy8fQV18YE= kind: List metadata: resourceVersion: "" selfLink: "" must-gather: https://drive.google.com/drive/folders/1oBFXrggyZRfx1v3jpet0yHOc7vNwCa61?usp=sharing
If you notice that there is a status component that says: 'Unable to retrieve available updates:'. It looks like the disconnected IPV6 cluster is not able to reach outside (the cluster) addresses either because the route is wrong or non-functional. I have seen this in my deployment as well. IPv4 works fine IMO.
Here is an example. I try to manually pull an image on an ipv6 cluster from one of the master nodes: [core@master-1 ~]$ sudo podman pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07 Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07... Get https://quay.io/v2/: dial tcp 52.0.92.170:443: connect: network is unreachable Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07: (Mirrors also failed: [virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07: Error reading manifest sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07 in virthost.ostest.test.metalkube.org:5000/localimages/local-release-image: unauthorized: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e55e2ed25c29bb54f00c82159b9ef04f43b387fc0c2f114f597b757205052f07: error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp 52.0.92.170:443: connect: network is unreachable Clearly the route to the internet is not available. So, when the cluster is trying to upgrade from the internet, ignoring the local mirror, then we'll see failures.
(In reply to Polina Rabinovich from comment #0) > 2.Upgrade the cluster to version 4.6 nightly ( i used this one: > https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasetag/4.6.0- > 0.nightly-2020-08-26-032807 > > $oc adm upgrade --to-image > registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release- > image:4.6.0-0.nightly-2020-08-26-032807 --allow-explicit-upgrade --force Forcing is not recommended [1]. If your motivation was the lack of an available, trusted signature on the target release, you should: * Stick to nightlies which have been mirrored to Quay [2] or use a feature candidate [3], and * Mirror in the target release's signature [4]. [1]: https://github.com/openshift/oc/pull/387 [2]: https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/latest-4.6/ [3]: https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/4.6.0-fc.5/ [4]: https://docs.openshift.com/container-platform/4.5/updating/updating-restricted-network-cluster.html#updating-restricted-network-image-signature-configmap (In reply to Kiran Thyagaraja from comment #4) > If you notice that there is a status component that says: 'Unable to > retrieve available updates... --allow-explicit-upgrade tells 'oc' to not worry about targets that not in avialableUpdates, and --to-image means that you don't need availableUpdates to convert a version to a pullspec. And all of this happens as part of requesting the update. By the time the cluster-version operator is actually reconciling the cluster to the 4.6 target, further availableUpdates issues are completely irrelevant. Whatever is sticking the update part way is something else.
(In reply to W. Trevor King from comment #7) > (In reply to Polina Rabinovich from comment #0) > > 2.Upgrade the cluster to version 4.6 nightly ( i used this one: > > https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasetag/4.6.0- > > 0.nightly-2020-08-26-032807 > > > > $oc adm upgrade --to-image > > registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release- > > image:4.6.0-0.nightly-2020-08-26-032807 --allow-explicit-upgrade --force > > Forcing is not recommended [1]. If your motivation was the lack of an > available, trusted signature on the target release, you should: > > * Stick to nightlies which have been mirrored to Quay [2] or use a feature > candidate [3], and > * Mirror in the target release's signature [4]. > > [1]: https://github.com/openshift/oc/pull/387 > [2]: > https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/ > latest-4.6/ > [3]: > https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/ > 4.6.0-fc.5/ > [4]: > https://docs.openshift.com/container-platform/4.5/updating/updating- > restricted-network-cluster.html#updating-restricted-network-image-signature- > configmap > > (In reply to Kiran Thyagaraja from comment #4) > > If you notice that there is a status component that says: 'Unable to > > retrieve available updates... > > --allow-explicit-upgrade tells 'oc' to not worry about targets that not in > avialableUpdates, and --to-image means that you don't need availableUpdates > to convert a version to a pullspec. And all of this happens as part of > requesting the update. By the time the cluster-version operator is actually > reconciling the cluster to the 4.6 target, further availableUpdates issues > are completely irrelevant. Whatever is sticking the update part way is > something else. We opened a bug (https://bugzilla.redhat.com/show_bug.cgi?id=1874093) without force but it was closed due to duplication. So the problem happened no matter if we use force flag or not.
From comment 3's must gather, the ClusterVersion had [1]: - lastTransitionTime: "2020-09-10T09:36:44Z" message: 'Working towards 4.6.0-fc.4: 1% complete' status: "True" type: Progressing namespaces/openshift-cluster-version/pods/cluster-version-operator-579fd8968b-k5f59 [2] had pod YAML showing the CVO was running, but no pod logs to say what it was up to. Can you attach CVO pod logs from the hung update? [1]: https://drive.google.com/drive/folders/1_WDGsXgMq40w5WTlf7fbkyCI5JK5meAW?usp=sharing [2]: https://drive.google.com/drive/folders/1oHqBopZt7SEmhVdryNlZP7_Ij8ABpjfK?usp=sharing
Hi Polina, Can you get us cluster-version-operator logs from your failed upgrade?
Created attachment 1715701 [details] cluster-version-operator logs I attached - cluster-version-operator logs
Hi Kiran, I attached cluster-version-operator logs from failed upgrade. [kni@provisionhost-0-0 ~]$ oc get pods -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-579fd8968b-zgxs7 1/1 Running 0 77m version--l5dj5-wm28h 0/1 Completed 0 78m
Looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1875237.
*** This bug has been marked as a duplicate of bug 1875237 ***
Feeding comment 11's logs into the CVO's log-explainer [1] $ ~/src/openshift/cluster-version-operator/hack/log-explainer.py <cluster-version-operator-logs.txt WARNING:root:not finished: clusteroperator network: Cluster operator network is still updating Analyzed logs show you're still forcing, despite comment 7 trying to explain why you should not need to force. The analyzed logs also show that Attempt 4 (12:16 - 12:22) made it as far as waiting on the network operator. Attempt 5 made it almost as far and stuck on operator-lifecycle-manager-packageserver. Later attempts, starting at 12:34 all blew up trying to push the openshift-cluster-version/cluster-version-operator PrometheusRule, with errors like: error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 602): Put "https://api-int.ocp-edge-cluster-0.qe.lab.redhat.com:6443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-version/prometheusrules/cluster-version-operator": context deadline exceeded I don't see any existing bugs mentioning that error message. That error message is not much to go on, and without a full must-gather containing both CVO logs and ClusterOperator YAML, it's hard to string things together. But if these CVO logs are from a similar situation to your initial comment 0 where monitoring was degraded, it might make sense to dig into why monitoring is sad to see if it's the source of the PrometheusRule server issues. I'm not clear on the connection with bug 1875237. [1]: https://github.com/openshift/cluster-version-operator/pull/452