+++ This bug was initially created as a clone of Bug #1835112 +++ Description of problem: When upgrading the zVM cluster from 4.2.29 to 4.3.18, it fails with the message, that the openshift cluster samples operator was not rolled out. When viewing the log of the container, it shows the following error: time="2020-05-13T06:27:05Z" level=info msg="watch event tests not part of operators inventory" time="2020-05-13T06:28:33Z" level=info msg="Spec is valid because this operator has not processed a config yet" time="2020-05-13T06:28:33Z" level=info msg="error reading in content : open /opt/openshift/operator/ocp-s390x: no such file or directory" time="2020-05-13T06:28:33Z" level=info msg="CRDUPDATE file list err update" time="2020-05-13T06:28:36Z" level=error msg="unable to sync: open /opt/openshift/operator/ocp-s390x: no such file or directory, requeuing" When rsh into pod, it only has the directory /opt/openshift/operator/ocp-x86_64 Version-Release number of selected component (if applicable): 4.3.18 How reproducible: Steps to Reproduce: 1. Start uograde from 4.2.29 to 4.3.18 2. 3. Actual results: Upgrade fails Expected results: Excpect upgrade to work Additional info: Deployment-YAML cluster-samples-operator: kind: Deployment apiVersion: apps/v1 metadata: name: cluster-samples-operator namespace: openshift-cluster-samples-operator selfLink: >- /apis/apps/v1/namespaces/openshift-cluster-samples-operator/deployments/cluster-samples-operator uid: c4a61f65-586c-11ea-b504-02462c000005 resourceVersion: '46056999' generation: 6 creationTimestamp: '2020-02-26T07:51:16Z' annotations: deployment.kubernetes.io/revision: '4' spec: replicas: 1 selector: matchLabels: name: cluster-samples-operator template: metadata: creationTimestamp: null labels: name: cluster-samples-operator spec: nodeSelector: node-role.kubernetes.io/master: '' restartPolicy: Always serviceAccountName: cluster-samples-operator schedulerName: default-scheduler terminationGracePeriodSeconds: 30 securityContext: {} containers: - resources: requests: cpu: 10m terminationMessagePath: /dev/termination-log name: cluster-samples-operator command: - cluster-samples-operator env: - name: WATCH_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: OPERATOR_NAME value: cluster-samples-operator - name: RELEASE_VERSION value: 4.3.18 - name: IMAGE_JENKINS value: >- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7688ecdcb88ff3b29abf0180da08cd26e42d285151cb399c0e3af160c1b2305e - name: IMAGE_AGENT_NODEJS value: >- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:50fa2dbf44ac1ab0487b6c69a2eb7f3513325a99dda91799af306c35a9f39ac4 - name: IMAGE_AGENT_MAVEN value: >- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4c052eb17a15cb35c1253d4fbd849959047b06b478e628e4601609c1b25bd178 ports: - name: metrics containerPort: 60000 protocol: TCP imagePullPolicy: IfNotPresent volumeMounts: - name: samples-operator-tls mountPath: /etc/secrets terminationMessagePolicy: File image: >- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7b6920adfb6ca19d9105d292feac51867eba3ee46825c0a4d187024c4695e790 - name: cluster-samples-operator-watch image: >- quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7b6920adfb6ca19d9105d292feac51867eba3ee46825c0a4d187024c4695e790 command: - cluster-samples-operator-watch - file-watcher-watchdog args: - '--namespace=openshift-cluster-samples-operator' - '--process-name=cluster-samples-operator' - '--termination-grace-period=30s' - '--files=/etc/secrets/tls.crt,/etc/secrets/tls.key' resources: requests: cpu: 10m terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent serviceAccount: cluster-samples-operator volumes: - name: samples-operator-tls secret: secretName: samples-operator-tls defaultMode: 420 dnsPolicy: ClusterFirst tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule - key: node.kubernetes.io/unreachable operator: Exists effect: NoExecute tolerationSeconds: 120 - key: node.kubernetes.io/not-ready operator: Exists effect: NoExecute tolerationSeconds: 120 priorityClassName: system-cluster-critical strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 25% maxSurge: 25% revisionHistoryLimit: 10 progressDeadlineSeconds: 600 status: observedGeneration: 6 replicas: 1 updatedReplicas: 1 readyReplicas: 1 availableReplicas: 1 conditions: - type: Progressing status: 'True' lastUpdateTime: '2020-05-12T12:51:11Z' lastTransitionTime: '2020-02-26T07:51:16Z' reason: NewReplicaSetAvailable message: >- ReplicaSet "cluster-samples-operator-64fc49d87" has successfully progressed. - type: Available status: 'True' lastUpdateTime: '2020-05-12T14:05:30Z' lastTransitionTime: '2020-05-12T14:05:30Z' reason: MinimumReplicasAvailable message: Deployment has minimum availability. --- Additional comment from gmontero on 2020-05-14 18:24:26 UTC --- Yep there is an upgrade specific (vs. initial install) error path with s390/ppc64le that I now see based on the data provided with the bug, which stems from bootstrapping as removed for those platforms, but having the payload imagestreams like tests and must-gather coming into the samples operator. Now, a couple of notes: 1) a process reminder, per the OCP process, I would need to fix this in 4.5 first, then 4.4.z, then 4.3.z; so it will take a bit 2) Jeremy Poulin and Renin Jose are in the process of trying to validate the s390 image in early 4.5 payloads to see if we can include some samples finally ... not sure yet if/when that would move back to 4.4 and 4.3, but if it did, it would obviate the need for the fix. I've cc:ed Jeremy on this bug and will send a needinfo to him for comment. But in the interim, I'll start on 1) --- Additional comment from wking on 2020-05-14 19:07:06 UTC --- We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: 100% of customers upgrading from 4.2 to 4.3 running s390/ppc64le. Is there also an impact from 4.2 -> 4.2, 4.3 -> 4.4, etc.? What is the impact? Is it serious enough to warrant blocking edges? example: Samples sticks on arch-specific bug, CVO sticks on samples, update hangs. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Clearing the attempted update resolves the issue. There is no other remediation procedure. --- Additional comment from gmontero on 2020-05-14 19:23:53 UTC --- (In reply to W. Trevor King from comment #2) > We're asking the following questions to evaluate whether or not this bug > warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The > ultimate goal is to avoid delivering an update which introduces new risk or > reduces cluster functionality in any way. Sample answers are provided to > give more context and the UpgradeBlocker flag has been added to this bug. It > will be removed if the assessment indicates that this should not block > upgrade edges. > > Who is impacted? If we have to block upgrade edges based on this issue, > which edges would need blocking? > example: 100% of customers upgrading from 4.2 to 4.3 running s390/ppc64le. I believe this is 100% of customers upgrading from 4.2 to 4.3 running s390/ppc64le. > Is there also an impact from 4.2 -> 4.2, 4.3 -> 4.4, etc.? > What is the impact? Is it serious enough to warrant blocking edges? > example: Samples sticks on arch-specific bug, CVO sticks on samples, > update hangs. I would expect these results on 4.2 -> 4.2, 4.3 -> 4.4. We are currently working with the multiarch team to get samples vetted on s390 and perhaps ppc64le for 4.5. But most likely that is several weeks out for ppc64le and maybe a few days to a week for s390x. So I would not expect this for 4.4 -> 4.5. Then, a discussion on backporting content to 4.4 or 4.3 could occur, though it is not a given that would be agreed upon. > How involved is remediation (even moderately serious impacts might be > acceptable if they are easy to mitigate)? > example: Clearing the attempted update resolves the issue. There is no > other remediation procedure. running `oc delete configs.samples cluster` should reset the samples operator; when it comes back up, it will treat things like an initial install and should bootstrap as removed, without misguided attempts to read non-existent content. --- Additional comment from gmontero on 2020-05-14 19:36:28 UTC --- Also, on my needinfo to Jeremy (though if anybody on cc: knows please feel free to chime in) - is there 4.3 -> 4.4 tests upgrade tests on s390x coming anytime soon? --- Additional comment from gmontero on 2020-05-14 19:44:17 UTC --- Lastly, to the originator and QA contact, can either of you reproduce the upgrade issue, and then run `oc delete configs.samples cluster` and observe the result. Ultimately, an `oc get clusteroperator openshift-samples -o yaml` should confirm the reset worked and samples is available==true progressing==false degraded==false version set, like you would get on an initial install. --- Additional comment from bdonahue on 2020-05-14 20:19:48 UTC --- This would only affect s390. 4.2 was not released on ppc64.
OK had a chat with Jerermy Poulin from multi-arch on what to do with this one. Details: 1) there is no viable x86 approximation for verifying this that openshift qe can do, like we've done with some other multi-arch sample bugs 2) it is not a given yet we backport the actual samples we are in progress for getting into 4.5/4.5 to 4.4 ... so the call made with the 4.5 version of this should not be applied for 4.4 (this bug only occurs if there are no samples for an arch in the payload) 3) however, 4.4 for multi-arch is currently blocked by some issues seen with libvirt deployments; Jeremy estimates at least a week for that to get sorted out 4) and 4.2 to 4.3 upgrade is the highest priority they have 5) and 4.3 to 4.4 upgrade is without question on tap and will be covered when 4.4 is ready 6) so he suggests, and I agree, to move this to verify so we can get the 4.3.z pick going, so they can actually verify the 4.2 to 4.3 upgrade issue previously reported. I'll give ART / elliot some more time to move this bz from modified to on_qa before verifying, but if not done by EOB today, I'll force the issue.
Need https://bugzilla.redhat.com/show_bug.cgi?id=1841185 to fix general test case needed to complete test suite for this PR. Will cherrypick 4.5 version to 4.4.z once test case if fixed.
there is not a viable x86 verification, and multi-arch is not ready to do this via 4.3 vs. 4.4 yet. that said, Dennis Gilmore verified these changes on 4.2 to 4.3 upgrade with a manually built payload once we get this in the 4.3.z nightlies, he can finally verify for real. marking verified to get the pipeline moving
Sorry that was David Benoit in #Comment 8, not Dennis Gilmore
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2445
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475