Bug 1835996 - s390x/ppc64le: Failed to upgrade Cluster from 4.2.29 to 4.3.18: unable to sync: open /opt/openshift/operator/ocp-s390x: no such file or directory
Summary: s390x/ppc64le: Failed to upgrade Cluster from 4.2.29 to 4.3.18: unable to syn...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.3.z
Hardware: s390x
OS: Other
high
high
Target Milestone: ---
: 4.3.z
Assignee: Gabe Montero
QA Contact: David Benoit
URL:
Whiteboard:
Depends On: 1835995
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-14 21:07 UTC by OpenShift BugZilla Robot
Modified: 2021-04-05 17:36 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: on upgrade, samples operator incorrectly assumes there are samples content for s390x in the payload Consequence: samples operator incorrectly cites a degradation because of this and does not finish its upgrade, which bubbles up to the overall upgrade Fix: 1) code changes introduced to not attempt to retrieve samples content in the payload for s390x during upgrade; 2) as a workaround, a cluster admin can run `oc delete config.samples cluster` to reset the samples operator and have it get out of degraded state. Result: s390x upgrades to 4.4 should not fail.
Clone Of:
Environment:
Last Closed: 2020-06-17 20:27:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-samples-operator pull 273 0 None closed Bug 1835996: avoid listing file system content for unsupported architures 2020-06-15 18:14:15 UTC
Github openshift cluster-samples-operator pull 285 0 None closed Bug 1835996: ensure s390/ppc64le platforms bootstrap as removed following upgrade 2020-06-15 18:14:15 UTC
Red Hat Product Errata RHBA-2020:2436 0 None None None 2020-06-17 20:28:16 UTC

Description OpenShift BugZilla Robot 2020-05-14 21:07:28 UTC
+++ This bug was initially created as a clone of Bug #1835112 +++

Description of problem:
When upgrading the zVM cluster from 4.2.29 to 4.3.18, it fails with the message, that the openshift cluster samples operator was not rolled out.
When viewing the log of the container, it shows the following error:


time="2020-05-13T06:27:05Z" level=info msg="watch event tests not part of operators inventory"
time="2020-05-13T06:28:33Z" level=info msg="Spec is valid because this operator has not processed a config yet"
time="2020-05-13T06:28:33Z" level=info msg="error reading in content : open /opt/openshift/operator/ocp-s390x: no such file or directory"
time="2020-05-13T06:28:33Z" level=info msg="CRDUPDATE file list err update"
time="2020-05-13T06:28:36Z" level=error msg="unable to sync: open /opt/openshift/operator/ocp-s390x: no such file or directory, requeuing"

When rsh into pod, it only has the directory /opt/openshift/operator/ocp-x86_64

Version-Release number of selected component (if applicable):
4.3.18

How reproducible:


Steps to Reproduce:
1. Start uograde from 4.2.29 to 4.3.18
2.
3.

Actual results:
Upgrade fails

Expected results:
Excpect upgrade to work

Additional info:

Deployment-YAML cluster-samples-operator:
kind: Deployment
apiVersion: apps/v1
metadata:
  name: cluster-samples-operator
  namespace: openshift-cluster-samples-operator
  selfLink: >-
    /apis/apps/v1/namespaces/openshift-cluster-samples-operator/deployments/cluster-samples-operator
  uid: c4a61f65-586c-11ea-b504-02462c000005
  resourceVersion: '46056999'
  generation: 6
  creationTimestamp: '2020-02-26T07:51:16Z'
  annotations:
    deployment.kubernetes.io/revision: '4'
spec:
  replicas: 1
  selector:
    matchLabels:
      name: cluster-samples-operator
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: cluster-samples-operator
    spec:
      nodeSelector:
        node-role.kubernetes.io/master: ''
      restartPolicy: Always
      serviceAccountName: cluster-samples-operator
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      securityContext: {}
      containers:
        - resources:
            requests:
              cpu: 10m
          terminationMessagePath: /dev/termination-log
          name: cluster-samples-operator
          command:
            - cluster-samples-operator
          env:
            - name: WATCH_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: OPERATOR_NAME
              value: cluster-samples-operator
            - name: RELEASE_VERSION
              value: 4.3.18
            - name: IMAGE_JENKINS
              value: >-
                quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7688ecdcb88ff3b29abf0180da08cd26e42d285151cb399c0e3af160c1b2305e
            - name: IMAGE_AGENT_NODEJS
              value: >-
                quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:50fa2dbf44ac1ab0487b6c69a2eb7f3513325a99dda91799af306c35a9f39ac4
            - name: IMAGE_AGENT_MAVEN
              value: >-
                quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4c052eb17a15cb35c1253d4fbd849959047b06b478e628e4601609c1b25bd178
          ports:
            - name: metrics
              containerPort: 60000
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: samples-operator-tls
              mountPath: /etc/secrets
          terminationMessagePolicy: File
          image: >-
            quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7b6920adfb6ca19d9105d292feac51867eba3ee46825c0a4d187024c4695e790
        - name: cluster-samples-operator-watch
          image: >-
            quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7b6920adfb6ca19d9105d292feac51867eba3ee46825c0a4d187024c4695e790
          command:
            - cluster-samples-operator-watch
            - file-watcher-watchdog
          args:
            - '--namespace=openshift-cluster-samples-operator'
            - '--process-name=cluster-samples-operator'
            - '--termination-grace-period=30s'
            - '--files=/etc/secrets/tls.crt,/etc/secrets/tls.key'
          resources:
            requests:
              cpu: 10m
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      serviceAccount: cluster-samples-operator
      volumes:
        - name: samples-operator-tls
          secret:
            secretName: samples-operator-tls
            defaultMode: 420
      dnsPolicy: ClusterFirst
      tolerations:
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
        - key: node.kubernetes.io/unreachable
          operator: Exists
          effect: NoExecute
          tolerationSeconds: 120
        - key: node.kubernetes.io/not-ready
          operator: Exists
          effect: NoExecute
          tolerationSeconds: 120
      priorityClassName: system-cluster-critical
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
status:
  observedGeneration: 6
  replicas: 1
  updatedReplicas: 1
  readyReplicas: 1
  availableReplicas: 1
  conditions:
    - type: Progressing
      status: 'True'
      lastUpdateTime: '2020-05-12T12:51:11Z'
      lastTransitionTime: '2020-02-26T07:51:16Z'
      reason: NewReplicaSetAvailable
      message: >-
        ReplicaSet "cluster-samples-operator-64fc49d87" has successfully
        progressed.
    - type: Available
      status: 'True'
      lastUpdateTime: '2020-05-12T14:05:30Z'
      lastTransitionTime: '2020-05-12T14:05:30Z'
      reason: MinimumReplicasAvailable
      message: Deployment has minimum availability.

--- Additional comment from gmontero on 2020-05-14 18:24:26 UTC ---

Yep there is an upgrade specific (vs. initial install) error path with s390/ppc64le that I now see based on the data provided with the bug,
which stems from bootstrapping as removed for those platforms, but having the payload imagestreams like tests and must-gather coming into
the samples operator.

Now, a couple of notes:
1) a process reminder, per the OCP process, I would need to fix this in 4.5 first, then 4.4.z, then 4.3.z; so it will take a bit
2) Jeremy Poulin and Renin Jose are in the process of trying to validate the s390 image in early 4.5 payloads to see if we can include
some samples finally ... not sure yet if/when that would move back to 4.4 and 4.3, but if it did, it would obviate the need for the fix.

I've cc:ed Jeremy on this bug and will send a needinfo to him for comment.

But in the interim, I'll start on 1)

--- Additional comment from wking on 2020-05-14 19:07:06 UTC ---

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: 100% of customers upgrading from 4.2 to 4.3 running s390/ppc64le.  Is there also an impact from 4.2 -> 4.2, 4.3 -> 4.4, etc.?
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Samples sticks on arch-specific bug, CVO sticks on samples, update hangs.
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Clearing the attempted update resolves the issue.  There is no other remediation procedure.

--- Additional comment from gmontero on 2020-05-14 19:23:53 UTC ---

(In reply to W. Trevor King from comment #2)
> We're asking the following questions to evaluate whether or not this bug
> warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The
> ultimate goal is to avoid delivering an update which introduces new risk or
> reduces cluster functionality in any way. Sample answers are provided to
> give more context and the UpgradeBlocker flag has been added to this bug. It
> will be removed if the assessment indicates that this should not block
> upgrade edges.
> 
> Who is impacted?  If we have to block upgrade edges based on this issue,
> which edges would need blocking?
>   example: 100% of customers upgrading from 4.2 to 4.3 running s390/ppc64le.

I believe this is 100% of customers upgrading from 4.2 to 4.3 running s390/ppc64le.

> Is there also an impact from 4.2 -> 4.2, 4.3 -> 4.4, etc.?
> What is the impact?  Is it serious enough to warrant blocking edges?
>   example: Samples sticks on arch-specific bug, CVO sticks on samples,
> update hangs.

I would expect these results on 4.2 -> 4.2, 4.3 -> 4.4.

We are currently working with the multiarch team to get samples vetted 
on s390 and perhaps ppc64le for 4.5.  But most likely that is several
weeks out for ppc64le and maybe a few days to a week for s390x.  So 
I would not expect this for 4.4 -> 4.5.

Then, a discussion on backporting content to 4.4 or 4.3 could occur, though
it is not a given that would be agreed upon.

> How involved is remediation (even moderately serious impacts might be
> acceptable if they are easy to mitigate)?
>   example: Clearing the attempted update resolves the issue.  There is no
> other remediation procedure.

running `oc delete configs.samples cluster` should reset the samples operator;
when it comes back up, it will treat things like an initial install and should 
bootstrap as removed, without misguided attempts to read non-existent content.

--- Additional comment from gmontero on 2020-05-14 19:36:28 UTC ---

Also, on my needinfo to Jeremy (though if anybody on cc: knows please feel free to chime in) - is there 4.3 -> 4.4 tests upgrade tests on s390x coming anytime soon?

--- Additional comment from gmontero on 2020-05-14 19:44:17 UTC ---

Lastly, to the originator and QA contact, can either of you reproduce the upgrade issue, and 
then run `oc delete configs.samples cluster` and observe the result.

Ultimately, an `oc get clusteroperator openshift-samples -o yaml` should confirm the reset worked and samples is available==true progressing==false degraded==false version set,
like you would get on an initial install.

--- Additional comment from bdonahue on 2020-05-14 20:19:48 UTC ---

   This would only affect s390. 4.2 was not released on ppc64.

Comment 1 Gabe Montero 2020-05-15 17:27:44 UTC
OK when https://github.com/openshift/cluster-samples-operator/pull/273 is cherrypicked approved by Scott Dodson or the next appointed openshift patch manager, and subsequently merges, the multiarch team will
have a 4.3.z samples operator commit level to build a new payload from and retry the 4.2 to 4.3 upgrade on s390x (and ppc64le as well if that every arises).

Comment 2 Gabe Montero 2020-05-21 18:05:59 UTC
OK the 4.3.z PR has merged.

The ball is now in the multi arch team's court.  When they can build a payload off of the associated commit from https://github.com/openshift/cluster-samples-operator/pull/273
Barry can verify.

Comment 3 Barry Donahue 2020-05-21 18:54:23 UTC
David Benoit will verify

Comment 7 Gabe Montero 2020-05-22 16:41:50 UTC
David's test uncovered an additional issue from what was previously fixed, stemming from the fact that we did not bootstrap as removed in 4.2.x for samples on non-x86

He and I are now iterating in his test environment over additional changes I've made to get the upgrade working.

Comment 9 Gabe Montero 2020-05-29 02:57:39 UTC
waiting for dependency/4.4.z version https://bugzilla.redhat.com/show_bug.cgi?id=1835995 to merge and then will start cherrypick

Comment 12 Gabe Montero 2020-06-06 23:29:13 UTC
OK 4.3.0-0.nightly-2020-06-06-164811 for sure has this fix.

Comment 13 David Benoit 2020-06-08 21:05:14 UTC
Fix is verified in 4.3.0-0.nightly-s390x-2020-06-06-130744.

Comment 15 errata-xmlrpc 2020-06-17 20:27:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2436

Comment 16 W. Trevor King 2021-04-05 17:36:52 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.