Bug 1905489 - Latest OCP 4.7 on Z builds fail to complete installation as SRO operator does not install
Summary: Latest OCP 4.7 on Z builds fail to complete installation as SRO operator does...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Special Resource Operator
Version: 4.7
Hardware: s390x
OS: Linux
high
high
Target Milestone: ---
: 4.7.0
Assignee: Andy McCrae
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: ocp-47-z-tracker
TreeView+ depends on / blocked
 
Reported: 2020-12-08 13:32 UTC by krmoser
Modified: 2020-12-15 14:13 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-15 14:13:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description krmoser 2020-12-08 13:32:27 UTC
Description of problem:
From the 4.7.0-0.nightly-s390x-2020-12-04-114524 OCP 4.7 build to the current 4.7.0-0.nightly-s390x-2020-12-07-232930 build (and most likely beyond), the OCP 4.7 builds do not complete installation as the special-resource-operator (SRO) does not install.

This issue has been seen with both zVM and KVM OCP 4.7 on Z hypervisors.


Version-Release number of selected component (if applicable):
OCP 4.7

How reproducible:
Easily reproducible for any OCP 4.7 on Z public mirror build from 4.7.0-0.nightly-s390x-2020-12-04-114524 to 4.7.0-0.nightly-s390x-2020-12-07-232930.

Steps to Reproduce:
1. Attempt to install any of the public mirror builds from 4.7.0-0.nightly-s390x-2020-12-04-114524 to 4.7.0-0.nightly-s390x-2020-12-07-232930.

Actual results:
The OCP 4.7 builds never completes.

Expected results:
The OCP 4.7 builds successfully completes.

Additional info:
Here is the output from the "oc get clusterversion", "oc get nodes", and "oc get  co" commands after over 2 hours since the install started.  Usually an OCP 4.7  install completes in 20-25 minutes or less, but here the install never completes as it appears the "special-resource-operator" does not seem to even start to install.


NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          128m    Unable to apply 4.7.0-0.nightly-s390x-2020-12-07-232930: the cluster operator special-resource-operator has not yet successfully rolled out

NAME                                          STATUS   ROLES    AGE    VERSION
master-0.pok-96.ocptest.pok.stglabs.ibm.com   Ready    master   127m   v1.19.2+ad738ba
master-1.pok-96.ocptest.pok.stglabs.ibm.com   Ready    master   127m   v1.19.2+ad738ba
master-2.pok-96.ocptest.pok.stglabs.ibm.com   Ready    master   127m   v1.19.2+ad738ba
worker-0.pok-96.ocptest.pok.stglabs.ibm.com   Ready    worker   119m   v1.19.2+ad738ba
worker-1.pok-96.ocptest.pok.stglabs.ibm.com   Ready    worker   119m   v1.19.2+ad738ba


NAME                                       VERSION                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      110m
baremetal                                  4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      126m
cloud-credential                           4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      126m
cluster-autoscaler                         4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      125m
config-operator                            4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      126m
console                                    4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      115m
csi-snapshot-controller                    4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      125m
dns                                        4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      125m
etcd                                       4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      124m
image-registry                             4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      119m
ingress                                    4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      118m
insights                                   4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      105m
kube-apiserver                             4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      122m
kube-controller-manager                    4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      124m
kube-scheduler                             4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      124m
kube-storage-version-migrator              4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      118m
machine-api                                4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      125m
machine-approver                           4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      125m
machine-config                             4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      125m
marketplace                                4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      124m
monitoring                                 4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      117m
network                                    4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      117m
node-tuning                                4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      125m
openshift-apiserver                        4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      119m
openshift-controller-manager               4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      123m
openshift-samples                          4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      118m
operator-lifecycle-manager                 4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      125m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      125m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      119m
service-ca                                 4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      126m
special-resource-operator
storage                                    4.7.0-0.nightly-s390x-2020-12-07-232930   True        False         False      126m

Comment 1 Rafael Fonseca 2020-12-08 13:56:02 UTC
One of the issues is that the operator hasn't been built Multi-arch:

$ oc get pods -A | grep Crash
openshift-sro                                      special-resource-controller-manager-57f8bff587-lxzhl     0/2     CrashLoopBackOff   14         33m

$ oc logs special-resource-controller-manager-57f8bff587-lxzhl -n openshift-sro --all-containers
standard_init_linux.go:219: exec user process caused: exec format error
standard_init_linux.go:219: exec user process caused: exec format error

This was fixed by Andy McCrae on PR https://github.com/openshift/special-resource-operator/pull/6 but it seems that hasn't made its way to the installer image yet.

There are other issues with the operator that affect all arches and are being currently discussed and fixed.

Comment 2 Dan Li 2020-12-08 14:05:02 UTC
Confirmed with the creator that this issue is a blocker+

Comment 3 Micah Abbott 2020-12-08 14:40:44 UTC
@Andy since you've already been playing in this space, I'm going to assign it to you for first looks.  Maybe it should be under the Multi-Arch component?

Comment 4 Andy McCrae 2020-12-08 16:01:58 UTC
This should be fixed - I'll double check the builds, but we fixed this with: https://github.com/openshift/special-resource-operator/commit/5f2cb4aff31207dcb82f4e0b9df5bc6700e99165#diff-dd2c0eb6ea5cfc6c4bd4eac30934e2d5746747af48fef6da689e85b752f39557

I'll do some tests just to make sure - but the issue was that the SRO is new and wasn't set to be built MA yet (it had hardcoded x86_64/linux), that has been fixed and backported so newer builds should include that fix.

Comment 5 Rafael Fonseca 2020-12-08 16:04:58 UTC
I tried today and it was not fixed yet.

Comment 6 krmoser 2020-12-08 19:21:01 UTC
Folks,

1. We tested with the OCP 4.7.0-0.nightly-s390x-2020-12-08-141200 build and it seems that this issue is fixed.

2. The "special-resource-operator" cluster operator is no longer listed when the "oc get co" command is issued, and the cluster AVAILABLE status does become True with the successful installation of the OCP cluster.

Here is an example:

NAME      VERSION                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         3m20s   Cluster version is 4.7.0-0.nightly-s390x-2020-12-08-141200

NAME                                          STATUS   ROLES    AGE   VERSION
master-0.pok-96.ocptest.pok.stglabs.ibm.com   Ready    master   28m   v1.19.2+ad738ba
master-1.pok-96.ocptest.pok.stglabs.ibm.com   Ready    master   28m   v1.19.2+ad738ba
master-2.pok-96.ocptest.pok.stglabs.ibm.com   Ready    master   28m   v1.19.2+ad738ba
worker-0.pok-96.ocptest.pok.stglabs.ibm.com   Ready    worker   20m   v1.19.2+ad738ba
worker-1.pok-96.ocptest.pok.stglabs.ibm.com   Ready    worker   20m   v1.19.2+ad738ba


NAME                                       VERSION                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      11m
baremetal                                  4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      27m
cloud-credential                           4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      27m
cluster-autoscaler                         4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      26m
config-operator                            4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      27m
console                                    4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      16m
csi-snapshot-controller                    4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      27m
dns                                        4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      26m
etcd                                       4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      25m
image-registry                             4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      21m
ingress                                    4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      18m
insights                                   4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      6m10s
kube-apiserver                             4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      25m
kube-controller-manager                    4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      25m
kube-scheduler                             4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      24m
kube-storage-version-migrator              4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      19m
machine-api                                4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      27m
machine-approver                           4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      26m
machine-config                             4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      25m
marketplace                                4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      25m
monitoring                                 4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      18m
network                                    4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      18m
node-tuning                                4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      26m
openshift-apiserver                        4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      21m
openshift-controller-manager               4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      25m
openshift-samples                          4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      20m
operator-lifecycle-manager                 4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      26m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      26m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      11m
service-ca                                 4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      27m
storage                                    4.7.0-0.nightly-s390x-2020-12-08-141200   True        False         False      27m


3. Just to confirm, the "special-resource-operator" no longer being listed from the "oc get co" command output is as expected?

Thank you,
Kyle

Comment 7 Rafael Fonseca 2020-12-08 19:50:31 UTC
I still see the operator on 4.7.0-0.nightly-s390x-2020-12-08-174134:

$ oc adm release info --pullspecs registry.svc.ci.openshift.org/ocp-s390x/release-s390x:4.7.0-0.nightly-s390x-2020-12-08-174134 | grep special
  special-resource-operator                      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:db19bf6617957ae94c17426b545ff867f5966d817d191ae5ffcdb1c2ab48890c

$ oc get pods -A | grep -i Crash
openshift-sro                                      special-resource-controller-manager-79c4bdb869-rxwl8      0/2     CrashLoopBackOff    8          10m

$ oc logs special-resource-controller-manager-79c4bdb869-rxwl8 -n openshift-sro --all-containers
standard_init_linux.go:219: exec user process caused: exec format error
standard_init_linux.go:219: exec user process caused: exec format error

So not yet solved for all installer images.

As far as I know, yes, the expected outcome is for this operator to be removed for the time being.

Comment 8 Andy McCrae 2020-12-09 13:40:18 UTC
I'll follow up on this, the issue is that the operator pulls in 2 images:

gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0
quay.io/openshift-psap/special-resource-operator:conditions

Neither of these are built for additional architectures so will fail with 'exec format error' - these aren't built as part of the OCP release.
It looks like this will be removed from the release so we should be fine for 4.7, but i'll follow up to ensure this gets resolved if it will be included in 4.8+

Comment 9 krmoser 2020-12-09 18:03:57 UTC
Folks,

As a follow-up to comments 6 and 7, our continued OCP 4.7 build testing with the 2 latest public mirror OCP 4.7 builds indicates that this issue is not yet resolved.

1. OCP 4.7 builds 4.7.0-0.nightly-s390x-2020-12-08-174134 and 4.7.0-0.nightly-s390x-2020-12-09-160115 both fail with the previous above "the cluster operator special-resource-operator has not yet successfully rolled out" issue.

Thank you,
Kyle

Comment 10 Dan Li 2020-12-14 22:02:03 UTC
Hi Andy, following up on this bug. I remember we discussed that this bug is being resolved. Do you know if that is fixed in the latest build?

Comment 11 Andy McCrae 2020-12-15 14:13:50 UTC
Sorry for the delay - it has been resolved, there was an issue with removing the SRO from the releases for MA but since the 4.7.0-0.nightly-s390x-2020-12-09-183623 build the SRO has not been included (and won't be included in 4.7 for any architecture).

I'll close this bug out - there is still some work for the SRO (across all architectures), we may not see any further issues, but if we do (in 4.8+) we can address those in a new bz.

Andy


Note You need to log in before you can comment on or make changes to this bug.