Created attachment 1878707 [details] FBC content and dockerfiles Description of problem: Testing build of FBC catalog image for mutli-arch clusters. Using docker buildx to emulate each architecture which we have been doing for a long time to build sqlite backed catalogs. The following dockerfile was used to build the catalog for ppc FROM --platform=linux/ppc64le registry.redhat.io/openshift4/ose-operator-registry:v4.10.0 AS builder FROM --platform=linux/ppc64le registry.redhat.io/ubi8/ubi-minimal LABEL operators.operatorframework.io.index.configs.v1=/configs LABEL ibm.operator.catalog.version=000-v1.24-fbc-20220510.113121-42AD3F05B ### Required OpenShift Labels LABEL name="IBM Operator Catalog" maintainer="jdockter.com" vendor="IBM" version="000-v1.24-fbc-20220510.113121-42AD3F05B" opm_version="v1.21.0" release="20220510.113121-42AD3F05B" summary="This is the IBM Operator Catalog for use with Red Hat OpenShift 4." description="This catalog contains operators for IBM product offerings." COPY ppc64le /configs COPY --from=builder /bin/opm/ /bin/opm COPY --from=builder /bin/grpc_health_probe /bin/grpc_health_probe ### add licenses to this directory RUN mkdir /licenses COPY licenses/LICENSE /licenses WORKDIR /tmp ARG user=1001 RUN microdnf clean all && chmod -vR g=u /etc/passwd USER ${user} ENTRYPOINT ["/bin/opm"] CMD ["serve", "/configs"] I've copied the manifest list and associated images to public location as 'icr.io/cpopen/ibm-operator-catalog:fbc-latest' The catalog source below can be used to test. apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: ibm-operator-catalog namespace: openshift-marketplace spec: displayName: IBM Operator Catalog publisher: IBM sourceType: grpc image: icr.io/cpopen/ibm-operator-catalog:fbc-latest Version-Release number of selected component (if applicable): How reproducible: All the time based off of dockerfile above. Tested on ppc64le OCP 4.6.53 and OCP 4.10.12 and get same errors. Steps to Reproduce: 1. Untar catalog.tar.gz attached containing 'catalog-test' directory of FBC content and dockerfiles for each architecture 2. Run build with docker buildx or other build tools to emulate ppc64le build, example command docker buildx build --no-cache --pull --push --platform linux/ppc64le -f index-fbc-v4.11-linux.amd64.dockerfile -t <some-registry.com/fbc-image:latest> . 3. Create catalog source with 'some-registry.com/fbc-image:latest' image referenced above Actual results: $ oc project Using project "openshift-marketplace" on server "https://api.catalog-tester-p.cp.fyre.ibm.com:6443". $ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-t48h6 1/1 Running 0 55m community-operators-htmxl 1/1 Running 0 53m ibm-operator-catalog-xjncd 0/1 Running 5 3m40s marketplace-operator-775c64b5d9-cqlmn 1/1 Running 0 48m redhat-marketplace-pvhjl 1/1 Running 0 53m redhat-operators-fm2x5 1/1 Running 0 48m $ oc describe pod/ibm-operator-catalog-xjncd ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m9s default-scheduler Successfully assigned openshift-marketplace/ibm-operator-catalog-xjncd to worker2.catalog-tester-p.cp.fyre.ibm.com Normal AddedInterface 2m6s multus Add eth0 [10.254.16.24/22] Normal Pulled 2m4s kubelet Successfully pulled image "icr.io/cpopen/ibm-operator-catalog:fbc-latest" in 1.369921548s Warning Unhealthy 113s kubelet Liveness probe failed: timeout: failed to connect service ":50051" within 1s Normal Pulled 92s kubelet Successfully pulled image "icr.io/cpopen/ibm-operator-catalog:fbc-latest" in 1.412117469s Normal Started 91s (x2 over 2m4s) kubelet Started container registry-server Warning Unhealthy 91s kubelet Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 4b0edaf071f99c244afbe47e650be7f9e3416ad93044eab0566a055ab277f426 is running failed: container process not found Warning Unhealthy 60s (x5 over 110s) kubelet Readiness probe failed: timeout: failed to connect service ":50051" within 1s Normal Pulling 53s (x3 over 2m6s) kubelet Pulling image "icr.io/cpopen/ibm-operator-catalog:fbc-latest" Warning Unhealthy 53s (x5 over 103s) kubelet Liveness probe failed: command timed out Normal Killing 53s (x2 over 93s) kubelet Container registry-server failed liveness probe, will be restarted Normal Pulled 52s kubelet Successfully pulled image "icr.io/cpopen/ibm-operator-catalog:fbc-latest" in 1.304201103s Normal Created 51s (x3 over 2m4s) kubelet Created container registry-server Expected results: On amd64 and s390x clusters I see the registry-server container start up with a log message of time="2022-05-10T21:53:27Z" level=info msg="serving registry" configs=/configs port=50051 Additional info:
Any update on this bug?
Hi Jon, Above you listed the build command as docker buildx build --no-cache --pull --push --platform linux/ppc64le -f index-fbc-v4.11-linux.amd64.dockerfile -t <some-registry.com/fbc-image:latest> . which specifies the platform as linux/ppc64lr, but references the dockerfile as index-fbc-v4.11-linux.amd64.dockerfile, which is the amd64 one. Is that intentional? I note that when I run that command, I am able to run the resulting image, but when both platform and dockerfile agree on arch, then it (correctly) tells me I'm on the wrong architecture and fails. Cheers, -j
Sorry that was just a typo, should be: docker buildx build --no-cache --pull --push --platform linux/ppc64le -f index-fbc-v4.11-linux.ppc64le.dockerfile -t <some-registry.com/fbc-image:latest> .
Note...the dockerfile should be pulling the linux/ppc64le version of the FROM images FROM --platform=linux/ppc64le registry.redhat.io/openshift4/ose-operator-registry:v4.10.0 AS builder FROM --platform=linux/ppc64le registry.redhat.io/ubi8/ubi-minimal
Setting is a non-blocker since FBCs aren't being widely used yet
Sorry not sure I follow. What is the next step in getting a resolution for this?
I'm waiting on a ticket to get access to the infrastructure that I need to reproduce/investigate. Since you confirm that the arch conflict typo was only on bug submission and not in use, my working theory is that /bin/grpc_health_probe is not the correct architecture for the ppc64le target due to a failure in the tooling. If you have the image available on that arch, can you verify? -j
That didn't pan out. I was able to examine the executable and confirmed the hypothesis was incorrect. % file /tmp/grpc_health_probe /tmp/grpc_health_probe: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, OpenPOWER ELF V2 ABI, version 1 (SYSV), statically linked, Go BuildID=rcGh5Z2pwZi2Vx--ZkCY/GlMuqpgNUuhookqhBxWq/aOJcwvkRSvzEgSnY8PHI/gbtmk8lNXOOfkt_07pmM, not stripped So still waiting on the infrastructure to support a deeper dive. :(
Also tried build with the following dockerfile and it did not work as well # The base image is expected to contain # /bin/opm (with a serve subcommand) and /bin/grpc_health_probe FROM --platform=linux/ppc64le quay.io/operator-framework/opm:latest # Configure the entrypoint and command ENTRYPOINT ["/bin/opm"] CMD ["serve", "/configs"] # Copy declarative config root into image at /configs ADD ppc64le /configs # Set DC-specific label for the location of the DC root directory # in the image LABEL operators.operatorframework.io.index.configs.v1=/configs Same Readiness/Liveness probe error
Reached out in slack, but capturing here, too. Noted that the FBC dockerfile does not contain the "EXPOSE 50051" line that the SQLite dockerfile has. This port is needed for grpc communication. Please add it to your FBC dockerfile and try.
i.e. % cat index-fbc-v4.11-linux.ppc64le.dockerfile FROM --platform=linux/ppc64le registry.redhat.io/openshift4/ose-operator-registry:v4.10.0 AS builder FROM --platform=linux/ppc64le registry.redhat.io/ubi8/ubi-minimal LABEL operators.operatorframework.io.index.configs.v1=/configs LABEL ibm.operator.catalog.version=000-v1.24-fbc-20220510.113121-42AD3F05B ### Required OpenShift Labels LABEL name="IBM Operator Catalog" maintainer="jdockter.com" vendor="IBM" version="000-v1.24-fbc-20220510.113121-42AD3F05B" opm_version="v1.21.0" release="20220510.113121-42AD3F05B" summary="This is the IBM Operator Catalog for use with Red Hat OpenShift 4." description="This catalog contains operators for IBM product offerings." COPY ppc64le /configs COPY --from=builder /bin/opm/ /bin/opm COPY --from=builder /bin/grpc_health_probe /bin/grpc_health_probe ### add licenses to this directory RUN mkdir /licenses COPY licenses/LICENSE /licenses WORKDIR /tmp ARG user=1001 RUN microdnf clean all && chmod -vR g=u /etc/passwd EXPOSE 50051 USER ${user} ENTRYPOINT ["/bin/opm"] CMD ["serve", "/configs"]
Yes I had tried that as well, same error. Also to note that is not required in the upstream opm dockerfile listed, nor did I need it called out for the amd64 or s390x dockerfile of which both work.
if I describe the pod I see this Containers: registry-server: Container ID: cri-o://97115188b277fbf5be0cc3e681315c8a43ebdef68b9abed51d596e39b3220a62 Image: cp.stg.icr.io/cp/ibm-operator-catalog@sha256:faa255413fdface26958bddc19bc2aa4ca063f8145d2879fee75da907126681b Image ID: cp.stg.icr.io/cp/ibm-operator-catalog@sha256:faa255413fdface26958bddc19bc2aa4ca063f8145d2879fee75da907126681b Port: 50051/TCP Host Port: 0/TCP State: Running Started: Wed, 25 May 2022 11:18:41 -0500 Last State: Terminated Reason: Error Exit Code: 2 Started: Wed, 25 May 2022 11:18:02 -0500 Finished: Wed, 25 May 2022 11:18:40 -0500 Ready: False Restart Count: 1 Requests: cpu: 10m memory: 50Mi Liveness: exec [grpc_health_probe -addr=:50051] delay=10s timeout=5s period=10s #success=1 #failure=3 Readiness: exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3 Environment: <none>
Captured catalog pod yaml, removed readiness and liveness probe, and was able to get the registry to start after about 15 or 20 seconds waiting time="2022-05-25T21:23:35Z" level=info msg="serving registry" configs=/configs port=50051
The readiness and liveness probe were the problems. What is the next step to get a fix for this?
Recording the updates here. As Jon indicated above, the pod readiness check fails at 5s. From observation, it appears that the pod takes 15-20s to get ready, so disabling those probes works around his issue. The only log from the pod was the 'serving registry' message above. Diving a little deeper, just an `opm validate` operation takes 14s to complete on the build (x86-64) hardware in this environment. This is in stark contrast to my own tests on x86-64 MacOS, x86-64 Fedora 36, x86-64 ubuntu 18LTS guest on Fedora 36 which all completed in less than half a second. My ppc64le (emulated via qemu on Fedora 36) testing took much longer, at 6s, but still not in the same realm as observed for this issue. We thought that perhaps the tooling versions were responsible for the poor performance, but is inconsistent with the testing results, tabulated at the end of this message. While I don't dispute that the environment is problematic, OCS is behaving properly given the default constraints. It doesn't seem appropriate to shift the timeout to fit the 15-20s observed window. It also doesn't make sense to consider this a release blocker, given that we cannot come close to reproducing the scenario, even under adverse conditions (emulating the hardware in question). It definitely makes sense to spend some time working to reduce the readiness time of `opm serve`. I suggest that we retitle this BZ "optimize opm serve readiness pipeline for FBC" and change the priority/severity to MEDIUM/MEDIUM, clearing any blocking flags. TESTING RESULTS ----------------- Jon's validation: ----------------- $ uname -a && time opm validate ppc64le/ Linux coc-devops-builder1.fyre.ibm.com 4.15.0-176-generic #185-Ubuntu SMP Tue Mar 29 17:40:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux real 0m14.040s user 0m14.202s sys 0m0.339s $ opm version Version: version.Version{OpmVersion:"v1.22.0", GitCommit:"fd85a98c", BuildDate:"2022-05-12T20:53:46Z", GoOs:"linux", GoArch:"amd64"} ---------------------------------------------- ppc64le simulated environment, for replication ---------------------------------------------- bash-4.4$ uname -a && time /bin/opm validate /configs/datapower-operator/ Linux 8cd82bcd85db 5.17.9-300.fc36.ppc64le #1 SMP Wed May 18 14:50:24 UTC 2022 ppc64le ppc64le ppc64le GNU/Linux real 0m4.325s user 0m5.014s sys 0m0.531s bash-4.4$ /bin/opm version Version: version.Version{OpmVersion:"1cb0c9a57", GitCommit:"1cb0c9a57affcc6d471b483ab34b627430677f09", BuildDate:"2022-04-21T14:15:12Z", GoOs:"linux", GoArch:"ppc64le"} ---------------------------- MacOS 12.4 (Monterey) x86-64 ---------------------------- % uname -a && time ./bin/opm validate ~/devel/test-index/ppc64le Darwin Jordan-MBP.local 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 x86_64 ./bin/opm validate ~/devel/test-index/ppc64le 0.37s user 0.04s system 88% cpu 0.457 total % ./bin/opm version Version: version.Version{OpmVersion:"v1.21.0-7-g8ef648f8", GitCommit:"8ef648f8", BuildDate:"2022-05-26T17:58:21Z", GoOs:"darwin", GoArch:"amd64"} ---------------------------------- Fedora 36 x86-64 (i7-10850H intel) ---------------------------------- $ uname -a && time ./bin/opm validate ~/devel/test-index/ppc64le/ && ./bin/opm version Linux hatbox 5.17.6-300.fc36.x86_64 #1 SMP PREEMPT Mon May 9 15:47:11 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux real 0m0.302s user 0m0.322s sys 0m0.020s Version: version.Version{OpmVersion:"v1.21.0-11-g0d111cd8", GitCommit:"0d111cd8", BuildDate:"2022-05-26T18:53:35Z", GoOs:"linux", GoArch:"amd64"} -------------------------------------------------- Ubuntu 18.04.6 LTS (kvm guest on Fedora 36, above) -------------------------------------------------- $ uname -a && time ./bin/opm validate ~/devel/test-index/ppc64le/ Linux ubuntu-18-lts-vm 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux real 0m0.325s user 0m0.334s sys 0m0.012s $ ./bin/opm version Version: version.Version{OpmVersion:"v1.22.1-3-g14368169", GitCommit:"14368169", BuildDate:"2022-05-26T20:21:42Z", GoOs:"linux", GoArch:"amd64"} $ go version go version go1.18.2 linux/amd64
Just to note again, we do NOT see this issue when we install an FBC backed catalog image, which is virtually the same configs directory across architectures, for amd64 and s390x. This very much seems isolated to ppc64le, which I tested both on OCP 4.6.x and 4.10.x clusters.
Be curious if the 4.11 FBC catalogs for RedHat also have this issue on a ppc64le cluster?
@jkeister Is this a potential duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2093288?
This BZ came first as an early-adopter identified the issue, but it took us some time to realize the nature of the issue. We spun https://bugzilla.redhat.com/show_bug.cgi?id=2092464 to update the marketplace catalogs so we could get another source of diagnostic info to inform this investigation. That had to be reverted, and Vu chose to open a new BZ for the downstream fix instead of using this BZ. I don't want to lose the record of this, but all the remaining work is being done in https://bugzilla.redhat.com/show_bug.cgi?id=2093288 When that is closed, we will also close this one.
Looks like the related bug is going to merge soon. I'm marking this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2093288. *** This bug has been marked as a duplicate of bug 2093288 ***