Bug 2084213
| Summary: | FBC catalog liveness and readiness probe failed to connect service :50051 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | jdockter | ||||
| Component: | OLM | Assignee: | jkeister | ||||
| OLM sub component: | OLM | QA Contact: | Jian Zhang <jiazha> | ||||
| Status: | CLOSED DUPLICATE | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | high | CC: | jkeister, krizza, tflannag | ||||
| Version: | 4.10 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.11.0 | ||||||
| Hardware: | ppc64le | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-06-13 19:53:01 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
jdockter
2022-05-11 16:39:35 UTC
Any update on this bug? Hi Jon, Above you listed the build command as docker buildx build --no-cache --pull --push --platform linux/ppc64le -f index-fbc-v4.11-linux.amd64.dockerfile -t <some-registry.com/fbc-image:latest> . which specifies the platform as linux/ppc64lr, but references the dockerfile as index-fbc-v4.11-linux.amd64.dockerfile, which is the amd64 one. Is that intentional? I note that when I run that command, I am able to run the resulting image, but when both platform and dockerfile agree on arch, then it (correctly) tells me I'm on the wrong architecture and fails. Cheers, -j Sorry that was just a typo, should be: docker buildx build --no-cache --pull --push --platform linux/ppc64le -f index-fbc-v4.11-linux.ppc64le.dockerfile -t <some-registry.com/fbc-image:latest> . Note...the dockerfile should be pulling the linux/ppc64le version of the FROM images FROM --platform=linux/ppc64le registry.redhat.io/openshift4/ose-operator-registry:v4.10.0 AS builder FROM --platform=linux/ppc64le registry.redhat.io/ubi8/ubi-minimal Setting is a non-blocker since FBCs aren't being widely used yet Sorry not sure I follow. What is the next step in getting a resolution for this? I'm waiting on a ticket to get access to the infrastructure that I need to reproduce/investigate. Since you confirm that the arch conflict typo was only on bug submission and not in use, my working theory is that /bin/grpc_health_probe is not the correct architecture for the ppc64le target due to a failure in the tooling. If you have the image available on that arch, can you verify? -j That didn't pan out. I was able to examine the executable and confirmed the hypothesis was incorrect. % file /tmp/grpc_health_probe /tmp/grpc_health_probe: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, OpenPOWER ELF V2 ABI, version 1 (SYSV), statically linked, Go BuildID=rcGh5Z2pwZi2Vx--ZkCY/GlMuqpgNUuhookqhBxWq/aOJcwvkRSvzEgSnY8PHI/gbtmk8lNXOOfkt_07pmM, not stripped So still waiting on the infrastructure to support a deeper dive. :( Also tried build with the following dockerfile and it did not work as well # The base image is expected to contain # /bin/opm (with a serve subcommand) and /bin/grpc_health_probe FROM --platform=linux/ppc64le quay.io/operator-framework/opm:latest # Configure the entrypoint and command ENTRYPOINT ["/bin/opm"] CMD ["serve", "/configs"] # Copy declarative config root into image at /configs ADD ppc64le /configs # Set DC-specific label for the location of the DC root directory # in the image LABEL operators.operatorframework.io.index.configs.v1=/configs Same Readiness/Liveness probe error Reached out in slack, but capturing here, too. Noted that the FBC dockerfile does not contain the "EXPOSE 50051" line that the SQLite dockerfile has. This port is needed for grpc communication. Please add it to your FBC dockerfile and try. i.e.
% cat index-fbc-v4.11-linux.ppc64le.dockerfile
FROM --platform=linux/ppc64le registry.redhat.io/openshift4/ose-operator-registry:v4.10.0 AS builder
FROM --platform=linux/ppc64le registry.redhat.io/ubi8/ubi-minimal
LABEL operators.operatorframework.io.index.configs.v1=/configs
LABEL ibm.operator.catalog.version=000-v1.24-fbc-20220510.113121-42AD3F05B
### Required OpenShift Labels
LABEL name="IBM Operator Catalog" maintainer="jdockter.com" vendor="IBM" version="000-v1.24-fbc-20220510.113121-42AD3F05B" opm_version="v1.21.0" release="20220510.113121-42AD3F05B" summary="This is the IBM Operator Catalog for use with Red Hat OpenShift 4." description="This catalog contains operators for IBM product offerings."
COPY ppc64le /configs
COPY --from=builder /bin/opm/ /bin/opm
COPY --from=builder /bin/grpc_health_probe /bin/grpc_health_probe
### add licenses to this directory
RUN mkdir /licenses
COPY licenses/LICENSE /licenses
WORKDIR /tmp
ARG user=1001
RUN microdnf clean all && chmod -vR g=u /etc/passwd
EXPOSE 50051
USER ${user}
ENTRYPOINT ["/bin/opm"]
CMD ["serve", "/configs"]
Yes I had tried that as well, same error. Also to note that is not required in the upstream opm dockerfile listed, nor did I need it called out for the amd64 or s390x dockerfile of which both work. if I describe the pod I see this
Containers:
registry-server:
Container ID: cri-o://97115188b277fbf5be0cc3e681315c8a43ebdef68b9abed51d596e39b3220a62
Image: cp.stg.icr.io/cp/ibm-operator-catalog@sha256:faa255413fdface26958bddc19bc2aa4ca063f8145d2879fee75da907126681b
Image ID: cp.stg.icr.io/cp/ibm-operator-catalog@sha256:faa255413fdface26958bddc19bc2aa4ca063f8145d2879fee75da907126681b
Port: 50051/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 25 May 2022 11:18:41 -0500
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Wed, 25 May 2022 11:18:02 -0500
Finished: Wed, 25 May 2022 11:18:40 -0500
Ready: False
Restart Count: 1
Requests:
cpu: 10m
memory: 50Mi
Liveness: exec [grpc_health_probe -addr=:50051] delay=10s timeout=5s period=10s #success=1 #failure=3
Readiness: exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
Environment: <none>
Captured catalog pod yaml, removed readiness and liveness probe, and was able to get the registry to start after about 15 or 20 seconds waiting time="2022-05-25T21:23:35Z" level=info msg="serving registry" configs=/configs port=50051 The readiness and liveness probe were the problems. What is the next step to get a fix for this? Recording the updates here. As Jon indicated above, the pod readiness check fails at 5s. From observation, it appears that the pod takes 15-20s to get ready, so disabling those probes works around his issue.
The only log from the pod was the 'serving registry' message above.
Diving a little deeper, just an `opm validate` operation takes 14s to complete on the build (x86-64) hardware in this environment. This is in stark contrast to my own tests on x86-64 MacOS, x86-64 Fedora 36, x86-64 ubuntu 18LTS guest on Fedora 36 which all completed in less than half a second. My ppc64le (emulated via qemu on Fedora 36) testing took much longer, at 6s, but still not in the same realm as observed for this issue.
We thought that perhaps the tooling versions were responsible for the poor performance, but is inconsistent with the testing results, tabulated at the end of this message.
While I don't dispute that the environment is problematic, OCS is behaving properly given the default constraints. It doesn't seem appropriate to shift the timeout to fit the 15-20s observed window. It also doesn't make sense to consider this a release blocker, given that we cannot come close to reproducing the scenario, even under adverse conditions (emulating the hardware in question).
It definitely makes sense to spend some time working to reduce the readiness time of `opm serve`.
I suggest that we retitle this BZ "optimize opm serve readiness pipeline for FBC" and change the priority/severity to MEDIUM/MEDIUM, clearing any blocking flags.
TESTING RESULTS
-----------------
Jon's validation:
-----------------
$ uname -a && time opm validate ppc64le/
Linux coc-devops-builder1.fyre.ibm.com 4.15.0-176-generic #185-Ubuntu SMP Tue Mar 29 17:40:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
real 0m14.040s
user 0m14.202s
sys 0m0.339s
$ opm version
Version: version.Version{OpmVersion:"v1.22.0", GitCommit:"fd85a98c", BuildDate:"2022-05-12T20:53:46Z", GoOs:"linux", GoArch:"amd64"}
----------------------------------------------
ppc64le simulated environment, for replication
----------------------------------------------
bash-4.4$ uname -a && time /bin/opm validate /configs/datapower-operator/
Linux 8cd82bcd85db 5.17.9-300.fc36.ppc64le #1 SMP Wed May 18 14:50:24 UTC 2022 ppc64le ppc64le ppc64le GNU/Linux
real 0m4.325s
user 0m5.014s
sys 0m0.531s
bash-4.4$ /bin/opm version
Version: version.Version{OpmVersion:"1cb0c9a57", GitCommit:"1cb0c9a57affcc6d471b483ab34b627430677f09", BuildDate:"2022-04-21T14:15:12Z", GoOs:"linux", GoArch:"ppc64le"}
----------------------------
MacOS 12.4 (Monterey) x86-64
----------------------------
% uname -a && time ./bin/opm validate ~/devel/test-index/ppc64le
Darwin Jordan-MBP.local 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 x86_64
./bin/opm validate ~/devel/test-index/ppc64le 0.37s user 0.04s system 88% cpu 0.457 total
% ./bin/opm version
Version: version.Version{OpmVersion:"v1.21.0-7-g8ef648f8", GitCommit:"8ef648f8", BuildDate:"2022-05-26T17:58:21Z", GoOs:"darwin", GoArch:"amd64"}
----------------------------------
Fedora 36 x86-64 (i7-10850H intel)
----------------------------------
$ uname -a && time ./bin/opm validate ~/devel/test-index/ppc64le/ && ./bin/opm version
Linux hatbox 5.17.6-300.fc36.x86_64 #1 SMP PREEMPT Mon May 9 15:47:11 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
real 0m0.302s
user 0m0.322s
sys 0m0.020s
Version: version.Version{OpmVersion:"v1.21.0-11-g0d111cd8", GitCommit:"0d111cd8", BuildDate:"2022-05-26T18:53:35Z", GoOs:"linux", GoArch:"amd64"}
--------------------------------------------------
Ubuntu 18.04.6 LTS (kvm guest on Fedora 36, above)
--------------------------------------------------
$ uname -a && time ./bin/opm validate ~/devel/test-index/ppc64le/
Linux ubuntu-18-lts-vm 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
real 0m0.325s
user 0m0.334s
sys 0m0.012s
$ ./bin/opm version
Version: version.Version{OpmVersion:"v1.22.1-3-g14368169", GitCommit:"14368169", BuildDate:"2022-05-26T20:21:42Z", GoOs:"linux", GoArch:"amd64"}
$ go version
go version go1.18.2 linux/amd64
Just to note again, we do NOT see this issue when we install an FBC backed catalog image, which is virtually the same configs directory across architectures, for amd64 and s390x. This very much seems isolated to ppc64le, which I tested both on OCP 4.6.x and 4.10.x clusters. Be curious if the 4.11 FBC catalogs for RedHat also have this issue on a ppc64le cluster? @jkeister Is this a potential duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2093288? This BZ came first as an early-adopter identified the issue, but it took us some time to realize the nature of the issue. We spun https://bugzilla.redhat.com/show_bug.cgi?id=2092464 to update the marketplace catalogs so we could get another source of diagnostic info to inform this investigation. That had to be reverted, and Vu chose to open a new BZ for the downstream fix instead of using this BZ. I don't want to lose the record of this, but all the remaining work is being done in https://bugzilla.redhat.com/show_bug.cgi?id=2093288 When that is closed, we will also close this one. Looks like the related bug is going to merge soon. I'm marking this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2093288. *** This bug has been marked as a duplicate of bug 2093288 *** |