2084213 – FBC catalog liveness and readiness probe failed to connect service :50051

Bug 2084213 - FBC catalog liveness and readiness probe failed to connect service :50051

Summary: FBC catalog liveness and readiness probe failed to connect service :50051

Keywords:
Status:	CLOSED DUPLICATE of bug 2093288
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.10
Hardware:	ppc64le
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	jkeister
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-11 16:39 UTC by jdockter
Modified:	2022-06-13 19:53 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-13 19:53:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
FBC content and dockerfiles (2.15 MB, application/gzip) 2022-05-11 16:39 UTC, jdockter	no flags	Details
View All

Description jdockter 2022-05-11 16:39:35 UTC

Created attachment 1878707 [details]
FBC content and dockerfiles

Description of problem:
Testing build of FBC catalog image for mutli-arch clusters.  Using docker buildx to emulate each architecture which we have been doing for a long time to build sqlite backed catalogs.  The following dockerfile was used to build the catalog for ppc

FROM --platform=linux/ppc64le registry.redhat.io/openshift4/ose-operator-registry:v4.10.0 AS builder
FROM --platform=linux/ppc64le registry.redhat.io/ubi8/ubi-minimal
LABEL operators.operatorframework.io.index.configs.v1=/configs
LABEL ibm.operator.catalog.version=000-v1.24-fbc-20220510.113121-42AD3F05B
### Required OpenShift Labels 
LABEL name="IBM Operator Catalog"     maintainer="jdockter.com"     vendor="IBM"     version="000-v1.24-fbc-20220510.113121-42AD3F05B"     opm_version="v1.21.0"     release="20220510.113121-42AD3F05B"     summary="This is the IBM Operator Catalog for use with Red Hat OpenShift 4."     description="This catalog contains operators for IBM product offerings."
COPY ppc64le /configs
COPY --from=builder /bin/opm/ /bin/opm
COPY --from=builder /bin/grpc_health_probe /bin/grpc_health_probe
### add licenses to this directory
RUN mkdir /licenses
COPY licenses/LICENSE /licenses
WORKDIR /tmp
ARG user=1001
RUN microdnf clean all && chmod -vR g=u /etc/passwd
USER ${user}
ENTRYPOINT ["/bin/opm"]
CMD ["serve", "/configs"]

I've copied the manifest list and associated images to public location as 'icr.io/cpopen/ibm-operator-catalog:fbc-latest'
The catalog source below can be used to test.

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: ibm-operator-catalog
  namespace: openshift-marketplace
spec:
  displayName: IBM Operator Catalog
  publisher: IBM
  sourceType: grpc
  image: icr.io/cpopen/ibm-operator-catalog:fbc-latest


Version-Release number of selected component (if applicable):


How reproducible:
All the time based off of dockerfile above.  Tested on ppc64le OCP 4.6.53 and OCP 4.10.12 and get same errors.

Steps to Reproduce:
1. Untar catalog.tar.gz attached containing 'catalog-test' directory of FBC content and dockerfiles for each architecture
2. Run build with docker buildx or other build tools to emulate ppc64le build, example command

docker buildx build --no-cache --pull --push --platform linux/ppc64le -f index-fbc-v4.11-linux.amd64.dockerfile -t <some-registry.com/fbc-image:latest> . 

3. Create catalog source with 'some-registry.com/fbc-image:latest' image referenced above

Actual results:

$ oc project
Using project "openshift-marketplace" on server "https://api.catalog-tester-p.cp.fyre.ibm.com:6443".

$ oc get pods
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-t48h6               1/1     Running   0          55m
community-operators-htmxl               1/1     Running   0          53m
ibm-operator-catalog-xjncd              0/1     Running   5          3m40s
marketplace-operator-775c64b5d9-cqlmn   1/1     Running   0          48m
redhat-marketplace-pvhjl                1/1     Running   0          53m
redhat-operators-fm2x5                  1/1     Running   0          48m

$ oc describe pod/ibm-operator-catalog-xjncd
...

Events:
  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       2m9s                default-scheduler  Successfully assigned openshift-marketplace/ibm-operator-catalog-xjncd to worker2.catalog-tester-p.cp.fyre.ibm.com
  Normal   AddedInterface  2m6s                multus             Add eth0 [10.254.16.24/22]
  Normal   Pulled          2m4s                kubelet            Successfully pulled image "icr.io/cpopen/ibm-operator-catalog:fbc-latest" in 1.369921548s
  Warning  Unhealthy       113s                kubelet            Liveness probe failed: timeout: failed to connect service ":50051" within 1s
  Normal   Pulled          92s                 kubelet            Successfully pulled image "icr.io/cpopen/ibm-operator-catalog:fbc-latest" in 1.412117469s
  Normal   Started         91s (x2 over 2m4s)  kubelet            Started container registry-server
  Warning  Unhealthy       91s                 kubelet            Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 4b0edaf071f99c244afbe47e650be7f9e3416ad93044eab0566a055ab277f426 is running failed: container process not found
  Warning  Unhealthy       60s (x5 over 110s)  kubelet            Readiness probe failed: timeout: failed to connect service ":50051" within 1s
  Normal   Pulling         53s (x3 over 2m6s)  kubelet            Pulling image "icr.io/cpopen/ibm-operator-catalog:fbc-latest"
  Warning  Unhealthy       53s (x5 over 103s)  kubelet            Liveness probe failed: command timed out
  Normal   Killing         53s (x2 over 93s)   kubelet            Container registry-server failed liveness probe, will be restarted
  Normal   Pulled          52s                 kubelet            Successfully pulled image "icr.io/cpopen/ibm-operator-catalog:fbc-latest" in 1.304201103s
  Normal   Created         51s (x3 over 2m4s)  kubelet            Created container registry-server


Expected results:
On amd64 and s390x clusters I see the registry-server container start up with a log message of 

time="2022-05-10T21:53:27Z" level=info msg="serving registry" configs=/configs port=50051

Additional info:

Comment 1 jdockter 2022-05-16 16:08:23 UTC

Any update on this bug?

Comment 2 jkeister 2022-05-17 21:34:38 UTC

Hi Jon, 

Above you listed the build command as 

docker buildx build --no-cache --pull --push --platform linux/ppc64le -f index-fbc-v4.11-linux.amd64.dockerfile -t <some-registry.com/fbc-image:latest> . 

which specifies the platform as linux/ppc64lr, but references the dockerfile as index-fbc-v4.11-linux.amd64.dockerfile, which is the amd64 one.  Is that intentional?  

I note that when I run that command, I am able to run the resulting image, but when both platform and dockerfile agree on arch, then it (correctly) tells me I'm on the wrong architecture and fails. 

Cheers,

-j

Comment 3 jdockter 2022-05-17 22:35:06 UTC

Sorry that was just a typo, should be:

docker buildx build --no-cache --pull --push --platform linux/ppc64le -f index-fbc-v4.11-linux.ppc64le.dockerfile -t <some-registry.com/fbc-image:latest> .

Comment 4 jdockter 2022-05-17 22:41:02 UTC

Note...the dockerfile should be pulling the linux/ppc64le version of the FROM images

FROM --platform=linux/ppc64le registry.redhat.io/openshift4/ose-operator-registry:v4.10.0 AS builder
FROM --platform=linux/ppc64le registry.redhat.io/ubi8/ubi-minimal

Comment 5 Per da Silva 2022-05-18 11:04:27 UTC

Setting is a non-blocker since FBCs aren't being widely used yet

Comment 6 jdockter 2022-05-18 13:17:06 UTC

Sorry not sure I follow.  What is the next step in getting a resolution for this?

Comment 7 jkeister 2022-05-18 17:45:07 UTC

I'm waiting on a ticket to get access to the infrastructure that I need to reproduce/investigate. 
Since you confirm that the arch conflict typo was only on bug submission and not in use, my working theory is that /bin/grpc_health_probe is not the correct architecture for the ppc64le target due to a failure in the tooling. 
If you have the image available on that arch, can you verify?  

-j

Comment 8 jkeister 2022-05-18 18:14:07 UTC

That didn't pan out.  I was able to examine the executable and confirmed the hypothesis was incorrect. 

% file /tmp/grpc_health_probe
/tmp/grpc_health_probe: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, OpenPOWER ELF V2 ABI, version 1 (SYSV), statically linked, Go BuildID=rcGh5Z2pwZi2Vx--ZkCY/GlMuqpgNUuhookqhBxWq/aOJcwvkRSvzEgSnY8PHI/gbtmk8lNXOOfkt_07pmM, not stripped

So still waiting on the infrastructure to support a deeper dive. :(

Comment 9 jdockter 2022-05-18 20:46:12 UTC

Also tried build with the following dockerfile and it did not work as well

# The base image is expected to contain
# /bin/opm (with a serve subcommand) and /bin/grpc_health_probe
FROM --platform=linux/ppc64le quay.io/operator-framework/opm:latest

# Configure the entrypoint and command
ENTRYPOINT ["/bin/opm"]
CMD ["serve", "/configs"]

# Copy declarative config root into image at /configs
ADD ppc64le /configs

# Set DC-specific label for the location of the DC root directory
# in the image
LABEL operators.operatorframework.io.index.configs.v1=/configs


Same Readiness/Liveness probe error

Comment 10 jkeister 2022-05-24 18:31:04 UTC

Reached out in slack, but capturing here, too.  Noted that the FBC dockerfile does not contain the "EXPOSE 50051" line that the SQLite dockerfile has.  This port is needed for grpc communication.  
Please add it to your FBC dockerfile and try.

Comment 11 jkeister 2022-05-24 18:46:08 UTC

i.e. 

% cat index-fbc-v4.11-linux.ppc64le.dockerfile
FROM --platform=linux/ppc64le registry.redhat.io/openshift4/ose-operator-registry:v4.10.0 AS builder
FROM --platform=linux/ppc64le registry.redhat.io/ubi8/ubi-minimal
LABEL operators.operatorframework.io.index.configs.v1=/configs
LABEL ibm.operator.catalog.version=000-v1.24-fbc-20220510.113121-42AD3F05B
### Required OpenShift Labels
LABEL name="IBM Operator Catalog"     maintainer="jdockter.com"     vendor="IBM"     version="000-v1.24-fbc-20220510.113121-42AD3F05B"     opm_version="v1.21.0"     release="20220510.113121-42AD3F05B"     summary="This is the IBM Operator Catalog for use with Red Hat OpenShift 4."     description="This catalog contains operators for IBM product offerings."
COPY ppc64le /configs
COPY --from=builder /bin/opm/ /bin/opm
COPY --from=builder /bin/grpc_health_probe /bin/grpc_health_probe
### add licenses to this directory
RUN mkdir /licenses
COPY licenses/LICENSE /licenses
WORKDIR /tmp
ARG user=1001
RUN microdnf clean all && chmod -vR g=u /etc/passwd
EXPOSE 50051
USER ${user}
ENTRYPOINT ["/bin/opm"]
CMD ["serve", "/configs"]

Comment 12 jdockter 2022-05-24 20:40:09 UTC

Yes I had tried that as well, same error.  Also to note that is not required in the upstream opm dockerfile listed, nor did I need it called out for the amd64 or s390x dockerfile of which both work.

Comment 13 jdockter 2022-05-25 16:20:20 UTC

if I describe the pod I see this

Containers:
  registry-server:
    Container ID:   cri-o://97115188b277fbf5be0cc3e681315c8a43ebdef68b9abed51d596e39b3220a62
    Image:          cp.stg.icr.io/cp/ibm-operator-catalog@sha256:faa255413fdface26958bddc19bc2aa4ca063f8145d2879fee75da907126681b
    Image ID:       cp.stg.icr.io/cp/ibm-operator-catalog@sha256:faa255413fdface26958bddc19bc2aa4ca063f8145d2879fee75da907126681b
    Port:           50051/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 25 May 2022 11:18:41 -0500
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Wed, 25 May 2022 11:18:02 -0500
      Finished:     Wed, 25 May 2022 11:18:40 -0500
    Ready:          False
    Restart Count:  1
    Requests:
      cpu:        10m
      memory:     50Mi
    Liveness:     exec [grpc_health_probe -addr=:50051] delay=10s timeout=5s period=10s #success=1 #failure=3
    Readiness:    exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>

Comment 14 jdockter 2022-05-25 22:15:22 UTC

Captured catalog pod yaml, removed readiness and liveness probe, and was able to get the registry to start after about 15 or 20 seconds waiting

time="2022-05-25T21:23:35Z" level=info msg="serving registry" configs=/configs port=50051

Comment 15 jdockter 2022-05-26 12:46:54 UTC

The readiness and liveness probe were the problems.  What is the next step to get a fix for this?

Comment 16 jkeister 2022-05-26 21:05:18 UTC

Recording the updates here.  As Jon indicated above, the pod readiness check fails at 5s.  From observation, it appears that the pod takes 15-20s to get ready, so disabling those probes works around his issue. 

The only log from the pod was the 'serving registry' message above.  

Diving a little deeper, just an `opm validate` operation takes 14s to complete on the build (x86-64) hardware in this environment.  This is in stark contrast to my own tests on x86-64 MacOS, x86-64 Fedora 36, x86-64 ubuntu 18LTS guest on Fedora 36 which all completed in less than half a second.  My ppc64le (emulated via qemu on Fedora 36) testing took much longer, at 6s, but still not in the same realm as observed for this issue. 

We thought that perhaps the tooling versions were responsible for the poor performance, but is inconsistent with the testing results, tabulated at the end of this message.  

While I don't dispute that the environment is problematic, OCS is behaving properly given the default constraints. It doesn't seem appropriate to shift the timeout to fit the 15-20s observed window. It also doesn't make sense to consider this a release blocker, given that we cannot come close to reproducing the scenario, even under adverse conditions (emulating the hardware in question). 
It definitely makes sense to spend some time working to reduce the readiness time of `opm serve`.  

I suggest that we retitle this BZ "optimize opm serve readiness pipeline for FBC" and change the priority/severity to MEDIUM/MEDIUM, clearing any blocking flags.



TESTING RESULTS

-----------------
Jon's validation:
-----------------
$ uname -a && time opm validate ppc64le/
Linux coc-devops-builder1.fyre.ibm.com 4.15.0-176-generic #185-Ubuntu SMP Tue Mar 29 17:40:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

real    0m14.040s
user    0m14.202s
sys     0m0.339s

$ opm version
Version: version.Version{OpmVersion:"v1.22.0", GitCommit:"fd85a98c", BuildDate:"2022-05-12T20:53:46Z", GoOs:"linux", GoArch:"amd64"}

----------------------------------------------
ppc64le simulated environment, for replication
----------------------------------------------
bash-4.4$ uname -a && time /bin/opm validate /configs/datapower-operator/
Linux 8cd82bcd85db 5.17.9-300.fc36.ppc64le #1 SMP Wed May 18 14:50:24 UTC 2022 ppc64le ppc64le ppc64le GNU/Linux

real    0m4.325s
user    0m5.014s
sys     0m0.531s

bash-4.4$ /bin/opm version
Version: version.Version{OpmVersion:"1cb0c9a57", GitCommit:"1cb0c9a57affcc6d471b483ab34b627430677f09", BuildDate:"2022-04-21T14:15:12Z", GoOs:"linux", GoArch:"ppc64le"}

----------------------------
MacOS 12.4 (Monterey) x86-64 
----------------------------
% uname -a  && time ./bin/opm validate ~/devel/test-index/ppc64le
Darwin Jordan-MBP.local 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 x86_64
./bin/opm validate ~/devel/test-index/ppc64le  0.37s user 0.04s system 88% cpu 0.457 total
% ./bin/opm version
Version: version.Version{OpmVersion:"v1.21.0-7-g8ef648f8", GitCommit:"8ef648f8", BuildDate:"2022-05-26T17:58:21Z", GoOs:"darwin", GoArch:"amd64"}

----------------------------------
Fedora 36 x86-64 (i7-10850H intel)
----------------------------------
$ uname -a && time ./bin/opm validate ~/devel/test-index/ppc64le/ && ./bin/opm version
Linux hatbox 5.17.6-300.fc36.x86_64 #1 SMP PREEMPT Mon May 9 15:47:11 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

real	0m0.302s
user	0m0.322s
sys	0m0.020s
Version: version.Version{OpmVersion:"v1.21.0-11-g0d111cd8", GitCommit:"0d111cd8", BuildDate:"2022-05-26T18:53:35Z", GoOs:"linux", GoArch:"amd64"}

--------------------------------------------------
Ubuntu 18.04.6 LTS (kvm guest on Fedora 36, above)
--------------------------------------------------
$ uname -a && time ./bin/opm validate ~/devel/test-index/ppc64le/
Linux ubuntu-18-lts-vm 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

real	0m0.325s
user	0m0.334s
sys	0m0.012s
$ ./bin/opm version
Version: version.Version{OpmVersion:"v1.22.1-3-g14368169", GitCommit:"14368169", BuildDate:"2022-05-26T20:21:42Z", GoOs:"linux", GoArch:"amd64"}
$ go version
go version go1.18.2 linux/amd64

Comment 17 jdockter 2022-05-26 21:58:49 UTC

Just to note again, we do NOT see this issue when we install an FBC backed catalog image, which is virtually the same configs directory across architectures, for amd64 and s390x.  This very much seems isolated to ppc64le, which I tested both on OCP 4.6.x and 4.10.x clusters.

Comment 18 jdockter 2022-05-26 21:59:48 UTC

Be curious if the 4.11 FBC catalogs for RedHat also have this issue on a ppc64le cluster?

Comment 19 tflannag 2022-06-07 15:16:13 UTC

@jkeister Is this a potential duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2093288?

Comment 20 jkeister 2022-06-07 15:48:58 UTC

This BZ came first as an early-adopter identified the issue, but it took us some time to realize the nature of the issue.  
We spun https://bugzilla.redhat.com/show_bug.cgi?id=2092464 to update the marketplace catalogs so we could get another source of diagnostic info to inform this investigation. 
That had to be reverted, and Vu chose to open a new BZ for the downstream fix instead of using this BZ.  

I don't want to lose the record of this, but all the remaining work is being done in https://bugzilla.redhat.com/show_bug.cgi?id=2093288
When that is closed, we will also close this one.

Comment 23 Kevin Rizza 2022-06-13 19:53:01 UTC

Looks like the related bug is going to merge soon. I'm marking this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2093288.

*** This bug has been marked as a duplicate of bug 2093288 ***

Note You need to log in before you can comment on or make changes to this bug.