Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1977241

Summary: registry-server is crashlooping due to livness probe failing
Product: OpenShift Container Platform Reporter: Eran Cohen <ercohen>
Component: OLMAssignee: Kevin Rizza <krizza>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: rfreiman
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-29 12:42:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eran Cohen 2021-06-29 09:56:17 UTC
Description of problem:
This issue is blocking SNO CI

On Single node Openshift the registry-server keeps failing the liveness probe which leads to a crash loop:

events:
00:58:54	openshift-marketplace	kubelet	redhat-operators-ghmtz	
Created
Created container registry-server
00:58:55	openshift-marketplace	kubelet	redhat-operators-ghmtz	
Started
Started container registry-server
00:59:04	openshift-marketplace	kubelet	redhat-operators-ghmtz	
Killing
Stopping container registry-server
00:59:04	openshift-marketplace	kubelet	community-operators-6x2kn	
Created
Created container registry-server
00:59:05	openshift-marketplace	kubelet	community-operators-6x2kn	
Started
Started container registry-server
00:59:16	openshift-marketplace	kubelet	community-operators-6x2kn	
Killing
Stopping container registry-server


Kubelet log:

 Jun 26 01:28:19.024704 ip-10-0-130-102 hyperkube[1665]: I0626 01:28:19.024669    1665 kuberuntime_manager.go:683] "Message for Container of pod" containerName="registry-server" containerStatusID={Type:cri-o ID:ddb48f03bc37d335c7d3ff17305c2f57a7b059132f6fcdec2b3e98c465df64f7} pod="openshift-marketplace/certified-operators-f2wrk" containerMessage="Container registry-server failed liveness probe, will be restarted"
Jun 26 01:28:19.024806 ip-10-0-130-102 hyperkube[1665]: I0626 01:28:19.024741    1665 kuberuntime_container.go:720] "Killing container with a grace period override" pod="openshift-marketplace/certified-operators-f2wrk" podUID=a984ce74-619f-4f08-ab75-f7b9ed00551d containerName="registry-server" containerID="cri-o://ddb48f03bc37d335c7d3ff17305c2f57a7b059132f6fcdec2b3e98c465df64f7" gracePeriod=30


This issue is failing a test in SNO CI job
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node/1408576091152453632

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node/1407488936774733824

Version-Release number of selected component (if applicable):

4.8.0-0.nightly-2021-06-15-181825

* The issue still exists in 4.8.0-0.nightly-2021-06-28-165738

        "name": "operator-marketplace",
        "annotations": {
          "io.openshift.build.commit.id": "e39ff59d5abc3e27effc7b726329d06a37644f2e",
          "io.openshift.build.source-location": "https://github.com/operator-framework/operator-marketplace"
        },
        "from": {
          "kind": "DockerImage",
          "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3e3a8102eac7bac8dc67fb5303e7f842e4fbe44f4bf30c08178442e6ad68312d"


        "name": "operator-registry",
        "annotations": {
          "io.openshift.build.commit.id": "f25f670c03e849ba0fd53a56daa0d8a697f68d16",
          "io.openshift.build.source-location": "https://github.com/openshift/operator-framework-olm"
        },
        "from": {
          "kind": "DockerImage",
          "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0e2b9a90811f8a25e6cc9c6c825098861e8d70004c19e5f8f9e0f5527c8d99be"



How reproducible:

100%
The following test fail in the CI due to this issue:
https://testgrid.k8s.io/redhat-single-node#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node&show-stale-tests=&include-filter-by-regex=AlertmanagerReceiversNotConfigured&include-filter-by-regex=%20Alerts%20shouldn't%20report%20any%20alerts%20in%20firing%20or%20pending%20state%20apart%20from%20Watchdog%20and%20AlertmanagerReceiversNotConfigured
 

Steps to Reproduce:
1. Run the 4.8-e2e-aws-single-node job
2.
3.

Actual results:
The marketplace registry-server is in a crashloop

Expected results:
Expected the pod not to crash

Additional info:
This issue is prominent in the single node e2e, but it’s also happening across a. bunch of other 4.8 suites: 
https://search.ci.openshift.org/chart?search=KubePodCrashLooping.*registry-server&maxAge[…]=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Kevin Rizza 2021-06-29 12:42:42 UTC

*** This bug has been marked as a duplicate of bug 1976326 ***