1873012 – CrashLoopBackoff occurrence in community-operators

Bug 1873012 - CrashLoopBackoff occurrence in community-operators

Summary: CrashLoopBackoff occurrence in community-operators

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Kevin Rizza
QA Contact:	Bruno Andrade
Docs Contact:
URL:
Whiteboard:
Depends On:	1873546
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-27 07:15 UTC by Jaspreet Kaur
Modified:	2024-03-25 16:22 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1873545 (view as bug list)
Environment:
Last Closed:	2021-01-05 15:21:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	operator-framework operator-marketplace pull 345	0	None	closed	[release-4.4] Bug 1873012: Increase registry delay for large payloads	2021-02-17 01:33:26 UTC
Red Hat Product Errata	RHBA-2020:4063	0	None	None	None	2020-10-13 08:18:17 UTC

Internal Links: 1946364

Description Jaspreet Kaur 2020-08-27 07:15:37 UTC

Description of problem: After upgrade to 4.4.6 we see community and certified operators crashing. The liveness probe fails with 

kind: Event
  lastTimestamp: "2020-08-26T10:50:53Z"
  message: |
    Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s
  metadata:
    creationTimestamp: "2020-08-26T10:31:43Z"
    name: certified-operators-5bcb56768c-ccj2d.162ecacdee3fa4aa
    namespace: openshift-marketplace
    resourceVersion: "705436"
    selfLink: /api/v1/namespaces/openshift-marketplace/events/certified-operators-5bcb56768c-ccj2d.162ecacdee3fa4aa
    uid: a1a0ca25-6ca7-4714-9ace-de7e8b975ac6
  reason: Unhealthy
  reportingComponent: ""
  reportingInstance: ""
  source:
    component: kubelet
    host: infra0.ao.example.com
  type: Warning

certified-operators-5bcb56768c-64fxs    0/1     CrashLoopBackOff   27         118m
certified-operators-cddd74b58-k86fv     0/1     Running            6          14m
community-operators-698654bb96-zd4s6    0/1     CrashLoopBackOff   13         51m
community-operators-786f694c8d-gl7bj    0/1     Running            6          14m
marketplace-operator-7c4959c648-fwmn7   1/1     Running            0          15m
redhat-marketplace-5874897f8f-527hz     1/1     Running            0          14m
redhat-operators-7d877d5977-jp8wz       1/1     Running            0          14m


Events:
  Type     Reason     Age                   From                            Message
  ----     ------     ----                  ----                            -------
  Normal   Scheduled  30m                   default-scheduler               Successfully assigned openshift-marketplace/community-operators-7b96c7dd85-ssv4w to infra0.ao.example.com
  Normal   Created    30m                   kubelet, infra0.ao.example.com  Created container community-operators
  Normal   Started    30m                   kubelet, infra0.ao.example.com  Started container community-operators
  Warning  Unhealthy  28m (x9 over 29m)     kubelet, infra0.ao.example.com  Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s
  Normal   Pulled     28m (x2 over 30m)     kubelet, infra0.ao.example.com  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:821853c24977f49986d51cf2a3756dc3d067fc3122c27ef60db9445f67d66c5c" already present on machine
  Normal   Killing    28m                   kubelet, infra0.ao.example.com  Container community-operators failed liveness probe, will be restarted
  Warning  Unhealthy  28m                   kubelet, infra0.ao.example.com  Readiness probe failed:
  Warning  Unhealthy  4m53s (x81 over 29m)  kubelet, infra0.ao.example.com  Liveness probe failed: timeout: failed to connect service "localhost:50051" within 1s
  Warning  BackOff    24s (x43 over 15m)    kubelet, infra0.ao.example.com  Back-off restarting failed container




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Pods in crashloopbackoff state after upgrade


Expected results: Pods should be in running state.


Additional info:

Comment 12 Bruno Andrade 2020-09-28 03:59:08 UTC

Marking as VERIFIED:

OCP: 4.4.0-0.nightly-2020-09-26-084423
OLM version: 0.14.2
git commit: 6307c54ea472e772de9d421201ce5a1ef1f7413

oc get pod -n openshift-marketplace -o jsonpath='{range .items[*]}{.metadata.name}{" -- "}{.spec.containers[*].readinessProbe.initialDelaySeconds}{"\n"}{end}'
certified-operators-795fd8965-vrk6n -- 60
community-operators-7797c6cb7b-w9lrl -- 60
marketplace-operator-5bc68bdfb7-5f2nl -- 
qe-app-registry-6796f94cc8-9z9rq -- 60
redhat-marketplace-7d8dcfd6d8-mkn4d -- 60
redhat-operators-96c4d7745-pmfzv -- 60

Comment 15 errata-xmlrpc 2020-10-13 08:17:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.4.27 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4063

Comment 16 Arunabha Banerjee 2020-10-17 15:45:53 UTC

Still this persists on 4.4.27 version:

stack.hpecloud.org:/home/stack>oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.27    True        False         21m     Cluster version is 4.4.27
stack.hpecloud.org:/home/stack>

stack.hpecloud.org:/home/stack>oc get po -n openshift-marketplace
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-78957bf87f-znxd8    1/1     Running   0          15m
community-operators-6f45588956-89l9f    0/1     Running   4          11m
community-operators-7957d59f7d-2868l    0/1     Running   5          11m
marketplace-operator-5df598b96b-8xpgc   1/1     Running   0          49m
redhat-marketplace-778757464b-nxwq4     1/1     Running   4          48m
redhat-operators-5745dd5649-t996z       1/1     Running   3          48m
stack.hpecloud.org:/home/stack>

I already followed this article and changed the value, now "CrashLoopBackOff" meagerness is not there but pods never became ready.
https://access.redhat.com/solutions/5388381

Comment 17 Arunabha Banerjee 2020-10-17 15:55:47 UTC

Now again I got "CrashLoopBackOff" error.

stack.hpecloud.org:/home/stack>oc get po -n openshift-marketplace
NAME                                    READY   STATUS             RESTARTS   AGE
certified-operators-78957bf87f-znxd8    1/1     Running            0          28m
community-operators-6f45588956-89l9f    0/1     CrashLoopBackOff   7          24m
community-operators-7957d59f7d-2868l    0/1     CrashLoopBackOff   8          24m
marketplace-operator-5df598b96b-8xpgc   1/1     Running            0          61m
redhat-marketplace-778757464b-nxwq4     1/1     Running            4          61m
redhat-operators-5745dd5649-t996z       1/1     Running            3          61m
stack.hpecloud.org:/home/stack>
[2] 0:stack@undercloud:~*

Comment 18 Kevin Rizza 2020-10-19 17:34:42 UTC

Arunabha,

Can you get the logs related to the crashlooping community-operators pod?

Additionally, can you `oc describe` the pod?

Comment 19 Arunabha Banerjee 2020-10-19 17:41:11 UTC

(ocp2) centos@bastion2:/home/centos>oc describe po certified-operators-7894cc7667-gp98t -n openshift-marketplace
Name:         certified-operators-7894cc7667-gp98t
Namespace:    openshift-marketplace
Priority:     0
Node:         ocp2-m4ndr-worker-qhqfc/10.0.1.156
Start Time:   Mon, 19 Oct 2020 16:37:34 +0000
Labels:       marketplace.operatorSource=certified-operators
              pod-template-hash=7894cc7667
Annotations:  k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.131.0.46"
                    ],
                    "dns": {},
                    "default-route": [
                        "10.131.0.1"
                    ]
                }]
              openshift-marketplace-update-hash: 163f721aa3b1c8b3
              openshift.io/scc: restricted
Status:       Running
IP:           10.131.0.46
IPs:
  IP:           10.131.0.46
Controlled By:  ReplicaSet/certified-operators-7894cc7667
Containers:
  certified-operators:
    Container ID:  cri-o://b5d868929528318532d4b694de19e6989a4886ed17096e43100d4d7b8b1510e8
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7eb4e13f22a6234c70e02203e5c9adebf9fc7088b9e2d11b7a5cc614f90f413d
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7eb4e13f22a6234c70e02203e5c9adebf9fc7088b9e2d11b7a5cc614f90f413d
    Port:          50051/TCP
    Host Port:     0/TCP
    Command:
      appregistry-server
      -r
      https://quay.io/cnr|certified-operators
      -o
      storageos,cortex-operator,ibm-management-ingress-operator-app,ibm-helm-repo-operator-app,newrelic-infrastructure,ibm-auditlogging-operator-app,crunchy-postgres-operator,sematext,gitlab-operator,aci-containers-operator,tf-operator,memql-certified,t8c-certified,nginx-ingress-operator,cyberarmor-operator-certified,vprotect-operator,neuvector-certified-operator,redhat-marketplace-operator,splunk-certified,anzograph-operator,nsx-container-plugin-operator,redis-enterprise-operator-cert,seldon-deploy-operator,tidb-operator-certified,kong,couchdb-operator-certified,ibm-spectrum-symphony-operator,eddi-operator-certified,tigera-operator,openshiftxray-operator,cic-operator-with-crds,can-operator,driverlessai-deployment-operator-certified,portshift-operator,here-service-operator-certified,openshiftartifactoryha-operator,gpu-operator-certified,hpe-csi-operator,atomicorp-helm-operator-certified,node-red-operator-certified,linstor-operator,orca,federatorai-certified,citrix-adc-istio-ingress-gateway-operator,open-liberty-certified,vfunction-server-operator,aws-event-sources-operator-certified,hazelcast-enterprise-certified,f5-bigip-ctlr-operator,zoperator,kubemq-operator-marketplace,cortex-hub-operator,percona-server-mongodb-operator-certified,traefikee-certified,mongodb-enterprise-advanced-ibm,k10-kasten-operator,percona-xtradb-cluster-operator-certified,presto-operator,insightedge-enterprise-operator2,rapidbiz-operator-certified,triggermesh-operator,transform-adv-operator,mongodb-enterprise,cic-operator,k8s-triliovault,xcrypt-operator,oneagent-certified,anchore-engine,citrix-cpx-istio-sidecar-injector-operator,robin-operator,openunison-ocp-certified,instana-agent,appranix-cps,datadog-operator-certified,hazelcast-jet-enterprise-operator,infinibox-operator-certified,ibm-mongodb-operator-app,cockroachdb-certified,appsody-operator-certified,ubix-operator,akka-cluster-operator-certified,ibm-monitoring-grafana-operator-app,cnvrg-operator-marketplace,ch-appliance-operator,cloud-native-postgresql,uma-operator,cortex-fabric-operator,cass-operator,portshift-controller-operator,fep-helm-operator-certified,planetscale-certified,aqua-operator-certified,nxrm-operator-certified,cih-operator-certified,cert-manager-operator,universalagent-operator-certified,portworx-certified,ibm-spectrum-scale-csi,zabbix-operator-certified,anaconda-team-edition,hspc-operator,stonebranch-universalagent-operator-certified,appdynamics-operator,kubeturbo-certified,cpx-cic-operator,ibm-block-csi-operator,wavefront-operator,kong-offline-operator,kube-arangodb,yugabyte-operator,dell-csi-operator-certified,sysdig-certified,seldon-operator-certified,cortex-certifai-operator,kubemq-operator,ibm-helm-api-operator-app,nxiq-operator-certified,fp-predict-plus-operator-certified,storageos2,couchbase-enterprise-certified,synopsys-certified,joget-openshift-operator,open-enterprise-spinnaker,runtime-component-operator-certified,timemachine-operator,nuodb-ce-certified,joget-dx-operator,nastel-navigator-operator-certified,traefikee-redhat-certified,rocketchat-operator-certified,cortex-healthcare-hub-operator,growth-stack-operator-certified,twistlock-certified,perceptilabs-operator-package,alcide-kaudit-operator,falco-certified,densify-operator,armo-operator-certified,ivory-server-app,ocean-operator,ibm-platform-api-operator-app,coralogix-operator-certified
    State:          Running
      Started:      Mon, 19 Oct 2020 17:37:43 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 19 Oct 2020 17:35:07 +0000
      Finished:     Mon, 19 Oct 2020 17:37:41 +0000
    Ready:          False
    Restart Count:  15
    Requests:
      cpu:      10m
      memory:   100Mi
    Liveness:   exec [grpc_health_probe -addr=localhost:50051] delay=60s timeout=1s period=10s #success=1 #failure=10
    Readiness:  exec [grpc_health_probe -addr=localhost:50051] delay=120s timeout=1s period=10s #success=1 #failure=10
    Environment:
      HTTP_PROXY:
      HTTPS_PROXY:
      NO_PROXY:
    Mounts:
      /etc/pki/ca-trust/extracted/pem/ from marketplace-trusted-ca (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-rstgp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  marketplace-trusted-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      marketplace-trusted-ca
    Optional:  false
  default-token-rstgp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-rstgp
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From                              Message
  ----     ------     ----                  ----                              -------
  Normal   Scheduled  61m                   default-scheduler                 Successfully assigned openshift-marketplace/certified-operators-7894cc7667-gp98t to ocp2-m4ndr-worker-qhqfc
  Warning  Unhealthy  59m (x3 over 59m)     kubelet, ocp2-m4ndr-worker-qhqfc  Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s
  Normal   Killing    59m                   kubelet, ocp2-m4ndr-worker-qhqfc  Container certified-operators failed liveness probe, will be restarted
  Normal   Pulled     59m (x2 over 61m)     kubelet, ocp2-m4ndr-worker-qhqfc  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7eb4e13f22a6234c70e02203e5c9adebf9fc7088b9e2d11b7a5cc614f90f413d" already present on machine
  Normal   Created    59m (x2 over 61m)     kubelet, ocp2-m4ndr-worker-qhqfc  Created container certified-operators
  Normal   Started    59m (x2 over 61m)     kubelet, ocp2-m4ndr-worker-qhqfc  Started container certified-operators
  Warning  BackOff    6m43s (x87 over 40m)  kubelet, ocp2-m4ndr-worker-qhqfc  Back-off restarting failed container
  Warning  Unhealthy  110s (x148 over 60m)  kubelet, ocp2-m4ndr-worker-qhqfc  Liveness probe failed: timeout: failed to connect service "localhost:50051" within 1s
(ocp2) centos@bastion2:/home/centos>

Comment 20 Arunabha Banerjee 2020-10-19 17:42:11 UTC

(ocp2) centos@bastion2:/home/centos>oc describe po community-operators-5cd868bdd-4vbvk -n openshift-marketplace
Name:         community-operators-5cd868bdd-4vbvk
Namespace:    openshift-marketplace
Priority:     0
Node:         ocp2-m4ndr-worker-qhqfc/10.0.1.156
Start Time:   Mon, 19 Oct 2020 16:37:40 +0000
Labels:       marketplace.operatorSource=community-operators
              pod-template-hash=5cd868bdd
Annotations:  k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.131.0.47"
                    ],
                    "dns": {},
                    "default-route": [
                        "10.131.0.1"
                    ]
                }]
              openshift-marketplace-update-hash: 163f721bea95d529
              openshift.io/scc: restricted
Status:       Running
IP:           10.131.0.47
IPs:
  IP:           10.131.0.47
Controlled By:  ReplicaSet/community-operators-5cd868bdd
Containers:
  community-operators:
    Container ID:  cri-o://25de23c7ac5b99de32bc8e1ddcf02b5d5f3be3904ba998a5a98854320179d61d
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7eb4e13f22a6234c70e02203e5c9adebf9fc7088b9e2d11b7a5cc614f90f413d
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7eb4e13f22a6234c70e02203e5c9adebf9fc7088b9e2d11b7a5cc614f90f413d
    Port:          50051/TCP
    Host Port:     0/TCP
    Command:
      appregistry-server
      -r
      https://quay.io/cnr|community-operators
      -o
      ember-csi-operator,federation,jenkins-operator,sysflow-operator,reportportal-operator,codeready-toolchain-operator,cost-mgmt-operator,openebs,kubeturbo,hazelcast-jet-operator,percona-xtradb-cluster-operator,ibmcloud-iam-operator,openshift-qiskit-operator,postgresql,microsegmentation-operator,oadp-operator,opsmx-spinnaker-operator,podium-operator-bundle,horreum-operator,redis-operator,cert-utils-operator,crossplane,group-sync-operator,klusterlet,azure-service-operator,gitops-operator,maistraoperator,postgresql-operator-dev4devs-com,portworx-essentials,ripsaw,kubernetes-imagepuller-operator,t8c,snyk-operator,metering,awss3-operator-registry,myvirtualdirectory,prisma-cloud-compute-console-operator,sealed-secrets-operator-helm,apicurio-registry,prometheus,ibmcloud-operator,submariner,api-operator,dell-csi-operator,strimzi-kafka-operator,knative-kafka-operator,seldon-operator,hive-operator,hazelcast-operator,tidb-operator,planetscale,atlasmap-operator,eclipse-che,aws-efs-operator,apicast-community-operator,nexus-operator-m88i,hyperfoil-bundle,global-load-balancer-operator,knative-camel-operator,radanalytics-spark,service-binding-operator,percona-server-mongodb-operator,kogito-operator,ditto-operator,mcad-operator,microcks,event-streams-topic,traefikee-operator,argocd-operator,lib-bucket-provisioner,egressip-ipam-operator,iot-simulator,keycloak-operator,starter-kit-operator,jupyterlab-operator,cockroachdb,resource-locker-operator,keepalived-operator,elastic-cloud-eck,grafana-operator,hawtio-operator,hawkbit-operator,jaeger,spinnaker-operator,lightbend-console-operator,multicluster-operators-subscription,eunomia,prometheus-exporter-operator,openshift-nfd-operator,opendatahub-operator,nsm-operator-registry,teiid,must-gather-operator,enc-key-sync,federatorai,apicurito,enmasse,composable-operator,ibm-block-csi-operator-community,ibm-quantum-operator,node-problem-detector,namespace-configuration-operator,3scale-community-operator,special-resource-operator,keda,datadog-operator,spark-gcp,syndesis,splunk,pystol,cluster-manager,openshift-ibm-quantum-operator,camel-k,ibm-spectrum-scale-csi-operator,neuvector-community-operator,konveyor-operator,buildv2-operator,snapscheduler,infinispan,kubefed,kubestone,wso2am-operator,ham-deploy,aqua,kiali,esindex-operator,argocd-operator-helm,skydive-operator,etcd,akka-cluster-operator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 19 Oct 2020 17:38:07 +0000
      Finished:     Mon, 19 Oct 2020 17:40:43 +0000
    Ready:          False
    Restart Count:  15
    Requests:
      cpu:      10m
      memory:   100Mi
    Liveness:   exec [grpc_health_probe -addr=localhost:50051] delay=60s timeout=1s period=10s #success=1 #failure=10
    Readiness:  exec [grpc_health_probe -addr=localhost:50051] delay=120s timeout=1s period=10s #success=1 #failure=10
    Environment:
      HTTP_PROXY:
      HTTPS_PROXY:
      NO_PROXY:
    Mounts:
      /etc/pki/ca-trust/extracted/pem/ from marketplace-trusted-ca (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-rstgp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  marketplace-trusted-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      marketplace-trusted-ca
    Optional:  false
  default-token-rstgp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-rstgp
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From                              Message
  ----     ------     ----                  ----                              -------
  Normal   Scheduled  64m                   default-scheduler                 Successfully assigned openshift-marketplace/community-operators-5cd868bdd-4vbvk to ocp2-m4ndr-worker-qhqfc
  Normal   Killing    61m                   kubelet, ocp2-m4ndr-worker-qhqfc  Container community-operators failed liveness probe, will be restarted
  Normal   Pulled     61m (x2 over 63m)     kubelet, ocp2-m4ndr-worker-qhqfc  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7eb4e13f22a6234c70e02203e5c9adebf9fc7088b9e2d11b7a5cc614f90f413d" already present on machine
  Normal   Created    61m (x2 over 63m)     kubelet, ocp2-m4ndr-worker-qhqfc  Created container community-operators
  Normal   Started    61m (x2 over 63m)     kubelet, ocp2-m4ndr-worker-qhqfc  Started container community-operators
  Warning  Unhealthy  23m (x87 over 62m)    kubelet, ocp2-m4ndr-worker-qhqfc  Liveness probe failed: timeout: failed to connect service "localhost:50051" within 1s
  Warning  Unhealthy  12m (x31 over 62m)    kubelet, ocp2-m4ndr-worker-qhqfc  Liveness probe failed: command timed out
  Warning  BackOff    8m51s (x87 over 42m)  kubelet, ocp2-m4ndr-worker-qhqfc  Back-off restarting failed container
  Warning  Unhealthy  3m50s (x58 over 61m)  kubelet, ocp2-m4ndr-worker-qhqfc  Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s
(ocp2) centos@bastion2:/home/centos>

Comment 21 Arunabha Banerjee 2020-10-19 17:43:16 UTC

(ocp2) centos@bastion2:/home/centos>oc describe po certified-operators-7894cc7667-gp98t -n openshift-marketplace
Name:         certified-operators-7894cc7667-gp98t
Namespace:    openshift-marketplace
Priority:     0
Node:         ocp2-m4ndr-worker-qhqfc/10.0.1.156
Start Time:   Mon, 19 Oct 2020 16:37:34 +0000
Labels:       marketplace.operatorSource=certified-operators
              pod-template-hash=7894cc7667
Annotations:  k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.131.0.46"
                    ],
                    "dns": {},
                    "default-route": [
                        "10.131.0.1"
                    ]
                }]
              openshift-marketplace-update-hash: 163f721aa3b1c8b3
              openshift.io/scc: restricted
Status:       Running
IP:           10.131.0.46
IPs:
  IP:           10.131.0.46
Controlled By:  ReplicaSet/certified-operators-7894cc7667
Containers:
  certified-operators:
    Container ID:  cri-o://b5d868929528318532d4b694de19e6989a4886ed17096e43100d4d7b8b1510e8
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7eb4e13f22a6234c70e02203e5c9adebf9fc7088b9e2d11b7a5cc614f90f413d
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7eb4e13f22a6234c70e02203e5c9adebf9fc7088b9e2d11b7a5cc614f90f413d
    Port:          50051/TCP
    Host Port:     0/TCP
    Command:
      appregistry-server
      -r
      https://quay.io/cnr|certified-operators
      -o
      storageos,cortex-operator,ibm-management-ingress-operator-app,ibm-helm-repo-operator-app,newrelic-infrastructure,ibm-auditlogging-operator-app,crunchy-postgres-operator,sematext,gitlab-operator,aci-containers-operator,tf-operator,memql-certified,t8c-certified,nginx-ingress-operator,cyberarmor-operator-certified,vprotect-operator,neuvector-certified-operator,redhat-marketplace-operator,splunk-certified,anzograph-operator,nsx-container-plugin-operator,redis-enterprise-operator-cert,seldon-deploy-operator,tidb-operator-certified,kong,couchdb-operator-certified,ibm-spectrum-symphony-operator,eddi-operator-certified,tigera-operator,openshiftxray-operator,cic-operator-with-crds,can-operator,driverlessai-deployment-operator-certified,portshift-operator,here-service-operator-certified,openshiftartifactoryha-operator,gpu-operator-certified,hpe-csi-operator,atomicorp-helm-operator-certified,node-red-operator-certified,linstor-operator,orca,federatorai-certified,citrix-adc-istio-ingress-gateway-operator,open-liberty-certified,vfunction-server-operator,aws-event-sources-operator-certified,hazelcast-enterprise-certified,f5-bigip-ctlr-operator,zoperator,kubemq-operator-marketplace,cortex-hub-operator,percona-server-mongodb-operator-certified,traefikee-certified,mongodb-enterprise-advanced-ibm,k10-kasten-operator,percona-xtradb-cluster-operator-certified,presto-operator,insightedge-enterprise-operator2,rapidbiz-operator-certified,triggermesh-operator,transform-adv-operator,mongodb-enterprise,cic-operator,k8s-triliovault,xcrypt-operator,oneagent-certified,anchore-engine,citrix-cpx-istio-sidecar-injector-operator,robin-operator,openunison-ocp-certified,instana-agent,appranix-cps,datadog-operator-certified,hazelcast-jet-enterprise-operator,infinibox-operator-certified,ibm-mongodb-operator-app,cockroachdb-certified,appsody-operator-certified,ubix-operator,akka-cluster-operator-certified,ibm-monitoring-grafana-operator-app,cnvrg-operator-marketplace,ch-appliance-operator,cloud-native-postgresql,uma-operator,cortex-fabric-operator,cass-operator,portshift-controller-operator,fep-helm-operator-certified,planetscale-certified,aqua-operator-certified,nxrm-operator-certified,cih-operator-certified,cert-manager-operator,universalagent-operator-certified,portworx-certified,ibm-spectrum-scale-csi,zabbix-operator-certified,anaconda-team-edition,hspc-operator,stonebranch-universalagent-operator-certified,appdynamics-operator,kubeturbo-certified,cpx-cic-operator,ibm-block-csi-operator,wavefront-operator,kong-offline-operator,kube-arangodb,yugabyte-operator,dell-csi-operator-certified,sysdig-certified,seldon-operator-certified,cortex-certifai-operator,kubemq-operator,ibm-helm-api-operator-app,nxiq-operator-certified,fp-predict-plus-operator-certified,storageos2,couchbase-enterprise-certified,synopsys-certified,joget-openshift-operator,open-enterprise-spinnaker,runtime-component-operator-certified,timemachine-operator,nuodb-ce-certified,joget-dx-operator,nastel-navigator-operator-certified,traefikee-redhat-certified,rocketchat-operator-certified,cortex-healthcare-hub-operator,growth-stack-operator-certified,twistlock-certified,perceptilabs-operator-package,alcide-kaudit-operator,falco-certified,densify-operator,armo-operator-certified,ivory-server-app,ocean-operator,ibm-platform-api-operator-app,coralogix-operator-certified
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 19 Oct 2020 17:37:43 +0000
      Finished:     Mon, 19 Oct 2020 17:40:21 +0000
    Ready:          False
    Restart Count:  15
    Requests:
      cpu:      10m
      memory:   100Mi
    Liveness:   exec [grpc_health_probe -addr=localhost:50051] delay=60s timeout=1s period=10s #success=1 #failure=10
    Readiness:  exec [grpc_health_probe -addr=localhost:50051] delay=120s timeout=1s period=10s #success=1 #failure=10
    Environment:
      HTTP_PROXY:
      HTTPS_PROXY:
      NO_PROXY:
    Mounts:
      /etc/pki/ca-trust/extracted/pem/ from marketplace-trusted-ca (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-rstgp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  marketplace-trusted-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      marketplace-trusted-ca
    Optional:  false
  default-token-rstgp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-rstgp
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                    From                              Message
  ----     ------     ----                   ----                              -------
  Normal   Scheduled  65m                    default-scheduler                 Successfully assigned openshift-marketplace/certified-operators-7894cc7667-gp98t to ocp2-m4ndr-worker-qhqfc
  Warning  Unhealthy  62m (x3 over 63m)      kubelet, ocp2-m4ndr-worker-qhqfc  Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s
  Normal   Killing    62m                    kubelet, ocp2-m4ndr-worker-qhqfc  Container certified-operators failed liveness probe, will be restarted
  Normal   Pulled     62m (x2 over 65m)      kubelet, ocp2-m4ndr-worker-qhqfc  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7eb4e13f22a6234c70e02203e5c9adebf9fc7088b9e2d11b7a5cc614f90f413d" already present on machine
  Normal   Created    62m (x2 over 65m)      kubelet, ocp2-m4ndr-worker-qhqfc  Created container certified-operators
  Normal   Started    62m (x2 over 65m)      kubelet, ocp2-m4ndr-worker-qhqfc  Started container certified-operators
  Warning  Unhealthy  5m12s (x148 over 64m)  kubelet, ocp2-m4ndr-worker-qhqfc  Liveness probe failed: timeout: failed to connect service "localhost:50051" within 1s
  Warning  BackOff    11s (x109 over 43m)    kubelet, ocp2-m4ndr-worker-qhqfc  Back-off restarting failed container
(ocp2) centos@bastion2:/home/centos>

Comment 22 Arunabha Banerjee 2020-10-19 17:50:26 UTC

Please let me know if we can increase the timeout value? Currently it "1s" only. 

Liveness:   exec [grpc_health_probe -addr=localhost:50051] delay=60s timeout=1s period=10s #success=1 #failure=10
Readiness:  exec [grpc_health_probe -addr=localhost:50051] delay=120s timeout=1s period=10s #success=1 #failure=10

Comment 23 Evan Cordell 2020-10-21 20:04:09 UTC

I spun up a 4.4.27 cluster and could not reproduce the issue.

There are multiple reasons this pod could crash loop, and the original cases attached to this BZ have been resolved. 

If you can still reproduce, can you please open a new BZ that includes the logs from this pod (if available?)

Note You need to log in before you can comment on or make changes to this bug.