Bug 1853601 - CatalogSource generate 2 pods and keep restarting
Summary: CatalogSource generate 2 pods and keep restarting
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.6.0
Assignee: Ankita Thomas
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks: 1870782
TreeView+ depends on / blocked
 
Reported: 2020-07-03 10:07 UTC by gaoshang
Modified: 2020-09-29 02:44 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1870782 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1624 None closed Bug 1853601: use server-side-apply for catalog source pod update 2020-09-29 02:43:07 UTC

Internal Links: 1870782

Description gaoshang 2020-07-03 10:07:27 UTC
Description of problem:
After applied catalogsource yaml, 2 pods generated and keep restarting
# oc apply -f deploy/olm-catalog/catalogsource.yaml
...

# cat deploy/olm-catalog/catalogsource.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: wmco
  # Deploying the openshift-marketplace namespace as it is the global catalog namespace
  # a subscription in any namespace can refer to catalogsources in this namespace without error.
  namespace: openshift-marketplace
spec:
  displayName: Windows Machine Config operators
  sourceType: grpc
  image: quay.io/sgaoshang/wmco-index:1.0.0
  updateStrategy:
    registryPoll:
      interval: 5m

# oc get all -n openshift-marketplace
NAME                                        READY   STATUS    RESTARTS   AGE
...
pod/wmco-h5gtn                              1/1     Running   0          16m
pod/wmco-xqlz4                              1/1     Running   0          10m

NAME                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
...
service/wmco                           ClusterIP   172.30.161.103   <none>        50051/TCP           14h

Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-07-01-213040   True        False         31h     Cluster version is 4.5.0-0.nightly-2020-07-01-213040

How reproducible:
always

Steps to Reproduce:
1. Create wmco operator image, bundle image and index image, push to quay.io, refer to https://github.com/openshift/windows-machine-config-operator/blob/master/README.md

2. Install OCP cluster with ovn-kubernetes, apply catalogsource
e.g.
# oc apply -f deploy/olm-catalog/catalogsource.yaml
...

3. Check registry-server pod generated in openshift-marketplace, 2 wmco pods exist and keep restarting, seems like CatalogSource object keep creating new pod even there's no update in operator index images, and the old pod can not be removed

# oc get all -n openshift-marketplace
NAME                                        READY   STATUS    RESTARTS   AGE
...
pod/wmco-h5gtn                              1/1     Running   0          16m
pod/wmco-xqlz4                              1/1     Running   0          10m

NAME                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
...
service/wmco                           ClusterIP   172.30.161.103   <none>        50051/TCP           14h


Actual results:

Expected results:

Additional info:

Comment 1 Ankita Thomas 2020-07-07 05:42:39 UTC
Looks like this is because the registry polling interval is too short. Can you try increasing this to 25m and see if the problem persists? 

See https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/catalog-polling.md#caveats

Comment 2 gaoshang 2020-07-07 11:23:28 UTC
(In reply to Ankita Thomas from comment #1)
> Looks like this is because the registry polling interval is too short. Can
> you try increasing this to 25m and see if the problem persists? 
> 
> See
> https://github.com/operator-framework/operator-lifecycle-manager/blob/master/
> doc/design/catalog-polling.md#caveats

Thanks for the info, unfortunately after changed registry polling interval to 25m, this issue still exist, added logs from catalog-operator.

# oc get all -n openshift-marketplace
NAME                                        READY   STATUS    RESTARTS   AGE
...
pod/wmco-46wlx                              1/1     Running   0          9m54s
pod/wmco-qc5b9                              1/1     Running   0          9m56s

NAME                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
...
service/wmco                           ClusterIP   172.30.48.15     <none>        50051/TCP           40m


# oc logs -f deployment.apps/catalog-operator -n openshift-operator-lifecycle-manager
...

time="2020-07-07T08:52:57Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T08:52:58Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=TRANSIENT_FAILURE"
time="2020-07-07T08:52:59Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T08:52:59Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=TRANSIENT_FAILURE"
time="2020-07-07T08:53:01Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T08:53:01Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=TRANSIENT_FAILURE"
time="2020-07-07T08:53:03Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T08:53:03Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=TRANSIENT_FAILURE"
time="2020-07-07T08:53:07Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T08:53:07Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=TRANSIENT_FAILURE"
time="2020-07-07T08:53:13Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T08:53:13Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=READY"
time="2020-07-07T08:54:39Z" level=info msg="Adding related objects for operator-lifecycle-manager-catalog"
time="2020-07-07T08:59:39Z" level=info msg="Adding related objects for operator-lifecycle-manager-catalog"
time="2020-07-07T09:04:39Z" level=info msg="Adding related objects for operator-lifecycle-manager-catalog"
time="2020-07-07T09:09:39Z" level=info msg="Adding related objects for operator-lifecycle-manager-catalog"
time="2020-07-07T09:14:39Z" level=info msg="Adding related objects for operator-lifecycle-manager-catalog"
time="2020-07-07T09:19:39Z" level=info msg="Adding related objects for operator-lifecycle-manager-catalog"
time="2020-07-07T09:23:55Z" level=info msg="no image update for catalogsource pod" CatalogSource=wmco
time="2020-07-07T09:23:55Z" level=info msg="creating new catalog source update pod" CatalogSource=wmco
time="2020-07-07T09:23:55Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T09:23:55Z" level=warning msg="pod status unknown" CatalogSource=wmco-qc5b9
time="2020-07-07T09:23:55Z" level=info msg="ImageID " CatalogSource=wmco-qc5b9
time="2020-07-07T09:23:55Z" level=info msg="Update Pod ImageID " CatalogSource=wmco-qc5b9
time="2020-07-07T09:23:55Z" level=info msg="ImageID quay.io/sgaoshang/wmco-index@sha256:e29b54c17d6458488acbd6b860d355ce4b6b58a568f05fee3218538745818fd7" CatalogSource=wmco-bhvcf
time="2020-07-07T09:23:55Z" level=info msg="Serving Pod ImageID quay.io/sgaoshang/wmco-index@sha256:e29b54c17d6458488acbd6b860d355ce4b6b58a568f05fee3218538745818fd7" CatalogSource=wmco-bhvcf
time="2020-07-07T09:23:55Z" level=info msg="detect image update for catalogsource pod" CatalogSource=wmco
time="2020-07-07T09:23:55Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=READY"
E0707 09:23:55.225719       1 queueinformer_operator.go:290] sync {"update" "openshift-marketplace/wmco"} failed: couldn't ensure registry server - error ensuring updated catalog source pod: : error creating new pod: : Operation cannot be fulfilled on pods "wmco-qc5b9": the object has been modified; please apply your changes to the latest version and try again
time="2020-07-07T09:23:55Z" level=info msg="ImageID " CatalogSource=wmco-qc5b9
time="2020-07-07T09:23:55Z" level=info msg="Update Pod ImageID " CatalogSource=wmco-qc5b9
time="2020-07-07T09:23:55Z" level=info msg="ImageID quay.io/sgaoshang/wmco-index@sha256:e29b54c17d6458488acbd6b860d355ce4b6b58a568f05fee3218538745818fd7" CatalogSource=wmco-bhvcf
time="2020-07-07T09:23:55Z" level=info msg="Serving Pod ImageID quay.io/sgaoshang/wmco-index@sha256:e29b54c17d6458488acbd6b860d355ce4b6b58a568f05fee3218538745818fd7" CatalogSource=wmco-bhvcf
time="2020-07-07T09:23:55Z" level=info msg="detect image update for catalogsource pod" CatalogSource=wmco
E0707 09:23:55.254957       1 queueinformer_operator.go:290] sync {"update" "openshift-marketplace/wmco"} failed: couldn't ensure registry server - error ensuring updated catalog source pod: : error creating new pod: : Operation cannot be fulfilled on pods "wmco-qc5b9": the object has been modified; please apply your changes to the latest version and try again
time="2020-07-07T09:23:55Z" level=info msg="ImageID " CatalogSource=wmco-qc5b9
time="2020-07-07T09:23:55Z" level=info msg="Update Pod ImageID " CatalogSource=wmco-qc5b9
time="2020-07-07T09:23:55Z" level=info msg="ImageID quay.io/sgaoshang/wmco-index@sha256:e29b54c17d6458488acbd6b860d355ce4b6b58a568f05fee3218538745818fd7" CatalogSource=wmco-bhvcf
time="2020-07-07T09:23:55Z" level=info msg="Serving Pod ImageID quay.io/sgaoshang/wmco-index@sha256:e29b54c17d6458488acbd6b860d355ce4b6b58a568f05fee3218538745818fd7" CatalogSource=wmco-bhvcf
time="2020-07-07T09:23:55Z" level=info msg="detect image update for catalogsource pod" CatalogSource=wmco
time="2020-07-07T09:23:55Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T09:23:55Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=SHUTDOWN"
time="2020-07-07T09:23:55Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T09:23:55Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=TRANSIENT_FAILURE"
time="2020-07-07T09:23:56Z" level=info msg="no image update for catalogsource pod" CatalogSource=wmco
time="2020-07-07T09:23:56Z" level=info msg="creating new catalog source update pod" CatalogSource=wmco
time="2020-07-07T09:23:56Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T09:23:56Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T09:23:57Z" level=info msg="ImageID " CatalogSource=wmco-bs8pq
time="2020-07-07T09:23:57Z" level=info msg="Update Pod ImageID " CatalogSource=wmco-bs8pq
time="2020-07-07T09:23:57Z" level=info msg="ImageID " CatalogSource=wmco-qc5b9
time="2020-07-07T09:23:57Z" level=info msg="Serving Pod ImageID " CatalogSource=wmco-qc5b9
time="2020-07-07T09:23:57Z" level=info msg="no image update for catalogsource pod" CatalogSource=wmco
time="2020-07-07T09:23:57Z" level=info msg="creating new catalog source update pod" CatalogSource=wmco
time="2020-07-07T09:23:57Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=SHUTDOWN"
time="2020-07-07T09:23:57Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=CONNECTING"
time="2020-07-07T09:24:13Z" level=info msg="state.Key.Namespace=openshift-marketplace state.Key.Name=wmco state.State=READY"
time="2020-07-07T09:24:39Z" level=info msg="Adding related objects for operator-lifecycle-manager-catalog"
time="2020-07-07T09:29:39Z" level=info msg="Adding related objects for operator-lifecycle-manager-catalog"
time="2020-07-07T09:34:39Z" level=info msg="Adding related objects for operator-lifecycle-manager-catalog"

Comment 3 gaoshang 2020-07-07 11:25:48 UTC
If I removed following section in catalogsource.yaml, this problem can not be reproduced.
---
  updateStrategy:
    registryPoll:
      interval: 5m

Comment 6 Jian Zhang 2020-08-20 07:10:11 UTC
From the above logs, we can see unnecessary update operations, as follows, submit a PR to fix: https://github.com/operator-framework/operator-lifecycle-manager/pull/1729

time="2020-07-07T09:23:57Z" level=info msg="Update Pod ImageID " CatalogSource=wmco-bs8pq
time="2020-07-07T09:23:57Z" level=info msg="ImageID " CatalogSource=wmco-qc5b9
time="2020-07-07T09:23:57Z" level=info msg="Serving Pod ImageID " CatalogSource=wmco-qc5b9

Comment 9 Jian Zhang 2020-08-21 03:08:13 UTC
[root@preserve-olm-env data]# oc exec catalog-operator-7574bc8948-gx62l -- olm --version
OLM version: 0.16.0
git commit: c3852d57c86707deb80c042c2155ad82c2d9628f

Based on my understanding, there are two pods for the catalogsource is as expected.
Because one of the pods is curently served pod. Anohter one is used to check the index image updates. So, it will be restarted every poll interval.


[root@preserve-olm-env data]# oc get catalogsource qe-app-registry -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
...
spec:
  displayName: Production Operators
  image: quay.io/openshift-qe-optional-operators/ocp4-index:latest
  publisher: OpenShift QE
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 15m

[root@preserve-olm-env data]# oc get pods
NAME                                    READY   STATUS    RESTARTS   AGE
...
qe-app-registry-7wk4v                   1/1     Running   0          4m43s
qe-app-registry-ml5p8                   1/1     Running   0          62m

As we can see, this pod is the serveing pod, it runs hours without restart since no updates for index image.
[root@preserve-olm-env data]# oc get pods qe-app-registry-ml5p8 -o=jsonpath='{.metadata.labels}'
{"olm.catalogSource":"qe-app-registry"}

This one is used to check the index image updates. So, it will be recreated every poll interval.
[root@preserve-olm-env data]# oc get pods qe-app-registry-7wk4v -o=jsonpath='{.metadata.labels}'
{"catalogsource.operators.coreos.com/update":"qe-app-registry","olm.catalogSource":""}

Verify it.


Note You need to log in before you can comment on or make changes to this bug.