1679309 – Garbage collector erroneously deleting resources

Bug 1679309 - Garbage collector erroneously deleting resources

Summary: Garbage collector erroneously deleting resources

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Evan Cordell
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Duplicates (5):	1666225 1677329 1678606 1678644 1678654 (view as bug list)
Depends On:
Blocks:	1678606
TreeView+	depends on / blocked

Reported:	2019-02-20 21:00 UTC by Aravindh Puthiyaparambil
Modified:	2019-06-04 10:44 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:44:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:44:20 UTC

Description Aravindh Puthiyaparambil 2019-02-20 21:00:48 UTC

Description of problem:
Multiple bugs have opened against OLM and Marketplace where QE has found OLM or Marketplace functionality not working due to missing resources that have no reason to be missing. After investigating it looks like the GC is erroneously deleting these resources. This was confirmed by looking at the audit logs of the api-server. Here is a snippet of the log:

cat ~/debug-files/audit.log | grep 'packageserver' | grep delete | grep deployments
ip-10-0-158-107.us-east-2.compute.internal {"kind":"Event","apiVersion":"audit.k8s.io/v1beta1","metadata":{"creationTimestamp":"2019-02-19T21:35:48Z"},"level":"Metadata","timestamp":"2019-02-19T21:35:48Z","auditID":"34cd42cd-cd73-41bc-a982-5d345a1bebf0","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-operator-lifecycle-manager/deployments/packageserver","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"31a4ce88-33b2-11e9-b126-02ed3ac405e6","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.12.4+1221cce (linux/amd64) kubernetes/1221cce/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-operator-lifecycle-manager","name":"packageserver","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2019-02-19T21:35:48.663667Z","stageTimestamp":"2019-02-19T21:35:48.708828Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}}

In the above instance the OLM packageserver deployment was garbage collected when there was no reason for it happen.

Steps to Reproduce:
1. Bring up a cluster
2. Wait for many hours or maybe not :-)
3. Check resources that you expect to be present. We have seen various examples that are show in the bugs that will be marked as duplicate of this bug.

Comment 1 Aravindh Puthiyaparambil 2019-02-20 21:03:25 UTC

*** Bug 1677329 has been marked as a duplicate of this bug. ***

Comment 2 Aravindh Puthiyaparambil 2019-02-20 21:04:17 UTC

*** Bug 1678654 has been marked as a duplicate of this bug. ***

Comment 3 Aravindh Puthiyaparambil 2019-02-20 21:05:30 UTC

*** Bug 1666225 has been marked as a duplicate of this bug. ***

Comment 4 Abu Kashem 2019-02-20 22:45:57 UTC

I can reproduce this issue on a cluster with CVO disabled. 

Steps:
- Bring up a new cluster
- Make sure packageserver is up and running.
- Scale down CVO - oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator

Comment 5 Aravindh Puthiyaparambil 2019-02-20 22:51:31 UTC

*** Bug 1678644 has been marked as a duplicate of this bug. ***

Comment 8 Fan Jia 2019-02-21 07:58:01 UTC

All the resources of marketplace will remain stable after stopping the cluster-version-operator.

Comment 9 Aravindh Puthiyaparambil 2019-02-21 15:33:23 UTC

(In reply to Fan Jia from comment #8)
> All the resources of marketplace will remain stable after stopping the
> cluster-version-operator.

This is most likely a red herring given that this issue is sporadic.

Comment 12 Fan Jia 2019-02-25 08:53:26 UTC

The catalogsourceconfig and marketplace's pods still crash in env : clusterversion 4.0.0-0.nightly-2019-02-24-045124. The same situation as https://bugzilla.redhat.com/show_bug.cgi?id=1678654

Comment 13 Fan Jia 2019-02-25 13:11:18 UTC

the marketplace still crash in the latest cv , subscribe failed  The csc can't be updated after a while. The cluster always keep losing stuff like (csc crd, packagemanifest crd, role, rolebinding....) and will recover or not.

eg: the crd `csv` will lose and recover
[core@ip-10-0-149-70 ~]$ oc get csv -n openshift-operators
error: the server doesn't have a resource type "csv"
[core@ip-10-0-149-70 ~]$ oc get csv -n openshift-operators
No resources found.

Comment 15 Evan Cordell 2019-02-25 16:20:44 UTC

*** Bug 1678606 has been marked as a duplicate of this bug. ***

Comment 19 David Eads 2019-02-26 15:07:50 UTC

Should this happen again, please be sure to gather the audit logs.  They are located on your masters at /var/log/kube-apiserver.  we need them from all masters.

If anyone can make a quick reproducer (or knows the root objects), you should set a finalizer on the root resources to prevent them from being fully removed.  Doing so will let you know if someone is deleting them.

From the audit logs we have, we can see the marketplace operator frequently deleting what appears to be teh root resource and GC cleaning up.


13:25:48 [DELETE][7.261ms] [200]  /apis/marketplace.redhat.com/v1alpha1/namespaces/openshift-marketplace/catalogsourceconfigs/community-operators [system:serviceaccount:openshift-marketplace:marketplace-operator]

Comment 20 Aravindh Puthiyaparambil 2019-02-26 17:09:02 UTC

(In reply to David Eads from comment #19)
> Should this happen again, please be sure to gather the audit logs.  They are
> located on your masters at /var/log/kube-apiserver.  we need them from all
> masters.
> 

We have not been able to reproduce this easily on our side but QE has seen it sporadically. I will pass on the info to them.

> From the audit logs we have, we can see the marketplace operator frequently
> deleting what appears to be teh root resource and GC cleaning up.

This I think is expected bad behaviour which we just fixed in https://github.com/operator-framework/operator-marketplace/pull/117. The issue QE was seeing was a CatalogSource we create in another namespace getting deleted. This CatalogSource was owner referenced across namespaces. We now know that this should not be done and are in the process of fixing it.

Comment 21 Evan Cordell 2019-02-26 20:42:56 UTC

https://github.com/operator-framework/operator-marketplace/pull/118

This PR may fix some of the issues with OLM

Comment 22 Kevin Rizza 2019-02-26 20:44:15 UTC

Created a PR to remove owner reference usage from Operator Marketplace:

https://github.com/operator-framework/operator-marketplace/pull/118

Comment 36 Fan Jia 2019-02-28 05:28:38 UTC

1.testenv:
clusterversion: 4.0.0-0.nightly-2019-02-27-213933
marketplace image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:db3741fada8ffb0417e20b58781d2028a17c2c7697c0965b248b6af8501ef312
2.The resources of marketplace (role, rolebinding, marketplace's pod, registory's pod, csc) stay stable.

Comment 37 Jian Zhang 2019-02-28 06:17:55 UTC

The cluster version:
[jzhang@dhcp-140-18 2019-02-27-213933]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-27-213933   True        False         3h2m    Cluster version is 4.0.0-0.nightly-2019-02-27-213933

The rolebinding looks good. And, the packageserver running for 3 hours without any restart.
[jzhang@dhcp-140-18 2019-02-27-213933]$ oc get rolebinding v1alpha1.packages.apps.redhat.com-cert -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  creationTimestamp: 2019-02-28T02:30:58Z
  name: v1alpha1.packages.apps.redhat.com-cert
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: ClusterServiceVersion
    name: packageserver.v0.8.1
    uid: e018a26e-3b00-11e9-b761-068a49731d2a
  resourceVersion: "7766"
  selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/openshift-operator-lifecycle-manager/rolebindings/v1alpha1.packages.apps.redhat.com-cert
  uid: e2069a36-3b00-11e9-8457-027e0049c756
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: v1alpha1.packages.apps.redhat.com-cert
subjects:
- kind: ServiceAccount
  name: packageserver
  namespace: openshift-operator-lifecycle-manager

[jzhang@dhcp-140-18 2019-02-27-213933]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-5459c85995-pwmmr   1/1     Running   0          3h12m
olm-operator-766bfd6747-crgm7       1/1     Running   0          3h12m
olm-operators-ttmhm                 1/1     Running   0          3h11m
packageserver-8855cb457-gmtt2       1/1     Running   0          3h11m
packageserver-8855cb457-jp9pd       1/1     Running   0          3h11m

No garbage delete logs found.
[jzhang@dhcp-140-18 2019-02-27-213933]$ oc logs kube-controller-manager-ip-10-0-128-205.us-east-2.compute.internal -n openshift-kube-controller-manager|grep garbage|grep delete
[jzhang@dhcp-140-18 2019-02-27-213933]$ oc logs kube-controller-manager-ip-10-0-146-107.us-east-2.compute.internal -n openshift-kube-controller-manager|grep garbage|grep delete
[jzhang@dhcp-140-18 2019-02-27-213933]$ oc logs kube-controller-manager-ip-10-0-171-50.us-east-2.compute.internal -n openshift-kube-controller-manager|grep garbage|grep delete

[core@ip-10-0-171-50 ~]$ sudo cat /var/log/kube-apiserver/audit.log |grep garbage |grep delete
[core@ip-10-0-171-50 ~]$ 
[core@ip-10-0-146-107 ~]$ sudo cat /var/log/kube-apiserver/audit.log |grep garbage |grep delete
[core@ip-10-0-146-107 ~]$ 
[core@ip-10-0-128-205 ~]$ sudo cat /var/log/kube-apiserver/audit.log |grep garbage |grep delete
[core@ip-10-0-128-205 ~]$ 

These InstallPlan/csvs are generated as expected.
[jzhang@dhcp-140-18 2019-02-27-213933]$ oc get sub -n default
NAME          PACKAGE       SOURCE                        CHANNEL
amq-streams   amq-streams   installed-redhat-default      final
prometheus    prometheus    installed-community-default   preview
[jzhang@dhcp-140-18 2019-02-27-213933]$ oc get pods -n default
NAME                                       READY   STATUS    RESTARTS   AGE
prometheus-operator-b7b7ff54-gf6nx         1/1     Running   0          18m
strimzi-cluster-operator-c56bf7877-pjs22   1/1     Running   0          21m

[jzhang@dhcp-140-18 2019-02-27-213933]$ oc get csv -n openshift-operators
NAME                        DISPLAY              VERSION   REPLACES                    PHASE
couchbase-operator.v1.1.0   Couchbase Operator   1.1.0     couchbase-operator.v1.0.0   Failed
etcdoperator.v0.9.2         etcd                 0.9.2     etcdoperator.v0.9.0         Succeeded
jaeger-operator.v1.8.2      jaeger-operator      1.8.2                                 Succeeded
mongodboperator.v0.3.2      MongoDB              0.3.2                                 Failed

[jzhang@dhcp-140-18 2019-02-27-213933]$ oc get pods -n openshift-operators
NAME                                           READY   STATUS             RESTARTS   AGE
couchbase-operator-855b57b84c-qfznv            0/1     ImagePullBackOff   0          3h8m
etcd-operator-69d5fd44f7-j475r                 3/3     Running            0          116m
jaeger-operator-7785d8c948-jj4qg               1/1     Running            0          22m
mongodb-enterprise-operator-5db87fd68b-5ppmr   0/1     ImagePullBackOff   0          173m

PS: please ignore couchbase/mongodb pods errors, this due to the lack of the auth of the "registry.connect.redhat.com" in current cluster.

Marketplace image info:
             io.openshift.build.commit.id=fedd694eb7d926ce11945f1eaa452e0fe7f8a25f
             io.openshift.build.commit.url=https://github.com/operator-framework/operator-marketplace/commit/fedd694eb7d926ce11945f1eaa452e0fe7f8a25f
             io.openshift.build.source-location=https://github.com/operator-framework/operator-marketplace

OLM image info:
               io.openshift.build.commit.id=918678103b5cf9ddfa1d196560e9828f6aa30c52
               io.openshift.build.commit.url=https://github.com/operator-framework/operator-lifecycle-manager/commit/918678103b5cf9ddfa1d196560e9828f6aa30c52
               io.openshift.build.source-location=https://github.com/operator-framework/operator-lifecycle-manager

LGTM, verify it.

Comment 40 errata-xmlrpc 2019-06-04 10:44:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.