Hide Forgot
Description of problem: Multiple bugs have opened against OLM and Marketplace where QE has found OLM or Marketplace functionality not working due to missing resources that have no reason to be missing. After investigating it looks like the GC is erroneously deleting these resources. This was confirmed by looking at the audit logs of the api-server. Here is a snippet of the log: cat ~/debug-files/audit.log | grep 'packageserver' | grep delete | grep deployments ip-10-0-158-107.us-east-2.compute.internal {"kind":"Event","apiVersion":"audit.k8s.io/v1beta1","metadata":{"creationTimestamp":"2019-02-19T21:35:48Z"},"level":"Metadata","timestamp":"2019-02-19T21:35:48Z","auditID":"34cd42cd-cd73-41bc-a982-5d345a1bebf0","stage":"ResponseComplete","requestURI":"/apis/apps/v1/namespaces/openshift-operator-lifecycle-manager/deployments/packageserver","verb":"delete","user":{"username":"system:serviceaccount:kube-system:generic-garbage-collector","uid":"31a4ce88-33b2-11e9-b126-02ed3ac405e6","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["::1"],"userAgent":"hyperkube/v1.12.4+1221cce (linux/amd64) kubernetes/1221cce/system:serviceaccount:kube-system:generic-garbage-collector","objectRef":{"resource":"deployments","namespace":"openshift-operator-lifecycle-manager","name":"packageserver","apiGroup":"apps","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2019-02-19T21:35:48.663667Z","stageTimestamp":"2019-02-19T21:35:48.708828Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:generic-garbage-collector\" of ClusterRole \"system:controller:generic-garbage-collector\" to ServiceAccount \"generic-garbage-collector/kube-system\""}} In the above instance the OLM packageserver deployment was garbage collected when there was no reason for it happen. Steps to Reproduce: 1. Bring up a cluster 2. Wait for many hours or maybe not :-) 3. Check resources that you expect to be present. We have seen various examples that are show in the bugs that will be marked as duplicate of this bug.
*** Bug 1677329 has been marked as a duplicate of this bug. ***
*** Bug 1678654 has been marked as a duplicate of this bug. ***
*** Bug 1666225 has been marked as a duplicate of this bug. ***
I can reproduce this issue on a cluster with CVO disabled. Steps: - Bring up a new cluster - Make sure packageserver is up and running. - Scale down CVO - oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator
*** Bug 1678644 has been marked as a duplicate of this bug. ***
All the resources of marketplace will remain stable after stopping the cluster-version-operator.
(In reply to Fan Jia from comment #8) > All the resources of marketplace will remain stable after stopping the > cluster-version-operator. This is most likely a red herring given that this issue is sporadic.
The catalogsourceconfig and marketplace's pods still crash in env : clusterversion 4.0.0-0.nightly-2019-02-24-045124. The same situation as https://bugzilla.redhat.com/show_bug.cgi?id=1678654
the marketplace still crash in the latest cv , subscribe failed The csc can't be updated after a while. The cluster always keep losing stuff like (csc crd, packagemanifest crd, role, rolebinding....) and will recover or not. eg: the crd `csv` will lose and recover [core@ip-10-0-149-70 ~]$ oc get csv -n openshift-operators error: the server doesn't have a resource type "csv" [core@ip-10-0-149-70 ~]$ oc get csv -n openshift-operators No resources found.
*** Bug 1678606 has been marked as a duplicate of this bug. ***
Should this happen again, please be sure to gather the audit logs. They are located on your masters at /var/log/kube-apiserver. we need them from all masters. If anyone can make a quick reproducer (or knows the root objects), you should set a finalizer on the root resources to prevent them from being fully removed. Doing so will let you know if someone is deleting them. From the audit logs we have, we can see the marketplace operator frequently deleting what appears to be teh root resource and GC cleaning up. 13:25:48 [DELETE][7.261ms] [200] /apis/marketplace.redhat.com/v1alpha1/namespaces/openshift-marketplace/catalogsourceconfigs/community-operators [system:serviceaccount:openshift-marketplace:marketplace-operator]
(In reply to David Eads from comment #19) > Should this happen again, please be sure to gather the audit logs. They are > located on your masters at /var/log/kube-apiserver. we need them from all > masters. > We have not been able to reproduce this easily on our side but QE has seen it sporadically. I will pass on the info to them. > From the audit logs we have, we can see the marketplace operator frequently > deleting what appears to be teh root resource and GC cleaning up. This I think is expected bad behaviour which we just fixed in https://github.com/operator-framework/operator-marketplace/pull/117. The issue QE was seeing was a CatalogSource we create in another namespace getting deleted. This CatalogSource was owner referenced across namespaces. We now know that this should not be done and are in the process of fixing it.
https://github.com/operator-framework/operator-marketplace/pull/118 This PR may fix some of the issues with OLM
Created a PR to remove owner reference usage from Operator Marketplace: https://github.com/operator-framework/operator-marketplace/pull/118
1.testenv: clusterversion: 4.0.0-0.nightly-2019-02-27-213933 marketplace image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:db3741fada8ffb0417e20b58781d2028a17c2c7697c0965b248b6af8501ef312 2.The resources of marketplace (role, rolebinding, marketplace's pod, registory's pod, csc) stay stable.
The cluster version: [jzhang@dhcp-140-18 2019-02-27-213933]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-27-213933 True False 3h2m Cluster version is 4.0.0-0.nightly-2019-02-27-213933 The rolebinding looks good. And, the packageserver running for 3 hours without any restart. [jzhang@dhcp-140-18 2019-02-27-213933]$ oc get rolebinding v1alpha1.packages.apps.redhat.com-cert -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: creationTimestamp: 2019-02-28T02:30:58Z name: v1alpha1.packages.apps.redhat.com-cert namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: false kind: ClusterServiceVersion name: packageserver.v0.8.1 uid: e018a26e-3b00-11e9-b761-068a49731d2a resourceVersion: "7766" selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/openshift-operator-lifecycle-manager/rolebindings/v1alpha1.packages.apps.redhat.com-cert uid: e2069a36-3b00-11e9-8457-027e0049c756 roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: v1alpha1.packages.apps.redhat.com-cert subjects: - kind: ServiceAccount name: packageserver namespace: openshift-operator-lifecycle-manager [jzhang@dhcp-140-18 2019-02-27-213933]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-5459c85995-pwmmr 1/1 Running 0 3h12m olm-operator-766bfd6747-crgm7 1/1 Running 0 3h12m olm-operators-ttmhm 1/1 Running 0 3h11m packageserver-8855cb457-gmtt2 1/1 Running 0 3h11m packageserver-8855cb457-jp9pd 1/1 Running 0 3h11m No garbage delete logs found. [jzhang@dhcp-140-18 2019-02-27-213933]$ oc logs kube-controller-manager-ip-10-0-128-205.us-east-2.compute.internal -n openshift-kube-controller-manager|grep garbage|grep delete [jzhang@dhcp-140-18 2019-02-27-213933]$ oc logs kube-controller-manager-ip-10-0-146-107.us-east-2.compute.internal -n openshift-kube-controller-manager|grep garbage|grep delete [jzhang@dhcp-140-18 2019-02-27-213933]$ oc logs kube-controller-manager-ip-10-0-171-50.us-east-2.compute.internal -n openshift-kube-controller-manager|grep garbage|grep delete [core@ip-10-0-171-50 ~]$ sudo cat /var/log/kube-apiserver/audit.log |grep garbage |grep delete [core@ip-10-0-171-50 ~]$ [core@ip-10-0-146-107 ~]$ sudo cat /var/log/kube-apiserver/audit.log |grep garbage |grep delete [core@ip-10-0-146-107 ~]$ [core@ip-10-0-128-205 ~]$ sudo cat /var/log/kube-apiserver/audit.log |grep garbage |grep delete [core@ip-10-0-128-205 ~]$ These InstallPlan/csvs are generated as expected. [jzhang@dhcp-140-18 2019-02-27-213933]$ oc get sub -n default NAME PACKAGE SOURCE CHANNEL amq-streams amq-streams installed-redhat-default final prometheus prometheus installed-community-default preview [jzhang@dhcp-140-18 2019-02-27-213933]$ oc get pods -n default NAME READY STATUS RESTARTS AGE prometheus-operator-b7b7ff54-gf6nx 1/1 Running 0 18m strimzi-cluster-operator-c56bf7877-pjs22 1/1 Running 0 21m [jzhang@dhcp-140-18 2019-02-27-213933]$ oc get csv -n openshift-operators NAME DISPLAY VERSION REPLACES PHASE couchbase-operator.v1.1.0 Couchbase Operator 1.1.0 couchbase-operator.v1.0.0 Failed etcdoperator.v0.9.2 etcd 0.9.2 etcdoperator.v0.9.0 Succeeded jaeger-operator.v1.8.2 jaeger-operator 1.8.2 Succeeded mongodboperator.v0.3.2 MongoDB 0.3.2 Failed [jzhang@dhcp-140-18 2019-02-27-213933]$ oc get pods -n openshift-operators NAME READY STATUS RESTARTS AGE couchbase-operator-855b57b84c-qfznv 0/1 ImagePullBackOff 0 3h8m etcd-operator-69d5fd44f7-j475r 3/3 Running 0 116m jaeger-operator-7785d8c948-jj4qg 1/1 Running 0 22m mongodb-enterprise-operator-5db87fd68b-5ppmr 0/1 ImagePullBackOff 0 173m PS: please ignore couchbase/mongodb pods errors, this due to the lack of the auth of the "registry.connect.redhat.com" in current cluster. Marketplace image info: io.openshift.build.commit.id=fedd694eb7d926ce11945f1eaa452e0fe7f8a25f io.openshift.build.commit.url=https://github.com/operator-framework/operator-marketplace/commit/fedd694eb7d926ce11945f1eaa452e0fe7f8a25f io.openshift.build.source-location=https://github.com/operator-framework/operator-marketplace OLM image info: io.openshift.build.commit.id=918678103b5cf9ddfa1d196560e9828f6aa30c52 io.openshift.build.commit.url=https://github.com/operator-framework/operator-lifecycle-manager/commit/918678103b5cf9ddfa1d196560e9828f6aa30c52 io.openshift.build.source-location=https://github.com/operator-framework/operator-lifecycle-manager LGTM, verify it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758