Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1771747

Summary: Operator installed via OperatorHub/markeplace crashing on pod creation with network error
Product: OpenShift Container Platform Reporter: Rogerio Bastos <rbastos>
Component: OLMAssignee: Evan Cordell <ecordell>
OLM sub component: OperatorHub QA Contact: Fan Jia <jfan>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: cblecker, dageoffr, haowang, jeder, nhale, nmalik, rbastos
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.2.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-13 15:01:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
operator deployment
none
oc describe pod none

Description Rogerio Bastos 2019-11-12 21:47:24 UTC
Created attachment 1635508 [details]
operator deployment

Description of problem:
An operator deployed by a customer from the certified-operators repo is crashing when creating a pod from the deployment object, with a network error.

Multiple bogus replicasets are being created in the cluster, to a point where it starts impacting responsiveness of cluster API 

Version-Release number of selected component (if applicable):
Openshift 4.2.2

How reproducible:
This same behavior was identified in 2 customer clusters, always in deployments created in the openshift-marketplace NS, when customers deploy operators from the marketplace/OperatorHub. It was mapped with different operators.


Steps to Reproduce:
1.Customer deploys operator into openshift-marketplace NS
2.Wait until deployment gets created and pods crashing
3.After a couple of days the cluster will have thousands of failed replicaSets, impacting cluster api/responsiveness

Actual results:

Example from openshift-marketplace NS, pods:
NAME                                                  READY   STATUS        RESTARTS   AGE
installed-certified-brandon-images-58bf78b4c6-c59n6   1/1     Running       0          131m
installed-certified-brandon-images-5d557857fb-xflrp   0/1     Terminating   0          3s
installed-certified-brandon-images-65cc8f96f8-n9pjr   0/1     Terminating   0          5s
installed-certified-brandon-images-6c6bfc9f66-xsdzs   0/1     Terminating   0          1s
installed-certified-brandon-images-6f7d85fb78-ftmqn   0/1     Terminating   0          6s
installed-certified-brandon-images-7b4ff448cb-9rbzb   0/1     Terminating   0          8s


Constant errors identified when doing: oc get events
46m         Normal    Scheduled                pod/installed-certified-brandon-images-fff78bc57-fjctc                Successfully assigned openshift-marketplace/installed-certified-brandon-images-fff78bc57-fjctc to ip-10-0-130-71.ca-central-1.compute.internal
45m         Warning   FailedCreatePodSandBox   pod/installed-certified-brandon-images-fff78bc57-fjctc                Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installed-certified-brandon-images-fff78bc57-fjctc_openshift-marketplace_77985f6e-058d-11ea-8b30-02637b2b3cea_0(e81726545e10a158414a8438639d33224de03421eca39607e80b252ff305d08e): Multus: Err adding pod to network "openshift-sdn": cannot set "openshift-sdn" ifname to "eth0": no netns: failed to Statfs "/proc/98478/ns/net": no such file or directory
46m         Normal    SuccessfulCreate         replicaset/installed-certified-brandon-images-fff78bc57               Created pod: installed-certified-brandon-images-fff78bc57-fjctc
46m         Normal    SuccessfulDelete         replicaset/installed-certified-brandon-images-fff78bc57               Deleted pod: installed-certified-brandon-images-fff78bc57-fjctc
109m        Normal    Scheduled                pod/installed-certified-brandon-images-fffb5c476-l5rfb                Successfully assigned openshift-marketplace/installed-certified-brandon-images-fffb5c476-l5rfb to ip-10-0-138-18.ca-central-1.compute.internal
109m        Warning   FailedCreatePodSandBox   pod/installed-certified-brandon-images-fffb5c476-l5rfb                Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installed-certified-brandon-images-fffb5c476-l5rfb_openshift-marketplace_a17da02a-0584-11ea-8b30-02637b2b3cea_0(7f8779b10a58ed39ef373eb43f99a7a4807b6df31cb76d70ae9d7f72b7053ba4): Multus: Err adding pod to network "openshift-sdn": cannot set "openshift-sdn" ifname to "eth0": no netns: failed to Statfs "/proc/5807/ns/net": no such file or directory

Operator Deployment:
(attached to this BZ)


Count of replicasets:
$ oc get replicaset | grep brandon | wc -l
6532

In another cluster it was identified >27k bogus replicaSets


Additional info:

An example of `oc describe pod` is attached as well

Comment 1 Rogerio Bastos 2019-11-12 21:47:45 UTC
Created attachment 1635509 [details]
oc describe pod

Comment 2 Naveen Malik 2019-11-12 22:41:46 UTC
On this cluster noticed an error with the CatalogSourceConfig:

$ oc get catalogsourceconfig -n openshift-marketplace
NAME                                 STATUS        MESSAGE                                 AGE
installed-certified-brandon-images   Configuring   namespaces "brandon-images" not found   50d

Comment 4 Rogerio Bastos 2019-11-13 15:01:01 UTC
A test was done to confirm the bad CatalogConfigConfig was causing the issue:

1) Problem in the CatalogSourceConfig object

$ oc get catalogsourceconfig installed-certified-rhopp1
NAME                         STATUS        MESSAGE                                       AGE
installed-certified-rhopp1   Configuring   namespaces "rhopp1" not found                 33d

2) Missing NS created: $ oc new-project rhopp1

3) Issue fixed:

$ oc get catalogsourceconfig installed-certified-rhopp1
NAME                         STATUS      MESSAGE                                       AGE
installed-certified-rhopp1   Succeeded   The object has been successfully reconciled   33d

$ oc get pods          
NAME                                               READY   STATUS    RESTARTS   AGE
installed-certified-rhopp1-7bdfd77876-st8bg        1/1     Running   0          64m
marketplace-operator-d6774f7b6-4pcv4               1/1     Running   0          2d16h
osd-curated-certified-operators-66488dd446-vn2lh   1/1     Running   0          2d16h
osd-curated-community-operators-6bdcf8d74c-mr4kc   1/1     Running   0          2d16h
osd-curated-redhat-operators-54944c5bc4-59t45      1/1     Running   0          18h

*** This bug has been marked as a duplicate of bug 1767547 ***

Comment 5 Dan Geoffroy 2019-11-14 15:17:42 UTC
For easy reference, the cloned 4.2.z version of the duplicate bug above is https://bugzilla.redhat.com/show_bug.cgi?id=1769841  and is awaiting cherry pick for the next 4.2.z release.