Bug 1638726

Summary: controller-manager pod stops responding with max memory usage and lot of open tcp sockets
Product: OpenShift Container Platform Reporter: Sudarshan Chaudhari <suchaudh>
Component: Service CatalogAssignee: Jay Boyd <jaboyd>
Status: CLOSED ERRATA QA Contact: Jian Zhang <jiazha>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: dyan, jaboyd, jfan, jiazha, jlee, zitang
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The OSB Client Library utilized by the Service Catalog controller pod was not closing and freeing TCP connections used to communicate with Brokers. Consequence: Over a period of time many TCP connections would remain open and eventually the communication between the Service Catalog controller and Brokers would fail. Additionally the pod would become unresponsive. Fix: Use new a new version of the OSB Client Library which contains a fix to close connections when finishing a HTTP request and free idle connections.
Story Points: ---
Clone Of:
: 1641791 1641796 (view as bug list) Environment:
Last Closed: 2018-12-13 19:27:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1641791, 1641796    
Attachments:
Description Flags
Memory Metrics of the pod "controller-manager." none

Description Sudarshan Chaudhari 2018-10-12 10:38:06 UTC
Created attachment 1493243 [details]
Memory Metrics of the pod "controller-manager."

Description of problem:

Many times the template provisioning form template-service-broker fails, after deleting/restarting the controller-manager pod, the deployment starts working properly. 

This time instead of deleting the pod, after inspecting it, we found out that there are number of open file descriptors and noticed that there were a lot of them (mostly sockets), and also at the number of open TCP connections.  Most of those connections are to 172.30.226.105:443 (hex notation 69E21EAC:01BB), which is the cluster IP of the `apiserver` service in the `openshift-template-service-broker` project.

```
sh-4.2$ ls /proc/1/fd | wc -l
32000
sh-4.2$ head -n25 /proc/1/net/tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
   0: 8604800A:BAE3 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 28996712 1 ffff9dad77ea5540 20 4 0 10 -1
   1: 8604800A:D853 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 18955000 1 ffff9dad79a4c5c0 20 4 0 10 -1
   2: 8604800A:B65E 8D461EAC:053A 01 00000000:00000000 00:00000000 00000000 1000120000        0 7074283 1 ffff9da6a7205540 20 4 1 10 -1
   3: 8604800A:99AA 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 20225303 1 ffff9dac0ee31740 20 4 0 10 -1
   4: 8604800A:B5E0 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 1415561 1 ffff9dac396f7440 20 4 0 10 -1
   5: 8604800A:BD36 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 12484495 1 ffff9da7a7f207c0 20 4 0 10 -1
   6: 8604800A:DCDD 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 9205861 1 ffff9dac0a333e00 20 4 0 10 -1
   7: 8604800A:CEE3 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 19934683 1 ffff9dad8f761f00 20 4 0 10 -1
   8: 8604800A:8744 8D461EAC:053A 01 00000000:00000000 00:00000000 00000000 1000120000        0 30344225 1 ffff9d967e7cb640 20 4 0 10 -1
   9: 8604800A:CF64 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 7938304 1 ffff9da57afeb640 20 4 0 10 -1
  10: 8604800A:C06D 8D461EAC:053A 01 00000000:00000000 00:00000000 00000000 1000120000        0 26188930 1 ffff9dad3b303640 20 4 0 10 -1
  11: 8604800A:8FAB 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 20171682 1 ffff9d8e73d79f00 20 4 0 10 -1
  12: 8604800A:8DFA 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 6412799 1 ffff9da8e8671740 20 4 0 10 -1
  13: 8604800A:D4D6 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 18936076 1 ffff9d9d52242e80 20 4 0 10 -1
  14: 8604800A:890F 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 21699608 1 ffff9dad70730f80 20 4 0 10 -1
  15: 8604800A:EA52 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 2846040 1 ffff9daa72a0be00 20 4 0 10 -1
  16: 8604800A:C97F 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 26505431 1 ffff9dad8ab95d00 20 4 0 10 -1
  17: 8604800A:C6A3 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 342953 1 ffff9dadef6c1740 20 4 0 10 -1
  18: 8604800A:B919 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 23986295 1 ffff9d981ce09740 20 4 0 10 -1
  19: 8604800A:AE6B 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 21858238 1 ffff9d9ab02a0f80 20 4 0 10 -1
  20: 8604800A:C010 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 6074854 1 ffff9da6df7d1f00 20 4 0 10 -1
  21: 8604800A:DD2F 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 8022146 1 ffff9da5552587c0 20 4 0 10 -1
  22: 8604800A:893D 8D461EAC:053A 01 00000000:00000000 00:00000000 00000000 1000120000        0 30338376 1 ffff9dad763fec80 20 4 0 10 -1
  23: 8604800A:B06A 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000        0 1384865 1 ffff9dad6b64be00 20 4 0 10 -1
```

In addition to that, we observed that the memory usage of the faulty `controller-manager` pod increased over time.  

I have attached a screenshot of a graph of the memory usage graph for that pod.

Looking at the kubernetes-incubator/service-catalog GitHub repository, I found this issue regarding a memory leak inside the controller manager: 

            https://github.com/kubernetes-incubator/service-catalog/issues/2276

It seems that this was fixed upstream by closing the HTTP response body in the service broker client: 

            https://github.com/pmorie/go-open-service-broker-client/pull/131/files

Now, we want to know if this issue will be backported to OpenShift 3.9?

If yes, can we know the status so that we can have a final fix for this?


How reproducible:
Randomly.

Actual results:
The pods becone unresponsive and needs to be deleted to work properly, we also noticed that to check the liveness, there should be liveness/readiness probe in controller manager and for that we have raised another Bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1630324

Expected results:
controller-manager should have worked properly.

Comment 1 Jay Boyd 2018-10-15 12:47:53 UTC
I will investigate backporting two fixes made to the go-open-service-broker package to 1.9 and 1.10:  https://github.com/pmorie/go-open-service-broker-client/pull/131 and https://github.com/pmorie/go-open-service-broker-client/pull/132.  There were other changes made in Service Catalog for reusing an established connection, but those changes will not be possible to backport to 1.9.  The service broker client changes should address the connection & memory issues you are seeing.

Comment 2 Jay Boyd 2018-10-15 15:03:13 UTC
not 1.9 & 1.10, should have said 3.9 & 3.10

Comment 4 Jay Boyd 2018-10-22 19:00:13 UTC
fixed by https://github.com/openshift/ose/pull/1439 in 3.9.48-1

Comment 13 Jian Zhang 2018-12-05 05:43:34 UTC
LGTM, verify it. Details as below:
[root@ip-172-18-2-26 ~]# oc exec controller-manager-xqjsz -- service-catalog --version
v0.1.9.1

1, Record the connection num about the service catalog controller and apiserver.

[root@ip-172-18-11-165 ~]# oc get svc
NAME        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
apiserver   ClusterIP   172.30.42.121   <none>        443/TCP   1h
[root@ip-172-18-11-165 ~]# netstat -an|grep -w  172.30.42.121
tcp        0      0 10.129.0.1:52754        172.30.42.121:443       ESTABLISHED

2, Access the slave node to record the connection number of the ansible-service-broker.

[root@ip-172-18-1-79 ~]# ps -elf |grep broker
4 S 1000100+  59768  59753  0  80   0 - 57802 futex_ 00:00 ?        00:00:04 asbd -c /etc/ansible-service-broker/config.yaml

[root@ip-172-18-1-79 ~]# nsenter -t 59768 --net netstat -n | grep ESTABLISHED
tcp        0      0 10.128.0.15:38448       172.30.0.1:443          ESTABLISHED
tcp        0      0 10.128.0.15:37646       172.30.223.196:2379     ESTABLISHED

3, Sync the ansible-service-broker with the service catalog many times.
4, Do some provision/deprovision APB operations.
5, Check the connection between service-catalog and the ansible-service-broker.
[root@ip-172-18-1-79 ~]# nsenter -t 59768 --net netstat -n | grep ESTABLISHED
tcp        0      0 10.128.0.15:38448       172.30.0.1:443          ESTABLISHED
tcp        0      0 10.128.0.15:37646       172.30.223.196:2379     ESTABLISHED
tcp        0      0 10.128.0.15:45686       104.108.121.159:443     ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49556        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49580        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49578        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49558        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49592        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49586        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49554        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49692        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49616        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49588        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49642        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49786        ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:49560        ESTABLISHED

After a while, we can see:
[root@ip-172-18-1-79 ~]# nsenter -t 59768 --net netstat -n | grep ESTABLISHED
tcp        0      0 10.128.0.15:38448       172.30.0.1:443          ESTABLISHED
tcp        0      0 10.128.0.15:37646       172.30.223.196:2379     ESTABLISHED
tcp6       0      0 10.128.0.15:1338        10.129.0.1:51756        ESTABLISHED

Comment 15 errata-xmlrpc 2018-12-13 19:27:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748