Created attachment 1493243 [details] Memory Metrics of the pod "controller-manager." Description of problem: Many times the template provisioning form template-service-broker fails, after deleting/restarting the controller-manager pod, the deployment starts working properly. This time instead of deleting the pod, after inspecting it, we found out that there are number of open file descriptors and noticed that there were a lot of them (mostly sockets), and also at the number of open TCP connections. Most of those connections are to 172.30.226.105:443 (hex notation 69E21EAC:01BB), which is the cluster IP of the `apiserver` service in the `openshift-template-service-broker` project. ``` sh-4.2$ ls /proc/1/fd | wc -l 32000 sh-4.2$ head -n25 /proc/1/net/tcp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode 0: 8604800A:BAE3 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 28996712 1 ffff9dad77ea5540 20 4 0 10 -1 1: 8604800A:D853 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 18955000 1 ffff9dad79a4c5c0 20 4 0 10 -1 2: 8604800A:B65E 8D461EAC:053A 01 00000000:00000000 00:00000000 00000000 1000120000 0 7074283 1 ffff9da6a7205540 20 4 1 10 -1 3: 8604800A:99AA 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 20225303 1 ffff9dac0ee31740 20 4 0 10 -1 4: 8604800A:B5E0 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 1415561 1 ffff9dac396f7440 20 4 0 10 -1 5: 8604800A:BD36 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 12484495 1 ffff9da7a7f207c0 20 4 0 10 -1 6: 8604800A:DCDD 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 9205861 1 ffff9dac0a333e00 20 4 0 10 -1 7: 8604800A:CEE3 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 19934683 1 ffff9dad8f761f00 20 4 0 10 -1 8: 8604800A:8744 8D461EAC:053A 01 00000000:00000000 00:00000000 00000000 1000120000 0 30344225 1 ffff9d967e7cb640 20 4 0 10 -1 9: 8604800A:CF64 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 7938304 1 ffff9da57afeb640 20 4 0 10 -1 10: 8604800A:C06D 8D461EAC:053A 01 00000000:00000000 00:00000000 00000000 1000120000 0 26188930 1 ffff9dad3b303640 20 4 0 10 -1 11: 8604800A:8FAB 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 20171682 1 ffff9d8e73d79f00 20 4 0 10 -1 12: 8604800A:8DFA 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 6412799 1 ffff9da8e8671740 20 4 0 10 -1 13: 8604800A:D4D6 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 18936076 1 ffff9d9d52242e80 20 4 0 10 -1 14: 8604800A:890F 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 21699608 1 ffff9dad70730f80 20 4 0 10 -1 15: 8604800A:EA52 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 2846040 1 ffff9daa72a0be00 20 4 0 10 -1 16: 8604800A:C97F 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 26505431 1 ffff9dad8ab95d00 20 4 0 10 -1 17: 8604800A:C6A3 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 342953 1 ffff9dadef6c1740 20 4 0 10 -1 18: 8604800A:B919 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 23986295 1 ffff9d981ce09740 20 4 0 10 -1 19: 8604800A:AE6B 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 21858238 1 ffff9d9ab02a0f80 20 4 0 10 -1 20: 8604800A:C010 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 6074854 1 ffff9da6df7d1f00 20 4 0 10 -1 21: 8604800A:DD2F 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 8022146 1 ffff9da5552587c0 20 4 0 10 -1 22: 8604800A:893D 8D461EAC:053A 01 00000000:00000000 00:00000000 00000000 1000120000 0 30338376 1 ffff9dad763fec80 20 4 0 10 -1 23: 8604800A:B06A 69E21EAC:01BB 01 00000000:00000000 00:00000000 00000000 1000120000 0 1384865 1 ffff9dad6b64be00 20 4 0 10 -1 ``` In addition to that, we observed that the memory usage of the faulty `controller-manager` pod increased over time. I have attached a screenshot of a graph of the memory usage graph for that pod. Looking at the kubernetes-incubator/service-catalog GitHub repository, I found this issue regarding a memory leak inside the controller manager: https://github.com/kubernetes-incubator/service-catalog/issues/2276 It seems that this was fixed upstream by closing the HTTP response body in the service broker client: https://github.com/pmorie/go-open-service-broker-client/pull/131/files Now, we want to know if this issue will be backported to OpenShift 3.9? If yes, can we know the status so that we can have a final fix for this? How reproducible: Randomly. Actual results: The pods becone unresponsive and needs to be deleted to work properly, we also noticed that to check the liveness, there should be liveness/readiness probe in controller manager and for that we have raised another Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1630324 Expected results: controller-manager should have worked properly.
I will investigate backporting two fixes made to the go-open-service-broker package to 1.9 and 1.10: https://github.com/pmorie/go-open-service-broker-client/pull/131 and https://github.com/pmorie/go-open-service-broker-client/pull/132. There were other changes made in Service Catalog for reusing an established connection, but those changes will not be possible to backport to 1.9. The service broker client changes should address the connection & memory issues you are seeing.
not 1.9 & 1.10, should have said 3.9 & 3.10
fixed by https://github.com/openshift/ose/pull/1439 in 3.9.48-1
LGTM, verify it. Details as below: [root@ip-172-18-2-26 ~]# oc exec controller-manager-xqjsz -- service-catalog --version v0.1.9.1 1, Record the connection num about the service catalog controller and apiserver. [root@ip-172-18-11-165 ~]# oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE apiserver ClusterIP 172.30.42.121 <none> 443/TCP 1h [root@ip-172-18-11-165 ~]# netstat -an|grep -w 172.30.42.121 tcp 0 0 10.129.0.1:52754 172.30.42.121:443 ESTABLISHED 2, Access the slave node to record the connection number of the ansible-service-broker. [root@ip-172-18-1-79 ~]# ps -elf |grep broker 4 S 1000100+ 59768 59753 0 80 0 - 57802 futex_ 00:00 ? 00:00:04 asbd -c /etc/ansible-service-broker/config.yaml [root@ip-172-18-1-79 ~]# nsenter -t 59768 --net netstat -n | grep ESTABLISHED tcp 0 0 10.128.0.15:38448 172.30.0.1:443 ESTABLISHED tcp 0 0 10.128.0.15:37646 172.30.223.196:2379 ESTABLISHED 3, Sync the ansible-service-broker with the service catalog many times. 4, Do some provision/deprovision APB operations. 5, Check the connection between service-catalog and the ansible-service-broker. [root@ip-172-18-1-79 ~]# nsenter -t 59768 --net netstat -n | grep ESTABLISHED tcp 0 0 10.128.0.15:38448 172.30.0.1:443 ESTABLISHED tcp 0 0 10.128.0.15:37646 172.30.223.196:2379 ESTABLISHED tcp 0 0 10.128.0.15:45686 104.108.121.159:443 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49556 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49580 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49578 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49558 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49592 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49586 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49554 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49692 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49616 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49588 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49642 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49786 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:49560 ESTABLISHED After a while, we can see: [root@ip-172-18-1-79 ~]# nsenter -t 59768 --net netstat -n | grep ESTABLISHED tcp 0 0 10.128.0.15:38448 172.30.0.1:443 ESTABLISHED tcp 0 0 10.128.0.15:37646 172.30.223.196:2379 ESTABLISHED tcp6 0 0 10.128.0.15:1338 10.129.0.1:51756 ESTABLISHED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3748