Description of problem: kubernetes service allows pods to access the kubernetes API. this service is created by the installer manually behind the octavia load balancer. it appears that some of the pod connections to the kubernetes service are being closed by API due to octavia timeouts, resulting in pods error status. Version-Release number of the following components: (shiftstack) [cloud-user@bastion ~]$ rpm -q openshift-ansible openshift-ansible-3.10.24-1.git.0.3032aec.el7.noarch (shiftstack) [cloud-user@bastion ~]$ rpm -q ansible ansible-2.4.6.0-1.el7ae.noarch (shiftstack) [cloud-user@bastion ~]$ ansible --version ansible 2.4.6.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/home/cloud-user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] How reproducible: We saw it 5/5 deployments with Octavia enabled. Different service pods failed: docker-registry registry-console We also saw this with the cake-php app we launched to test the ocp deployment: $ oc login --insecure-skip-tls-verify=true "https://console.openshift.example.com:8443" -u username -p password $ oc new-project test-app $ oc new-app --template=cakephp-mysql-example Steps to Reproduce: 1. Deploy openshift-ansible with octavia enabled 2. View pod STATUS with oc get pods --all-namespaces -o wide Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated [openshift@master-0 ~]$ oc get pods docker-registry-1-deploy docker-registry-1-deploy 0/1 Error 0 1h [openshift@master-0 ~]$ oc logs docker-registry-1- deploy --> Scaling docker-registry-1 to 1 error: update acceptor rejected docker-registry-1: watch closed before Until timeout Expected results: Pods complete build before connections are dropped. (if that is indeed what is happening) Additional info: Please attach logs from ansible-playbook with the -vvv flag
I've verified the Octavia timeout hypothesis today: 1. When I've shrinked `timeout client` and `timeout server` HAProxy options to 20000 on Amphora VM, the deployment containers started to die earlier. 2. When I've extended `timeout client` and `timeout server` HAProxy options to 500000 on Amphora VM, the deployment containers started to work successfully. This proves that those options being non configurable is the root cause. Unfortunately it's fixed only in OSP14 Octavia.
Closing the bug as this is fixed on OCP3.11 when used on top of OSP14. For OSP13 the modifications to Director to include the knobs to increase the loadbalancers client timeouts has also been merged