Bug 1608950

Summary: Pod deployments failing DeploymentConfig on openshift on openstack provider
Product: OpenShift Container Platform Reporter: jliberma <jliberma>
Component: InstallerAssignee: Luis Tomas Bolivar <ltomasbo>
Status: CLOSED CURRENTRELEASE QA Contact: weiwei jiang <wjiang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, cgoncalves, jokerman, mdulko, mmccomas, tzumainn, wsun
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-04 08:31:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description jliberma@redhat.com 2018-07-26 14:23:16 UTC
Description of problem:

kubernetes service allows pods to access the kubernetes API. this service is created by the installer manually behind the octavia load balancer. it appears that some of the pod connections to the kubernetes service are being closed by API due to octavia timeouts, resulting in pods error status.

Version-Release number of the following components:

(shiftstack) [cloud-user@bastion ~]$ rpm -q openshift-ansible
openshift-ansible-3.10.24-1.git.0.3032aec.el7.noarch

(shiftstack) [cloud-user@bastion ~]$ rpm -q ansible
ansible-2.4.6.0-1.el7ae.noarch

(shiftstack) [cloud-user@bastion ~]$ ansible --version
ansible 2.4.6.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/cloud-user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible:

We saw it 5/5 deployments with Octavia enabled. Different service pods failed: 

docker-registry
registry-console

We also saw this with the cake-php app we launched to test the ocp deployment:

  $ oc login --insecure-skip-tls-verify=true "https://console.openshift.example.com:8443" -u username -p password
  $ oc new-project test-app
  $ oc new-app --template=cakephp-mysql-example


Steps to Reproduce:
1. Deploy openshift-ansible with octavia enabled
2. View pod STATUS with oc get pods --all-namespaces -o wide

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated


[openshift@master-0 ~]$ oc get pods docker-registry-1-deploy
 docker-registry-1-deploy   0/1       Error     0          1h

[openshift@master-0 ~]$ oc logs docker-registry-1- deploy                                                                                                                                       
 --> Scaling docker-registry-1 to 1
 error: update acceptor rejected docker-registry-1: watch closed before Until timeout

Expected results:

Pods complete build before connections are dropped. (if that is indeed what is happening)

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 MichaƂ Dulko 2018-07-27 17:06:50 UTC
I've verified the Octavia timeout hypothesis today:

1. When I've shrinked `timeout client` and `timeout server` HAProxy options to 20000 on Amphora VM, the deployment containers started to die earlier.
2. When I've extended `timeout client` and `timeout server` HAProxy options to 500000 on Amphora VM, the deployment containers started to work successfully.

This proves that those options being non configurable is the root cause. Unfortunately it's fixed only in OSP14 Octavia.

Comment 2 Luis Tomas Bolivar 2019-02-04 08:31:20 UTC
Closing the bug as this is fixed on OCP3.11 when used on top of OSP14. For OSP13 the modifications to Director to include the knobs to increase the loadbalancers client timeouts has also been merged