Description of problem: Router (Haproxy) not updating routes immediately after rollout of latest deployment config. Commands used in pipeline to watch for rollout status (transactions-1-0 is app name) oc rollout latest dc/transactions-1-0 oc rollout status dc/transactions-1-0 --watch=true Result: (This has been tested with multiple apps) Curl using pod ip/name immediately sends response code 200 OK Curl using Service ip/name immediately sends response code 200 OK Curl using external url takes time to respond. It looks like router is not updating routes immediately - Few times it takes around 15seconds to respond. During that time services are not available, which is not desirable (refer script_output.txt file for results) - Few times it takes between 60sec- 300sec to respond. During that time services are not available, which is not desirable. Also this breaks the builds. - And very rare times it responds immediately. (I had attached output, we have captured these results using custom script) Enviornment: There are no constraints at resource level on infra nodes and we are using two routers. Required logs and config output as requested: 1) oc exec -it $(oc get pods | grep router | awk '{print $1}' | head -n 1) -- /bin/bash cat /var/lib/containers/router/routes.json Ans: Two filed attached immediate before and after rollout dc (router-before-rollout.json and router-after-rollout-1.json) 2) try curling svcname:port/your-health-check (oc get svc -n namespace) Ans: output attached in files (script_output.txt) 3) try curling serviceip:port/your-health-check (oc get svc -n namespace) Ans: output attached in files (script_output.txt) 4) try curling endpoint-ip:port/your-health-check (oc get ep -n namespace) Ans: output attached in files (script_output.txt) 5) oc get route <route-name> -o yaml Attached file route_app_trasactions.yaml 6) oc logs dc/router -n default Attached file router_logs_router-4-3pk65.logs and router_logs_router-4-npdn2.logs 7) oc get dc/router -o yaml -n default Attached file router_dc.yaml 8)oc get route <NAME_OF_ROUTE> -n <PROJECT> attached file commands_output 9) oc get endpoints --all-namespaces attached file commands_output 10)oc exec -it $ROUTER_POD -- ls -la Attached file script_output.txt (has both before and after dc rollout) 11) oc exec -it $ROUTER_POD -- find /var/lib/haproxy -regex ".*\(.map\|config.*\|.json\)" -print -exec cat {} \; > haproxy_configs_and_maps Attached file haproxy_configs_and_maps_router-4-3pk65_before_rollout haproxy_configs_and_maps_router-4-npdn2_before_rollout haproxy_configs_and_maps_router-4-npdn2_1 (after rollout) haproxy_configs_and_maps_router-4-3pk65_1 (after rollout) 12)oc get svc svc-which-present-in-the-route -o yaml service_app_trasactions.yaml 13) oc get dc dc-of-pod -o yaml pod_app_trasactions.yaml Please let me know if you require anything else on this Router config files are updated everytime new dc rollout is done so no strace of the router process are captured Version-Release number of selected component (if applicable): OCP 3.5 How reproducible: Always on customer end Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
As discussed in today , I am providing following details - TCP dump from two infra nodes where router/haproxy is installed - tcpdump from node where application pod is deployed - Output of commands 'ovs-ofctl -O OpenFlow13 dump-flows br0" and "ovs-dpctl show" from above three nodes - Arp cache from above three nodes - script used to capture output of curl Observation : I had changed script to capture arp from all three nodes to check if the pod IP is there in cache and has got proper mac or not . (this time script is run from our ansible host). Observation here is that whenever curl is failing the new pod has some old arp address entry in arp cache of infra nodes. And it takes sometime for these arp entries to get updated . As soon as the arp entries are updated with latest mac of pod, curl works perfectly fine. Please refer to output of script for this.
Apparent duplicate of this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1451854
*** This bug has been marked as a duplicate of bug 1451854 ***