Bug 1462675

Summary: Router is not synchronized
Product: OpenShift Container Platform Reporter: Ruben Romero Montes <rromerom>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, asogukpi, bbennett, javier.ramirez
Version: 3.5.0Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-21 14:04:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dc router none

Description Ruben Romero Montes 2017-06-19 09:05:53 UTC
Created attachment 1289051 [details]
dc router

Description of problem:
openshift-router process is not updating the ha-proxy files and seems to be stalled in some part of the loop.

Logs and files have not been updated for weeks (since May 31st)

-rwxrwxrwx. 1 root root   517 May 31 13:01 cert_config.map
-rw-r--r--. 1 root root  2035 Apr 11 15:00 default_pub_keys.pem
-rw-r--r--. 1 root root  3278 Apr 11 15:00 error-page-503.http
-rw-r--r--. 1 root root 31863 Apr 11 15:00 haproxy-config.template
-rwxrwxrwx. 1 root root 36469 May 31 13:01 haproxy.config
-rwxrwxrwx. 1 root root   279 May 31 13:01 os_edge_http_be.map
-rwxrwxrwx. 1 root root  2114 May 31 13:01 os_http_be.map
-rwxrwxrwx. 1 root root   512 May 31 13:01 os_reencrypt.map
-rwxrwxrwx. 1 root root    64 May 31 13:01 os_route_http_expose.map
-rwxrwxrwx. 1 root root   362 May 31 13:01 os_route_http_redirect.map
-rwxrwxrwx. 1 root root   275 May 31 13:01 os_sni_passthrough.map
-rwxrwxrwx. 1 root root   783 May 31 13:01 os_tcp_be.map
-rwxrwxrwx. 1 root root     2 May 31 13:01 os_wildcard_domain.map

oc get routes works within the pod. There are two pods and the other one is working properly.

It is not even doing the full refresh every 10 minutes.

Version-Release number of selected component (if applicable):
OCP 3.5.5.5
registry.access.redhat.com/openshift3/ose-haproxy-router         v3.5.5.5

How reproducible:
Only seen once

Steps to Reproduce:
1. Create a route

Actual results:
Configuration files and logs are only updated on one of the pods

Expected results:
Both pods should be udpated with the latest configuration.

Additional info:

Comment 2 Ruben Romero Montes 2017-06-19 09:09:19 UTC
More information about the environment:
 - Non containerized
 - Mixed deployment with Azure and Openstack
 - network plugin: ovs-multitenant

Comment 3 Ben Bennett 2017-06-19 17:24:28 UTC
The culprit was probably https://bugzilla.redhat.com/show_bug.cgi?id=1415112 (fixed in 3.5.5.7 I believe).

*** This bug has been marked as a duplicate of bug 1415112 ***

Comment 5 Ben Bennett 2017-06-21 14:04:55 UTC
It is either https://bugzilla.redhat.com/show_bug.cgi?id=1415112 or https://bugzilla.redhat.com/show_bug.cgi?id=1429823 (fixed in 3.5.5.8).

Both identified problems with the EventQueue that we were using and both can lead to router lock-ups.  The symptoms of a router that stops updating the config file better match the notes in https://bugzilla.redhat.com/show_bug.cgi?id=1429823 but in our investigation we found that https://bugzilla.redhat.com/show_bug.cgi?id=1415112 can lead to the same symptom.  It is easier to make the event queue lock up by changing the labels, but rapid route insertions and deletions were later found to expose the same bug.

*** This bug has been marked as a duplicate of bug 1429823 ***