Created attachment 1289051 [details] dc router Description of problem: openshift-router process is not updating the ha-proxy files and seems to be stalled in some part of the loop. Logs and files have not been updated for weeks (since May 31st) -rwxrwxrwx. 1 root root 517 May 31 13:01 cert_config.map -rw-r--r--. 1 root root 2035 Apr 11 15:00 default_pub_keys.pem -rw-r--r--. 1 root root 3278 Apr 11 15:00 error-page-503.http -rw-r--r--. 1 root root 31863 Apr 11 15:00 haproxy-config.template -rwxrwxrwx. 1 root root 36469 May 31 13:01 haproxy.config -rwxrwxrwx. 1 root root 279 May 31 13:01 os_edge_http_be.map -rwxrwxrwx. 1 root root 2114 May 31 13:01 os_http_be.map -rwxrwxrwx. 1 root root 512 May 31 13:01 os_reencrypt.map -rwxrwxrwx. 1 root root 64 May 31 13:01 os_route_http_expose.map -rwxrwxrwx. 1 root root 362 May 31 13:01 os_route_http_redirect.map -rwxrwxrwx. 1 root root 275 May 31 13:01 os_sni_passthrough.map -rwxrwxrwx. 1 root root 783 May 31 13:01 os_tcp_be.map -rwxrwxrwx. 1 root root 2 May 31 13:01 os_wildcard_domain.map oc get routes works within the pod. There are two pods and the other one is working properly. It is not even doing the full refresh every 10 minutes. Version-Release number of selected component (if applicable): OCP 3.5.5.5 registry.access.redhat.com/openshift3/ose-haproxy-router v3.5.5.5 How reproducible: Only seen once Steps to Reproduce: 1. Create a route Actual results: Configuration files and logs are only updated on one of the pods Expected results: Both pods should be udpated with the latest configuration. Additional info:
More information about the environment: - Non containerized - Mixed deployment with Azure and Openstack - network plugin: ovs-multitenant
The culprit was probably https://bugzilla.redhat.com/show_bug.cgi?id=1415112 (fixed in 3.5.5.7 I believe). *** This bug has been marked as a duplicate of bug 1415112 ***
It is either https://bugzilla.redhat.com/show_bug.cgi?id=1415112 or https://bugzilla.redhat.com/show_bug.cgi?id=1429823 (fixed in 3.5.5.8). Both identified problems with the EventQueue that we were using and both can lead to router lock-ups. The symptoms of a router that stops updating the config file better match the notes in https://bugzilla.redhat.com/show_bug.cgi?id=1429823 but in our investigation we found that https://bugzilla.redhat.com/show_bug.cgi?id=1415112 can lead to the same symptom. It is easier to make the event queue lock up by changing the labels, but rapid route insertions and deletions were later found to expose the same bug. *** This bug has been marked as a duplicate of bug 1429823 ***