Created attachment 1291101 [details] haproxy stats page [customer case 01789221] Description of problem: after an upgrade to ocp 3.4.1.24 to solve the issue (Bug 1440977 - Router hangs on deadlock) the following issue has been observed: Some (2 out of 4) of the router's haproxy config files are getting de-synchronized as they are missing some relevant "server" information. By redeploying the application, the information in the haproxy is restored, however as they recreate the pod (oc delete), the server disappears again from the haproxy config file. I'm attaching the router config files and logs from the case. Version-Release number of selected component (if applicable): ocp 3.4.1.24 How reproducible: The issue is not yet reproducible in a predictable way and it appears only on a few routers Actual results: Some servers are not included in the haproxy config file Expected results: All servers to be included in the haproxy config file Additional info: All servers are reachable with a curl. A view of the HAProxy stats page shows that backend are active
Created attachment 1291102 [details] Router configurations
Created attachment 1291103 [details] router log files
Created attachment 1291105 [details] Deployment config and ha-proxy config template
Per @fmarchio: I've checked through the router files and could find the following "broken" configurations: ./router-cpyi0018-10-5255r/haproxy.config ./router-cpyi0087-11-rbfed/haproxy.config More in detail, they are missing the following servers: < server 7dd6fb3777b32ca1383ef13f0622de63 10.1.22.144:8080 check inter 5000ms cookie 7dd6fb3777b32ca1383ef13f0622de63 weight 100 4606d4604 < server 7dd6fb3777b32ca1383ef13f0622de63 10.1.22.144:8080 check inter 5000ms cookie 7dd6fb3777b32ca1383ef13f0622de63 weight 100 4670d4667 < server 7dd6fb3777b32ca1383ef13f0622de63 10.1.22.144:8080 check inter 5000ms cookie 7dd6fb3777b32ca1383ef13f0622de63 weight 100 routes
It could be https://bugzilla.redhat.com/show_bug.cgi?id=1464567 (or one of the other event queue bugs that we fixed). I know you aren't seeing the panics, but there are other queue problems that were identified when we fixed that bug. Can they try enabling the debugging endpoint as the comment here outlines: https://github.com/openshift/ose/pull/700 Then get me the output from that curl command and we can see if we can see anything funky there. Alternatively, would they be willing to run the router with an elevated log level?
How to enable the debugging endpoints: This implements an http endpoint controlled by setting OPENSHIFT_PROFILE=web and then you can override the address it listens on (default is 127.0.0.1) and the port (default 6061) using the OPENSHIFT_PROFILE_HOST and OPENSHIFT_PROFILE_PORT environment variables respectively. This is disabled by default until OPENSHIFT_PROFILE=web is set. With the default setup, you can do: curl http://127.0.0.1:6061/debug/pprof/goroutine?debug=1
bbennett comment 4 lists the same server 3 times. This suggest the router didn't see the event for it.
Can we get the current pod spec please?
*** This bug has been marked as a duplicate of bug 1464567 ***