Description of problem: A user who is a member of 61 groups tries to log into Kibana. The login fails with a 504 Gateway Timeout and, additionally, it causes the logging stack to become unstable. cat 2020-11-13_all_groups.txt | grep '<name>@<domain>' | wc -l 61 Web Dev tools show 504 after logging into Kibana (see attached) 'Failed to load resource: the server responded with a status of 504 (Gateway Time-out)' All other users can log in, accept for this one which is why we suspect the group's as being a possible cause of the problem. When the user tries to log in, we see 'http: proxy error: context canceled in kibana proxy:' 2020-11-12T21:47:38.571344522Z time="2020-11-12T21:47:38Z" level=info msg="Handling request \"authorization\"" 2020-11-12T21:47:43.299938713Z time="2020-11-12T21:47:43Z" level=info msg="Handling request \"authorization\"" 2020-11-12T21:47:48.56353312Z 2020/11/12 21:47:48 http: proxy error: context canceled 2020-11-12T21:47:53.260677543Z 2020/11/12 21:47:53 http: proxy error: context canceled 2020-11-12T21:48:08.569948868Z time="2020-11-12T21:48:08Z" level=info msg="Handling request \"authorization\"" 2020-11-12T21:48:09.429117008Z 2020/11/12 21:48:09 http: proxy error: context canceled 2020-11-12T21:48:13.264678167Z time="2020-11-12T21:48:13Z" level=info msg="Handling request \"authorization\"" 2020-11-12T21:48:18.563518141Z 2020/11/12 21:48:18 http: proxy error: context canceled 2020-11-12T21:48:23.260694871Z 2020/11/12 21:48:23 http: proxy error: context canceled 2020-11-12T21:48:38.57132017Z time="2020-11-12T21:48:38Z" level=info msg="Handling request \"authorization\"" 2020-11-12T21:48:43.264629309Z time="2020-11-12T21:48:43Z" level=info msg="Handling request \"authorization\"" 2020-11-12T21:48:48.563492832Z 2020/11/12 21:48:48 http: proxy error: context canceled 2020-11-12T21:48:53.26067297Z 2020/11/12 21:48:53 http: proxy error: context canceled 2020-11-12T21:49:08.605230893Z time="2020-11-12T21:49:08Z" level=info msg="Handling request \"authorization\"" 2020-11-12T21:49:13.264782811Z time="2020-11-12T21:49:13Z" level=info msg="Handling request \"authorization\"" 2020-11-12T21:49:18.563476362Z 2020/11/12 21:49:18 http: proxy error: context canceled 2020-11-12T21:49:23.260727239Z 2020/11/12 21:49:23 http: proxy error: context canceled 2020-11-12T21:49:38.569684013Z time="2020-11-12T21:49:38Z" level=info msg="Handling request \"authorization\"" 2020-11-12T21:49:43.265503867Z time="2020-11-12T21:49:43Z" level=info msg="Handling request \"authorization\"" 2020-11-12T21:49:48.563457285Z 2020/11/12 21:49:48 http: proxy error: context canceled And various timeouts / socket hang ups from kibana: 2020-11-12T21:44:31.386182329Z {"type":"log","@timestamp":"2020-11-12T21:44:31Z","tags":["error","elasticsearch","data"],"pid":122,"message":"Request error, retrying\nPOST https://elasticsearch.openshift-logging.svc.cluster.local:9200/*/_search?ignore_unavailable=true => socket hang up"} 2020-11-12T21:46:31.414261869Z {"type":"log","@timestamp":"2020-11-12T21:46:31Z","tags":["error","elasticsearch","data"],"pid":122,"message":"Request error, retrying\nPOST https://elasticsearch.openshift-logging.svc.cluster.local:9200/*/_search?ignore_unavailable=true => socket hang up"} 2020-11-12T21:47:09.569597865Z {"type":"log","@timestamp":"2020-11-12T21:47:09Z","tags":["error","elasticsearch","admin"],"pid":122,"message":"Request error, retrying\nGET https://elasticsearch.openshift-logging.svc.cluster.local:9200/_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip => socket hang up"} 2020-11-12T21:49:09.583838754Z {"type":"log","@timestamp":"2020-11-12T21:49:09Z","tags":["status","plugin:elasticsearch.1","error"],"pid":122,"state":"red","message":"Status changed from green to red - No Living connections","prevState":"green","prevMsg":"Ready"} Elasticsearch effectively falls over. GC goes from ~90% throughput to ~10-20% which leads to various timeouts / ping issues. All of the GC are young collections between 0.1 and 0.246s averaging 0.148s - the issue is that between 4:40 / 5:10 EST 21:40 / 22:10 UTC there are 9378 collections totalling 1390seconds out of the 1800seconds of elasped time. 2020-11-12T21:45:21.471906457Z [2020-11-12T21:45:21,471][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][365] overhead, spent [822ms] collecting in the last [1s] 2020-11-12T21:45:22.472201269Z [2020-11-12T21:45:22,472][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][366] overhead, spent [813ms] collecting in the last [1s] 2020-11-12T21:45:23.472509676Z [2020-11-12T21:45:23,472][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][367] overhead, spent [809ms] collecting in the last [1s] 2020-11-12T21:45:24.472821974Z [2020-11-12T21:45:24,472][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][368] overhead, spent [826ms] collecting in the last [1s] 2020-11-12T21:45:25.572738797Z [2020-11-12T21:45:25,572][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][369] overhead, spent [907ms] collecting in the last [1s] 2020-11-12T21:45:26.57305946Z [2020-11-12T21:45:26,572][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][370] overhead, spent [807ms] collecting in the last [1s] 2020-11-12T21:45:27.648489273Z [2020-11-12T21:45:27,648][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][371] overhead, spent [884ms] collecting in the last [1s] 2020-11-12T21:45:28.648831575Z [2020-11-12T21:45:28,648][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][372] overhead, spent [802ms] collecting in the last [1s] 2020-11-12T21:45:29.71976221Z [2020-11-12T21:45:29,719][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][373] overhead, spent [880ms] collecting in the last [1s] 2020-11-12T21:45:30.72011347Z [2020-11-12T21:45:30,719][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][374] overhead, spent [796ms] collecting in the last [1s] 2020-11-12T21:45:31.82776028Z [2020-11-12T21:45:31,827][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][375] overhead, spent [918ms] collecting in the last [1.1s] 2020-11-12T21:45:32.830395664Z [2020-11-12T21:45:32,830][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][376] overhead, spent [814ms] collecting in the last [1s] 2020-11-12T21:45:33.830691169Z [2020-11-12T21:45:33,830][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][377] overhead, spent [803ms] collecting in the last [1s] 2020-11-12T21:45:34.936488219Z [2020-11-12T21:45:34,936][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][378] overhead, spent [915ms] collecting in the last [1.1s] 2020-11-12T21:45:35.936777039Z [2020-11-12T21:45:35,936][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][379] overhead, spent [810ms] collecting in the last [1s] 2020-11-12T21:45:36.971670136Z [2020-11-12T21:45:36,971][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][380] overhead, spent [857ms] collecting in the last [1s] 2020-11-12T21:45:37.972026989Z [2020-11-12T21:45:37,971][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][381] overhead, spent [800ms] collecting in the last [1s] 2020-11-12T21:45:39.029920465Z [2020-11-12T21:45:39,029][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][382] overhead, spent [822ms] collecting in the last [1s] 2020-11-12T21:45:40.06267205Z [2020-11-12T21:45:40,062][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][383] overhead, spent [835ms] collecting in the last [1s] 2020-11-12T21:45:41.162303643Z [2020-11-12T21:45:41,162][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][384] overhead, spent [838ms] collecting in the last [1s] 2020-11-12T21:45:42.237747891Z [2020-11-12T21:45:42,237][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][385] overhead, spent [838ms] collecting in the last [1s] 2020-11-12T21:45:43.238117047Z [2020-11-12T21:45:43,237][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][386] overhead, spent [800ms] collecting in the last [1s] 2020-11-12T21:45:44.306545326Z [2020-11-12T21:45:44,306][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][387] overhead, spent [874ms] collecting in the last [1s] 2020-11-12T21:45:44.738827192Z [2020-11-12T21:45:44,738][WARN ][o.e.c.InternalClusterInfoService] [elasticsearch-cdm-p1pxk400-1] Failed to update node information for ClusterInfoUpdateJob within 15s timeout 2020-11-12T21:45:44.739307655Z [2020-11-12T21:45:44,738][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-cdm-p1pxk400-1] failed to execute on node [umEqe7zTQIGWBqJ4I4EXHQ] 2020-11-12T21:45:44.739307655Z org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-cdm-p1pxk400-3][10.130.14.21:9300][cluster:monitor/nodes/stats[n]] request_id [15974] timed out after [14876ms] 2020-11-12T21:45:44.739307655Z at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6] 2020-11-12T21:45:44.739307655Z at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6] 2020-11-12T21:45:44.739307655Z at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265] 2020-11-12T21:45:44.739307655Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265] 2020-11-12T21:45:44.739307655Z at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265] 2020-11-12T21:45:44.739366015Z [2020-11-12T21:45:44,738][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-cdm-p1pxk400-1] failed to execute on node [x1aOSBYpR0Cdjkt5yZlRJQ] 2020-11-12T21:45:44.739366015Z org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-cdm-p1pxk400-2][10.129.6.22:9300][cluster:monitor/nodes/stats[n]] request_id [15973] timed out after [14876ms] 2020-11-12T21:45:44.739366015Z at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6] 2020-11-12T21:45:44.739366015Z at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6] 2020-11-12T21:45:44.739366015Z at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265] 2020-11-12T21:45:44.739366015Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265] 2020-11-12T21:45:44.739366015Z at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265] 2020-11-12T21:45:45.306839983Z [2020-11-12T21:45:45,306][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][388] overhead, spent [813ms] collecting in the last [1s] 2020-11-12T21:45:46.307129478Z [2020-11-12T21:45:46,307][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][389] overhead, spent [817ms] collecting in the last [1s] 2020-11-12T21:45:47.30744604Z [2020-11-12T21:45:47,307][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][390] overhead, spent [815ms] collecting in the last [1s] 2020-11-12T21:45:48.310788806Z [2020-11-12T21:45:48,310][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][391] overhead, spent [832ms] collecting in the last [1s] 2020-11-12T21:45:59.743606102Z [2020-11-12T21:45:59,743][WARN ][o.e.c.InternalClusterInfoService] [elasticsearch-cdm-p1pxk400-1] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout 2020-11-12T21:46:18.421464753Z [2020-11-12T21:46:18,421][WARN ][o.e.t.TransportService ] [elasticsearch-cdm-p1pxk400-1] Received response for a request that has timed out, sent [48686ms] ago, timed out [33810ms] ago, action [cluster:monitor/nodes/stats[n]], node [{elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300}], id [15974] 2020-11-12T21:46:18.429372332Z [2020-11-12T21:46:18,429][WARN ][o.e.t.TransportService ] [elasticsearch-cdm-p1pxk400-1] Received response for a request that has timed out, sent [70352ms] ago, timed out [40447ms] ago, action [internal:discovery/zen/fd/ping], node [{elasticsearch-cdm-p1pxk400-2}{x1aOSBYpR0Cdjkt5yZlRJQ}{VHLG4j3wQQW2LaQ3aC1L9w}{10.129.6.22}{10.129.6.22:9300}], id [15931] 2020-11-12T21:46:18.429427283Z [2020-11-12T21:46:18,429][WARN ][o.e.t.TransportService ] [elasticsearch-cdm-p1pxk400-1] Received response for a request that has timed out, sent [40447ms] ago, timed out [10383ms] ago, action [internal:discovery/zen/fd/ping], node [{elasticsearch-cdm-p1pxk400-2}{x1aOSBYpR0Cdjkt5yZlRJQ}{VHLG4j3wQQW2LaQ3aC1L9w}{10.129.6.22}{10.129.6.22:9300}], id [15979] 2020-11-12T21:46:18.429532841Z [2020-11-12T21:46:18,429][WARN ][o.e.t.TransportService ] [elasticsearch-cdm-p1pxk400-1] Received response for a request that has timed out, sent [48686ms] ago, timed out [33810ms] ago, action [cluster:monitor/nodes/stats[n]], node [{elasticsearch-cdm-p1pxk400-2}{x1aOSBYpR0Cdjkt5yZlRJQ}{VHLG4j3wQQW2LaQ3aC1L9w}{10.129.6.22}{10.129.6.22:9300}], id [15973] 2020-11-12T21:46:37.65583289Z [2020-11-12T21:46:37,655][INFO ][o.e.c.s.MasterService ] [elasticsearch-cdm-p1pxk400-1] zen-disco-node-failed({elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300},} 2020-11-12T21:46:50.181415374Z [2020-11-12T21:46:50,181][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][450] overhead, spent [893ms] collecting in the last [1.1s] 2020-11-12T21:46:51.181699253Z [2020-11-12T21:46:51,181][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][451] overhead, spent [790ms] collecting in the last [1s] 2020-11-12T21:46:52.26313281Z [2020-11-12T21:46:52,263][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][452] overhead, spent [893ms] collecting in the last [1s] 2020-11-12T21:46:53.281153794Z [2020-11-12T21:46:53,281][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][453] overhead, spent [820ms] collecting in the last [1s] 2020-11-12T21:46:54.281433893Z [2020-11-12T21:46:54,281][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][454] overhead, spent [786ms] collecting in the last [1s] 2020-11-12T21:46:55.365415789Z [2020-11-12T21:46:55,365][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][455] overhead, spent [900ms] collecting in the last [1s] 2020-11-12T21:46:56.369313568Z [2020-11-12T21:46:56,369][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][456] overhead, spent [807ms] collecting in the last [1s] 2020-11-12T21:46:57.48354951Z [2020-11-12T21:46:57,483][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][457] overhead, spent [903ms] collecting in the last [1.1s] 2020-11-12T21:46:58.483863418Z [2020-11-12T21:46:58,483][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][458] overhead, spent [798ms] collecting in the last [1s] 2020-11-12T21:46:59.55684556Z [2020-11-12T21:46:59,556][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][459] overhead, spent [882ms] collecting in the last [1s] 2020-11-12T21:47:00.643944182Z [2020-11-12T21:47:00,643][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][460] overhead, spent [883ms] collecting in the last [1s] 2020-11-12T21:47:01.757139253Z [2020-11-12T21:47:01,757][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][461] overhead, spent [904ms] collecting in the last [1.1s] 2020-11-12T21:47:02.763469886Z [2020-11-12T21:47:02,763][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][462] overhead, spent [824ms] collecting in the last [1s] 2020-11-12T21:47:03.874501784Z [2020-11-12T21:47:03,874][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][463] overhead, spent [906ms] collecting in the last [1.1s] 2020-11-12T21:47:04.874772121Z [2020-11-12T21:47:04,874][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][464] overhead, spent [789ms] collecting in the last [1s] 2020-11-12T21:47:05.961033847Z [2020-11-12T21:47:05,960][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][465] overhead, spent [904ms] collecting in the last [1s] 2020-11-12T21:47:07.072977573Z [2020-11-12T21:47:07,072][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][466] overhead, spent [902ms] collecting in the last [1.1s] 2020-11-12T21:47:07.658715241Z [2020-11-12T21:47:07,658][WARN ][o.e.d.z.ZenDiscovery ] [elasticsearch-cdm-p1pxk400-1] zen-disco-failed-to-publish, current nodes: nodes: 2020-11-12T21:47:07.658715241Z {elasticsearch-cdm-p1pxk400-2}{x1aOSBYpR0Cdjkt5yZlRJQ}{VHLG4j3wQQW2LaQ3aC1L9w}{10.129.6.22}{10.129.6.22:9300} 2020-11-12T21:47:07.658715241Z {elasticsearch-cdm-p1pxk400-1}{1Xk64uTGQ8eO_u3-4qSaYA}{Cjw7sQ5rRK6eoiH1QKRQJQ}{10.131.10.25}{10.131.10.25:9300}, local, master 2020-11-12T21:47:07.658715241Z {elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300} 2020-11-12T21:47:07.658715241Z 2020-11-12T21:47:07.659988632Z [2020-11-12T21:47:07,659][WARN ][o.e.c.s.MasterService ] [elasticsearch-cdm-p1pxk400-1] failing [zen-disco-node-failed({elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300} failed to ping, tried [3] times, each with maximum [30s] timeout]]: failed to commit cluster state version [200] 2020-11-12T21:47:07.659988632Z org.elasticsearch.discovery.Discovery$FailedToCommitClusterStateException: timed out while waiting for enough masters to ack sent cluster state. [1] left 2020-11-12T21:47:14.467382007Z org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-cdm-p1pxk400-3][10.130.14.21:9300][internal:discovery/zen/unicast] request_id [16349] timed out after [3873ms] 2020-11-12T21:47:14.467382007Z at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6] 2020-11-12T21:47:14.467382007Z at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6] 2020-11-12T21:47:14.467382007Z at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265] 2020-11-12T21:47:14.467382007Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265] 2020-11-12T21:47:14.467382007Z at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265] 2020-11-12T21:47:14.467382007Z [2020-11-12T21:47:14,467][WARN ][o.e.d.z.UnicastZenPing ] [elasticsearch-cdm-p1pxk400-1] failed to send ping to [{elasticsearch-cdm-p1pxk400-2}{x1aOSBYpR0Cdjkt5yZlRJQ}{VHLG4j3wQQW2LaQ3aC1L9w}{10.129.6.22}{10.129.6.22:9300}] ES 2 and 3 show the same issues. Generally though ES is green and there is not a lot of CPU or heap pressure. ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 10.129.6.22 48 26 8 1.81 1.81 1.77 mdi - elasticsearch-cdm-p1pxk400-2 10.130.14.21 45 27 14 2.42 2.42 2.05 mdi - elasticsearch-cdm-p1pxk400-3 10.131.10.25 33 41 10 2.94 2.53 2.35 mdi * elasticsearch-cdm-p1pxk400-1 epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1605219657 22:20:57 elasticsearch green 3 3 468 234 0 0 0 0 - 100.0% As a result fluentd is either non-functional or really slow: 2020-11-12T21:47:14.339053774Z 2020-11-12 21:47:14 +0000 [warn]: [elasticsearch] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.17672873500851 slow_flush_log_threshold=20.0 plugin_id="elasticsearch" 2020-11-12T21:49:24.243061207Z 2020-11-12 21:49:24 +0000 [warn]: [elasticsearch] failed to flush the buffer. retry_time=0 next_retry_seconds=2020-11-12 21:49:25 +0000 chunk="5b3ef9eb60be09eb2efba6aecc08b3cd" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Connection timed out - connect(2) for 172.30.61.39:9200 (Errno::ETIMEDOUT)" After the user loging fails, things generally go back to normal Version-Release number of selected component (if applicable): EO is 4.5.0-202010161522.p0 CLO is 4.5.0-202010161522.p0 How reproducible: 100% with this specific user. Steps to Reproduce: 1. User with a lot of groups logs in 2. Issues 3. Actual results: User login fails and ES becomes unstable. Expected results: User login is successful and ES maintains stability. Additional info: We validated the http.max_header_size is 128 and also tested with 256 and 512 just to be sure.