1897595 – Logging timeouts and cluster disconnections occur when user with many groups log in

Bug 1897595 - Logging timeouts and cluster disconnections occur when user with many groups log in

Summary: Logging timeouts and cluster disconnections occur when user with many groups ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1895581
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	ewolinet
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-13 15:20 UTC by Matthew Robson
Modified:	2024-06-13 23:25 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-20 16:59:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Matthew Robson 2020-11-13 15:20:50 UTC

Description of problem:

A user who is a member of 61 groups tries to log into Kibana. The login fails with a 504 Gateway Timeout and, additionally, it causes the logging stack to become unstable.

cat 2020-11-13_all_groups.txt | grep '<name>@<domain>' | wc -l
      61

Web Dev tools show 504 after logging into Kibana (see attached)
'Failed to load resource: the server responded with a status of 504 (Gateway Time-out)'

All other users can log in, accept for this one which is why we suspect the group's as being a possible cause of the problem.

When the user tries to log in, we see 'http: proxy error: context canceled in kibana proxy:'

2020-11-12T21:47:38.571344522Z time="2020-11-12T21:47:38Z" level=info msg="Handling request \"authorization\""
2020-11-12T21:47:43.299938713Z time="2020-11-12T21:47:43Z" level=info msg="Handling request \"authorization\""
2020-11-12T21:47:48.56353312Z 2020/11/12 21:47:48 http: proxy error: context canceled
2020-11-12T21:47:53.260677543Z 2020/11/12 21:47:53 http: proxy error: context canceled
2020-11-12T21:48:08.569948868Z time="2020-11-12T21:48:08Z" level=info msg="Handling request \"authorization\""
2020-11-12T21:48:09.429117008Z 2020/11/12 21:48:09 http: proxy error: context canceled
2020-11-12T21:48:13.264678167Z time="2020-11-12T21:48:13Z" level=info msg="Handling request \"authorization\""
2020-11-12T21:48:18.563518141Z 2020/11/12 21:48:18 http: proxy error: context canceled
2020-11-12T21:48:23.260694871Z 2020/11/12 21:48:23 http: proxy error: context canceled
2020-11-12T21:48:38.57132017Z time="2020-11-12T21:48:38Z" level=info msg="Handling request \"authorization\""
2020-11-12T21:48:43.264629309Z time="2020-11-12T21:48:43Z" level=info msg="Handling request \"authorization\""
2020-11-12T21:48:48.563492832Z 2020/11/12 21:48:48 http: proxy error: context canceled
2020-11-12T21:48:53.26067297Z 2020/11/12 21:48:53 http: proxy error: context canceled
2020-11-12T21:49:08.605230893Z time="2020-11-12T21:49:08Z" level=info msg="Handling request \"authorization\""
2020-11-12T21:49:13.264782811Z time="2020-11-12T21:49:13Z" level=info msg="Handling request \"authorization\""
2020-11-12T21:49:18.563476362Z 2020/11/12 21:49:18 http: proxy error: context canceled
2020-11-12T21:49:23.260727239Z 2020/11/12 21:49:23 http: proxy error: context canceled
2020-11-12T21:49:38.569684013Z time="2020-11-12T21:49:38Z" level=info msg="Handling request \"authorization\""
2020-11-12T21:49:43.265503867Z time="2020-11-12T21:49:43Z" level=info msg="Handling request \"authorization\""
2020-11-12T21:49:48.563457285Z 2020/11/12 21:49:48 http: proxy error: context canceled

And various timeouts / socket hang ups from kibana:

2020-11-12T21:44:31.386182329Z {"type":"log","@timestamp":"2020-11-12T21:44:31Z","tags":["error","elasticsearch","data"],"pid":122,"message":"Request error, retrying\nPOST https://elasticsearch.openshift-logging.svc.cluster.local:9200/*/_search?ignore_unavailable=true => socket hang up"}
2020-11-12T21:46:31.414261869Z {"type":"log","@timestamp":"2020-11-12T21:46:31Z","tags":["error","elasticsearch","data"],"pid":122,"message":"Request error, retrying\nPOST https://elasticsearch.openshift-logging.svc.cluster.local:9200/*/_search?ignore_unavailable=true => socket hang up"}
2020-11-12T21:47:09.569597865Z {"type":"log","@timestamp":"2020-11-12T21:47:09Z","tags":["error","elasticsearch","admin"],"pid":122,"message":"Request error, retrying\nGET https://elasticsearch.openshift-logging.svc.cluster.local:9200/_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip => socket hang up"}
2020-11-12T21:49:09.583838754Z {"type":"log","@timestamp":"2020-11-12T21:49:09Z","tags":["status","plugin:elasticsearch.1","error"],"pid":122,"state":"red","message":"Status changed from green to red - No Living connections","prevState":"green","prevMsg":"Ready"}


Elasticsearch effectively falls over. GC goes from ~90% throughput to ~10-20% which leads to various timeouts / ping issues. 

All of the GC are young collections between 0.1 and 0.246s averaging 0.148s - the issue is that between 4:40 / 5:10 EST 21:40 / 22:10 UTC there are 9378 collections totalling 1390seconds out of the 1800seconds of elasped time.

2020-11-12T21:45:21.471906457Z [2020-11-12T21:45:21,471][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][365] overhead, spent [822ms] collecting in the last [1s]
2020-11-12T21:45:22.472201269Z [2020-11-12T21:45:22,472][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][366] overhead, spent [813ms] collecting in the last [1s]
2020-11-12T21:45:23.472509676Z [2020-11-12T21:45:23,472][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][367] overhead, spent [809ms] collecting in the last [1s]
2020-11-12T21:45:24.472821974Z [2020-11-12T21:45:24,472][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][368] overhead, spent [826ms] collecting in the last [1s]
2020-11-12T21:45:25.572738797Z [2020-11-12T21:45:25,572][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][369] overhead, spent [907ms] collecting in the last [1s]
2020-11-12T21:45:26.57305946Z [2020-11-12T21:45:26,572][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][370] overhead, spent [807ms] collecting in the last [1s]
2020-11-12T21:45:27.648489273Z [2020-11-12T21:45:27,648][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][371] overhead, spent [884ms] collecting in the last [1s]
2020-11-12T21:45:28.648831575Z [2020-11-12T21:45:28,648][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][372] overhead, spent [802ms] collecting in the last [1s]
2020-11-12T21:45:29.71976221Z [2020-11-12T21:45:29,719][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][373] overhead, spent [880ms] collecting in the last [1s]
2020-11-12T21:45:30.72011347Z [2020-11-12T21:45:30,719][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][374] overhead, spent [796ms] collecting in the last [1s]
2020-11-12T21:45:31.82776028Z [2020-11-12T21:45:31,827][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][375] overhead, spent [918ms] collecting in the last [1.1s]
2020-11-12T21:45:32.830395664Z [2020-11-12T21:45:32,830][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][376] overhead, spent [814ms] collecting in the last [1s]
2020-11-12T21:45:33.830691169Z [2020-11-12T21:45:33,830][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][377] overhead, spent [803ms] collecting in the last [1s]
2020-11-12T21:45:34.936488219Z [2020-11-12T21:45:34,936][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][378] overhead, spent [915ms] collecting in the last [1.1s]
2020-11-12T21:45:35.936777039Z [2020-11-12T21:45:35,936][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][379] overhead, spent [810ms] collecting in the last [1s]
2020-11-12T21:45:36.971670136Z [2020-11-12T21:45:36,971][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][380] overhead, spent [857ms] collecting in the last [1s]
2020-11-12T21:45:37.972026989Z [2020-11-12T21:45:37,971][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][381] overhead, spent [800ms] collecting in the last [1s]
2020-11-12T21:45:39.029920465Z [2020-11-12T21:45:39,029][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][382] overhead, spent [822ms] collecting in the last [1s]
2020-11-12T21:45:40.06267205Z [2020-11-12T21:45:40,062][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][383] overhead, spent [835ms] collecting in the last [1s]
2020-11-12T21:45:41.162303643Z [2020-11-12T21:45:41,162][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][384] overhead, spent [838ms] collecting in the last [1s]
2020-11-12T21:45:42.237747891Z [2020-11-12T21:45:42,237][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][385] overhead, spent [838ms] collecting in the last [1s]
2020-11-12T21:45:43.238117047Z [2020-11-12T21:45:43,237][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][386] overhead, spent [800ms] collecting in the last [1s]
2020-11-12T21:45:44.306545326Z [2020-11-12T21:45:44,306][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][387] overhead, spent [874ms] collecting in the last [1s]
2020-11-12T21:45:44.738827192Z [2020-11-12T21:45:44,738][WARN ][o.e.c.InternalClusterInfoService] [elasticsearch-cdm-p1pxk400-1] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
2020-11-12T21:45:44.739307655Z [2020-11-12T21:45:44,738][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-cdm-p1pxk400-1] failed to execute on node [umEqe7zTQIGWBqJ4I4EXHQ]
2020-11-12T21:45:44.739307655Z org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-cdm-p1pxk400-3][10.130.14.21:9300][cluster:monitor/nodes/stats[n]] request_id [15974] timed out after [14876ms]
2020-11-12T21:45:44.739307655Z  at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
2020-11-12T21:45:44.739307655Z  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
2020-11-12T21:45:44.739307655Z  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
2020-11-12T21:45:44.739307655Z  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
2020-11-12T21:45:44.739307655Z  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
2020-11-12T21:45:44.739366015Z [2020-11-12T21:45:44,738][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-cdm-p1pxk400-1] failed to execute on node [x1aOSBYpR0Cdjkt5yZlRJQ]
2020-11-12T21:45:44.739366015Z org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-cdm-p1pxk400-2][10.129.6.22:9300][cluster:monitor/nodes/stats[n]] request_id [15973] timed out after [14876ms]
2020-11-12T21:45:44.739366015Z  at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
2020-11-12T21:45:44.739366015Z  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
2020-11-12T21:45:44.739366015Z  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
2020-11-12T21:45:44.739366015Z  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
2020-11-12T21:45:44.739366015Z  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
2020-11-12T21:45:45.306839983Z [2020-11-12T21:45:45,306][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][388] overhead, spent [813ms] collecting in the last [1s]
2020-11-12T21:45:46.307129478Z [2020-11-12T21:45:46,307][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][389] overhead, spent [817ms] collecting in the last [1s]
2020-11-12T21:45:47.30744604Z [2020-11-12T21:45:47,307][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][390] overhead, spent [815ms] collecting in the last [1s]
2020-11-12T21:45:48.310788806Z [2020-11-12T21:45:48,310][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][391] overhead, spent [832ms] collecting in the last [1s]
2020-11-12T21:45:59.743606102Z [2020-11-12T21:45:59,743][WARN ][o.e.c.InternalClusterInfoService] [elasticsearch-cdm-p1pxk400-1] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
2020-11-12T21:46:18.421464753Z [2020-11-12T21:46:18,421][WARN ][o.e.t.TransportService   ] [elasticsearch-cdm-p1pxk400-1] Received response for a request that has timed out, sent [48686ms] ago, timed out [33810ms] ago, action [cluster:monitor/nodes/stats[n]], node [{elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300}], id [15974]
2020-11-12T21:46:18.429372332Z [2020-11-12T21:46:18,429][WARN ][o.e.t.TransportService   ] [elasticsearch-cdm-p1pxk400-1] Received response for a request that has timed out, sent [70352ms] ago, timed out [40447ms] ago, action [internal:discovery/zen/fd/ping], node [{elasticsearch-cdm-p1pxk400-2}{x1aOSBYpR0Cdjkt5yZlRJQ}{VHLG4j3wQQW2LaQ3aC1L9w}{10.129.6.22}{10.129.6.22:9300}], id [15931]
2020-11-12T21:46:18.429427283Z [2020-11-12T21:46:18,429][WARN ][o.e.t.TransportService   ] [elasticsearch-cdm-p1pxk400-1] Received response for a request that has timed out, sent [40447ms] ago, timed out [10383ms] ago, action [internal:discovery/zen/fd/ping], node [{elasticsearch-cdm-p1pxk400-2}{x1aOSBYpR0Cdjkt5yZlRJQ}{VHLG4j3wQQW2LaQ3aC1L9w}{10.129.6.22}{10.129.6.22:9300}], id [15979]
2020-11-12T21:46:18.429532841Z [2020-11-12T21:46:18,429][WARN ][o.e.t.TransportService   ] [elasticsearch-cdm-p1pxk400-1] Received response for a request that has timed out, sent [48686ms] ago, timed out [33810ms] ago, action [cluster:monitor/nodes/stats[n]], node [{elasticsearch-cdm-p1pxk400-2}{x1aOSBYpR0Cdjkt5yZlRJQ}{VHLG4j3wQQW2LaQ3aC1L9w}{10.129.6.22}{10.129.6.22:9300}], id [15973]
2020-11-12T21:46:37.65583289Z [2020-11-12T21:46:37,655][INFO ][o.e.c.s.MasterService    ] [elasticsearch-cdm-p1pxk400-1] zen-disco-node-failed({elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300},}

2020-11-12T21:46:50.181415374Z [2020-11-12T21:46:50,181][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][450] overhead, spent [893ms] collecting in the last [1.1s]
2020-11-12T21:46:51.181699253Z [2020-11-12T21:46:51,181][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][451] overhead, spent [790ms] collecting in the last [1s]
2020-11-12T21:46:52.26313281Z [2020-11-12T21:46:52,263][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][452] overhead, spent [893ms] collecting in the last [1s]
2020-11-12T21:46:53.281153794Z [2020-11-12T21:46:53,281][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][453] overhead, spent [820ms] collecting in the last [1s]
2020-11-12T21:46:54.281433893Z [2020-11-12T21:46:54,281][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][454] overhead, spent [786ms] collecting in the last [1s]
2020-11-12T21:46:55.365415789Z [2020-11-12T21:46:55,365][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][455] overhead, spent [900ms] collecting in the last [1s]
2020-11-12T21:46:56.369313568Z [2020-11-12T21:46:56,369][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][456] overhead, spent [807ms] collecting in the last [1s]
2020-11-12T21:46:57.48354951Z [2020-11-12T21:46:57,483][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][457] overhead, spent [903ms] collecting in the last [1.1s]
2020-11-12T21:46:58.483863418Z [2020-11-12T21:46:58,483][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][458] overhead, spent [798ms] collecting in the last [1s]
2020-11-12T21:46:59.55684556Z [2020-11-12T21:46:59,556][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][459] overhead, spent [882ms] collecting in the last [1s]
2020-11-12T21:47:00.643944182Z [2020-11-12T21:47:00,643][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][460] overhead, spent [883ms] collecting in the last [1s]
2020-11-12T21:47:01.757139253Z [2020-11-12T21:47:01,757][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][461] overhead, spent [904ms] collecting in the last [1.1s]
2020-11-12T21:47:02.763469886Z [2020-11-12T21:47:02,763][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][462] overhead, spent [824ms] collecting in the last [1s]
2020-11-12T21:47:03.874501784Z [2020-11-12T21:47:03,874][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][463] overhead, spent [906ms] collecting in the last [1.1s]
2020-11-12T21:47:04.874772121Z [2020-11-12T21:47:04,874][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][464] overhead, spent [789ms] collecting in the last [1s]
2020-11-12T21:47:05.961033847Z [2020-11-12T21:47:05,960][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][465] overhead, spent [904ms] collecting in the last [1s]
2020-11-12T21:47:07.072977573Z [2020-11-12T21:47:07,072][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-p1pxk400-1] [gc][466] overhead, spent [902ms] collecting in the last [1.1s]

2020-11-12T21:47:07.658715241Z [2020-11-12T21:47:07,658][WARN ][o.e.d.z.ZenDiscovery     ] [elasticsearch-cdm-p1pxk400-1] zen-disco-failed-to-publish, current nodes: nodes:
2020-11-12T21:47:07.658715241Z    {elasticsearch-cdm-p1pxk400-2}{x1aOSBYpR0Cdjkt5yZlRJQ}{VHLG4j3wQQW2LaQ3aC1L9w}{10.129.6.22}{10.129.6.22:9300}
2020-11-12T21:47:07.658715241Z    {elasticsearch-cdm-p1pxk400-1}{1Xk64uTGQ8eO_u3-4qSaYA}{Cjw7sQ5rRK6eoiH1QKRQJQ}{10.131.10.25}{10.131.10.25:9300}, local, master
2020-11-12T21:47:07.658715241Z    {elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300}
2020-11-12T21:47:07.658715241Z

2020-11-12T21:47:07.659988632Z [2020-11-12T21:47:07,659][WARN ][o.e.c.s.MasterService    ] [elasticsearch-cdm-p1pxk400-1] failing [zen-disco-node-failed({elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{elasticsearch-cdm-p1pxk400-3}{umEqe7zTQIGWBqJ4I4EXHQ}{eLL6-3MMRW-ihqEBU-bd1A}{10.130.14.21}{10.130.14.21:9300} failed to ping, tried [3] times, each with maximum [30s] timeout]]: failed to commit cluster state version [200]
2020-11-12T21:47:07.659988632Z org.elasticsearch.discovery.Discovery$FailedToCommitClusterStateException: timed out while waiting for enough masters to ack sent cluster state. [1] left

2020-11-12T21:47:14.467382007Z org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-cdm-p1pxk400-3][10.130.14.21:9300][internal:discovery/zen/unicast] request_id [16349] timed out after [3873ms]
2020-11-12T21:47:14.467382007Z  at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
2020-11-12T21:47:14.467382007Z  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
2020-11-12T21:47:14.467382007Z  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
2020-11-12T21:47:14.467382007Z  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
2020-11-12T21:47:14.467382007Z  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
2020-11-12T21:47:14.467382007Z [2020-11-12T21:47:14,467][WARN ][o.e.d.z.UnicastZenPing   ] [elasticsearch-cdm-p1pxk400-1] failed to send ping to [{elasticsearch-cdm-p1pxk400-2}{x1aOSBYpR0Cdjkt5yZlRJQ}{VHLG4j3wQQW2LaQ3aC1L9w}{10.129.6.22}{10.129.6.22:9300}]

ES 2 and 3 show the same issues.

Generally though ES is green and there is not a lot of CPU or heap pressure.

ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.129.6.22            48          26   8    1.81    1.81     1.77 mdi       -      elasticsearch-cdm-p1pxk400-2
10.130.14.21           45          27  14    2.42    2.42     2.05 mdi       -      elasticsearch-cdm-p1pxk400-3
10.131.10.25           33          41  10    2.94    2.53     2.35 mdi       *      elasticsearch-cdm-p1pxk400-1

epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1605219657 22:20:57  elasticsearch green           3         3    468 234    0    0        0             0                  -                100.0%



As a result fluentd is either non-functional or really slow:

2020-11-12T21:47:14.339053774Z 2020-11-12 21:47:14 +0000 [warn]: [elasticsearch] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=64.17672873500851 slow_flush_log_threshold=20.0 plugin_id="elasticsearch"
2020-11-12T21:49:24.243061207Z 2020-11-12 21:49:24 +0000 [warn]: [elasticsearch] failed to flush the buffer. retry_time=0 next_retry_seconds=2020-11-12 21:49:25 +0000 chunk="5b3ef9eb60be09eb2efba6aecc08b3cd" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Connection timed out - connect(2) for 172.30.61.39:9200 (Errno::ETIMEDOUT)"

After the user loging fails, things generally go back to normal  


Version-Release number of selected component (if applicable):

EO is 4.5.0-202010161522.p0
CLO is 4.5.0-202010161522.p0

How reproducible:

100% with this specific user.


Steps to Reproduce:
1. User with a lot of groups logs in
2. Issues
3.

Actual results:

User login fails and ES becomes unstable.


Expected results:

User login is successful and ES maintains stability.

Additional info:

We validated the http.max_header_size is 128 and also tested with 256 and 512 just to be sure.

Note You need to log in before you can comment on or make changes to this bug.