Created attachment 1696210 [details] Pod logs Description of problem: After running a logging workload I am no longer able to curl elasticsearch endpoints through the route I created. Version-Release number of selected component (if applicable): 4.5 How reproducible: 100% Steps to Reproduce: 1. Deploy logging 2. Create elasticsearch route (https://docs.openshift.com/container-platform/4.4/logging/config/cluster-logging-elasticsearch.html#cluster-logging-elasticsearch-exposing_cluster-logging-elasticsearch) 3. Verify route works 4. Run logging workload with 10M+ messages 5. Route returns too_long_frame_exception from then on Actual results: before workload: $ curl -k https://elasticsearch-openshift-logging.apps.qe-em-45.qe.devcluster.openshift.com/_cat/indices?v -H "Authorization: Bearer SOME_BEARER" health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open .security CsSfhyuhQx2N18YR8hDvCA 1 1 5 0 59.9kb 29.9kb green open infra-000001 wdg7mOqYTQCeuy2f5jwLDg 3 1 240558 0 277.5mb 139.3mb green open infra-000003 fSl2zkbBRqe1rW9OB73Qmw 3 1 368192 0 471.7mb 236.5mb green open .kibana-6 1HXPGkCbS1yjOAl3eo69eg 1 1 0 0 522b 261b green open .kibana_1 1MDlb9wJSFGpZEZ8wxa5ww 1 1 0 0 522b 261b green open app-000001 nx-BmgdWTe-DGC_PM4j44Q 3 1 0 0 1.5kb 783b green open infra-000004 FUzLcNmGSyiu9Ceb0eLQlA 3 1 212775 0 276.8mb 138.4mb green open audit-000001 ogQ9dNlXRd2f3VffYk4DOQ 3 1 0 0 1.5kb 783b green open infra-000002 YvSqlqRXSnWuBaZvgMl5Tw 3 1 454997 0 563.6mb 281.8mb after workload: $ curl -k https://elasticsearch-openshift-logging.apps.qe-em-45.qe.devcluster.openshift.com/_cat/indices?v -H "Authorization: Bearer SOME_BEARER" -s | python -m json.tool { "error": { "root_cause": [ { "type": "too_long_frame_exception", "reason": "HTTP header is larger than 8192 bytes." } ], "type": "too_long_frame_exception", "reason": "HTTP header is larger than 8192 bytes." }, "status": 400 } Expected results: Using oc exec to see what it should have returned: $ oc exec -n openshift-logging elasticsearch-cdm-i0d4932s-1-59c8b4c977-ctlmn -c elasticsearch -- \ > curl --connect-timeout 30 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key \ > https://localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open infra-000010 dvTryhUYQx2zsUf5YjlXTw 3 1 392851 0 505.6mb 253.5mb green open infra-000012 SEPrWK6pRs2RVBxgQgQUpQ 3 1 395405 0 509.2mb 253.9mb green open .kibana-6 1HXPGkCbS1yjOAl3eo69eg 1 1 0 0 522b 261b green open infra-000007 myIGSex6TWGJhRYxChu8vA 3 1 393098 0 500.6mb 250.5mb green open .kibana_1 1MDlb9wJSFGpZEZ8wxa5ww 1 1 0 0 522b 261b green open infra-000006 5DclncVQQDi4aRQsf8oyVA 3 1 390823 0 494.6mb 247.4mb green open infra-000009 DkQfFXhUQfiiyxAmMIXTqQ 3 1 391861 0 505.2mb 252.6mb green open infra-000014 OEKuNaDgSM-xxs6DSbBYwQ 3 1 324042 0 426.5mb 207.7mb green open infra-000002 YvSqlqRXSnWuBaZvgMl5Tw 3 1 454997 0 563.6mb 281.8mb green open app-000004 7gC999UyTE2VbZRL_1NJ7A 3 1 8680018 0 13.8gb 6.9gb green open .security CsSfhyuhQx2N18YR8hDvCA 1 1 5 0 59.9kb 29.9kb green open infra-000013 yXjFB8MVQHO6dT3fTo7_Rw 3 1 391241 0 503.6mb 250.9mb green open infra-000011 EsrGtrOoRGaIg5_EMnA-wg 3 1 387519 0 499.3mb 249mb green open infra-000003 fSl2zkbBRqe1rW9OB73Qmw 3 1 368192 0 471.7mb 236.5mb green open infra-000001 wdg7mOqYTQCeuy2f5jwLDg 3 1 240558 0 277.5mb 139.3mb green open app-000001 nx-BmgdWTe-DGC_PM4j44Q 3 1 2560136 0 4.2gb 2.1gb green open infra-000004 FUzLcNmGSyiu9Ceb0eLQlA 3 1 395497 0 510.1mb 255mb green open app-000005 3CatH-AHQsi9wsDWnyRIsA 3 1 6583800 0 10.4gb 5.2gb green open audit-000001 ogQ9dNlXRd2f3VffYk4DOQ 3 1 0 0 1.5kb 783b green open app-000002 6n0ABU5bR8W6WgJXG3aI6w 3 1 8860564 0 14.4gb 7.2gb green open app-000006 ixEHho2eSNK_SZx3cmdjUg 3 1 0 0 1.5kb 783b green open app-000003 u0NMdcNCSIKCsvrvGJ5-Lg 3 1 8740882 0 14.7gb 7.3gb green open infra-000005 gqbyA1QmRDaN1klOHvqQdg 3 1 392899 0 499.5mb 249.8mb green open infra-000008 NcAvleDhQziXNIJJG4rITQ 3 1 395415 0 502.9mb 251.5mb Additional info: I have not observed this with logging workloads that had less total messages indexed. The example above is after running a workload with 36M total messages.
We can see hitting the service IP from inside the cluster has the same result: $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE cluster-logging-operator ClusterIP 172.30.110.230 <none> 8383/TCP 3h35m elasticsearch ClusterIP 172.30.72.50 <none> 9200/TCP 3h35m elasticsearch-cluster ClusterIP 172.30.67.59 <none> 9300/TCP 3h35m elasticsearch-metrics ClusterIP 172.30.150.72 <none> 60001/TCP 3h35m fluentd ClusterIP 172.30.77.147 <none> 24231/TCP 3h35m kibana ClusterIP 172.30.108.203 <none> 443/TCP 3h35m openshift-olm-test-clo ClusterIP 172.30.196.98 <none> 50051/TCP 3h35m $ oc exec elasticsearch-cdm-i0d4932s-1-59c8b4c977-ctlmn -c elasticsearch -- curl -k https://172.30.72.50:9200/_cat/indices?v -H "Authorization: Bearer SOME_BEARER" -s {"error":{"root_cause":[{"type":"too_long_frame_exception","reason":"HTTP header is larger than 8192 bytes."}],"type":"too_long_frame_exception","reason":"HTTP header is larger than 8192 bytes."},"status":400}
(In reply to Eric Matysek from comment #1) > We can see hitting the service IP from inside the cluster has the same > result: > $ oc get svc > NAME TYPE CLUSTER-IP EXTERNAL-IP > PORT(S) AGE > cluster-logging-operator ClusterIP 172.30.110.230 <none> > 8383/TCP 3h35m > elasticsearch ClusterIP 172.30.72.50 <none> > 9200/TCP 3h35m > elasticsearch-cluster ClusterIP 172.30.67.59 <none> > 9300/TCP 3h35m > elasticsearch-metrics ClusterIP 172.30.150.72 <none> > 60001/TCP 3h35m > fluentd ClusterIP 172.30.77.147 <none> > 24231/TCP 3h35m > kibana ClusterIP 172.30.108.203 <none> > 443/TCP 3h35m > openshift-olm-test-clo ClusterIP 172.30.196.98 <none> > 50051/TCP 3h35m > > > $ oc exec elasticsearch-cdm-i0d4932s-1-59c8b4c977-ctlmn -c elasticsearch -- > curl -k https://172.30.72.50:9200/_cat/indices?v -H "Authorization: Bearer > SOME_BEARER" -s > {"error":{"root_cause":[{"type":"too_long_frame_exception","reason":"HTTP > header is larger than 8192 > bytes."}],"type":"too_long_frame_exception","reason":"HTTP header is larger > than 8192 bytes."},"status":400} Can you just curl the pod directly instead of going through the service endpoint? `oc exec -c elasticsearch $pod -- indices` or `oc exec -c elasticsearch $pod -- es_util --query=_cat/indices?v`
Yes I can use internal curls for the most part but I have automation written for debugging and QoL purposes that relies on the service endpoint. Not to mention customers could see the same thing when trying to administer elasticsearch on their end.
I will say today I was able to run a scale workload without hitting the exception afterwards, so reproducibility isn't 100%. Interestingly, the other day I hit this error, un-deployed logging, then re-deployed it and immediately started hitting this error again without running a workload so I wonder if there is some underlying issue.
(In reply to Eric Matysek from comment #4) > I will say today I was able to run a scale workload without hitting the > exception afterwards, so reproducibility isn't 100%. > > Interestingly, the other day I hit this error, un-deployed logging, then > re-deployed it and immediately started hitting this error again without > running a workload so I wonder if there is some underlying issue. > bytes."}],"type":"too_long_frame_exception","reason":"HTTP header is larger > than 8192 bytes."},"status":400} 400 is a client error which means it's something on the caller side. Consider adding a verbose flag or something else to dump the headers. Is there maybe an issue with the size of the token you are sending? Have you considered jumping on the pod to run the curl to remove 'oc exec' from the equation?
(In reply to Jeff Cantrill from comment #5) > > 400 is a client error which means it's something on the caller side. > Consider adding a verbose flag or something else to dump the headers. You can see I am only passing the Authorization header in my curl command. I am using the exact same curl command before and after the workload, it works before but not after. > > Is there maybe an issue with the size of the token you are sending? If this was the case I believe my curl before the workload would fail. > Have you considered jumping on the pod to run the curl to remove 'oc exec' > from the equation? The call works with oc exec curl hitting localhost:9200, you can see this in my "expected results" section, so I think that eliminates oc exec from the equation but I can give rsh a shot. This seems to suggest to me that something in between my curl command and the elasticsearch container is altering (adding to) my headers. I am also wondering why I see no exception in the elasticsearch pod logs
Putting on Upcoming Sprint as ongoing investigation.
(In reply to Eric Matysek from comment #6) > > This seems to suggest to me that something in between my curl command and > the elasticsearch container is altering (adding to) my headers. > I am also wondering why I see no exception in the elasticsearch pod logs 4.5 introduced the elasticsearch-proxy which fronts ES and sits between the service and the pod. Maybe there is something with how we are manipulating the request in the proxy which exhibits the behavior you see. Pushing target release as this is not a blocker for 4.6
Moving to UpcomingSprint for future evaluation
Moving to UpcomingRelease
Moving to 4.7 to satisfy CF requirements for 4.6
Marking UpcomingSprint as will not be merged or addressed by EOD
Setting UpcomingSprint as unable to resolve before EOD
Verified on Image: registry.ci.openshift.org/ocp/4.7:logging-elasticsearch6 Image ID: registry.ci.openshift.org/ocp/4.7@sha256:25fca4c45ca11e7dc3fd25f35eb5e5f7225ee87942f66b77e8a1ce2b8672dbc3 Image: registry.ci.openshift.org/ocp/4.7:elasticsearch-proxy Image ID: registry.ci.openshift.org/ocp/4.7@sha256:0272be3313129cf243ab7ec884e91f3168c8ad0e01b357076ac75ecce52ba024 with 30M messages, curl to service IP worked correctly.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652