Bug 1845293 - Elasticsearch returning too_long_frame_exception when calling endpoints through external route after logging workload
Summary: Elasticsearch returning too_long_frame_exception when calling endpoints throu...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.7.0
Assignee: ewolinet
QA Contact: Mike Fiedler
Rolfe Dlugy-Hegwer
URL:
Whiteboard: logging-exploration
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-08 20:31 UTC by Eric Matysek
Modified: 2021-02-24 11:21 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, Elasticsearch rejected HTTP requests whose headers exceeded the default max header size, 8kb. The current release fixes this issue by increasing the max header size to 128kb. We no longer see Elasticsearch rejecting HTTP requests for exceeding the max header size. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1845293[*BZ#1845293*])
Clone Of:
Environment:
Last Closed: 2021-02-24 11:21:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Pod logs (277.30 KB, application/gzip)
2020-06-08 20:31 UTC, Eric Matysek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift elasticsearch-operator pull 502 0 None closed Bug 1878305: Bump es max_header_size to address errors seen in Kibana 2021-02-06 11:03:05 UTC
Red Hat Product Errata RHBA-2021:0652 0 None None None 2021-02-24 11:21:52 UTC

Description Eric Matysek 2020-06-08 20:31:42 UTC
Created attachment 1696210 [details]
Pod logs

Description of problem:
After running a logging workload I am no longer able to curl elasticsearch endpoints through the route I created.


Version-Release number of selected component (if applicable):
4.5

How reproducible:
100%

Steps to Reproduce:
1. Deploy logging
2. Create elasticsearch route (https://docs.openshift.com/container-platform/4.4/logging/config/cluster-logging-elasticsearch.html#cluster-logging-elasticsearch-exposing_cluster-logging-elasticsearch)
3. Verify route works
4. Run logging workload with 10M+ messages
5. Route returns too_long_frame_exception from then on

Actual results:
before workload:
$ curl -k https://elasticsearch-openshift-logging.apps.qe-em-45.qe.devcluster.openshift.com/_cat/indices?v -H "Authorization: Bearer SOME_BEARER"
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .security    CsSfhyuhQx2N18YR8hDvCA   1   1          5            0     59.9kb         29.9kb
green  open   infra-000001 wdg7mOqYTQCeuy2f5jwLDg   3   1     240558            0    277.5mb        139.3mb
green  open   infra-000003 fSl2zkbBRqe1rW9OB73Qmw   3   1     368192            0    471.7mb        236.5mb
green  open   .kibana-6    1HXPGkCbS1yjOAl3eo69eg   1   1          0            0       522b           261b
green  open   .kibana_1    1MDlb9wJSFGpZEZ8wxa5ww   1   1          0            0       522b           261b
green  open   app-000001   nx-BmgdWTe-DGC_PM4j44Q   3   1          0            0      1.5kb           783b
green  open   infra-000004 FUzLcNmGSyiu9Ceb0eLQlA   3   1     212775            0    276.8mb        138.4mb
green  open   audit-000001 ogQ9dNlXRd2f3VffYk4DOQ   3   1          0            0      1.5kb           783b
green  open   infra-000002 YvSqlqRXSnWuBaZvgMl5Tw   3   1     454997            0    563.6mb        281.8mb

after workload:
$ curl -k https://elasticsearch-openshift-logging.apps.qe-em-45.qe.devcluster.openshift.com/_cat/indices?v -H "Authorization: Bearer SOME_BEARER" -s | python -m json.tool
{
    "error": {
        "root_cause": [
            {
                "type": "too_long_frame_exception",
                "reason": "HTTP header is larger than 8192 bytes."
            }
        ],
        "type": "too_long_frame_exception",
        "reason": "HTTP header is larger than 8192 bytes."
    },
    "status": 400
}

Expected results:
Using oc exec to see what it should have returned:
$ oc exec -n openshift-logging elasticsearch-cdm-i0d4932s-1-59c8b4c977-ctlmn -c elasticsearch -- \
> curl --connect-timeout 30 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key \
> https://localhost:9200/_cat/indices?v
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   infra-000010 dvTryhUYQx2zsUf5YjlXTw   3   1     392851            0    505.6mb        253.5mb
green  open   infra-000012 SEPrWK6pRs2RVBxgQgQUpQ   3   1     395405            0    509.2mb        253.9mb
green  open   .kibana-6    1HXPGkCbS1yjOAl3eo69eg   1   1          0            0       522b           261b
green  open   infra-000007 myIGSex6TWGJhRYxChu8vA   3   1     393098            0    500.6mb        250.5mb
green  open   .kibana_1    1MDlb9wJSFGpZEZ8wxa5ww   1   1          0            0       522b           261b
green  open   infra-000006 5DclncVQQDi4aRQsf8oyVA   3   1     390823            0    494.6mb        247.4mb
green  open   infra-000009 DkQfFXhUQfiiyxAmMIXTqQ   3   1     391861            0    505.2mb        252.6mb
green  open   infra-000014 OEKuNaDgSM-xxs6DSbBYwQ   3   1     324042            0    426.5mb        207.7mb
green  open   infra-000002 YvSqlqRXSnWuBaZvgMl5Tw   3   1     454997            0    563.6mb        281.8mb
green  open   app-000004   7gC999UyTE2VbZRL_1NJ7A   3   1    8680018            0     13.8gb          6.9gb
green  open   .security    CsSfhyuhQx2N18YR8hDvCA   1   1          5            0     59.9kb         29.9kb
green  open   infra-000013 yXjFB8MVQHO6dT3fTo7_Rw   3   1     391241            0    503.6mb        250.9mb
green  open   infra-000011 EsrGtrOoRGaIg5_EMnA-wg   3   1     387519            0    499.3mb          249mb
green  open   infra-000003 fSl2zkbBRqe1rW9OB73Qmw   3   1     368192            0    471.7mb        236.5mb
green  open   infra-000001 wdg7mOqYTQCeuy2f5jwLDg   3   1     240558            0    277.5mb        139.3mb
green  open   app-000001   nx-BmgdWTe-DGC_PM4j44Q   3   1    2560136            0      4.2gb          2.1gb
green  open   infra-000004 FUzLcNmGSyiu9Ceb0eLQlA   3   1     395497            0    510.1mb          255mb
green  open   app-000005   3CatH-AHQsi9wsDWnyRIsA   3   1    6583800            0     10.4gb          5.2gb
green  open   audit-000001 ogQ9dNlXRd2f3VffYk4DOQ   3   1          0            0      1.5kb           783b
green  open   app-000002   6n0ABU5bR8W6WgJXG3aI6w   3   1    8860564            0     14.4gb          7.2gb
green  open   app-000006   ixEHho2eSNK_SZx3cmdjUg   3   1          0            0      1.5kb           783b
green  open   app-000003   u0NMdcNCSIKCsvrvGJ5-Lg   3   1    8740882            0     14.7gb          7.3gb
green  open   infra-000005 gqbyA1QmRDaN1klOHvqQdg   3   1     392899            0    499.5mb        249.8mb
green  open   infra-000008 NcAvleDhQziXNIJJG4rITQ   3   1     395415            0    502.9mb        251.5mb

Additional info:
I have not observed this with logging workloads that had less total messages indexed. The example above is after running a workload with 36M total messages.

Comment 1 Eric Matysek 2020-06-08 20:35:55 UTC
We can see hitting the service IP from inside the cluster has the same result:
$ oc get svc
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
cluster-logging-operator   ClusterIP   172.30.110.230   <none>        8383/TCP    3h35m
elasticsearch              ClusterIP   172.30.72.50     <none>        9200/TCP    3h35m
elasticsearch-cluster      ClusterIP   172.30.67.59     <none>        9300/TCP    3h35m
elasticsearch-metrics      ClusterIP   172.30.150.72    <none>        60001/TCP   3h35m
fluentd                    ClusterIP   172.30.77.147    <none>        24231/TCP   3h35m
kibana                     ClusterIP   172.30.108.203   <none>        443/TCP     3h35m
openshift-olm-test-clo     ClusterIP   172.30.196.98    <none>        50051/TCP   3h35m


$ oc exec elasticsearch-cdm-i0d4932s-1-59c8b4c977-ctlmn -c elasticsearch -- curl -k https://172.30.72.50:9200/_cat/indices?v -H "Authorization: Bearer SOME_BEARER" -s
{"error":{"root_cause":[{"type":"too_long_frame_exception","reason":"HTTP header is larger than 8192 bytes."}],"type":"too_long_frame_exception","reason":"HTTP header is larger than 8192 bytes."},"status":400}

Comment 2 Jeff Cantrill 2020-06-08 21:33:26 UTC
(In reply to Eric Matysek from comment #1)
> We can see hitting the service IP from inside the cluster has the same
> result:
> $ oc get svc
> NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP  
> PORT(S)     AGE
> cluster-logging-operator   ClusterIP   172.30.110.230   <none>       
> 8383/TCP    3h35m
> elasticsearch              ClusterIP   172.30.72.50     <none>       
> 9200/TCP    3h35m
> elasticsearch-cluster      ClusterIP   172.30.67.59     <none>       
> 9300/TCP    3h35m
> elasticsearch-metrics      ClusterIP   172.30.150.72    <none>       
> 60001/TCP   3h35m
> fluentd                    ClusterIP   172.30.77.147    <none>       
> 24231/TCP   3h35m
> kibana                     ClusterIP   172.30.108.203   <none>       
> 443/TCP     3h35m
> openshift-olm-test-clo     ClusterIP   172.30.196.98    <none>       
> 50051/TCP   3h35m
> 
> 
> $ oc exec elasticsearch-cdm-i0d4932s-1-59c8b4c977-ctlmn -c elasticsearch --
> curl -k https://172.30.72.50:9200/_cat/indices?v -H "Authorization: Bearer
> SOME_BEARER" -s
> {"error":{"root_cause":[{"type":"too_long_frame_exception","reason":"HTTP
> header is larger than 8192
> bytes."}],"type":"too_long_frame_exception","reason":"HTTP header is larger
> than 8192 bytes."},"status":400}



Can you just curl the pod directly instead of going through the service endpoint?  `oc exec -c elasticsearch $pod -- indices` or `oc exec -c elasticsearch $pod -- es_util --query=_cat/indices?v`

Comment 3 Eric Matysek 2020-06-09 14:23:24 UTC
Yes I can use internal curls for the most part but I have automation written for debugging and QoL purposes that relies on the service endpoint.
Not to mention customers could see the same thing when trying to administer elasticsearch on their end.

Comment 4 Eric Matysek 2020-06-09 19:30:36 UTC
I will say today I was able to run a scale workload without hitting the exception afterwards, so reproducibility isn't 100%.

Interestingly, the other day I hit this error, un-deployed logging, then re-deployed it and immediately started hitting this error again without running a workload so I wonder if there is some underlying issue.

Comment 5 Jeff Cantrill 2020-06-10 13:33:21 UTC
(In reply to Eric Matysek from comment #4)
> I will say today I was able to run a scale workload without hitting the
> exception afterwards, so reproducibility isn't 100%.
> 
> Interestingly, the other day I hit this error, un-deployed logging, then
> re-deployed it and immediately started hitting this error again without
> running a workload so I wonder if there is some underlying issue.

> bytes."}],"type":"too_long_frame_exception","reason":"HTTP header is larger
> than 8192 bytes."},"status":400}

400 is a client error which means it's something on the caller side.  Consider adding a verbose flag or something else to dump the headers.  

Is there maybe an issue with the size of the token you are sending?
Have you considered jumping on the pod to run the curl to remove 'oc exec' from the equation?

Comment 6 Eric Matysek 2020-06-10 18:51:20 UTC
(In reply to Jeff Cantrill from comment #5)
> 
> 400 is a client error which means it's something on the caller side. 
> Consider adding a verbose flag or something else to dump the headers.  
You can see I am only passing the Authorization header in my curl command. I am using the exact same curl command before and after the workload, it works before but not after.

> 
> Is there maybe an issue with the size of the token you are sending?
If this was the case I believe my curl before the workload would fail.

> Have you considered jumping on the pod to run the curl to remove 'oc exec'
> from the equation?

The call works with oc exec curl hitting localhost:9200, you can see this in my "expected results" section, so I think that eliminates oc exec from the equation but I can give rsh a shot.


This seems to suggest to me that something in between my curl command and the elasticsearch container is altering (adding to) my headers.
I am also wondering why I see no exception in the elasticsearch pod logs

Comment 7 Periklis Tsirakidis 2020-07-06 06:56:01 UTC
Putting on Upcoming Sprint as ongoing investigation.

Comment 9 Jeff Cantrill 2020-08-19 14:49:05 UTC
(In reply to Eric Matysek from comment #6)
> 
> This seems to suggest to me that something in between my curl command and
> the elasticsearch container is altering (adding to) my headers.
> I am also wondering why I see no exception in the elasticsearch pod logs

4.5 introduced the elasticsearch-proxy which fronts ES and sits between the service and the pod.  Maybe there is something with how we are manipulating the request in the proxy which exhibits the behavior you see.

Pushing target release as this is not a blocker for 4.6

Comment 10 Jeff Cantrill 2020-08-21 14:11:01 UTC
Moving to UpcomingSprint for future evaluation

Comment 11 Jeff Cantrill 2020-09-11 20:18:37 UTC
Moving to UpcomingRelease

Comment 12 Jeff Cantrill 2020-10-01 12:42:11 UTC
Moving to 4.7 to satisfy CF requirements for 4.6

Comment 13 Jeff Cantrill 2020-10-02 15:23:52 UTC
Marking UpcomingSprint as will not be merged or addressed by EOD

Comment 14 Jeff Cantrill 2020-10-23 15:19:54 UTC
Setting UpcomingSprint as unable to resolve before EOD

Comment 18 Mike Fiedler 2021-02-05 00:01:28 UTC
Verified on     Image:          registry.ci.openshift.org/ocp/4.7:logging-elasticsearch6
    Image ID:       registry.ci.openshift.org/ocp/4.7@sha256:25fca4c45ca11e7dc3fd25f35eb5e5f7225ee87942f66b77e8a1ce2b8672dbc3
    Image:         registry.ci.openshift.org/ocp/4.7:elasticsearch-proxy
    Image ID:      registry.ci.openshift.org/ocp/4.7@sha256:0272be3313129cf243ab7ec884e91f3168c8ad0e01b357076ac75ecce52ba024


with 30M messages, curl to service IP worked correctly.

Comment 22 errata-xmlrpc 2021-02-24 11:21:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652


Note You need to log in before you can comment on or make changes to this bug.