Bug 2033461 - loki-promtail pod readiness probe failure during CI tests
Summary: loki-promtail pod readiness probe failure during CI tests
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-16 22:03 UTC by Dan Williams
Modified: 2021-12-17 06:17 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-17 06:17:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dan Williams 2021-12-16 22:03:27 UTC
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/875/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-gateway/1471574928024145920

[sig-arch] events should not repeat pathologically
1 events happened too frequently

event happened 21 times, something is wrong: ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal - reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Looking at the logs for loki-promtail-9fmk2 we see many instances of ratelimiting continuing through the test, like:

level=warn ts=2021-12-16T21:07:49.256075215Z caller=client.go:349 component=client host=logs-prod3.grafana.net msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): {\"code\":429,\"status\":\"error\",\"message\":\"Ingestion rate limit exceeded for user 18459 (limit: 6291456 bytes/sec) while attempting to ingest '598' lines totaling '102332' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\"}"

Dec 16 21:07:50.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Is it possible the health checks are timing out because the promtail pod is too busy?

Dec 16 21:07:50.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Dec 16 21:12:40.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (2 times)
Dec 16 21:12:50.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (3 times)
Dec 16 21:13:00.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (4 times)
Dec 16 21:13:10.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (5 times)
Dec 16 21:13:20.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (6 times)
Dec 16 21:13:21.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (7 times)
Dec 16 21:13:30.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (8 times)
Dec 16 21:13:40.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (9 times)
Dec 16 21:14:40.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (10 times)
Dec 16 21:16:50.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (11 times)
Dec 16 21:19:00.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (12 times)
Dec 16 21:19:10.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (13 times)
Dec 16 21:19:20.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (14 times)
Dec 16 21:19:30.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (15 times)
Dec 16 21:19:40.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (16 times)
Dec 16 21:19:41.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (17 times)
Dec 16 21:19:50.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (18 times)
Dec 16 21:20:00.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (19 times)
Dec 16 21:20:10.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (20 times)
Dec 16 21:20:20.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (21 times)

There are 167 instances of:

level=warn ts=2021-12-16T20:57:14.991170953Z caller=client.go:349 component=client host=logs-prod3.grafana.net msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): {\"code\":429,\"status\":\"error\",\"message\":\"Ingestion rate limit exceeded for user 18459 (limit: 6291456 bytes/sec) while attempting to ingest '722' lines totaling '102341' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\"}"

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/875/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-gateway/1471574928024145920/artifacts/e2e-aws-ovn-local-gateway/gather-extra/artifacts/pods/loki_loki-promtail-9fmk2_promtail.log


Note You need to log in before you can comment on or make changes to this bug.