Bug 2033461

Summary: loki-promtail pod readiness probe failure during CI tests
Product: OpenShift Container Platform Reporter: Dan Williams <dcbw>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED NOTABUG QA Contact: Anping Li <anli>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.10CC: amuller, anpicker, aos-bugs, arajkuma, erooth, periklis
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-17 06:17:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Williams 2021-12-16 22:03:27 UTC
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/875/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-gateway/1471574928024145920

[sig-arch] events should not repeat pathologically
1 events happened too frequently

event happened 21 times, something is wrong: ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal - reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Looking at the logs for loki-promtail-9fmk2 we see many instances of ratelimiting continuing through the test, like:

level=warn ts=2021-12-16T21:07:49.256075215Z caller=client.go:349 component=client host=logs-prod3.grafana.net msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): {\"code\":429,\"status\":\"error\",\"message\":\"Ingestion rate limit exceeded for user 18459 (limit: 6291456 bytes/sec) while attempting to ingest '598' lines totaling '102332' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\"}"

Dec 16 21:07:50.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Is it possible the health checks are timing out because the promtail pod is too busy?

Dec 16 21:07:50.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Dec 16 21:12:40.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (2 times)
Dec 16 21:12:50.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (3 times)
Dec 16 21:13:00.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (4 times)
Dec 16 21:13:10.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (5 times)
Dec 16 21:13:20.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (6 times)
Dec 16 21:13:21.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (7 times)
Dec 16 21:13:30.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (8 times)
Dec 16 21:13:40.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (9 times)
Dec 16 21:14:40.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (10 times)
Dec 16 21:16:50.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (11 times)
Dec 16 21:19:00.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (12 times)
Dec 16 21:19:10.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (13 times)
Dec 16 21:19:20.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (14 times)
Dec 16 21:19:30.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (15 times)
Dec 16 21:19:40.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (16 times)
Dec 16 21:19:41.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (17 times)
Dec 16 21:19:50.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (18 times)
Dec 16 21:20:00.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (19 times)
Dec 16 21:20:10.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (20 times)
Dec 16 21:20:20.000 W ns/loki pod/loki-promtail-9fmk2 node/ip-10-0-144-145.us-west-2.compute.internal reason/Unhealthy Readiness probe failed: Get "http://10.131.0.17:3101/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers) (21 times)

There are 167 instances of:

level=warn ts=2021-12-16T20:57:14.991170953Z caller=client.go:349 component=client host=logs-prod3.grafana.net msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): {\"code\":429,\"status\":\"error\",\"message\":\"Ingestion rate limit exceeded for user 18459 (limit: 6291456 bytes/sec) while attempting to ingest '722' lines totaling '102341' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\"}"

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/875/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-gateway/1471574928024145920/artifacts/e2e-aws-ovn-local-gateway/gather-extra/artifacts/pods/loki_loki-promtail-9fmk2_promtail.log