1700779 – e2e flake: fail [k8s.io/kubernetes/test/e2e/storage/volume_metrics.go:159]: Error finding volume metrics : timed out waiting for the condition

Bug 1700779 - e2e flake: fail [k8s.io/kubernetes/test/e2e/storage/volume_metrics.go:159]: Error finding volume metrics : timed out waiting for the condition

Summary: e2e flake: fail [k8s.io/kubernetes/test/e2e/storage/volume_metrics.go:159]: E...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Robert Krawitz
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:	buildcop
Duplicates (2):	1701368 1709609 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-17 11:17 UTC by Antonio Murdaca
Modified:	2023-09-14 05:27 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-15 18:51:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Antonio Murdaca 2019-04-17 11:17:29 UTC

Test has been flaky for the past ~6days: https://search.svc.ci.openshift.org/?search=Error+finding+volume+metrics&maxAge=168h&context=2&type=all

Example job: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22573/pull-ci-openshift-origin-master-e2e-aws-serial/4846

Comment 1 Tomas Smetana 2019-04-17 11:44:04 UTC

Looks like AWS API quota limit problem again:

Apr 16 20:55:56.036: INFO: At 2019-04-16 20:47:09 +0000 UTC - event for pvc-pnb6z: {persistentvolume-controller } ProvisioningFailed: Failed to provision volume with StorageClass "gp2": RequestLimitExceeded: Request limit exceeded.

Comment 2 Scott Dodson 2019-04-17 13:00:17 UTC

We need to come up with a budget that reasonably approximates the API usage required to run an e2e job and make sure that we're not running too many concurrent jobs. I'm not sure if that should be test platform, installer, or a wider audience but it feels like we've not done enough to size our CI requirements and that will require input from many teams.

Comment 3 W. Trevor King 2019-04-17 14:16:34 UTC

> We need to come up with a budget that reasonably approximates the API usage required to run an e2e job and make sure that we're not running too many concurrent jobs.

But these are per-AWS quotas, and any test-suite tooling is unlikely to have insight into that scope.  Maybe we should start bumping AWS retry caps when we call into the AWS SDK?  No throttle avoidance, but we'd be more likely to poll through to a quiet-enough window to pass the test.

Comment 4 Steve Kuznetsov 2019-04-17 23:48:17 UTC

This is a bad interaction between the installer and the AWS API, not Test Infra

Comment 5 Brenton Leanhardt 2019-04-18 18:15:02 UTC

Could the test be made more robust to handle timeouts?  We're likely going to see these in our CI environment whenever we're running many tests in parallel.

Comment 6 Tomas Smetana 2019-04-23 10:21:39 UTC

The RequestLimitExceeded is not present in the latest logs. So this might be something else than I thought.

Comment 7 Fabio Bertinatto 2019-04-23 14:42:18 UTC

Tried to reproduce by running this test in a loop for 40 minutes, but no errors happened.

Comment 8 Tomas Smetana 2019-04-24 10:19:45 UTC

I was staring into this for a while... The volume stats are being periodically (every minute by default) collected and cached by the kubelet. The logs we have suggest either the cache is empty (i.e. no error anywhere just no volume stats recorded) or the REST call to a proxy returned empty body (again no error anywhere). Both of the situations are quite weird.

Comment 9 Fabio Bertinatto 2019-04-25 11:32:34 UTC

1. Ran test for 3 hours and no errors happened:

(...)
real	181m41.134s
user	2m3.011s
sys	0m28.484s

2. This flake isn't currently happening in upstream Kubernetes. Note that it runs on GCE there, which may suggest some relationship with AWS. Unfortunately we don't run AWS serial tests in upstream to have a direct comparison.

https://testgrid.k8s.io/sig-storage-kubernetes#gce-serial

3. A filtering from the last 7 days indicates two flakes:

https://search.svc.ci.openshift.org/?search=should+create+volume+metrics+with+the+correct+PVC+ref&maxAge=168h&context=2&type=all

The older one is definitely an AWS API quote limit being exceeded. The volume created in the test could not be provisioned, so the metrics count didn't match:

Apr 22 21:18:20.761: INFO: At 2019-04-22 21:09:45 +0000 UTC - event for pvc-jgxpd: {persistentvolume-controller } ProvisioningFailed: Failed to provision volume with StorageClass "gp2": RequestLimitExceeded: Request limit exceeded.

The newest flake suggests a failure getting cAdvisor stats of the filesystem (as @tsmetana found out):

Apr 25 09:07:57.995 I test="[sig-storage] [Serial] Volume metrics should create volume metrics with the correct PVC ref [Suite:openshift/conformance/serial] [Suite:k8s]" failed
Apr 25 09:08:03.404 W node/ip-10-0-157-33.ec2.internal failed to get imageFs info: non-existent label "crio-images" (11 times)
Apr 25 09:08:22.814 W node/ip-10-0-148-48.ec2.internal failed to get imageFs info: non-existent label "crio-images" (9 times)

Comment 10 Tomas Smetana 2019-04-25 11:41:45 UTC

Yes. It really seems the metrics fail to collect. The log message about failed imageFs info because of non-existent label comes from kubelet cadvisor stats provider and each time this occurs kubelet gives up on collecting the pods' volumes statistics so if this never succeeded the caches remained empty. I checked other e2e tests where the volume metrics test passes and they do not contain this error in the logs. If it's really so the failure is not caused by the storage test but something else is broken: Why doesn't cadvisor know the label?

Comment 11 Fabio Bertinatto 2019-04-25 11:51:18 UTC

To summarize:

Two flakes in the last 7 days. One of the flakes happened because of the AWS quota limit. The other one happened because of an error in cAdvidor (or rather an image it uses misses a label).

This is where in the code the error happened [1], and this is where we collect the metrics from the flaky test [2]. This confirms that the flake happened because cAdvisor couldn't collect the pod stats.


[1] https://github.com/kubernetes/kubernetes/blob/6c664746e4e33f37c38769d20b1c64b08fcd52d0/pkg/kubelet/stats/cadvisor_stats_provider.go#L72-L82
[2] https://github.com/kubernetes/kubernetes/blob/6c664746e4e33f37c38769d20b1c64b08fcd52d0/pkg/kubelet/metrics/collectors/volume_stats.go#L81-L116

Comment 13 Seth Jennings 2019-04-25 18:27:36 UTC

Some failures not related to infra failures
https://prow.k8s.io/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/4198
https://prow.k8s.io/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/4109
https://prow.k8s.io/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/4039

If you want to look for more
https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-serial-4.0&sort-by-flakiness=&width=5

Comment 16 Tomas Smetana 2019-04-26 10:26:28 UTC

*** Bug 1701368 has been marked as a duplicate of this bug. ***

Comment 20 Gabe Montero 2019-05-10 15:30:01 UTC

adding the buildcop label since rose up in deck-build-log-hot

Comment 21 Hemant Kumar 2019-05-10 15:38:03 UTC

Yeah, so I found out the same thing:

May 09 18:05:50 ip-10-0-130-111 hyperkube[867]: I0509 18:05:50.136571     867 secret.go:206] Received secret e2e-tests-pv-r58cj/default-token-9m5mm containing (4) pieces of data, 11273 total bytes
May 09 18:05:51 ip-10-0-130-111 hyperkube[867]: E0509 18:05:51.071715     867 log_metrics.go:59] failed to get pod stats: failed to get imageFs info: non-existent label "crio-images"
May 09 18:05:51 ip-10-0-130-111 hyperkube[867]: I0509 18:05:51.075850     867 server.go:818] GET /metrics: (4.449849ms) 200 [openshift-tests/v1.13.4+9ffd146 (linux/amd64) kubernetes/9ffd146 -- [sig-storage] [Serial] Volume metrics should create volume metrics with the correct PVC ref [Suite:openshift/conformance/serial] [Suite:k8s] 10.0.130.163:44554]


both volume and pod stats are collected in same function and above error prevents metrics for volume to be returned as well.

Comment 22 W. Trevor King 2019-05-14 06:14:22 UTC

*** Bug 1709609 has been marked as a duplicate of this bug. ***

Comment 31 Nick Hale 2019-07-12 04:37:41 UTC

Currently something similar at https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.2/1814/build-log.txt :

STEP: Found 8 events.
Jul 11 20:09:04.863: INFO: At 2019-07-11 19:59:04 +0000 UTC - event for pvc-knwvj: {persistentvolume-controller } WaitForFirstConsumer: waiting for first consumer to be created before binding
Jul 11 20:09:04.863: INFO: At 2019-07-11 20:00:48 +0000 UTC - event for pvc-knwvj: {persistentvolume-controller } ProvisioningFailed: Failed to provision volume with StorageClass "gp2": error querying for all zones: error listing AWS instances: "RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: 67cc20ad-27f7-4f43-9a56-1fc8ef04f936"
Jul 11 20:09:04.863: INFO: At 2019-07-11 20:00:49 +0000 UTC - event for pvc-tester-jfwj2: {default-scheduler } FailedScheduling: selectedNode annotation value "" not set to scheduled node "ip-10-0-143-34.ec2.internal"
Jul 11 20:09:04.863: INFO: At 2019-07-11 20:03:15 +0000 UTC - event for pvc-knwvj: {persistentvolume-controller } ProvisioningSucceeded: Successfully provisioned volume pvc-55c067cb-a416-11e9-9d8d-127bb7b703ee using kubernetes.io/aws-ebs
Jul 11 20:09:04.863: INFO: At 2019-07-11 20:03:16 +0000 UTC - event for pvc-tester-jfwj2: {default-scheduler } Scheduled: Successfully assigned pv-6295/pvc-tester-jfwj2 to ip-10-0-143-34.ec2.internal
Jul 11 20:09:04.863: INFO: At 2019-07-11 20:05:11 +0000 UTC - event for pvc-tester-jfwj2: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-55c067cb-a416-11e9-9d8d-127bb7b703ee" : error finding instance ip-10-0-143-34.ec2.internal: "error listing AWS instances: \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: 981b9323-7a0d-44bf-b301-382d31e7dabb\""
Jul 11 20:09:04.863: INFO: At 2019-07-11 20:05:19 +0000 UTC - event for pvc-tester-jfwj2: {kubelet ip-10-0-143-34.ec2.internal} FailedMount: Unable to mount volumes for pod "pvc-tester-jfwj2_pv-6295(55c5268e-a416-11e9-9d8d-127bb7b703ee)": timeout expired waiting for volumes to attach or mount for pod "pv-6295"/"pvc-tester-jfwj2". list of unmounted volumes=[volume1]. list of unattached volumes=[volume1 default-token-gkz9h]
Jul 11 20:09:04.863: INFO: At 2019-07-11 20:07:44 +0000 UTC - event for pvc-tester-jfwj2: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-55c067cb-a416-11e9-9d8d-127bb7b703ee" : error finding instance ip-10-0-143-34.ec2.internal: "error listing AWS instances: \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: 7ecf50e9-516f-43f5-8ab3-d4f78c1283e7\""

Comment 33 Robert Krawitz 2019-07-15 18:51:05 UTC

Closing this as there have been no further reports of the error finding volume metrics problem in over a month.

The issue of AWS rate limiting is distinct; if there's a problem, that should be filed separately.

Comment 34 Red Hat Bugzilla 2023-09-14 05:27:06 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.