Bug 1700779
Summary: | e2e flake: fail [k8s.io/kubernetes/test/e2e/storage/volume_metrics.go:159]: Error finding volume metrics : timed out waiting for the condition | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Antonio Murdaca <amurdaca> |
Component: | Node | Assignee: | Robert Krawitz <rkrawitz> |
Status: | CLOSED WORKSFORME | QA Contact: | Sunil Choudhary <schoudha> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.1.0 | CC: | aos-bugs, aos-storage-staff, bchilds, bleanhar, chancez, gblomqui, hekumar, jesusr, jokerman, mmccomas, nhale, pehunt, sdodson, wking |
Target Milestone: | --- | ||
Target Release: | 4.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | buildcop | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-07-15 18:51:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Antonio Murdaca
2019-04-17 11:17:29 UTC
Looks like AWS API quota limit problem again: Apr 16 20:55:56.036: INFO: At 2019-04-16 20:47:09 +0000 UTC - event for pvc-pnb6z: {persistentvolume-controller } ProvisioningFailed: Failed to provision volume with StorageClass "gp2": RequestLimitExceeded: Request limit exceeded. We need to come up with a budget that reasonably approximates the API usage required to run an e2e job and make sure that we're not running too many concurrent jobs. I'm not sure if that should be test platform, installer, or a wider audience but it feels like we've not done enough to size our CI requirements and that will require input from many teams. > We need to come up with a budget that reasonably approximates the API usage required to run an e2e job and make sure that we're not running too many concurrent jobs.
But these are per-AWS quotas, and any test-suite tooling is unlikely to have insight into that scope. Maybe we should start bumping AWS retry caps when we call into the AWS SDK? No throttle avoidance, but we'd be more likely to poll through to a quiet-enough window to pass the test.
This is a bad interaction between the installer and the AWS API, not Test Infra Could the test be made more robust to handle timeouts? We're likely going to see these in our CI environment whenever we're running many tests in parallel. The RequestLimitExceeded is not present in the latest logs. So this might be something else than I thought. Tried to reproduce by running this test in a loop for 40 minutes, but no errors happened. I was staring into this for a while... The volume stats are being periodically (every minute by default) collected and cached by the kubelet. The logs we have suggest either the cache is empty (i.e. no error anywhere just no volume stats recorded) or the REST call to a proxy returned empty body (again no error anywhere). Both of the situations are quite weird. 1. Ran test for 3 hours and no errors happened: (...) real 181m41.134s user 2m3.011s sys 0m28.484s 2. This flake isn't currently happening in upstream Kubernetes. Note that it runs on GCE there, which may suggest some relationship with AWS. Unfortunately we don't run AWS serial tests in upstream to have a direct comparison. https://testgrid.k8s.io/sig-storage-kubernetes#gce-serial 3. A filtering from the last 7 days indicates two flakes: https://search.svc.ci.openshift.org/?search=should+create+volume+metrics+with+the+correct+PVC+ref&maxAge=168h&context=2&type=all The older one is definitely an AWS API quote limit being exceeded. The volume created in the test could not be provisioned, so the metrics count didn't match: Apr 22 21:18:20.761: INFO: At 2019-04-22 21:09:45 +0000 UTC - event for pvc-jgxpd: {persistentvolume-controller } ProvisioningFailed: Failed to provision volume with StorageClass "gp2": RequestLimitExceeded: Request limit exceeded. The newest flake suggests a failure getting cAdvisor stats of the filesystem (as @tsmetana found out): Apr 25 09:07:57.995 I test="[sig-storage] [Serial] Volume metrics should create volume metrics with the correct PVC ref [Suite:openshift/conformance/serial] [Suite:k8s]" failed Apr 25 09:08:03.404 W node/ip-10-0-157-33.ec2.internal failed to get imageFs info: non-existent label "crio-images" (11 times) Apr 25 09:08:22.814 W node/ip-10-0-148-48.ec2.internal failed to get imageFs info: non-existent label "crio-images" (9 times) Yes. It really seems the metrics fail to collect. The log message about failed imageFs info because of non-existent label comes from kubelet cadvisor stats provider and each time this occurs kubelet gives up on collecting the pods' volumes statistics so if this never succeeded the caches remained empty. I checked other e2e tests where the volume metrics test passes and they do not contain this error in the logs. If it's really so the failure is not caused by the storage test but something else is broken: Why doesn't cadvisor know the label? To summarize: Two flakes in the last 7 days. One of the flakes happened because of the AWS quota limit. The other one happened because of an error in cAdvidor (or rather an image it uses misses a label). This is where in the code the error happened [1], and this is where we collect the metrics from the flaky test [2]. This confirms that the flake happened because cAdvisor couldn't collect the pod stats. [1] https://github.com/kubernetes/kubernetes/blob/6c664746e4e33f37c38769d20b1c64b08fcd52d0/pkg/kubelet/stats/cadvisor_stats_provider.go#L72-L82 [2] https://github.com/kubernetes/kubernetes/blob/6c664746e4e33f37c38769d20b1c64b08fcd52d0/pkg/kubelet/metrics/collectors/volume_stats.go#L81-L116 Some failures not related to infra failures https://prow.k8s.io/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/4198 https://prow.k8s.io/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/4109 https://prow.k8s.io/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/4039 If you want to look for more https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-serial-4.0&sort-by-flakiness=&width=5 *** Bug 1701368 has been marked as a duplicate of this bug. *** adding the buildcop label since rose up in deck-build-log-hot Yeah, so I found out the same thing: May 09 18:05:50 ip-10-0-130-111 hyperkube[867]: I0509 18:05:50.136571 867 secret.go:206] Received secret e2e-tests-pv-r58cj/default-token-9m5mm containing (4) pieces of data, 11273 total bytes May 09 18:05:51 ip-10-0-130-111 hyperkube[867]: E0509 18:05:51.071715 867 log_metrics.go:59] failed to get pod stats: failed to get imageFs info: non-existent label "crio-images" May 09 18:05:51 ip-10-0-130-111 hyperkube[867]: I0509 18:05:51.075850 867 server.go:818] GET /metrics: (4.449849ms) 200 [openshift-tests/v1.13.4+9ffd146 (linux/amd64) kubernetes/9ffd146 -- [sig-storage] [Serial] Volume metrics should create volume metrics with the correct PVC ref [Suite:openshift/conformance/serial] [Suite:k8s] 10.0.130.163:44554] both volume and pod stats are collected in same function and above error prevents metrics for volume to be returned as well. *** Bug 1709609 has been marked as a duplicate of this bug. *** Currently something similar at https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.2/1814/build-log.txt : STEP: Found 8 events. Jul 11 20:09:04.863: INFO: At 2019-07-11 19:59:04 +0000 UTC - event for pvc-knwvj: {persistentvolume-controller } WaitForFirstConsumer: waiting for first consumer to be created before binding Jul 11 20:09:04.863: INFO: At 2019-07-11 20:00:48 +0000 UTC - event for pvc-knwvj: {persistentvolume-controller } ProvisioningFailed: Failed to provision volume with StorageClass "gp2": error querying for all zones: error listing AWS instances: "RequestLimitExceeded: Request limit exceeded.\n\tstatus code: 503, request id: 67cc20ad-27f7-4f43-9a56-1fc8ef04f936" Jul 11 20:09:04.863: INFO: At 2019-07-11 20:00:49 +0000 UTC - event for pvc-tester-jfwj2: {default-scheduler } FailedScheduling: selectedNode annotation value "" not set to scheduled node "ip-10-0-143-34.ec2.internal" Jul 11 20:09:04.863: INFO: At 2019-07-11 20:03:15 +0000 UTC - event for pvc-knwvj: {persistentvolume-controller } ProvisioningSucceeded: Successfully provisioned volume pvc-55c067cb-a416-11e9-9d8d-127bb7b703ee using kubernetes.io/aws-ebs Jul 11 20:09:04.863: INFO: At 2019-07-11 20:03:16 +0000 UTC - event for pvc-tester-jfwj2: {default-scheduler } Scheduled: Successfully assigned pv-6295/pvc-tester-jfwj2 to ip-10-0-143-34.ec2.internal Jul 11 20:09:04.863: INFO: At 2019-07-11 20:05:11 +0000 UTC - event for pvc-tester-jfwj2: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-55c067cb-a416-11e9-9d8d-127bb7b703ee" : error finding instance ip-10-0-143-34.ec2.internal: "error listing AWS instances: \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: 981b9323-7a0d-44bf-b301-382d31e7dabb\"" Jul 11 20:09:04.863: INFO: At 2019-07-11 20:05:19 +0000 UTC - event for pvc-tester-jfwj2: {kubelet ip-10-0-143-34.ec2.internal} FailedMount: Unable to mount volumes for pod "pvc-tester-jfwj2_pv-6295(55c5268e-a416-11e9-9d8d-127bb7b703ee)": timeout expired waiting for volumes to attach or mount for pod "pv-6295"/"pvc-tester-jfwj2". list of unmounted volumes=[volume1]. list of unattached volumes=[volume1 default-token-gkz9h] Jul 11 20:09:04.863: INFO: At 2019-07-11 20:07:44 +0000 UTC - event for pvc-tester-jfwj2: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-55c067cb-a416-11e9-9d8d-127bb7b703ee" : error finding instance ip-10-0-143-34.ec2.internal: "error listing AWS instances: \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: 7ecf50e9-516f-43f5-8ab3-d4f78c1283e7\"" Closing this as there have been no further reports of the error finding volume metrics problem in over a month. The issue of AWS rate limiting is distinct; if there's a problem, that should be filed separately. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |