Bug 1707495
Summary: | [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report less than two alerts in firing or pending state [Suite:openshift/conformance/parallel/minimal] | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Rich Megginson <rmeggins> |
Component: | Storage | Assignee: | Bradley Childs <bchilds> |
Status: | CLOSED DUPLICATE | QA Contact: | Liang Xia <lxia> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.0 | CC: | alegrand, anpicker, aos-bugs, aos-storage-staff, eparis, erooth, gblomqui, jokerman, jsafrane, kakkoyun, lcosic, maszulik, mfojtik, mloibl, mmccomas, pkrupa, shlao, sponnaga, sttts, surbania |
Target Milestone: | --- | ||
Target Release: | 4.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-09-16 14:06:43 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Rich Megginson
2019-05-07 15:59:48 UTC
this is only 1 report of failure so I'm pushing to 4.2. If this is a a major problem please feel free to reassign to 4.1 I looked into this, and this is a legitimate failure, but not of the monitoring component. The alert that fired is the TargetDown alert for the kube-controller-manager-operator. ``` up{endpoint="https",instance="10.130.0.4:8443",job="metrics",namespace="openshift-kube-controller-manager-operator",pod="kube-controller-manager-operator-6f5b4b54b9-gjcc8",service="metrics"} 0 ``` This also doesn't seem to be a flaky behavior, at least not within this CI run, the target was down for the entire CI run, it was never successfully scraped. Although we haven't seen this flake (yet), I do believe that including the pending alert state in the smoke test, is a bit too aggressive, so I opened: https://github.com/openshift/origin/pull/22797 to relax this a bit and prevent potential flakes (note that that's not what caused this failure, but a separate realization, the above alert was a truly firing alert not a pending one, so my patch would not have made a difference, this is still something for the master team to look into). Moving to master component due to above explanation. Old bug, and logs seem to be gone. I get the same error: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.2/125 This is the correct url: https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.2/71 This time it looks like the culprit is monitoring component: ALERTS{alertname="KubePersistentVolumeErrors",alertstate="firing",endpoint="https-main",instance="10.131.2.5:8443",job="kube-state-metrics",namespace="openshift-monitoring",persistentvolume="pvc-295abc2e-d638-11e9-8b7e-128b681d70d4",phase="Failed",pod="kube-state-metrics-6b895d4967-m5lcn",service="kube-state-metrics",severity="critical"} there are 3 of these. (In reply to Maciej Szulik from comment #6) > This time it looks like the culprit is monitoring component: > > ALERTS{alertname="KubePersistentVolumeErrors",alertstate="firing", > endpoint="https-main",instance="10.131.2.5:8443",job="kube-state-metrics", > namespace="openshift-monitoring",persistentvolume="pvc-295abc2e-d638-11e9- > 8b7e-128b681d70d4",phase="Failed",pod="kube-state-metrics-6b895d4967-m5lcn", > service="kube-state-metrics",severity="critical"} > > there are 3 of these. What version of the cluster are you seeing this in? From looking into https://00e9e64bacd4b298441212acee1bf5ce9c256f1a3d324bb0cd-apidata.googleusercontent.com/download/storage/v1/b/origin-ci-test/o/logs%2Fcanary-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.2%2F71%2Fartifacts%2Frelease-latest%2Frelease-payload-latest%2Fimage-references?qk=AD5uMEuCfD_iaOddAMC4IiTC_1eU3Ze8f2jYa6PnD_z_6bRX6DEn62SANCPJ68Aqnb_V1EzwQFAeqde1AAaM326Eb9AIDu1hc_0qZvUSkV21tRVzHNJXn0jDXh9aK6E5QzZpCcnmlb_sT_2R4R45pGKvwaz2jhgAPFyzQHo0Cpx49M0mxIoIEGKCN2HG4avY0ggADqX19CoOp-h-g8XiMRtharLjtTtWtccUdI70dfTRdWDtnXKI_mYZVvEO2oj4WYRQLGewiZXaOg-uxi-PDQmGAo0y7mqo90pEGVixPtIZCWKQR2_tz_FavOArMhzXMhESvYV3cVahYV_N0XPtgLjc8twY1pb2fdHI_Jm3FE2GNm91hCnXrSeLNhas-GrmIKmjhRxaqWA9f5v2cQyIt87D3OJNyGC6A29zt-0UnMC6OFCEHyAJTWIrfq8wP8P8RfFiOqlSj4SRh3WTrgnTQBUX-0PmIthzMszrjBnir3FRTjiVgsH8azpePqjvJaItJovMPmPAWKdcrjbQThRb_kYpemfB00jNQB0L0sOT0glwiUS4-gF5z0RJSxVdKBipIctkmqf3XUOX86tj5qLPdnJexOG3cqgyrP2ZSUtFbYv085YDq902RoZSQ2Y0uTcLufTS0mzQQs68H1E7QkcvPUKQpjxJU8Rdx4lD2cfKN_PU3JU-3SH6zEC-Q-AqS1Lf2iKRv8ux8CtQ9U9Omjn39sPa3ZkS6cwjHYkAA6WEvqYO5yXujB6RXLjj3GEC-zJlM9FgbLlM0pWwR0-4mMPA-q_CUUoYKm_Sfk6ii3zgTXFVNOQuzv0Dt2ezWy56ItAZQRMzYCSG6vrCWqNPs3h2U1SE6AXwb48IzwGGDmWGVznQ8bS_i-dKVLEgI-X2vAhx3qopH_mJLnu9_OP14ST9a-1dnIIHWl4TbVn1JWQnlOnfmvSFhWCJese4CvQXk5Mr8l5e5d5oedYDJix2npA-Olg2WJqLRTG4Tg it looks like it's pretty fresh cluster: "metadata": { "name": "0.0.1-2019-09-13-135245", "creationTimestamp": "2019-09-13T13:52:45Z", "annotations": { "release.openshift.io/from-image-stream": "ci-op-ljmc427b/stable" } } Thanks will spin up a cluster, but I suspect this PV called "pvc-295abc2e-d638-11e9-8b7e-128b681d70d4" doesn't actually belong to monitoring components, but will investigate. The above does not look like it's a problem with monitoring components, after looking at the build logs output, the volume itself seems to be the problem. As discussed reassigning to the storage team. (Probably best to open a separate bugzilla for this and close this one) ``` Sep 13 14:58:30.470 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-2 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-2 (8 times) Sep 13 14:58:35.786 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-0 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-0 (8 times) Sep 13 14:58:38.136 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-xdfgj Updated machine ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-xdfgj (7 times) Sep 13 14:58:43.707 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-1 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-1 (9 times) Sep 13 14:58:46.624 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-mr5b2 Updated machine ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-mr5b2 (10 times) Sep 13 14:58:55.681 W persistentvolume/pvc-dc151324-d636-11e9-8b7e-128b681d70d4 Error deleting EBS volume "vol-0ad5802fbce514f5c" since volume is currently attached to "i-08fa6cdc8babfd9c8" Sep 13 14:59:38.973 W persistentvolume/pvc-ee15bec2-d636-11e9-8b7e-128b681d70d4 Error deleting EBS volume "vol-0deb1b52fb33a2bfb" since volume is currently attached to "i-08fa6cdc8babfd9c8" Sep 13 15:02:17.644 W persistentvolume/pvc-4569cc7d-d637-11e9-8b7e-128b681d70d4 Error deleting EBS volume "vol-0e105c142a924de1c" since volume is currently attached to "i-08fa6cdc8babfd9c8" Sep 13 15:03:13.860 W persistentvolume/pvc-6ac42871-d637-11e9-91a4-12a554f072c8 Error deleting EBS volume "vol-035a1e3c43dccae05" since volume is currently attached to "i-08346f3c57e94e400" Sep 13 15:03:25.691 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated Sep 13 15:03:25.739 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (2 times) Sep 13 15:03:25.749 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (3 times) Sep 13 15:03:25.764 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (4 times) Sep 13 15:03:25.809 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (5 times) Sep 13 15:03:25.887 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (6 times) Sep 13 15:03:26.046 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (7 times) Sep 13 15:03:26.357 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (8 times) Sep 13 15:03:26.603 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (9 times) Sep 13 15:04:09.438 W persistentvolume/pvc-805c23fb-d637-11e9-8b7e-128b681d70d4 Unavailable: The service is unavailable. Please try again shortly.\n status code: 503, request id: e8286310-9787-4889-86e6-912a1fe08783 Sep 13 15:06:43.359 W persistentvolume/pvc-33a20e7e-d637-11e9-8b7e-128b681d70d4 Error deleting EBS volume "vol-09a49ef7f8459a998" since volume is currently attached to "i-08346f3c57e94e400" Sep 13 15:06:58.691 W persistentvolume/pvc-fb5c7578-d637-11e9-91a4-12a554f072c8 Error deleting EBS volume "vol-054e27747eaf35c0d" since volume is currently attached to "i-08fa6cdc8babfd9c8" Sep 13 15:08:12.807 W persistentvolume/pvc-ebc69497-d637-11e9-91a4-12a554f072c8 Unavailable: The service is unavailable. Please try again shortly.\n status code: 503, request id: ae00dfe8-3a80-4341-aa2f-20ff8ad9e187 Sep 13 15:08:29.584 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-0 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-0 (9 times) Sep 13 15:08:34.805 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-xdfgj Updated machine ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-xdfgj (8 times) Sep 13 15:08:37.443 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1b-centos-rpgz6 Updated machine ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1b-centos-rpgz6 (10 times) Sep 13 15:08:38.620 W persistentvolume/pvc-295abc2e-d638-11e9-8b7e-128b681d70d4 Error deleting EBS volume "vol-0f80fdc753a293715" since volume is currently attached to "i-08346f3c57e94e400" Sep 13 15:08:40.723 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-2 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-2 (9 times) Sep 13 15:08:45.293 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-mr5b2 Updated machine ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-mr5b2 (11 times) Sep 13 15:08:52.137 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-1 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-1 (10 times) Sep 13 15:09:20.005 W persistentvolume/pvc-17f5a7d9-d638-11e9-ab9a-0a3cfc05fb74 Error deleting EBS volume "vol-0c7c003d396769c25" since volume is currently attached to "i-08fa6cdc8babfd9c8" Sep 13 15:10:48.679 W persistentvolume/pvc-77d461cf-d638-11e9-ab9a-0a3cfc05fb74 Error deleting EBS volume "vol-0a4c97a4c3a33981e" since volume is currently attached to "i-08346f3c57e94e400" Sep 13 15:10:49.036 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated Sep 13 15:10:49.043 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (2 times) Sep 13 15:10:49.051 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (3 times) Sep 13 15:10:49.073 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (4 times) Sep 13 15:10:49.132 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (5 times) Sep 13 15:10:49.227 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (6 times) Sep 13 15:10:49.378 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (7 times) ``` It another occurrence of bug #1690588, i.e. AWS API throttling. pvc-295abc2e-d638-11e9-8b7e-128b681d70d4 cannot be detached from a node for a very long time due to "503 Unavailable: The service is unavailable" on AWS side: E0913 15:09:07.013793 1 goroutinemap.go:150] Operation for "delete-pvc-295abc2e-d638-11e9-8b7e-128b681d70d4[2d6a1f21-d638-11e9-91a4-12a554f072c8]" failed. No retries permitted until 2019-09-13 15:09:09.013760178 +0000 UTC m=+3043.320431610 (durationBeforeRetry 2s). Error: "Unavailable: The service is unavailable. Please try again shortly.\n\tstatus code: 503, request id: 39f528ac-39d1-444f-8a33-0a990f82854a" [repeated with exp. backoff] Then we got couple of "503 RequestLimitExceeded" from AWS: E0913 15:13:39.329286 1 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1b/vol-0f80fdc753a293715\"" failed. No retries permitted until 2019-09-13 15:15:41.329245368 +0000 UTC m=+3435.635916903 (durationBeforeRetry 2m2s). Error: "DetachVolume.Detach failed for volume \"pvc-295abc2e-d638-11e9-8b7e-128b681d70d4\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1b/vol-0f80fdc753a293715\") on node \"ip-10-0-149-24.ec2.internal\" : error detaching EBS volume \"vol-0f80fdc753a293715\" from \"i-08346f3c57e94e400\": \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: a673aecb-ea06-4356-9858-2b3043dc3d85\"" [repeated with exp. backoff] Finally, after ~15 minutes (!) detach succeeded and volume got detached && deleted: I0913 15:22:39.157518 1 operation_generator.go:498] DetachVolume.Detach succeeded for volume "pvc-295abc2e-d638-11e9-8b7e-128b681d70d4" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0f80fdc753a293715") on node "ip-10-0-149-24.ec2.internal" All because we exhausted our API quota on AWS. *** This bug has been marked as a duplicate of bug 1690588 *** |