Bug 1707495

Summary: [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report less than two alerts in firing or pending state [Suite:openshift/conformance/parallel/minimal]
Product: OpenShift Container Platform Reporter: Rich Megginson <rmeggins>
Component: StorageAssignee: Bradley Childs <bchilds>
Status: CLOSED DUPLICATE QA Contact: Liang Xia <lxia>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: alegrand, anpicker, aos-bugs, aos-storage-staff, eparis, erooth, gblomqui, jokerman, jsafrane, kakkoyun, lcosic, maszulik, mfojtik, mloibl, mmccomas, pkrupa, shlao, sponnaga, sttts, surbania
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-16 14:06:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rich Megginson 2019-05-07 15:59:48 UTC
Description of problem:

Test failure in https://storage.googleapis.com/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws/8504/build-log.txt


[Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report less than two alerts in firing or pending state [Suite:openshift/conformance/parallel/minimal]



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Eric Paris 2019-05-07 16:32:58 UTC
this is only 1 report of failure so I'm pushing to 4.2. If this is a a major problem please feel free to reassign to 4.1

Comment 2 Frederic Branczyk 2019-05-07 21:58:03 UTC
I looked into this, and this is a legitimate failure, but not of the monitoring component. The alert that fired is the TargetDown alert for the kube-controller-manager-operator.

```
up{endpoint="https",instance="10.130.0.4:8443",job="metrics",namespace="openshift-kube-controller-manager-operator",pod="kube-controller-manager-operator-6f5b4b54b9-gjcc8",service="metrics"} 0
```

This also doesn't seem to be a flaky behavior, at least not within this CI run, the target was down for the entire CI run, it was never successfully scraped.

Although we haven't seen this flake (yet), I do believe that including the pending alert state in the smoke test, is a bit too aggressive, so I opened: https://github.com/openshift/origin/pull/22797 to relax this a bit and prevent potential flakes (note that that's not what caused this failure, but a separate realization, the above alert was a truly firing alert not a pending one, so my patch would not have made a difference, this is still something for the master team to look into).

Moving to master component due to above explanation.

Comment 3 Stefan Schimanski 2019-08-02 10:08:32 UTC
Old bug, and logs seem to be gone.

Comment 6 Maciej Szulik 2019-09-16 10:46:48 UTC
This time it looks like the culprit is monitoring component:

ALERTS{alertname="KubePersistentVolumeErrors",alertstate="firing",endpoint="https-main",instance="10.131.2.5:8443",job="kube-state-metrics",namespace="openshift-monitoring",persistentvolume="pvc-295abc2e-d638-11e9-8b7e-128b681d70d4",phase="Failed",pod="kube-state-metrics-6b895d4967-m5lcn",service="kube-state-metrics",severity="critical"}

there are 3 of these.

Comment 7 Lili Cosic 2019-09-16 11:22:42 UTC
(In reply to Maciej Szulik from comment #6)
> This time it looks like the culprit is monitoring component:
> 
> ALERTS{alertname="KubePersistentVolumeErrors",alertstate="firing",
> endpoint="https-main",instance="10.131.2.5:8443",job="kube-state-metrics",
> namespace="openshift-monitoring",persistentvolume="pvc-295abc2e-d638-11e9-
> 8b7e-128b681d70d4",phase="Failed",pod="kube-state-metrics-6b895d4967-m5lcn",
> service="kube-state-metrics",severity="critical"}
> 
> there are 3 of these.

What version of the cluster are you seeing this in?

Comment 8 Maciej Szulik 2019-09-16 11:27:10 UTC
From looking into https://00e9e64bacd4b298441212acee1bf5ce9c256f1a3d324bb0cd-apidata.googleusercontent.com/download/storage/v1/b/origin-ci-test/o/logs%2Fcanary-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.2%2F71%2Fartifacts%2Frelease-latest%2Frelease-payload-latest%2Fimage-references?qk=AD5uMEuCfD_iaOddAMC4IiTC_1eU3Ze8f2jYa6PnD_z_6bRX6DEn62SANCPJ68Aqnb_V1EzwQFAeqde1AAaM326Eb9AIDu1hc_0qZvUSkV21tRVzHNJXn0jDXh9aK6E5QzZpCcnmlb_sT_2R4R45pGKvwaz2jhgAPFyzQHo0Cpx49M0mxIoIEGKCN2HG4avY0ggADqX19CoOp-h-g8XiMRtharLjtTtWtccUdI70dfTRdWDtnXKI_mYZVvEO2oj4WYRQLGewiZXaOg-uxi-PDQmGAo0y7mqo90pEGVixPtIZCWKQR2_tz_FavOArMhzXMhESvYV3cVahYV_N0XPtgLjc8twY1pb2fdHI_Jm3FE2GNm91hCnXrSeLNhas-GrmIKmjhRxaqWA9f5v2cQyIt87D3OJNyGC6A29zt-0UnMC6OFCEHyAJTWIrfq8wP8P8RfFiOqlSj4SRh3WTrgnTQBUX-0PmIthzMszrjBnir3FRTjiVgsH8azpePqjvJaItJovMPmPAWKdcrjbQThRb_kYpemfB00jNQB0L0sOT0glwiUS4-gF5z0RJSxVdKBipIctkmqf3XUOX86tj5qLPdnJexOG3cqgyrP2ZSUtFbYv085YDq902RoZSQ2Y0uTcLufTS0mzQQs68H1E7QkcvPUKQpjxJU8Rdx4lD2cfKN_PU3JU-3SH6zEC-Q-AqS1Lf2iKRv8ux8CtQ9U9Omjn39sPa3ZkS6cwjHYkAA6WEvqYO5yXujB6RXLjj3GEC-zJlM9FgbLlM0pWwR0-4mMPA-q_CUUoYKm_Sfk6ii3zgTXFVNOQuzv0Dt2ezWy56ItAZQRMzYCSG6vrCWqNPs3h2U1SE6AXwb48IzwGGDmWGVznQ8bS_i-dKVLEgI-X2vAhx3qopH_mJLnu9_OP14ST9a-1dnIIHWl4TbVn1JWQnlOnfmvSFhWCJese4CvQXk5Mr8l5e5d5oedYDJix2npA-Olg2WJqLRTG4Tg

it looks like it's pretty fresh cluster:

  "metadata": {
    "name": "0.0.1-2019-09-13-135245",
    "creationTimestamp": "2019-09-13T13:52:45Z",
    "annotations": {
      "release.openshift.io/from-image-stream": "ci-op-ljmc427b/stable"
    }
  }

Comment 9 Lili Cosic 2019-09-16 11:39:34 UTC
Thanks will spin up a cluster, but I suspect this PV called "pvc-295abc2e-d638-11e9-8b7e-128b681d70d4" doesn't actually belong to monitoring components, but will investigate.

Comment 10 Lili Cosic 2019-09-16 12:28:18 UTC
The above does not look like it's a problem with monitoring components, after looking at the build logs output, the volume itself seems to be the problem. As discussed reassigning to the storage team. (Probably best to open a separate bugzilla for this and close this one)

```
Sep 13 14:58:30.470 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-2 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-2 (8 times)
Sep 13 14:58:35.786 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-0 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-0 (8 times)
Sep 13 14:58:38.136 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-xdfgj Updated machine ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-xdfgj (7 times)
Sep 13 14:58:43.707 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-1 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-1 (9 times)
Sep 13 14:58:46.624 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-mr5b2 Updated machine ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-mr5b2 (10 times)
Sep 13 14:58:55.681 W persistentvolume/pvc-dc151324-d636-11e9-8b7e-128b681d70d4 Error deleting EBS volume "vol-0ad5802fbce514f5c" since volume is currently attached to "i-08fa6cdc8babfd9c8"
Sep 13 14:59:38.973 W persistentvolume/pvc-ee15bec2-d636-11e9-8b7e-128b681d70d4 Error deleting EBS volume "vol-0deb1b52fb33a2bfb" since volume is currently attached to "i-08fa6cdc8babfd9c8"
Sep 13 15:02:17.644 W persistentvolume/pvc-4569cc7d-d637-11e9-8b7e-128b681d70d4 Error deleting EBS volume "vol-0e105c142a924de1c" since volume is currently attached to "i-08fa6cdc8babfd9c8"
Sep 13 15:03:13.860 W persistentvolume/pvc-6ac42871-d637-11e9-91a4-12a554f072c8 Error deleting EBS volume "vol-035a1e3c43dccae05" since volume is currently attached to "i-08346f3c57e94e400"
Sep 13 15:03:25.691 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated
Sep 13 15:03:25.739 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (2 times)
Sep 13 15:03:25.749 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (3 times)
Sep 13 15:03:25.764 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (4 times)
Sep 13 15:03:25.809 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (5 times)
Sep 13 15:03:25.887 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (6 times)
Sep 13 15:03:26.046 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (7 times)
Sep 13 15:03:26.357 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (8 times)
Sep 13 15:03:26.603 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-expiration-2z6tv/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-expiration-2z6tv because it is being terminated (9 times)
Sep 13 15:04:09.438 W persistentvolume/pvc-805c23fb-d637-11e9-8b7e-128b681d70d4 Unavailable: The service is unavailable. Please try again shortly.\n	status code: 503, request id: e8286310-9787-4889-86e6-912a1fe08783
Sep 13 15:06:43.359 W persistentvolume/pvc-33a20e7e-d637-11e9-8b7e-128b681d70d4 Error deleting EBS volume "vol-09a49ef7f8459a998" since volume is currently attached to "i-08346f3c57e94e400"
Sep 13 15:06:58.691 W persistentvolume/pvc-fb5c7578-d637-11e9-91a4-12a554f072c8 Error deleting EBS volume "vol-054e27747eaf35c0d" since volume is currently attached to "i-08fa6cdc8babfd9c8"
Sep 13 15:08:12.807 W persistentvolume/pvc-ebc69497-d637-11e9-91a4-12a554f072c8 Unavailable: The service is unavailable. Please try again shortly.\n	status code: 503, request id: ae00dfe8-3a80-4341-aa2f-20ff8ad9e187
Sep 13 15:08:29.584 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-0 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-0 (9 times)
Sep 13 15:08:34.805 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-xdfgj Updated machine ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-xdfgj (8 times)
Sep 13 15:08:37.443 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1b-centos-rpgz6 Updated machine ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1b-centos-rpgz6 (10 times)
Sep 13 15:08:38.620 W persistentvolume/pvc-295abc2e-d638-11e9-8b7e-128b681d70d4 Error deleting EBS volume "vol-0f80fdc753a293715" since volume is currently attached to "i-08346f3c57e94e400"
Sep 13 15:08:40.723 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-2 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-2 (9 times)
Sep 13 15:08:45.293 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-mr5b2 Updated machine ci-op-ljmc427b-c5cba-8mgr7-worker-us-east-1a-centos-mr5b2 (11 times)
Sep 13 15:08:52.137 I ns/openshift-machine-api machine/ci-op-ljmc427b-c5cba-8mgr7-master-1 Updated machine ci-op-ljmc427b-c5cba-8mgr7-master-1 (10 times)
Sep 13 15:09:20.005 W persistentvolume/pvc-17f5a7d9-d638-11e9-ab9a-0a3cfc05fb74 Error deleting EBS volume "vol-0c7c003d396769c25" since volume is currently attached to "i-08fa6cdc8babfd9c8"
Sep 13 15:10:48.679 W persistentvolume/pvc-77d461cf-d638-11e9-ab9a-0a3cfc05fb74 Error deleting EBS volume "vol-0a4c97a4c3a33981e" since volume is currently attached to "i-08346f3c57e94e400"
Sep 13 15:10:49.036 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated
Sep 13 15:10:49.043 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (2 times)
Sep 13 15:10:49.051 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (3 times)
Sep 13 15:10:49.073 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (4 times)
Sep 13 15:10:49.132 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (5 times)
Sep 13 15:10:49.227 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (6 times)
Sep 13 15:10:49.378 W endpoints/test-oauth-svc Failed to create endpoint for service e2e-test-oauth-server-headers-9rtb2/test-oauth-svc: endpoints "test-oauth-svc" is forbidden: unable to create new content in namespace e2e-test-oauth-server-headers-9rtb2 because it is being terminated (7 times)

```

Comment 11 Jan Safranek 2019-09-16 14:06:43 UTC
It another occurrence of bug #1690588, i.e. AWS API  throttling.

pvc-295abc2e-d638-11e9-8b7e-128b681d70d4 cannot be detached from a node for a very long time due to "503 Unavailable: The service is unavailable" on AWS side:

E0913 15:09:07.013793       1 goroutinemap.go:150] Operation for "delete-pvc-295abc2e-d638-11e9-8b7e-128b681d70d4[2d6a1f21-d638-11e9-91a4-12a554f072c8]" failed. No retries permitted until 2019-09-13 15:09:09.013760178 +0000 UTC m=+3043.320431610 (durationBeforeRetry 2s). Error: "Unavailable: The service is unavailable. Please try again shortly.\n\tstatus code: 503, request id: 39f528ac-39d1-444f-8a33-0a990f82854a"

[repeated with exp. backoff]

Then we got couple of "503 RequestLimitExceeded" from AWS:
E0913 15:13:39.329286       1 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1b/vol-0f80fdc753a293715\"" failed. No retries permitted until 2019-09-13 15:15:41.329245368 +0000 UTC m=+3435.635916903 (durationBeforeRetry 2m2s). Error: "DetachVolume.Detach failed for volume \"pvc-295abc2e-d638-11e9-8b7e-128b681d70d4\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1b/vol-0f80fdc753a293715\") on node \"ip-10-0-149-24.ec2.internal\" : error detaching EBS volume \"vol-0f80fdc753a293715\" from \"i-08346f3c57e94e400\": \"RequestLimitExceeded: Request limit exceeded.\\n\\tstatus code: 503, request id: a673aecb-ea06-4356-9858-2b3043dc3d85\""

[repeated with exp. backoff]

Finally, after ~15 minutes (!) detach succeeded and volume got detached && deleted:

I0913 15:22:39.157518       1 operation_generator.go:498] DetachVolume.Detach succeeded for volume "pvc-295abc2e-d638-11e9-8b7e-128b681d70d4" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0f80fdc753a293715") on node "ip-10-0-149-24.ec2.internal" 

All because we exhausted our API quota on AWS.

*** This bug has been marked as a duplicate of bug 1690588 ***