Bug 1753012 - [GCP] e2e failure: PV alerts firing
Summary: [GCP] e2e failure: PV alerts firing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.2.0
Assignee: Hemant Kumar
QA Contact: Liang Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-17 19:07 UTC by David Eads
Modified: 2019-10-16 06:41 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:41:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 23835 0 None closed BUG 1753012: UPSTREAM: 82830: Do not query the cloud if PV has all the labels 2020-08-24 19:25:46 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:41:41 UTC

Description David Eads 2019-09-17 19:07:38 UTC
on 2/10 e2e runs, an alert has fired for persistentvolumes POST latency. We're seeing spikes in p99 to 1.5 seconds (normal is less than 5 milliseconds or so).

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/309 is a good example.

The p99 of admission POSTs is also high, which points to storage admission as a good candidate to start with.


1. download https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/309/artifacts/e2e-gcp/metrics/

2. run a hacky bash script like:

#!/bin/bash

set -euo pipefail

url=$1
tmp=/tmp/prom1

PORT=${PORT:-9090}

rm -rf $tmp || true
mkdir $tmp
curl $1 | tar xvzf - -C $tmp
echo open http://localhost:${PORT}
prometheus --storage.tsdb.path=/tmp/prom1 --config.file ~/projects/prometheus.yml --storage.tsdb.retention=1y --web.listen address=localhost:${PORT}

Comment 1 David Eads 2019-09-17 19:08:54 UTC
You could gather more information using something like https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L615-L619

Comment 7 David Eads 2019-09-20 11:45:20 UTC
This is flaking hard in GCP.  We need to either tolerate the extra alert (not great) or address the problem.  Is the admission change to avoid the query if we have labels not sufficient?  Did you add tracing to confirm this is it (it was a good guess, but not complete proof)?

This flaked in 4 of the last 10 tests on GCP: https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-origin-installer-e2e-gcp-4.2&sort-by-failures and https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-gcp-4.2&sort-by-failures&sort-by-failures=

Comment 10 Liang Xia 2019-09-23 10:02:24 UTC
Current status,

For origin,
On 9/19, 2 failures
On 9/20, 4 failures
On 9/21, 1 failure
On 9/22, 1 failure

For OCP,
On 9/19, 4 failures
On 9/20, 3 failures
On 9/21, 2 failure
On 9/22, 0 failure

QE will watch it for another 1 or 2 days to decide whether to move the bug back or to verified.

Comment 11 Liang Xia 2019-09-24 08:49:53 UTC
For origin,
On 9/23, 1 failure

For OCP,
On 9/23, 0 failure

Comment 12 Liang Xia 2019-09-26 05:59:03 UTC
For OCP,
On 9/24, 1 failure
On 9/25, 0 failure

There are less alerts now.  Moving the bug to verified.

Comment 13 errata-xmlrpc 2019-10-16 06:41:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.