1753012 – [GCP] e2e failure: PV alerts firing

Bug 1753012 - [GCP] e2e failure: PV alerts firing

Summary: [GCP] e2e failure: PV alerts firing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Hemant Kumar
QA Contact:	Liang Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-17 19:07 UTC by David Eads
Modified:	2019-10-16 06:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:41:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 23835	0	None	closed	BUG 1753012: UPSTREAM: 82830: Do not query the cloud if PV has all the labels	2020-08-24 19:25:46 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:41:41 UTC

Description David Eads 2019-09-17 19:07:38 UTC

on 2/10 e2e runs, an alert has fired for persistentvolumes POST latency. We're seeing spikes in p99 to 1.5 seconds (normal is less than 5 milliseconds or so).

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/309 is a good example.

The p99 of admission POSTs is also high, which points to storage admission as a good candidate to start with.


1. download https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/309/artifacts/e2e-gcp/metrics/

2. run a hacky bash script like:

#!/bin/bash

set -euo pipefail

url=$1
tmp=/tmp/prom1

PORT=${PORT:-9090}

rm -rf $tmp || true
mkdir $tmp
curl $1 | tar xvzf - -C $tmp
echo open http://localhost:${PORT}
prometheus --storage.tsdb.path=/tmp/prom1 --config.file ~/projects/prometheus.yml --storage.tsdb.retention=1y --web.listen address=localhost:${PORT}

Comment 1 David Eads 2019-09-17 19:08:54 UTC

You could gather more information using something like https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L615-L619

Comment 7 David Eads 2019-09-20 11:45:20 UTC

This is flaking hard in GCP.  We need to either tolerate the extra alert (not great) or address the problem.  Is the admission change to avoid the query if we have labels not sufficient?  Did you add tracing to confirm this is it (it was a good guess, but not complete proof)?

This flaked in 4 of the last 10 tests on GCP: https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-origin-installer-e2e-gcp-4.2&sort-by-failures and https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-gcp-4.2&sort-by-failures&sort-by-failures=

Comment 10 Liang Xia 2019-09-23 10:02:24 UTC

Current status,

For origin,
On 9/19, 2 failures
On 9/20, 4 failures
On 9/21, 1 failure
On 9/22, 1 failure

For OCP,
On 9/19, 4 failures
On 9/20, 3 failures
On 9/21, 2 failure
On 9/22, 0 failure

QE will watch it for another 1 or 2 days to decide whether to move the bug back or to verified.

Comment 11 Liang Xia 2019-09-24 08:49:53 UTC

For origin,
On 9/23, 1 failure

For OCP,
On 9/23, 0 failure

Comment 12 Liang Xia 2019-09-26 05:59:03 UTC

For OCP,
On 9/24, 1 failure
On 9/25, 0 failure

There are less alerts now.  Moving the bug to verified.

Comment 13 errata-xmlrpc 2019-10-16 06:41:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.