Bug 1753012

Summary: [GCP] e2e failure: PV alerts firing
Product: OpenShift Container Platform Reporter: David Eads <deads>
Component: StorageAssignee: Hemant Kumar <hekumar>
Status: CLOSED ERRATA QA Contact: Liang Xia <lxia>
Severity: high Docs Contact:
Priority: high    
Version: 4.2.0CC: aos-bugs, aos-storage-staff, bbennett, hekumar, jsafrane
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:41:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Eads 2019-09-17 19:07:38 UTC
on 2/10 e2e runs, an alert has fired for persistentvolumes POST latency. We're seeing spikes in p99 to 1.5 seconds (normal is less than 5 milliseconds or so).

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/309 is a good example.

The p99 of admission POSTs is also high, which points to storage admission as a good candidate to start with.


1. download https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/309/artifacts/e2e-gcp/metrics/

2. run a hacky bash script like:

#!/bin/bash

set -euo pipefail

url=$1
tmp=/tmp/prom1

PORT=${PORT:-9090}

rm -rf $tmp || true
mkdir $tmp
curl $1 | tar xvzf - -C $tmp
echo open http://localhost:${PORT}
prometheus --storage.tsdb.path=/tmp/prom1 --config.file ~/projects/prometheus.yml --storage.tsdb.retention=1y --web.listen address=localhost:${PORT}

Comment 1 David Eads 2019-09-17 19:08:54 UTC
You could gather more information using something like https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L615-L619

Comment 7 David Eads 2019-09-20 11:45:20 UTC
This is flaking hard in GCP.  We need to either tolerate the extra alert (not great) or address the problem.  Is the admission change to avoid the query if we have labels not sufficient?  Did you add tracing to confirm this is it (it was a good guess, but not complete proof)?

This flaked in 4 of the last 10 tests on GCP: https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-origin-installer-e2e-gcp-4.2&sort-by-failures and https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-gcp-4.2&sort-by-failures&sort-by-failures=

Comment 10 Liang Xia 2019-09-23 10:02:24 UTC
Current status,

For origin,
On 9/19, 2 failures
On 9/20, 4 failures
On 9/21, 1 failure
On 9/22, 1 failure

For OCP,
On 9/19, 4 failures
On 9/20, 3 failures
On 9/21, 2 failure
On 9/22, 0 failure

QE will watch it for another 1 or 2 days to decide whether to move the bug back or to verified.

Comment 11 Liang Xia 2019-09-24 08:49:53 UTC
For origin,
On 9/23, 1 failure

For OCP,
On 9/23, 0 failure

Comment 12 Liang Xia 2019-09-26 05:59:03 UTC
For OCP,
On 9/24, 1 failure
On 9/25, 0 failure

There are less alerts now.  Moving the bug to verified.

Comment 13 errata-xmlrpc 2019-10-16 06:41:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922