on 2/10 e2e runs, an alert has fired for persistentvolumes POST latency. We're seeing spikes in p99 to 1.5 seconds (normal is less than 5 milliseconds or so). https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/309 is a good example. The p99 of admission POSTs is also high, which points to storage admission as a good candidate to start with. 1. download https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/309/artifacts/e2e-gcp/metrics/ 2. run a hacky bash script like: #!/bin/bash set -euo pipefail url=$1 tmp=/tmp/prom1 PORT=${PORT:-9090} rm -rf $tmp || true mkdir $tmp curl $1 | tar xvzf - -C $tmp echo open http://localhost:${PORT} prometheus --storage.tsdb.path=/tmp/prom1 --config.file ~/projects/prometheus.yml --storage.tsdb.retention=1y --web.listen address=localhost:${PORT}
You could gather more information using something like https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L615-L619
This is flaking hard in GCP. We need to either tolerate the extra alert (not great) or address the problem. Is the admission change to avoid the query if we have labels not sufficient? Did you add tracing to confirm this is it (it was a good guess, but not complete proof)? This flaked in 4 of the last 10 tests on GCP: https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-origin-installer-e2e-gcp-4.2&sort-by-failures and https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-gcp-4.2&sort-by-failures&sort-by-failures=
Current status, For origin, On 9/19, 2 failures On 9/20, 4 failures On 9/21, 1 failure On 9/22, 1 failure For OCP, On 9/19, 4 failures On 9/20, 3 failures On 9/21, 2 failure On 9/22, 0 failure QE will watch it for another 1 or 2 days to decide whether to move the bug back or to verified.
For origin, On 9/23, 1 failure For OCP, On 9/23, 0 failure
For OCP, On 9/24, 1 failure On 9/25, 0 failure There are less alerts now. Moving the bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922