Bug 2002776 - Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured - fail due to ErrImagePull
Summary: Prometheus when installed on the cluster shouldn't report any alerts in firin...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.10.0
Assignee: Sunil Thaha
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-09 17:05 UTC by Eran Cohen
Modified: 2021-11-12 19:25 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-11 02:13:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 26472 0 None open e2e: Wait for an individual imagestream import 2021-09-21 06:18:33 UTC
Github openshift origin pull 26580 0 None open Bug 2002776: fix intermittent test failure due to missing tools image 2021-11-09 06:38:22 UTC

Description Eran Cohen 2021-09-09 17:05:13 UTC
Description of problem:

This test fails ~90% of the runs in the single node periodic jobs

https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=168h&context=-1&type=bug%2Bjunit&name=single&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=job

This is the test output:

Sep  9 15:25:08.589: INFO: Creating new exec pod
Sep  9 15:30:08.657: INFO: Unexpected error occurred: timed out waiting for the condition
[AfterEach] [sig-instrumentation] Prometheus
  github.com/openshift/origin/test/extended/util/client.go:140
STEP: Collecting events from namespace "e2e-test-prometheus-x28lk".
STEP: Found 7 events.
Sep  9 15:30:08.666: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for execpod: { } Scheduled: Successfully assigned e2e-test-prometheus-x28lk/execpod to ci-op-cft718lg-1f3e2-stj78-master-0
Sep  9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {multus } AddedInterface: Add eth0 [10.128.0.101/23] from openshift-sdn
Sep  9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {kubelet ci-op-cft718lg-1f3e2-stj78-master-0} Pulling: Pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest"
Sep  9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {kubelet ci-op-cft718lg-1f3e2-stj78-master-0} Failed: Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest": rpc error: code = Unknown desc = reading manifest latest in image-registry.openshift-image-registry.svc:5000/openshift/tools: manifest unknown: manifest unknown
Sep  9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {kubelet ci-op-cft718lg-1f3e2-stj78-master-0} Failed: Error: ErrImagePull
Sep  9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {kubelet ci-op-cft718lg-1f3e2-stj78-master-0} BackOff: Back-off pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest"
Sep  9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {kubelet ci-op-cft718lg-1f3e2-stj78-master-0} Failed: Error: ImagePullBackOff
Sep  9 15:30:08.673: INFO: POD      NODE                                 PHASE    GRACE  CONDITIONS
Sep  9 15:30:08.673: INFO: execpod  ci-op-cft718lg-1f3e2-stj78-master-0  Pending         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-09-09 15:25:08 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-09-09 15:25:08 +0000 UTC ContainersNotReady containers with unready status: [agnhost-container]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-09-09 15:25:08 +0000 UTC ContainersNotReady containers with unready status: [agnhost-container]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-09-09 15:25:08 +0000 UTC  }]
Sep  9 15:30:08.673: INFO: 
Sep  9 15:30:08.680: INFO: skipping dumping cluster info - cluster too large
[AfterEach] [sig-instrumentation] Prometheus
  github.com/openshift/origin/test/extended/util/client.go:141
STEP: Destroying namespace "e2e-test-prometheus-x28lk" for this suite.
fail [k8s.io/kubernetes.1/test/e2e/framework/pod/resource.go:497]: failed to create new exec pod in namespace: e2e-test-prometheus-x28lk
Unexpected error:
    <*errors.errorString | 0xc00031aaf0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred



This is the log in crio:
Sep 09 15:30:10.073159 ci-op-cft718lg-1f3e2-stj78-master-0 crio[1878]: time="2021-09-09 15:30:10.073019922Z" level=info msg="Checking image status: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" id=984a3bac-2340-4501-8f74-bf631b362911 name=/runtime.v1alpha2.ImageService/ImageStatus
Sep 09 15:30:10.073531 ci-op-cft718lg-1f3e2-stj78-master-0 crio[1878]: time="2021-09-09 15:30:10.073417545Z" level=info msg="Image image-registry.openshift-image-registry.svc:5000/openshift/tools:latest not found" id=984a3bac-2340-4501-8f74-bf631b362911 name=/runtime.v1alpha2.ImageService/ImageStatus

Unsure why it can't find the image.

Version-Release number of selected component (if applicable):


How reproducible:

Fairly consistent in our payload acceptance periodic jobs
70% on the SNO upgrade tests
90% on the SNO nightly CI

Steps to Reproduce:
1. 
2.
3.

Actual results:


Expected results:


Additional info:
Seems to happen in other jobs as well
https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=168h&context=-1&type=bug%2Bjunit&name=&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=job

Comment 1 Simon Pasquier 2021-09-10 07:45:46 UTC
Eran, this looks like an issue with cri-o or with the image-registry service. Can you check with these teams and get their opinion? From a monitoring standpoint, there's not much we can do unfortunately.

Comment 2 Simon Pasquier 2021-09-10 08:02:04 UTC
Looking at the image-registry logs [0] from a failed run [1]:

** the first cri-o request to the image-registry is rejected because it requires authentication **

time="2021-09-05T20:09:20.174939122Z" level=warning msg="error authorizing context: authorization header required" go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=eff50b10-9940-44c9-af85-38b2910b57d6 http.request.method=GET http.request.remoteaddr="10.128.0.1:47562" http.request.uri=/v2/ http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64"
time="2021-09-05T20:09:20.175016628Z" level=info msg=response go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=5448d595-1a35-44ce-94f9-501ed60c6b60 http.request.method=GET http.request.remoteaddr="10.128.0.1:47562" http.request.uri=/v2/ http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=1.3ms http.response.status=401 http.response.written=87

** then cri-o authenticates successfully  **

time="2021-09-05T20:09:20.196286265Z" level=info msg="response completed" go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=67359636-e1a2-47d4-8dee-2a0970a5ad89 http.request.method=GET http.request.remoteaddr="10.128.0.1:47564" http.request.uri="/openshift/token?account=serviceaccount&scope=repository%3Aopenshift%2Ftools%3Apull" http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" http.response.contenttype=application/json http.response.duration=16.265351ms http.response.status=200 http.response.written=2609
time="2021-09-05T20:09:20.196349469Z" level=info msg=response go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=11d09ec7-4170-47e3-9153-0dcf67aff829 http.request.method=GET http.request.remoteaddr="10.128.0.1:47564" http.request.uri="/openshift/token?account=serviceaccount&scope=repository%3Aopenshift%2Ftools%3Apull" http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" http.response.contenttype=application/json http.response.duration=16.368559ms http.response.status=200 http.response.written=2609

** but the request from cri-o to pull the image manifest gets a 404 response **

time="2021-09-05T20:09:20.207282411Z" level=info msg="authorized request" go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=860e25a8-f3ea-4ec1-b49b-fad4237dfbd6 http.request.method=GET http.request.remoteaddr="10.128.0.1:47566" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" openshift.auth.user="system:serviceaccount:e2e-test-prometheus-pvjr9:default" vars.name=openshift/tools vars.reference=latest
time="2021-09-05T20:09:20.215149716Z" level=error msg="response completed with error" err.code="manifest unknown" err.detail="unknown tag=latest" err.message="manifest unknown" go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=860e25a8-f3ea-4ec1-b49b-fad4237dfbd6 http.request.method=GET http.request.remoteaddr="10.128.0.1:47566" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=14.446912ms http.response.status=404 http.response.written=96 openshift.auth.user="system:serviceaccount:e2e-test-prometheus-pvjr9:default" vars.name=openshift/tools vars.reference=latest
time="2021-09-05T20:09:20.215231422Z" level=info msg=response go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=6d16989c-3257-474d-8141-a806ef63cae2 http.request.method=GET http.request.remoteaddr="10.128.0.1:47566" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=14.56172ms http.response.status=404 http.response.written=96


[0] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node/1434598504415629312/artifacts/e2e-azure-upgrade-single-node/gather-extra/artifacts/pods/openshift-image-registry_image-registry-6d4df58c7-9d5nq_registry.log
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node/1434598504415629312

Comment 3 Eran Cohen 2021-09-12 15:02:36 UTC
@spasquie from the imagestreams.yaml (in the must-gather logs) it seems that the image wasn't there at the time:

cat openshift/image.openshift.io/imagestreams.yaml | egrep "dockerImageRepository|created" | tail -11
    dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/sso74-openshift-rhel8
      - created: "2021-09-05T19:59:33Z"
      - created: "2021-09-05T19:59:33Z"
    dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/tests
      - created: "2021-09-05T20:13:15Z"
    dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/tools
      - created: "2021-09-05T20:20:45Z"
    dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/ubi8-openjdk-11
      - created: "2021-09-05T19:59:33Z"
    dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/ubi8-openjdk-8
      - created: "2021-09-05T19:59:32Z"


For some reason, the image the test is trying to pull (image-registry.openshift-image-registry.svc:5000/openshift/tools) got created 20 minutes after all other images (8 minutes after the test pull attempt).


Any idea what might cause it?

Comment 7 Eran Cohen 2021-09-19 13:02:36 UTC
As you can see in comment #3 the image get created 20 minutes later.
I'm a bit lost here about what create the image and why it shows up 20 minutes after all other images.
@wking perhaps you have an idea what's going wrong here?


Note You need to log in before you can comment on or make changes to this bug.