Description of problem: This test fails ~90% of the runs in the single node periodic jobs https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=168h&context=-1&type=bug%2Bjunit&name=single&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=job This is the test output: Sep 9 15:25:08.589: INFO: Creating new exec pod Sep 9 15:30:08.657: INFO: Unexpected error occurred: timed out waiting for the condition [AfterEach] [sig-instrumentation] Prometheus github.com/openshift/origin/test/extended/util/client.go:140 STEP: Collecting events from namespace "e2e-test-prometheus-x28lk". STEP: Found 7 events. Sep 9 15:30:08.666: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for execpod: { } Scheduled: Successfully assigned e2e-test-prometheus-x28lk/execpod to ci-op-cft718lg-1f3e2-stj78-master-0 Sep 9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {multus } AddedInterface: Add eth0 [10.128.0.101/23] from openshift-sdn Sep 9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {kubelet ci-op-cft718lg-1f3e2-stj78-master-0} Pulling: Pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" Sep 9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {kubelet ci-op-cft718lg-1f3e2-stj78-master-0} Failed: Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest": rpc error: code = Unknown desc = reading manifest latest in image-registry.openshift-image-registry.svc:5000/openshift/tools: manifest unknown: manifest unknown Sep 9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {kubelet ci-op-cft718lg-1f3e2-stj78-master-0} Failed: Error: ErrImagePull Sep 9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {kubelet ci-op-cft718lg-1f3e2-stj78-master-0} BackOff: Back-off pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" Sep 9 15:30:08.666: INFO: At 2021-09-09 15:25:11 +0000 UTC - event for execpod: {kubelet ci-op-cft718lg-1f3e2-stj78-master-0} Failed: Error: ImagePullBackOff Sep 9 15:30:08.673: INFO: POD NODE PHASE GRACE CONDITIONS Sep 9 15:30:08.673: INFO: execpod ci-op-cft718lg-1f3e2-stj78-master-0 Pending [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-09-09 15:25:08 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-09-09 15:25:08 +0000 UTC ContainersNotReady containers with unready status: [agnhost-container]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-09-09 15:25:08 +0000 UTC ContainersNotReady containers with unready status: [agnhost-container]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-09-09 15:25:08 +0000 UTC }] Sep 9 15:30:08.673: INFO: Sep 9 15:30:08.680: INFO: skipping dumping cluster info - cluster too large [AfterEach] [sig-instrumentation] Prometheus github.com/openshift/origin/test/extended/util/client.go:141 STEP: Destroying namespace "e2e-test-prometheus-x28lk" for this suite. fail [k8s.io/kubernetes.1/test/e2e/framework/pod/resource.go:497]: failed to create new exec pod in namespace: e2e-test-prometheus-x28lk Unexpected error: <*errors.errorString | 0xc00031aaf0>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred This is the log in crio: Sep 09 15:30:10.073159 ci-op-cft718lg-1f3e2-stj78-master-0 crio[1878]: time="2021-09-09 15:30:10.073019922Z" level=info msg="Checking image status: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" id=984a3bac-2340-4501-8f74-bf631b362911 name=/runtime.v1alpha2.ImageService/ImageStatus Sep 09 15:30:10.073531 ci-op-cft718lg-1f3e2-stj78-master-0 crio[1878]: time="2021-09-09 15:30:10.073417545Z" level=info msg="Image image-registry.openshift-image-registry.svc:5000/openshift/tools:latest not found" id=984a3bac-2340-4501-8f74-bf631b362911 name=/runtime.v1alpha2.ImageService/ImageStatus Unsure why it can't find the image. Version-Release number of selected component (if applicable): How reproducible: Fairly consistent in our payload acceptance periodic jobs 70% on the SNO upgrade tests 90% on the SNO nightly CI Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Seems to happen in other jobs as well https://search.ci.openshift.org/?search=Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=168h&context=-1&type=bug%2Bjunit&name=&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=job
Eran, this looks like an issue with cri-o or with the image-registry service. Can you check with these teams and get their opinion? From a monitoring standpoint, there's not much we can do unfortunately.
Looking at the image-registry logs [0] from a failed run [1]: ** the first cri-o request to the image-registry is rejected because it requires authentication ** time="2021-09-05T20:09:20.174939122Z" level=warning msg="error authorizing context: authorization header required" go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=eff50b10-9940-44c9-af85-38b2910b57d6 http.request.method=GET http.request.remoteaddr="10.128.0.1:47562" http.request.uri=/v2/ http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" time="2021-09-05T20:09:20.175016628Z" level=info msg=response go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=5448d595-1a35-44ce-94f9-501ed60c6b60 http.request.method=GET http.request.remoteaddr="10.128.0.1:47562" http.request.uri=/v2/ http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=1.3ms http.response.status=401 http.response.written=87 ** then cri-o authenticates successfully ** time="2021-09-05T20:09:20.196286265Z" level=info msg="response completed" go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=67359636-e1a2-47d4-8dee-2a0970a5ad89 http.request.method=GET http.request.remoteaddr="10.128.0.1:47564" http.request.uri="/openshift/token?account=serviceaccount&scope=repository%3Aopenshift%2Ftools%3Apull" http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" http.response.contenttype=application/json http.response.duration=16.265351ms http.response.status=200 http.response.written=2609 time="2021-09-05T20:09:20.196349469Z" level=info msg=response go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=11d09ec7-4170-47e3-9153-0dcf67aff829 http.request.method=GET http.request.remoteaddr="10.128.0.1:47564" http.request.uri="/openshift/token?account=serviceaccount&scope=repository%3Aopenshift%2Ftools%3Apull" http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" http.response.contenttype=application/json http.response.duration=16.368559ms http.response.status=200 http.response.written=2609 ** but the request from cri-o to pull the image manifest gets a 404 response ** time="2021-09-05T20:09:20.207282411Z" level=info msg="authorized request" go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=860e25a8-f3ea-4ec1-b49b-fad4237dfbd6 http.request.method=GET http.request.remoteaddr="10.128.0.1:47566" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" openshift.auth.user="system:serviceaccount:e2e-test-prometheus-pvjr9:default" vars.name=openshift/tools vars.reference=latest time="2021-09-05T20:09:20.215149716Z" level=error msg="response completed with error" err.code="manifest unknown" err.detail="unknown tag=latest" err.message="manifest unknown" go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=860e25a8-f3ea-4ec1-b49b-fad4237dfbd6 http.request.method=GET http.request.remoteaddr="10.128.0.1:47566" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=14.446912ms http.response.status=404 http.response.written=96 openshift.auth.user="system:serviceaccount:e2e-test-prometheus-pvjr9:default" vars.name=openshift/tools vars.reference=latest time="2021-09-05T20:09:20.215231422Z" level=info msg=response go.version=go1.16.6 http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=6d16989c-3257-474d-8141-a806ef63cae2 http.request.method=GET http.request.remoteaddr="10.128.0.1:47566" http.request.uri=/v2/openshift/tools/manifests/latest http.request.useragent="cri-o/1.22.0-68.rhaos4.9.git011c10a.el8 go/go1.16.6 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=14.56172ms http.response.status=404 http.response.written=96 [0] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node/1434598504415629312/artifacts/e2e-azure-upgrade-single-node/gather-extra/artifacts/pods/openshift-image-registry_image-registry-6d4df58c7-9d5nq_registry.log [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node/1434598504415629312
@spasquie from the imagestreams.yaml (in the must-gather logs) it seems that the image wasn't there at the time: cat openshift/image.openshift.io/imagestreams.yaml | egrep "dockerImageRepository|created" | tail -11 dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/sso74-openshift-rhel8 - created: "2021-09-05T19:59:33Z" - created: "2021-09-05T19:59:33Z" dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/tests - created: "2021-09-05T20:13:15Z" dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/tools - created: "2021-09-05T20:20:45Z" dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/ubi8-openjdk-11 - created: "2021-09-05T19:59:33Z" dockerImageRepository: image-registry.openshift-image-registry.svc:5000/openshift/ubi8-openjdk-8 - created: "2021-09-05T19:59:32Z" For some reason, the image the test is trying to pull (image-registry.openshift-image-registry.svc:5000/openshift/tools) got created 20 minutes after all other images (8 minutes after the test pull attempt). Any idea what might cause it?
As you can see in comment #3 the image get created 20 minutes later. I'm a bit lost here about what create the image and why it shows up 20 minutes after all other images. @wking perhaps you have an idea what's going wrong here?