Hide Forgot
[sig-arch] events should not repeat pathologically Since the merge of https://github.com/openshift/cluster-kube-apiserver-operator/pull/1199, we've seen the following repeated events in CI: 1 events happened too frequently event happened 25 times, something is wrong: ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator - reason/MissingVersion no image found for operand pod I think it's coming from this: https://github.com/openshift/cluster-kube-apiserver-operator/commit/ea2ec3bb5a8a36b98c987901a12822c34451354f#diff-22001281e3b968448f2558fd87069f7dbe886ce349047d0270433e17ece4372aR56 In the end, it looks like it got the info it wanted: { "lastTransitionTime": "2021-09-07T06:38:10Z", "message": "KubeletMinorVersionUpgradeable: Kubelet and API server minor versions are synced.", "reason": "AsExpected", "status": "True", "type": "Upgradeable" } Example job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node-serial/1435120717145313280 Is this a bug, or should we an exception for these events? Thanks!
this seems related https://github.com/openshift/library-go/pull/1049, doesn't it?
Looks related, but that fix is already vendored in cluster-kube-apiserver-operator. In the job I linked in comment #0, bootstrap finished by 6:23, but we still had missing image events until 07:16:32. Based on a conversation with sttts, my suggestion it was https://github.com/openshift/cluster-kube-apiserver-operator/pull/1199 is incorrect since that's a different controller. Other controllers also had missing image events, including during the bootstrap. kube-apiserver just had the most. Search message for operand in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node-serial/1435120717145313280/artifacts/e2e-aws-single-node-serial/gather-must-gather/artifacts/event-filter.html.
Happens in the SNO upgrade test as well: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node/1435687668171149312
still missing the revendoring dance, checking one occurrence I can see events from: kube-controller-manager-operator kube-scheduler-operator-container kube-apiserver-operator etcd-operator
Verification as below, To check when does the PR fix land to the payload, $ git clone https://github.com/openshift/cluster-kube-apiserver-operator $ cd cluster-kube-apiserver-operator $ git pull $ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.10.0-0.ci-2021-09-16-194109 | grep cluster-kube-apiserver-operator cluster-kube-apiserver-operator https://github.com/openshift/cluster-kube-apiserver-operator 405ff13f18da49548dd409a0faba992cb4782961 $ git log --date local --pretty="%h %an %cd - %s" 405ff13 | grep '#1228 ' 17d0234d OpenShift Merge Robot Mon Sep 13 22:49:26 2021 - Merge pull request #1228 from aojea/librarygo_bump We can see the PR fix was landed to the OCP payload on Sep 13, that means we shouldn't see repeating events from CI tests after Sep 13th. For kube-apiserver-operator, still found repeating events in the past two days, they were tested with 4.10.0-0.ci-2021-09-16-194109. $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=no+image+found+for+operand+pod&maxAge=48h&context=1&type=junit&name=4%5C.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job' | grep 'openshift-kube-apiserver-operator' event happened 21 times, something is wrong: ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator - reason/MissingVersion no image found for operand pod event happened 29 times, something is wrong: ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator - reason/MissingVersion no image found for operand pod So I think this bug was not fixed, assign back.
(In reply to Ke Wang from comment #7) > Verification as below, > > To check when does the PR fix land to the payload, > $ git clone https://github.com/openshift/cluster-kube-apiserver-operator > $ cd cluster-kube-apiserver-operator > $ git pull > > $ oc adm release info --commits > registry.ci.openshift.org/ocp/release:4.10.0-0.ci-2021-09-16-194109 | grep > cluster-kube-apiserver-operator > cluster-kube-apiserver-operator > https://github.com/openshift/cluster-kube-apiserver-operator > 405ff13f18da49548dd409a0faba992cb4782961 > > $ git log --date local --pretty="%h %an %cd - %s" 405ff13 | grep '#1228 ' > 17d0234d OpenShift Merge Robot Mon Sep 13 22:49:26 2021 - Merge pull request > #1228 from aojea/librarygo_bump > > We can see the PR fix was landed to the OCP payload on Sep 13, that means we > shouldn't see repeating events from CI tests after Sep 13th. > > For kube-apiserver-operator, still found repeating events in the past two > days, they were tested with 4.10.0-0.ci-2021-09-16-194109. > > $ w3m -dump -cols 200 > 'https://search.ci.openshift.org/ > ?search=no+image+found+for+operand+pod&maxAge=48h&context=1&type=junit&name=4 > %5C.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job' | grep > 'openshift-kube-apiserver-operator' > event happened 21 times, something is wrong: > ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator - > reason/MissingVersion no image found for operand pod > event happened 29 times, something is wrong: > ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator - > reason/MissingVersion no image found for operand pod > > So I think this bug was not fixed, assign back. That jobs that you link are upgrade jobs from 4.9 :), the fix wasn't backported yet
Hi aojeagar, thank you for your reply, that means I just check the 4.10 relevant CI jobs, right? If so, I'll leave it for a few days and then check it again without upgrade CI jobs, the upgrade jobs from 4.9 will check with 4.9 bug.
right, we can see here that in non-upgrade jobs it stopped 7 days ago you can see that it is still happening in 4.9, but not in 4.10 https://search.ci.openshift.org/?search=no+image+found+for+operand+pod&maxAge=336h&context=1&type=junit&name=&excludeName=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job The backport is here https://bugzilla.redhat.com/show_bug.cgi?id=2003540
@kewang it seems there were no errors in "non-upgrade" 4.10 jobs in the last 10 days https://search.ci.openshift.org/?search=no+image+found+for+operand+pod&maxAge=336h&context=1&type=junit&name=4.10&excludeName=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job
aojeagar, I also checked again, just like what you said the bug was fixed on 4.10. Please change the bug status to ON_QA, I will add some comments and move it VERIFIED.
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=no+image+found+for+operand+pod&maxAge=336h&context=1&type=junit&name=4.10&excludeName=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job' | grep 'openshift-kube-apiserver-operator' No results found. So the bug was fixed, move the it VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056