There have been occasional failed installs during 4.7 CI tests where the install fails and reports that the openshift-samples operator has not successfully rolled out yet. The must-gather logs show that the CVO is reported that the openshift-samples operator is not rolled out but the openshift-samples ClusterOperator is showing as available, not progressing, and not degraded. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328393068503633920 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328760479341875200 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.7/1328677666647183360
I was just looking at this problem I think. I posted in #forum-release-oversight in CoreOS slack: https://coreos.slack.com/archives/C01CQA76KMX/p1605659925460900 anyone happen to know anything about the jobs failing to run tests because Cluster operator openshift-samples is still updating? I think it's widespread (happens across clouds and different job flavors) and happens enough to show up in my job debugging frequently. https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-samples+is+still+updating&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job example: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328760479341875200
Looks like samples isn't setting an operator version: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328393068503633920/artifacts/e2e-aws/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-samples").status | keys[]' conditions extension relatedObjects Compare [1], where the operator claims it will be setting the 'operator' version entry (which is good, it should set that). I'm not sure why samples is not setting the version entry; the ClusterOperator looks happy otherwise: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328393068503633920/artifacts/e2e-aws/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-samples").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2020-11-16T18:13:09Z Degraded=False -: - 2020-11-16T18:13:24Z Available=True -: Samples installation successful at 4.7.0-0.nightly-2020-11-16-173710 2020-11-16T18:13:24Z Progressing=False -: Samples installation successful at Would be nice if the operator was setting reasons for those conditions. And it looks like there's maybe a formatting issue in the progressing message where the version, or a timestamp, or some such was not included. [1]: https://github.com/openshift/cluster-samples-operator/blob/840c02c7efeb9ca46e6ea340941bef9ff4a0b97e/manifests/07-clusteroperator.yaml#L10-L11
Pretty sure this has not been MODIFIED, because it links no PR. Moving back to NEW.
Setting Progressing=False without setting versions looks like a bug introduced by [1]. [1]: https://github.com/openshift/cluster-samples-operator/pull/338/files#diff-613c854f8edbfd59f49b315d4de6c115892337b913f4913c8329af8341a3804dR59-R61
I fixed this via PR late this afternoon / early this evening before the BZ was opened. I associated the PR with this BZ on the PR side, but it did not link here. Will do manually now. Moving back to MODIFIED.
(In reply to W. Trevor King from comment #4) > Setting Progressing=False without setting versions looks like a bug > introduced by [1]. > > [1]: > https://github.com/openshift/cluster-samples-operator/pull/338/files#diff- > 613c854f8edbfd59f49b315d4de6c115892337b913f4913c8329af8341a3804dR59-R61 no version and progressing false are not set at the same time
I am cross referencing the failed runs and confirming the do not container https://github.com/openshift/cluster-samples-operator/pull/338
ok the runs noted in the description did NOT have https://github.com/openshift/cluster-samples-operator/pull/338 (which was suppose to fix a handful of migration failures search.ci showed me over the past week). however, I ran https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-samples+is+still+updating&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job and it turned up more recent runs which have the change and did not migrate. such as https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/874/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/1328884250555977728 the bug is in the code area Trevor mentioned, but not the fact that I don't set the version at the same time. I'm not actually setting the version to false in the case when my print statement says it does. Should have fix up soon.
More details in relation to the run noted in #Comment 8 the internal sample crd condition image chagnse in progress (slightly different than clusteroperator progressing) stays true: - lastTransitionTime: "2020-11-18T02:39:12Z" lastUpdateTime: "2020-11-18T02:39:12Z" status: "True" type: ImageChangesInProgress even though there are repeated attempts to set it to false in the pod logs: 2020-11-18T03:21:01.127784646Z time="2020-11-18T03:21:01Z" level=info msg="CRDUPDATE progressing false update on imagestream redhat-sso73-openshift event" is one of many we can set progressing to false along several paths based on timings of events ... this one did not manifest during pr e2e's
ok https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-samples+is+still+updating&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job has no hits in the last 5 hours since PR 340 merged still monitoring but good sign
(In reply to Gabe Montero from comment #10) > ok > > https://search.ci.openshift.org/ > ?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift- > samples+is+still+updating&maxAge=6h&context=1&type=build- > log&name=&maxMatches=5&maxBytes=20971520&groupBy=job > > has no hits in the last 5 hours since PR 340 merged > > still monitoring but good sign yep, this no longer seems to be failing anywhere. thanks for the quick work. Yesterday when I checked we were seeing this problem across many failing jobs aprox 2.5% of the time.
(In reply to jamo luhrsen from comment #11) > (In reply to Gabe Montero from comment #10) > > ok > > > > https://search.ci.openshift.org/ > > ?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift- > > samples+is+still+updating&maxAge=6h&context=1&type=build- > > log&name=&maxMatches=5&maxBytes=20971520&groupBy=job > > > > has no hits in the last 5 hours since PR 340 merged > > > > still monitoring but good sign > > yep, this no longer seems to be failing anywhere. thanks for the quick work. > Yesterday when I checked we were seeing this problem across many failing > jobs aprox 2.5% of the time. Thanks for the cross confirmation Jamo, as well as the failing stats (I did not have time to do such calculations). I too have been running the query you provided here about once a hour throughout the day, as well and the one I had been using previously (which motivate PR 338). So far so good. I'll post again later this evening my time (US Eastern) with EOB results.
yeah both https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-samples+is+still+updating&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job and https://search.ci.openshift.org/?search=openshift-samples+progressing&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job are turning up clean for me still I'll continue to run them periodically through the end of the week, and lengthen the search window to 12h and then 1d accordingly as things progress.
I saw one more failure profile overnight in one particular periodic, along with a rehearsal job related to that periodic. key particulars: APIServerConflictError (on the samples exist update / samples config crd resource) and a "fatal error: concurrent map iteration and map write" in the pod logs from the search: matched 0.09% of failing runs and 0.43% of jobs over thousands of runs. I have a PR up that should address both. My team will review today, repeat the upgrade e2e some, then merge.
After seeing an albeit small, but consistent run of failures in ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack pulls / periodics /rehearsals yesterday, even leading up when I merged PR 341. Since PR 341, those errors are not showing up. Will continue to monitor throughout the day but so far we are looking good. If still good by EOB I'm marking this verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633