Bug 1898745 - installation failing with CVO reporting openshift-samples not rolled out, samples not setting versions in its ClusterOperator
Summary: installation failing with CVO reporting openshift-samples not rolled out, sam...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Samples
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-18 01:01 UTC by Matthew Staebler
Modified: 2021-02-24 15:34 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:34:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-samples-operator pull 338 0 None closed Bug 1898745: set progressing false on imagestream events as well if no active streams 2021-02-09 19:47:50 UTC
Github openshift cluster-samples-operator pull 340 0 None closed Bug 1898745: actually set imagestreams in progress to false in imagestream event path 2021-02-09 19:47:50 UTC
Github openshift cluster-samples-operator pull 341 0 None closed Bug 1898745: deprecate ImageChangesInProgress, make sure it stays false 2021-02-09 19:47:50 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:34:53 UTC

Description Matthew Staebler 2020-11-18 01:01:23 UTC
There have been occasional failed installs during 4.7 CI tests where the install fails and reports that the openshift-samples operator has not successfully rolled out yet. The must-gather logs show that the CVO is reported that the openshift-samples operator is not rolled out but the openshift-samples ClusterOperator is showing as available, not progressing, and not degraded.

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328393068503633920

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328760479341875200

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.7/1328677666647183360

Comment 1 jamo luhrsen 2020-11-18 01:10:58 UTC
I was just looking at this problem I think. I posted in #forum-release-oversight in CoreOS slack:
https://coreos.slack.com/archives/C01CQA76KMX/p1605659925460900

anyone happen to know anything about the jobs failing to run tests because Cluster operator openshift-samples
is still updating? I think it's widespread (happens across clouds and different job flavors) and happens
enough to show up in my job debugging frequently.

 https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-samples+is+still+updating&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

example:
 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328760479341875200

Comment 2 W. Trevor King 2020-11-18 03:45:00 UTC
Looks like samples isn't setting an operator version:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328393068503633920/artifacts/e2e-aws/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-samples").status | keys[]'
conditions
extension
relatedObjects

Compare [1], where the operator claims it will be setting the 'operator' version entry (which is good, it should set that).  I'm not sure why samples is not setting the version entry; the ClusterOperator looks happy otherwise:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328393068503633920/artifacts/e2e-aws/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-samples").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-11-16T18:13:09Z Degraded=False -: -
2020-11-16T18:13:24Z Available=True -: Samples installation successful at 4.7.0-0.nightly-2020-11-16-173710
2020-11-16T18:13:24Z Progressing=False -: Samples installation successful at 

Would be nice if the operator was setting reasons for those conditions.  And it looks like there's maybe a formatting issue in the progressing message where the version, or a timestamp, or some such was not included.

[1]: https://github.com/openshift/cluster-samples-operator/blob/840c02c7efeb9ca46e6ea340941bef9ff4a0b97e/manifests/07-clusteroperator.yaml#L10-L11

Comment 3 W. Trevor King 2020-11-18 03:46:21 UTC
Pretty sure this has not been MODIFIED, because it links no PR.  Moving back to NEW.

Comment 4 W. Trevor King 2020-11-18 03:49:16 UTC
Setting Progressing=False without setting versions looks like a bug introduced by [1].

[1]: https://github.com/openshift/cluster-samples-operator/pull/338/files#diff-613c854f8edbfd59f49b315d4de6c115892337b913f4913c8329af8341a3804dR59-R61

Comment 5 Gabe Montero 2020-11-18 03:51:32 UTC
I fixed this via PR late this afternoon / early this evening before the BZ was opened.  I associated the PR with this BZ on the PR side, but it did not link here.

Will do manually now.

Moving back to MODIFIED.

Comment 6 Gabe Montero 2020-11-18 03:53:02 UTC
(In reply to W. Trevor King from comment #4)
> Setting Progressing=False without setting versions looks like a bug
> introduced by [1].
> 
> [1]:
> https://github.com/openshift/cluster-samples-operator/pull/338/files#diff-
> 613c854f8edbfd59f49b315d4de6c115892337b913f4913c8329af8341a3804dR59-R61

no version and progressing false are not set at the same time

Comment 7 Gabe Montero 2020-11-18 03:55:27 UTC
I am cross referencing the failed runs and confirming the do not container  https://github.com/openshift/cluster-samples-operator/pull/338

Comment 8 Gabe Montero 2020-11-18 04:33:44 UTC
ok the runs noted in the description did NOT have https://github.com/openshift/cluster-samples-operator/pull/338 (which was suppose to fix a handful of migration failures search.ci showed me over the past week).

however, I ran https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-samples+is+still+updating&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job
and it turned up more recent runs which have the change and did not migrate.

such as https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/874/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/1328884250555977728

the bug is in the code area Trevor mentioned, but not the fact that I don't set the version at the same time.  I'm not actually setting the version to false in the case when my print statement says it does.

Should have fix up soon.

Comment 9 Gabe Montero 2020-11-18 04:39:32 UTC
More details in relation to the run noted in #Comment 8

the internal sample crd condition image chagnse in progress (slightly different than clusteroperator progressing)

stays true:

 - lastTransitionTime: "2020-11-18T02:39:12Z"
    lastUpdateTime: "2020-11-18T02:39:12Z"
    status: "True"
    type: ImageChangesInProgress

even though there are repeated attempts to set it to false in the pod logs:

2020-11-18T03:21:01.127784646Z time="2020-11-18T03:21:01Z" level=info msg="CRDUPDATE progressing false update on imagestream redhat-sso73-openshift event"

is one of many

we can set progressing to false along several paths based on timings of events ... this one did not manifest during pr e2e's

Comment 11 jamo luhrsen 2020-11-18 20:59:08 UTC
(In reply to Gabe Montero from comment #10)
> ok 
> 
> https://search.ci.openshift.org/
> ?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-
> samples+is+still+updating&maxAge=6h&context=1&type=build-
> log&name=&maxMatches=5&maxBytes=20971520&groupBy=job
> 
> has no hits in the last 5 hours since PR 340 merged
> 
> still monitoring but good sign

yep, this no longer seems to be failing anywhere. thanks for the quick work. Yesterday when I checked we were seeing this problem across many failing jobs aprox 2.5% of the time.

Comment 12 Gabe Montero 2020-11-18 22:04:11 UTC
(In reply to jamo luhrsen from comment #11)
> (In reply to Gabe Montero from comment #10)
> > ok 
> > 
> > https://search.ci.openshift.org/
> > ?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-
> > samples+is+still+updating&maxAge=6h&context=1&type=build-
> > log&name=&maxMatches=5&maxBytes=20971520&groupBy=job
> > 
> > has no hits in the last 5 hours since PR 340 merged
> > 
> > still monitoring but good sign
> 
> yep, this no longer seems to be failing anywhere. thanks for the quick work.
> Yesterday when I checked we were seeing this problem across many failing
> jobs aprox 2.5% of the time.

Thanks for the cross confirmation Jamo, as well as the failing stats (I did not have time to do such calculations).

I too have been running the query you provided here about once a hour throughout the day, as well and the one I had been using previously (which motivate PR 338).  So far so good.

I'll post again later this evening my time (US Eastern) with EOB results.

Comment 14 Gabe Montero 2020-11-19 12:48:52 UTC
I saw one more failure profile overnight in one particular periodic, along with a rehearsal job related to that periodic.

key particulars:  APIServerConflictError (on the samples exist update / samples config crd resource)
and a "fatal error: concurrent map iteration and map write" in the pod logs

from the search:  matched 0.09% of failing runs and 0.43% of jobs
over thousands of runs.

I have a PR up that should address both.  My team will review today, repeat the upgrade e2e some, then
merge.

Comment 15 Gabe Montero 2020-11-20 14:42:52 UTC
After seeing an albeit small, but consistent run of failures in ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack pulls / periodics /rehearsals yesterday, even leading up when I merged PR 341.

Since PR 341, those errors are not showing up.

Will continue to monitor throughout the day but so far we are looking good.  If still good by EOB I'm marking this verified.

Comment 18 errata-xmlrpc 2021-02-24 15:34:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.