1898745 – installation failing with CVO reporting openshift-samples not rolled out, samples not setting versions in its ClusterOperator

Bug 1898745 - installation failing with CVO reporting openshift-samples not rolled out, samples not setting versions in its ClusterOperator

Summary: installation failing with CVO reporting openshift-samples not rolled out, sam...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Samples
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Gabe Montero
QA Contact:	XiuJuan Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-18 01:01 UTC by Matthew Staebler
Modified:	2021-02-24 15:34 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:34:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-samples-operator pull 338	None	closed	Bug 1898745: set progressing false on imagestream events as well if no active streams	2021-02-09 19:47:50 UTC
Github	openshift cluster-samples-operator pull 340	None	closed	Bug 1898745: actually set imagestreams in progress to false in imagestream event path	2021-02-09 19:47:50 UTC
Github	openshift cluster-samples-operator pull 341	None	closed	Bug 1898745: deprecate ImageChangesInProgress, make sure it stays false	2021-02-09 19:47:50 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:34:53 UTC

Description Matthew Staebler 2020-11-18 01:01:23 UTC

There have been occasional failed installs during 4.7 CI tests where the install fails and reports that the openshift-samples operator has not successfully rolled out yet. The must-gather logs show that the CVO is reported that the openshift-samples operator is not rolled out but the openshift-samples ClusterOperator is showing as available, not progressing, and not degraded.

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328393068503633920

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328760479341875200

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.7/1328677666647183360

Comment 1 jamo luhrsen 2020-11-18 01:10:58 UTC

I was just looking at this problem I think. I posted in #forum-release-oversight in CoreOS slack:
https://coreos.slack.com/archives/C01CQA76KMX/p1605659925460900

anyone happen to know anything about the jobs failing to run tests because Cluster operator openshift-samples
is still updating? I think it's widespread (happens across clouds and different job flavors) and happens
enough to show up in my job debugging frequently.

 https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-samples+is+still+updating&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

example:
 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328760479341875200

Comment 2 W. Trevor King 2020-11-18 03:45:00 UTC

Looks like samples isn't setting an operator version:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328393068503633920/artifacts/e2e-aws/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-samples").status | keys[]'
conditions
extension
relatedObjects

Compare [1], where the operator claims it will be setting the 'operator' version entry (which is good, it should set that).  I'm not sure why samples is not setting the version entry; the ClusterOperator looks happy otherwise:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328393068503633920/artifacts/e2e-aws/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-samples").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-11-16T18:13:09Z Degraded=False -: -
2020-11-16T18:13:24Z Available=True -: Samples installation successful at 4.7.0-0.nightly-2020-11-16-173710
2020-11-16T18:13:24Z Progressing=False -: Samples installation successful at 

Would be nice if the operator was setting reasons for those conditions.  And it looks like there's maybe a formatting issue in the progressing message where the version, or a timestamp, or some such was not included.

[1]: https://github.com/openshift/cluster-samples-operator/blob/840c02c7efeb9ca46e6ea340941bef9ff4a0b97e/manifests/07-clusteroperator.yaml#L10-L11

Comment 3 W. Trevor King 2020-11-18 03:46:21 UTC

Pretty sure this has not been MODIFIED, because it links no PR.  Moving back to NEW.

Comment 4 W. Trevor King 2020-11-18 03:49:16 UTC

Setting Progressing=False without setting versions looks like a bug introduced by [1].

[1]: https://github.com/openshift/cluster-samples-operator/pull/338/files#diff-613c854f8edbfd59f49b315d4de6c115892337b913f4913c8329af8341a3804dR59-R61

Comment 5 Gabe Montero 2020-11-18 03:51:32 UTC

I fixed this via PR late this afternoon / early this evening before the BZ was opened.  I associated the PR with this BZ on the PR side, but it did not link here.

Will do manually now.

Moving back to MODIFIED.

Comment 6 Gabe Montero 2020-11-18 03:53:02 UTC

(In reply to W. Trevor King from comment #4)
> Setting Progressing=False without setting versions looks like a bug
> introduced by [1].
> 
> [1]:
> https://github.com/openshift/cluster-samples-operator/pull/338/files#diff-
> 613c854f8edbfd59f49b315d4de6c115892337b913f4913c8329af8341a3804dR59-R61

no version and progressing false are not set at the same time

Comment 7 Gabe Montero 2020-11-18 03:55:27 UTC

I am cross referencing the failed runs and confirming the do not container  https://github.com/openshift/cluster-samples-operator/pull/338

Comment 8 Gabe Montero 2020-11-18 04:33:44 UTC

ok the runs noted in the description did NOT have https://github.com/openshift/cluster-samples-operator/pull/338 (which was suppose to fix a handful of migration failures search.ci showed me over the past week).

however, I ran https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-samples+is+still+updating&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job
and it turned up more recent runs which have the change and did not migrate.

such as https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/874/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/1328884250555977728

the bug is in the code area Trevor mentioned, but not the fact that I don't set the version at the same time.  I'm not actually setting the version to false in the case when my print statement says it does.

Should have fix up soon.

Comment 9 Gabe Montero 2020-11-18 04:39:32 UTC

More details in relation to the run noted in #Comment 8

the internal sample crd condition image chagnse in progress (slightly different than clusteroperator progressing)

stays true:

 - lastTransitionTime: "2020-11-18T02:39:12Z"
    lastUpdateTime: "2020-11-18T02:39:12Z"
    status: "True"
    type: ImageChangesInProgress

even though there are repeated attempts to set it to false in the pod logs:

2020-11-18T03:21:01.127784646Z time="2020-11-18T03:21:01Z" level=info msg="CRDUPDATE progressing false update on imagestream redhat-sso73-openshift event"

is one of many

we can set progressing to false along several paths based on timings of events ... this one did not manifest during pr e2e's

Comment 10 Gabe Montero 2020-11-18 12:52:47 UTC

ok 

https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-samples+is+still+updating&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

has no hits in the last 5 hours since PR 340 merged

still monitoring but good sign

Comment 11 jamo luhrsen 2020-11-18 20:59:08 UTC

(In reply to Gabe Montero from comment #10)
> ok 
> 
> https://search.ci.openshift.org/
> ?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-
> samples+is+still+updating&maxAge=6h&context=1&type=build-
> log&name=&maxMatches=5&maxBytes=20971520&groupBy=job
> 
> has no hits in the last 5 hours since PR 340 merged
> 
> still monitoring but good sign

yep, this no longer seems to be failing anywhere. thanks for the quick work. Yesterday when I checked we were seeing this problem across many failing jobs aprox 2.5% of the time.

Comment 12 Gabe Montero 2020-11-18 22:04:11 UTC

(In reply to jamo luhrsen from comment #11)
> (In reply to Gabe Montero from comment #10)
> > ok 
> > 
> > https://search.ci.openshift.org/
> > ?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-
> > samples+is+still+updating&maxAge=6h&context=1&type=build-
> > log&name=&maxMatches=5&maxBytes=20971520&groupBy=job
> > 
> > has no hits in the last 5 hours since PR 340 merged
> > 
> > still monitoring but good sign
> 
> yep, this no longer seems to be failing anywhere. thanks for the quick work.
> Yesterday when I checked we were seeing this problem across many failing
> jobs aprox 2.5% of the time.

Thanks for the cross confirmation Jamo, as well as the failing stats (I did not have time to do such calculations).

I too have been running the query you provided here about once a hour throughout the day, as well and the one I had been using previously (which motivate PR 338).  So far so good.

I'll post again later this evening my time (US Eastern) with EOB results.

Comment 13 Gabe Montero 2020-11-19 00:31:18 UTC

yeah both 

https://search.ci.openshift.org/?search=failed+to+initialize+the+cluster%3A+Cluster+operator+openshift-samples+is+still+updating&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

and 

https://search.ci.openshift.org/?search=openshift-samples+progressing&maxAge=6h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

are turning up clean for me still

I'll continue to run them periodically through the end of the week, and lengthen the search window to 12h and then 1d accordingly as things progress.

Comment 14 Gabe Montero 2020-11-19 12:48:52 UTC

I saw one more failure profile overnight in one particular periodic, along with a rehearsal job related to that periodic.

key particulars:  APIServerConflictError (on the samples exist update / samples config crd resource)
and a "fatal error: concurrent map iteration and map write" in the pod logs

from the search:  matched 0.09% of failing runs and 0.43% of jobs
over thousands of runs.

I have a PR up that should address both.  My team will review today, repeat the upgrade e2e some, then
merge.

Comment 15 Gabe Montero 2020-11-20 14:42:52 UTC

After seeing an albeit small, but consistent run of failures in ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack pulls / periodics /rehearsals yesterday, even leading up when I merged PR 341.

Since PR 341, those errors are not showing up.

Will continue to monitor throughout the day but so far we are looking good.  If still good by EOB I'm marking this verified.

Comment 18 errata-xmlrpc 2021-02-24 15:34:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.