Description of problem: after upgrade from 4.3 nightly build, to 4.4-rc11, openshift-samples is moving between ready and "Progressing" status that keep happen 1-2 hours after upgrade completed. Version-Release number of selected component (if applicable): oc version Client Version: 4.3.0-0.nightly-2020-03-19-220321 Server Version: 4.4.0-rc.11 Kubernetes Version: v1.17.1 Steps to Reproduce: 1. install ocp4.3 2. install ocs4.4 on lso setup (vmware with direct attached storage) 3. run ocp upgrade (patch channel and upgrade) Actual results: upgrade seems to be successful, but openshift-samples keep moving between ready and "Progressing" status. its visible via web UI or CLI Expected results: all cluster operators need to be "ready" must-gather: https://drive.google.com/file/d/1koNffYVx3J-dIUhIXZDoQ_1MYj-53S9q/view?usp=sharing
Created attachment 1682087 [details] samples operator log Imagestreams have imported successfully, but ImageChangesInProgress keep true for a long time. $oc get config.samples -o yaml apiVersion: v1 items: - apiVersion: samples.operator.openshift.io/v1 kind: Config metadata: creationTimestamp: "2020-04-27T03:08:16Z" finalizers: - samples.operator.openshift.io/finalizer generation: 1 name: cluster resourceVersion: "137599" selfLink: /apis/samples.operator.openshift.io/v1/configs/cluster uid: 28ec00ff-52d9-4dab-bdac-ad9ec0675979 spec: architectures: - x86_64 managementState: Managed status: architectures: - x86_64 conditions: - lastTransitionTime: "2020-04-27T03:08:16Z" lastUpdateTime: "2020-04-27T03:08:16Z" status: "True" type: ImportCredentialsExist - lastTransitionTime: "2020-04-27T03:08:19Z" lastUpdateTime: "2020-04-27T03:08:19Z" status: "True" type: ConfigurationValid - lastTransitionTime: "2020-04-27T03:08:17Z" lastUpdateTime: "2020-04-27T09:18:23Z" status: "False" type: ImportImageErrorsExist - lastTransitionTime: "2020-04-27T09:18:25Z" lastUpdateTime: "2020-04-27T09:18:25Z" reason: 'jboss-datagrid71-openshift jboss-datavirt64-driver-openshift fuse-apicurito-generator jboss-fuse70-eap-openshift redhat-sso71-openshift jboss-datagrid65-client-openshift redhat-sso70-openshift eap-cd-openshift openjdk-11-rhel8 jenkins-agent-maven nodejs jboss-webserver30-tomcat7-openshift jboss-webserver31-tomcat7-openshift jboss-datagrid65-openshift fuse7-console fuse7-karaf-openshift nginx rhdm-optaweb-employee-rostering-rhel8 rhpam-businesscentral-monitoring-rhel8 dotnet-runtime mysql perl jboss-amq-62 jboss-decisionserver64-openshift python redis jboss-eap72-openshift mariadb rhdm-decisioncentral-rhel8 rhpam-businesscentral-rhel8 redhat-sso73-openshift jboss-amq-63 fuse7-java-openshift jenkins-agent-nodejs postgresql jboss-fuse70-console httpd openjdk-8-rhel8 dotnet jboss-fuse70-java-openshift jenkins jboss-processserver64-openshift rhpam-kieserver-rhel8 jboss-datagrid72-openshift openjdk-11-rhel7 mongodb rhpam-smartrouter-rhel8 golang java jboss-datagrid71-client-openshift jboss-datagrid73-openshift jboss-eap70-openshift apicurito-ui fis-karaf-openshift fuse7-eap-openshift redhat-openjdk18-openshift jboss-webserver31-tomcat8-openshift jboss-datavirt64-openshift jboss-eap71-openshift ruby jboss-webserver30-tomcat8-openshift apicast-gateway jboss-eap64-openshift jboss-fuse70-karaf-openshift php rhdm-kieserver-rhel8 redhat-sso72-openshift fis-java-openshift modern-webapp jboss-webserver50-tomcat9-openshift ' status: "True" type: ImageChangesInProgress - lastTransitionTime: "2020-04-27T09:18:31Z" lastUpdateTime: "2020-04-27T09:18:31Z" status: "True" type: SamplesExist - lastTransitionTime: "2020-04-27T03:08:19Z" lastUpdateTime: "2020-04-27T03:08:19Z" status: "False" type: RemovePending - lastTransitionTime: "2020-04-27T06:03:34Z" lastUpdateTime: "2020-04-27T06:03:34Z" status: "False" type: MigrationInProgress managementState: Managed version: 4.4.0-rc.12 kind: List metadata: resourceVersion: "" selfLink: "" $oc describe is jboss-datagrid71-openshift -n openshift Name: jboss-datagrid71-openshift Namespace: openshift Created: 6 hours ago Labels: samples.operator.openshift.io/managed=true Annotations: openshift.io/display-name=Red Hat JBoss Data Grid 7.1 openshift.io/provider-display-name=Red Hat, Inc. samples.operator.openshift.io/version=4.4.0-rc.12 version=1.4.16 Image Repository: image-registry.openshift-image-registry.svc:5000/openshift/jboss-datagrid71-openshift Image Lookup: local=false Unique Images: 4 Tags: 4 1.3 tagged from registry.redhat.io/jboss-datagrid-7/datagrid71-openshift:1.3 prefer registry pullthrough when referencing this tag JBoss Data Grid 7.1 S2I images. Tags: datagrid, jboss, hidden Supports: datagrid:7.1 * registry.redhat.io/jboss-datagrid-7/datagrid71-openshift@sha256:f1355e0714431109c6b4c64fe8bcc565c096395ccce0690bb21ce5fb49d51d6c 6 hours ago 1.2 tagged from registry.redhat.io/jboss-datagrid-7/datagrid71-openshift:1.2 prefer registry pullthrough when referencing this tag JBoss Data Grid 7.1 S2I images. Tags: datagrid, jboss, hidden Supports: datagrid:7.1 * registry.redhat.io/jboss-datagrid-7/datagrid71-openshift@sha256:5909bd1b08fd6b4428e259c4262dd70ca928089e5a2949cc57edfab019b503d0 6 hours ago 1.1 tagged from registry.redhat.io/jboss-datagrid-7/datagrid71-openshift:1.1 prefer registry pullthrough when referencing this tag JBoss Data Grid 7.1 S2I images. Tags: datagrid, jboss, hidden Supports: datagrid:7.1 * registry.redhat.io/jboss-datagrid-7/datagrid71-openshift@sha256:8c7690d21bedc1aa822661b5e5b72626e52178ea8295ab734d2f9c9bc7548e19 6 hours ago 1.0 tagged from registry.redhat.io/jboss-datagrid-7/datagrid71-openshift:1.0 prefer registry pullthrough when referencing this tag JBoss Data Grid 7.1 S2I images. Tags: datagrid, jboss, hidden Supports: datagrid:7.1 * registry.redhat.io/jboss-datagrid-7/datagrid71-openshift@sha256:df13fa689944b028595f593480bc54b17e3f81ca685a6b4290089e6a3c4e3a8e 6 hours ago
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.
I suspsect this is a dup of existing bugs on the 4.3 / 4.4 but will triage the data today and report back
I'm working off of the logs @XiuJuan posted for now, but the must gather logs attachment from Avia is getting errors when I attempt to process it using multiple tools, including gunzip/tar/archive mgr Can a valid must gather archive file be resubmitted?
OK seeing a common thread between XiuJuan's datat and the data from https://bugzilla.redhat.com/show_bug.cgi?id=1827325#c11 Definite timing bug related with some the upgrade issues previously seen where sample providers have started deleting samples between releases, but just in a different spot this time.
I have handle on this .... essentially https://github.com/openshift/cluster-samples-operator/pull/240 was incomplete and did not include templates with set of samples that were removed and needed to be ignore we ended up with a bunch of time="2020-04-27T06:43:12Z" level=info msg="open : no such file or directory error reading file []" time="2020-04-27T06:43:12Z" level=info msg="CRDUPDATE event temp udpate err" that slows down progressing == false
Added new must-gather, it is not zipped this time, its from today's upgrade rc-11 ro rc-12, issue still here https://drive.google.com/open?id=171vGyhchGcvC598UtSdlqxRm415L2oLW
The must gather Aviad just posted showed the same root cause as the other 2 examples.
upgrade related e2e failures are also showing this sig-devex][Feature:Templates] templateinstance readiness test should report failed soon after an annotated objects has failed [Suite:openshift/conformance/parallel] in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3-to-4.4-to-4.5-ci/48 for example
[sig-devex][Feature:Templates] templateinstance readiness test should report failed soon after an annotated objects has failed [Suite:openshift/conformance/parallel]
Cross-linking bug 1811143, which is also about 4.5 samples Progessing.
Upgrade 4.4.0-0.nightly-2020-04-27-013217 to 4.5.0-0.nightly-2020-04-28-110828(which doesn't include fix), could reproduce this issue. Continue upgrade to 4.5.0-0.nightly-2020-04-29-040854, the fix works. openshift-samples processing change to false sooner, and don't change to true without changing. Also no such error "open : no such file or directory error reading file []" in samples operator logs. Mark this bug as verified.
We saw this too in CI build cluster. time="2020-05-01T20:37:50Z" level=info msg="CRDERROR event temp udpate err" time="2020-05-01T20:37:50Z" level=info msg="open : no such file or directory error reading file []" time="2020-05-01T20:37:51Z" level=info msg="CRDERROR event temp udpate err" time="2020-05-01T20:37:51Z" level=info msg="open : no such file or directory error reading file []" oc --context build01 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.2 True False 6h38m Cluster version is 4.4.2 What is the impact of the bug?
It can slows down by a non trivial amount of minutes the sample operator noting to the CVO that it has reached progressing==false state as part of an upgrade. Which can ultimately slow down when a upgrade says it completes. Once you upgrade is finished, as your oc --context build01 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.2 True False 6h38m Cluster version is 4.4.2 notes, you are OK. Also, if you do not care about samples, you are OK. I would think for CI you do not care about using the samples. So you might want to mark its management state as Removed, which will remove the samples and effectively disable the operator.
This is still there in OCP 4.4.3 ``` openshift-samples 4.4.3 True False False 5m50s operator-lifecycle-manager 4.4.3 True False False 48d ``` ==> ``` openshift-samples 4.4.3 True True False 2m39s operator-lifecycle-manager 4.4.3 True False False 48d ``` There are a lot of error messages in operator ~~~ time="2020-05-05T17:49:46Z" level=error msg="unable to sync: Operation cannot be fulfilled on configs.samples.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again, requeuing" time="2020-05-05T17:49:46Z" level=info msg="CRDUPDATE exist true update" time="2020-05-05T17:49:46Z" level=error msg="unable to sync: retry secret event because in the middle of an sample upsert cycle, requeuing" time="2020-05-05T17:49:46Z" level=error msg="unable to sync: retry secret event because in the middle of an sample upsert cycle, requeuing" time="2020-05-05T17:49:48Z" level=info msg="CRDERROR event temp udpate err" time="2020-05-05T17:49:48Z" level=info msg="open : no such file or directory error reading file []" time="2020-05-05T17:49:48Z" level=info msg="CRDUPDATE event temp udpate err" time="2020-05-05T17:49:48Z" level=info msg="CRDERROR event temp udpate err" ~~~ I believe the operator keep restarting because of some errors. ~~~ Client Version: v4.3.0 Server Version: 4.4.3 Kubernetes Version: v1.17.1 [root@ocp4 home]# ~~~
After catching up with Vadim in slack, he informs me that #Comment 17 is related to broad set of 4.3 to 4.4 upgrade issues being triaged in https://app.slack.com/client/T027F3GAJ/C0108DW4J7P and https://app.slack.com/client/T027F3GAJ/C0134TXLGG3 For samples operator, I will be improving on the 'no such file or directory error reading file []' via https://bugzilla.redhat.com/show_bug.cgi?id=1832193 Now, this bug has yet to be cherry picked back to 4.4 ... it is only fixed in 4.5 at the moment. Hence what was seen in #Comment 17 I will be initiating clone / cherry pick now.
Turns out the cherry pick associated with the 4.4.z clone of this bug, https://bugzilla.redhat.com/show_bug.cgi?id=1832344, is sufficient for the 'no such file or directory error reading file []' once we stop trying to read in template files for templates that a no longer in our current inventory (but were in past inventories) we no longer produced the message
All, just wanted to circle around on this. We're still seeing this in the latest 4.4 nightlies (openshift-v4.4.0-0.nightly-2020-06-01-211921, more detailed job results here: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/osde2e-int-aws-e2e-upgrade-to-osd-default-plus-one-nightly/335).
Looking at that run in #Comment 20, it is a different cause than this bug: 2020/06/02 11:40:03 Operator openshift-samples type Progressing is True: Samples installation in error at 4.4.0-0.nightly-2020-06-01-211921: FailedImageImports that one stems from the imagestream imports failing. Looking at the artifacts for that run, I do not see the must gather data, so I can't pull up the imagestream status for the precise reason, but those failed image imports typically occur either because a) registry.redhat.io has some sort of outage or intermittent issues preventing image download by the image registry b) there is a problem with the credentials supplied during install for registry.redhat.io (where those issues could be "user" error or again more problems on the infrastructure side of registry.redhat.io We do have an open issue with that team for intermittent outages: https://projects.engineering.redhat.com/browse/ITESPLAT-1817 If must gather could be gathered for that particular nightly job, I could get more precise on the analysis and conclude if the failure should be added to https://projects.engineering.redhat.com/browse/ITESPLAT-1817 or is a separate item should be filed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409