Description of problem: Do samples operator finalizer some times, the ImageChangesInProgress will stuck in true for long time(more than 1h). Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-08-01-035705 How reproducible: sometimes Steps to Reproduce: 1.Do samples operator finalizer some times. 2.Check samples operator crd. 3. Actual results: $ oc describe config.samples.operator | grep -A1 -B3 ImageChangesInProgress Last Update Time: 2019-08-01T08:42:00Z Reason: fuse-apicurito-generator jboss-processserver64-openshift redis rhpam73-kieserver-openshift jboss-webserver31-tomcat7-openshift jboss-webserver31-tomcat8-openshift jboss-decisionserver64-openshift fuse7-karaf-openshift jenkins-agent-nodejs nginx python rhdm73-decisioncentral-indexing-openshift jboss-datagrid72-openshift dotnet-runtime jenkins-agent-maven rhpam73-businesscentral-openshift jboss-amq-62 java openjdk-11-rhel7 fis-java-openshift jboss-fuse70-karaf-openshift httpd mariadb rhpam73-smartrouter-openshift jboss-datagrid71-client-openshift jboss-datagrid73-openshift dotnet Status: True Type: ImageChangesInProgress Last Transition Time: 2019-08-01T08:41:45Z Expected results: Should update the managed imagestreams fast. Additional info: $ oc logs -f cluster-samples-operator-54c57b54f7-mvtpz | grep fuse-apicurito-generator time="2019-08-01T08:36:20Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T08:41:22Z" level=info msg="one time ignoring of delete event for imagestream fuse-apicurito-generator as part of group delete" time="2019-08-01T08:41:38Z" level=info msg="processing file fuse-apicurito-generator-rhel7.json from dir /opt/openshift/operator/ocp-x86_64/fis/imagestreams" time="2019-08-01T08:41:39Z" level=info msg="created imagestream fuse-apicurito-generator" time="2019-08-01T08:41:39Z" level=info msg="caching imagestream event fuse-apicurito-generator because we have not yet completed all the samples upserts" time="2019-08-01T08:42:01Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T08:42:03Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator" time="2019-08-01T08:42:11Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator" time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T08:46:20Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T08:51:53Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T08:51:53Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator" time="2019-08-01T08:51:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:01:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:01:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:01:53Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:05:59Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:05:59Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:13:28Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:23:28Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:23:28Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:24:55Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:34:46Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:34:46Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:44:46Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:44:46Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator" time="2019-08-01T09:44:46Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:51:41Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:51:41Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " $ oc get images | grep fuse-apicurito-generator sha256:6ef87e9fc054e3e38954a663fef6feb0662711ebd128c363b32d02f8ec023510 registry.redhat.io/fuse7/fuse-apicurito-generator@sha256:6ef87e9fc054e3e38954a663fef6feb0662711ebd128c363b32d02f8ec023510 sha256:86e3cbe3e43c4048147c61bdf7b0d682f95753e4fb456ea8319302152a75073a registry.redhat.io/fuse7/fuse-apicurito-generator@sha256:86e3cbe3e43c4048147c61bdf7b0d682f95753e4fb456ea8319302152a75073a
XiuJuan - by "Do samples operator finalizer some times", do you mean you delete the config.samples object, perhaps a few times in a row?
@gabe yes, I delete the config.samples crd several times, then met the hanging issue. after 2h, the imagestreams finsihed import. $oc describe config.samples.operator | grep -A1 -B3 ImageChangesInProgress Last Transition Time: 2019-08-01T10:25:37Z Last Update Time: 2019-08-01T10:25:37Z Status: False Type: ImageChangesInProgress Last Transition Time: 2019-08-01T10:25:06Z
OK I may have reproduced this ... I at least reproduced an issue where successive deletes done quickly enough can hang on the last one, and one of the byproducts is that InProgress is still true. Any chance your last delete hung @XiuJuan ? Regardless minimally need to fix the thing I'm seeing.
Note the config obj when I'm hung / stuck in in progress == true: gmontero ~/ci-debug/4-2 $ oc get configs.samples -o yaml apiVersion: v1 items: - apiVersion: samples.operator.openshift.io/v1 kind: Config metadata: creationTimestamp: "2019-08-02T20:42:12Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2019-08-02T20:42:13Z" finalizers: - samples.operator.openshift.io/finalizer generation: 2 name: cluster resourceVersion: "98456" selfLink: /apis/samples.operator.openshift.io/v1/configs/cluster uid: 01390f91-b566-11e9-a3b3-02a744e6a3a4 spec: architectures: - x86_64 managementState: Managed status: conditions: - lastTransitionTime: "2019-08-02T20:42:15Z" lastUpdateTime: "2019-08-02T20:42:15Z" reason: 'jboss-fuse70-karaf-openshift jboss-webserver30-tomcat8-openshift rhdm73-kieserver-openshift redhat-sso71-openshift jboss-datavirt64-driver-openshift fuse7-java-openshift golang redhat-sso70-openshift jboss-amq-63 fis-karaf-openshift jboss-fuse70-java-openshift rhdm73-decisioncentral-openshift openjdk-11-rhel7 rhpam73-businesscentral-indexing-openshift jboss-webserver31-tomcat8-openshift jboss-webserver50-tomcat9-openshift fis-java-openshift rhpam73-businesscentral-monitoring-openshift rhpam73-businesscentral-openshift jboss-webserver31-tomcat7-openshift jboss-amq-62 jboss-eap70-openshift nginx perl redhat-sso73-openshift jboss-webserver30-tomcat7-openshift jboss-eap64-openshift fuse7-karaf-openshift mysql rhpam73-smartrouter-openshift jboss-datagrid72-openshift jboss-fuse70-eap-openshift redhat-openjdk18-openshift jenkins mariadb python jboss-datagrid65-client-openshift jboss-datagrid71-client-openshift dotnet-runtime jboss-eap72-openshift jboss-processserver64-openshift jboss-datagrid71-openshift jboss-datavirt64-openshift httpd php jboss-fuse70-console nodejs redis redhat-sso72-openshift rhdm73-decisioncentral-indexing-openshift jboss-decisionserver64-openshift eap-cd-openshift jenkins-agent-nodejs mongodb jenkins-agent-maven postgresql ruby dotnet fuse-apicurito-generator fuse7-console fuse7-eap-openshift rhdm73-optaweb-employee-rostering-openshift rhpam73-kieserver-openshift apicast-gateway jboss-datagrid73-openshift jboss-eap71-openshift apicurito-ui jboss-datagrid65-openshift java modern-webapp ' status: "True" type: ImageChangesInProgress - lastTransitionTime: "2019-08-02T20:42:15Z" lastUpdateTime: "2019-08-02T20:42:15Z" status: "False" type: SamplesExist - lastTransitionTime: "2019-08-02T20:42:15Z" lastUpdateTime: "2019-08-02T20:42:15Z" status: "False" type: ImportCredentialsExist - lastTransitionTime: "2019-08-02T20:42:15Z" lastUpdateTime: "2019-08-02T20:42:15Z" status: "True" type: ConfigurationValid - lastTransitionTime: "2019-08-02T20:42:15Z" lastUpdateTime: "2019-08-02T20:42:15Z" status: "False" type: RemovePending - lastTransitionTime: "2019-08-02T20:42:15Z" lastUpdateTime: "2019-08-02T20:42:15Z" status: "False" type: MigrationInProgress - lastTransitionTime: "2019-08-02T20:42:15Z" lastUpdateTime: "2019-08-02T20:42:15Z" status: "False" type: ImportImageErrorsExist kind: List metadata: resourceVersion: "" selfLink: ""
My first delete processed, and the next one that hung immediately restarted: time="2019-08-02T20:42:03Z" level=info msg="CRDUPDATE exist false update" time="2019-08-02T20:42:06Z" level=info msg="Initiating finalizer processing for a SampleResource delete attempt" time="2019-08-02T20:42:06Z" level=info msg="CRDUPDATE remove finalizer update" time="2019-08-02T20:42:09Z" level=info msg="A previous delete attempt has been successfully completed" time="2019-08-02T20:42:09Z" level=info msg="delete of Config recognized" time="2019-08-02T20:42:09Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace" time="2019-08-02T20:42:09Z" level=info msg="creating default Config" Issue might be related to https://github.com/openshift/cluster-samples-operator/blob/master/pkg/stub/handler.go#L467-L473 ... where we should return an error to initiate retry on the delete processing, where it eventually goes throug after the imports are done. Possibly could also augment https://github.com/openshift/cluster-samples-operator/blob/master/pkg/stub/handler.go#L153-L168 to abort image stream tracking As well as short circuit the upsert paths based on the delete timestamp being set.
Yeah, you find the rule.. I could reproduce this almost every time. Deleted config.samples.operator frequestly.Then will met this issue when deleted the config config.samples.operator when ImageChangesInProgres is not finished. Conducing ImageChangesInProgres hang on a long time.
OK thanks for the confirm @XiuJuan, and good that we are seeing the same thing I'll be working on the fix today ... I think I have a handle on it.
Fix is working for me locally/manually. Working on updating the e2e tests to cover this, and then a PR will be forthcoming.
@Gabe, Seem I failed to verified this issue. After several deletes config in continuous,the ImageChangesInProgress stuck again. time="2019-08-08T07:20:09Z" level=info msg="Received watch event imagestream/must-gather but not upserting since deletion of the Config is in progress" time="2019-08-08T07:20:09Z" level=info msg="Received watch event imagestream/jboss-webserver50-tomcat9-openshift but not upserting since deletion of the Config is in progress" time="2019-08-08T07:22:11Z" level=error msg="unable to sync: retry secret event because in the middle of an sample upsert cycle, requeuing" time="2019-08-08T07:22:12Z" level=error msg="unable to sync: retry secret event because in the middle of an sample upsert cycle, requeuing" time="2019-08-08T07:22:12Z" level=error msg="unable to sync: A delete attempt has come in while creating samples; initiating retry; creation loop should abort soon, requeuing" $ oc get co openshift-samples -o yaml | grep Progressing -B3 - lastTransitionTime: "2019-08-08T06:43:47Z" message: Samples processing to 4.2.0-0.nightly-2019-08-08-002434 status: "True" type: Progressing $ oc get co openshift-samples NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-samples 4.2.0-0.nightly-2019-08-08-002434 True True False 41m Checked in 4.2.0-0.nightly-2019-08-08-002434 payload.
@XiuJuan can you please attach the *FULL* operator pod logs, and the *FULL* configs.samples yaml, when you are able to reproduced this.
Created attachment 1601961 [details] samples operator log
Created attachment 1601962 [details] samples operator config
OK I see what is wrong now with the full data .... I did not hit the timing window exposed in my testing. Will have PR up later today my time.
Can't reproduce this issue any more in payload 4.2.0-0.nightly-2019-08-12-080504.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922