Bug 1737590 - ImageChangesInProgress stuck in true for long time
Summary: ImageChangesInProgress stuck in true for long time
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: ImageStreams
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.1.z
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On: 1735711
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-05 18:28 UTC by Gabe Montero
Modified: 2019-08-09 20:20 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1735711
Environment:
Last Closed: 2019-08-09 20:20:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Gabe Montero 2019-08-05 18:28:46 UTC
+++ This bug was initially created as a clone of Bug #1735711 +++

Description of problem:
Do samples operator finalizer some times, the ImageChangesInProgress will stuck in true for long time(more than 1h).


Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-01-035705

How reproducible:
sometimes

Steps to Reproduce:
1.Do samples operator finalizer some times.
2.Check samples operator crd.
3.

Actual results:

$ oc describe config.samples.operator | grep -A1 -B3 ImageChangesInProgress 
    Last Update Time:      2019-08-01T08:42:00Z
    Reason:                fuse-apicurito-generator jboss-processserver64-openshift redis rhpam73-kieserver-openshift jboss-webserver31-tomcat7-openshift jboss-webserver31-tomcat8-openshift jboss-decisionserver64-openshift fuse7-karaf-openshift jenkins-agent-nodejs nginx python rhdm73-decisioncentral-indexing-openshift jboss-datagrid72-openshift dotnet-runtime jenkins-agent-maven rhpam73-businesscentral-openshift jboss-amq-62 java openjdk-11-rhel7 fis-java-openshift jboss-fuse70-karaf-openshift httpd mariadb rhpam73-smartrouter-openshift jboss-datagrid71-client-openshift jboss-datagrid73-openshift dotnet 
    Status:                True
    Type:                  ImageChangesInProgress
    Last Transition Time:  2019-08-01T08:41:45Z

Expected results:
Should update the managed imagestreams fast.

Additional info:
$ oc logs -f cluster-samples-operator-54c57b54f7-mvtpz  | grep fuse-apicurito-generator 
time="2019-08-01T08:36:20Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T08:41:22Z" level=info msg="one time ignoring of delete event for imagestream fuse-apicurito-generator as part of group delete"
time="2019-08-01T08:41:38Z" level=info msg="processing file fuse-apicurito-generator-rhel7.json from dir /opt/openshift/operator/ocp-x86_64/fis/imagestreams"
time="2019-08-01T08:41:39Z" level=info msg="created imagestream fuse-apicurito-generator"
time="2019-08-01T08:41:39Z" level=info msg="caching imagestream event fuse-apicurito-generator because we have not yet completed all the samples upserts"
time="2019-08-01T08:42:01Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T08:42:03Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator"
time="2019-08-01T08:42:11Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator"
time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T08:46:20Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T08:51:53Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T08:51:53Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator"
time="2019-08-01T08:51:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:01:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:01:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:01:53Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:05:59Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:05:59Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:13:28Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:23:28Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:23:28Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:24:55Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:34:46Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:34:46Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:44:46Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:44:46Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator"
time="2019-08-01T09:44:46Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:51:41Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:51:41Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "


$ oc get images  | grep fuse-apicurito-generator 
sha256:6ef87e9fc054e3e38954a663fef6feb0662711ebd128c363b32d02f8ec023510   registry.redhat.io/fuse7/fuse-apicurito-generator@sha256:6ef87e9fc054e3e38954a663fef6feb0662711ebd128c363b32d02f8ec023510
sha256:86e3cbe3e43c4048147c61bdf7b0d682f95753e4fb456ea8319302152a75073a   registry.redhat.io/fuse7/fuse-apicurito-generator@sha256:86e3cbe3e43c4048147c61bdf7b0d682f95753e4fb456ea8319302152a75073a

--- Additional comment from Gabe Montero on 2019-08-01 14:54:26 UTC ---

XiuJuan - by "Do samples operator finalizer some times", do you mean you delete the config.samples object, perhaps a few times in a row?

--- Additional comment from XiuJuan Wang on 2019-08-02 02:33:18 UTC ---

@gabe yes, I delete the config.samples crd several times, then met the hanging issue.

after 2h, the imagestreams finsihed import.
 
$oc describe config.samples.operator | grep -A1 -B3 ImageChangesInProgress  
    Last Transition Time:  2019-08-01T10:25:37Z
    Last Update Time:      2019-08-01T10:25:37Z
    Status:                False
    Type:                  ImageChangesInProgress
    Last Transition Time:  2019-08-01T10:25:06Z

--- Additional comment from Gabe Montero on 2019-08-02 20:48:12 UTC ---

OK I may have reproduced this ... I at least reproduced an issue where successive deletes done quickly enough can hang on the last one, and one of the byproducts is that InProgress is still true.

Any chance your last delete hung @XiuJuan ?

Regardless minimally need to fix the thing I'm seeing.

--- Additional comment from Gabe Montero on 2019-08-02 20:49:38 UTC ---

Note the config obj when I'm hung / stuck in in progress == true:

gmontero ~/ci-debug/4-2 $ oc get configs.samples -o yaml 
apiVersion: v1
items:
- apiVersion: samples.operator.openshift.io/v1
  kind: Config
  metadata:
    creationTimestamp: "2019-08-02T20:42:12Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2019-08-02T20:42:13Z"
    finalizers:
    - samples.operator.openshift.io/finalizer
    generation: 2
    name: cluster
    resourceVersion: "98456"
    selfLink: /apis/samples.operator.openshift.io/v1/configs/cluster
    uid: 01390f91-b566-11e9-a3b3-02a744e6a3a4
  spec:
    architectures:
    - x86_64
    managementState: Managed
  status:
    conditions:
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      reason: 'jboss-fuse70-karaf-openshift jboss-webserver30-tomcat8-openshift rhdm73-kieserver-openshift
        redhat-sso71-openshift jboss-datavirt64-driver-openshift fuse7-java-openshift
        golang redhat-sso70-openshift jboss-amq-63 fis-karaf-openshift jboss-fuse70-java-openshift
        rhdm73-decisioncentral-openshift openjdk-11-rhel7 rhpam73-businesscentral-indexing-openshift
        jboss-webserver31-tomcat8-openshift jboss-webserver50-tomcat9-openshift fis-java-openshift
        rhpam73-businesscentral-monitoring-openshift rhpam73-businesscentral-openshift
        jboss-webserver31-tomcat7-openshift jboss-amq-62 jboss-eap70-openshift nginx
        perl redhat-sso73-openshift jboss-webserver30-tomcat7-openshift jboss-eap64-openshift
        fuse7-karaf-openshift mysql rhpam73-smartrouter-openshift jboss-datagrid72-openshift
        jboss-fuse70-eap-openshift redhat-openjdk18-openshift jenkins mariadb python
        jboss-datagrid65-client-openshift jboss-datagrid71-client-openshift dotnet-runtime
        jboss-eap72-openshift jboss-processserver64-openshift jboss-datagrid71-openshift
        jboss-datavirt64-openshift httpd php jboss-fuse70-console nodejs redis redhat-sso72-openshift
        rhdm73-decisioncentral-indexing-openshift jboss-decisionserver64-openshift
        eap-cd-openshift jenkins-agent-nodejs mongodb jenkins-agent-maven postgresql
        ruby dotnet fuse-apicurito-generator fuse7-console fuse7-eap-openshift rhdm73-optaweb-employee-rostering-openshift
        rhpam73-kieserver-openshift apicast-gateway jboss-datagrid73-openshift jboss-eap71-openshift
        apicurito-ui jboss-datagrid65-openshift java modern-webapp '
      status: "True"
      type: ImageChangesInProgress
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "False"
      type: SamplesExist
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "False"
      type: ImportCredentialsExist
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "True"
      type: ConfigurationValid
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "False"
      type: RemovePending
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "False"
      type: MigrationInProgress
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "False"
      type: ImportImageErrorsExist
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

--- Additional comment from Gabe Montero on 2019-08-02 21:01:51 UTC ---

My first delete processed, and the next one that hung immediately restarted:

time="2019-08-02T20:42:03Z" level=info msg="CRDUPDATE exist false update"
time="2019-08-02T20:42:06Z" level=info msg="Initiating finalizer processing for a SampleResource delete attempt"
time="2019-08-02T20:42:06Z" level=info msg="CRDUPDATE remove finalizer update"
time="2019-08-02T20:42:09Z" level=info msg="A previous delete attempt has been successfully completed"
time="2019-08-02T20:42:09Z" level=info msg="delete of Config recognized"
time="2019-08-02T20:42:09Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace"
time="2019-08-02T20:42:09Z" level=info msg="creating default Config"


Issue might be related to https://github.com/openshift/cluster-samples-operator/blob/master/pkg/stub/handler.go#L467-L473 ... where we should return an error to initiate retry on the delete processing, where it eventually goes throug after the imports are done.

Possibly could also augment https://github.com/openshift/cluster-samples-operator/blob/master/pkg/stub/handler.go#L153-L168 to abort image stream tracking

As well as short circuit the upsert paths based on the delete timestamp being set.

--- Additional comment from XiuJuan Wang on 2019-08-05 08:39:43 UTC ---

Yeah, you find the rule..
I could reproduce this almost every time. 
Deleted config.samples.operator frequestly.Then will met this issue when deleted the config config.samples.operator when ImageChangesInProgres is not finished. Conducing ImageChangesInProgres hang on a long time.

--- Additional comment from Gabe Montero on 2019-08-05 13:31:26 UTC ---

OK thanks for the confirm @XiuJuan, and good that we are seeing the same thing

I'll be working on the fix today ... I think I have a handle on it.

--- Additional comment from Gabe Montero on 2019-08-05 17:32:49 UTC ---

Fix is working for me locally/manually.

Working on updating the e2e tests to cover this, and then a PR will be forthcoming.

Comment 1 Gabe Montero 2019-08-05 18:30:27 UTC
waiting on cherrypick bot to create 4.1 PR once the 4.2 PR merges

Comment 3 Gabe Montero 2019-08-07 14:36:37 UTC
cherrypick bot could not merge ... had to manually pick, reconcile, and create 4.1 PR

Comment 4 Gabe Montero 2019-08-09 20:20:11 UTC
switching gears here based on conversations regarding other 4.1 bugs ... unless this is specifically reported by a customer here, not going to bother backporting.


Note You need to log in before you can comment on or make changes to this bug.