Bug 1735711 - ImageChangesInProgress stuck in true for long time
Summary: ImageChangesInProgress stuck in true for long time
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: ImageStreams
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.2.0
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On:
Blocks: 1737590
TreeView+ depends on / blocked
 
Reported: 2019-08-01 10:02 UTC by XiuJuan Wang
Modified: 2019-10-16 06:34 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: quick, successive, deletes of the samples operator config object could lead to the last delete hanging and the operator stuck in its ImageChangesInProgress condition stuck in true, which results in the clusteroperator object for the samples operator being stuck in Progressing==True Consequence: Indeterminate state for cluster samples Fix: Corrections to the coordination between the delete finalizer and samples upsert were introduced Result: quick, successive deletes of the samples operator config object work as expected.
Clone Of:
: 1737590 (view as bug list)
Environment:
Last Closed: 2019-10-16 06:34:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
samples operator log (319.40 KB, text/plain)
2019-08-09 02:19 UTC, XiuJuan Wang
no flags Details
samples operator config (4.36 KB, text/plain)
2019-08-09 02:20 UTC, XiuJuan Wang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-samples-operator pull 163 0 None closed Bug 1735711: lock down timing windows when deletes of the config object come in du… 2021-01-13 13:27:25 UTC
Github openshift cluster-samples-operator pull 169 0 None closed Bug 1735711: another concurrent delete/images still importing fix - cover window a… 2021-01-13 13:27:25 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:34:31 UTC

Description XiuJuan Wang 2019-08-01 10:02:47 UTC
Description of problem:
Do samples operator finalizer some times, the ImageChangesInProgress will stuck in true for long time(more than 1h).


Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-01-035705

How reproducible:
sometimes

Steps to Reproduce:
1.Do samples operator finalizer some times.
2.Check samples operator crd.
3.

Actual results:

$ oc describe config.samples.operator | grep -A1 -B3 ImageChangesInProgress 
    Last Update Time:      2019-08-01T08:42:00Z
    Reason:                fuse-apicurito-generator jboss-processserver64-openshift redis rhpam73-kieserver-openshift jboss-webserver31-tomcat7-openshift jboss-webserver31-tomcat8-openshift jboss-decisionserver64-openshift fuse7-karaf-openshift jenkins-agent-nodejs nginx python rhdm73-decisioncentral-indexing-openshift jboss-datagrid72-openshift dotnet-runtime jenkins-agent-maven rhpam73-businesscentral-openshift jboss-amq-62 java openjdk-11-rhel7 fis-java-openshift jboss-fuse70-karaf-openshift httpd mariadb rhpam73-smartrouter-openshift jboss-datagrid71-client-openshift jboss-datagrid73-openshift dotnet 
    Status:                True
    Type:                  ImageChangesInProgress
    Last Transition Time:  2019-08-01T08:41:45Z

Expected results:
Should update the managed imagestreams fast.

Additional info:
$ oc logs -f cluster-samples-operator-54c57b54f7-mvtpz  | grep fuse-apicurito-generator 
time="2019-08-01T08:36:20Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T08:41:22Z" level=info msg="one time ignoring of delete event for imagestream fuse-apicurito-generator as part of group delete"
time="2019-08-01T08:41:38Z" level=info msg="processing file fuse-apicurito-generator-rhel7.json from dir /opt/openshift/operator/ocp-x86_64/fis/imagestreams"
time="2019-08-01T08:41:39Z" level=info msg="created imagestream fuse-apicurito-generator"
time="2019-08-01T08:41:39Z" level=info msg="caching imagestream event fuse-apicurito-generator because we have not yet completed all the samples upserts"
time="2019-08-01T08:42:01Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T08:42:03Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator"
time="2019-08-01T08:42:11Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator"
time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T08:46:20Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T08:51:53Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T08:51:53Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator"
time="2019-08-01T08:51:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:01:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:01:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:01:53Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:05:59Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:05:59Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:13:28Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:23:28Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:23:28Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:24:55Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:34:46Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:34:46Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:44:46Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:44:46Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator"
time="2019-08-01T09:44:46Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "
time="2019-08-01T09:51:41Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true,  possibly update operator conditions true"
time="2019-08-01T09:51:41Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including "


$ oc get images  | grep fuse-apicurito-generator 
sha256:6ef87e9fc054e3e38954a663fef6feb0662711ebd128c363b32d02f8ec023510   registry.redhat.io/fuse7/fuse-apicurito-generator@sha256:6ef87e9fc054e3e38954a663fef6feb0662711ebd128c363b32d02f8ec023510
sha256:86e3cbe3e43c4048147c61bdf7b0d682f95753e4fb456ea8319302152a75073a   registry.redhat.io/fuse7/fuse-apicurito-generator@sha256:86e3cbe3e43c4048147c61bdf7b0d682f95753e4fb456ea8319302152a75073a

Comment 1 Gabe Montero 2019-08-01 14:54:26 UTC
XiuJuan - by "Do samples operator finalizer some times", do you mean you delete the config.samples object, perhaps a few times in a row?

Comment 2 XiuJuan Wang 2019-08-02 02:33:18 UTC
@gabe yes, I delete the config.samples crd several times, then met the hanging issue.

after 2h, the imagestreams finsihed import.
 
$oc describe config.samples.operator | grep -A1 -B3 ImageChangesInProgress  
    Last Transition Time:  2019-08-01T10:25:37Z
    Last Update Time:      2019-08-01T10:25:37Z
    Status:                False
    Type:                  ImageChangesInProgress
    Last Transition Time:  2019-08-01T10:25:06Z

Comment 3 Gabe Montero 2019-08-02 20:48:12 UTC
OK I may have reproduced this ... I at least reproduced an issue where successive deletes done quickly enough can hang on the last one, and one of the byproducts is that InProgress is still true.

Any chance your last delete hung @XiuJuan ?

Regardless minimally need to fix the thing I'm seeing.

Comment 4 Gabe Montero 2019-08-02 20:49:38 UTC
Note the config obj when I'm hung / stuck in in progress == true:

gmontero ~/ci-debug/4-2 $ oc get configs.samples -o yaml 
apiVersion: v1
items:
- apiVersion: samples.operator.openshift.io/v1
  kind: Config
  metadata:
    creationTimestamp: "2019-08-02T20:42:12Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2019-08-02T20:42:13Z"
    finalizers:
    - samples.operator.openshift.io/finalizer
    generation: 2
    name: cluster
    resourceVersion: "98456"
    selfLink: /apis/samples.operator.openshift.io/v1/configs/cluster
    uid: 01390f91-b566-11e9-a3b3-02a744e6a3a4
  spec:
    architectures:
    - x86_64
    managementState: Managed
  status:
    conditions:
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      reason: 'jboss-fuse70-karaf-openshift jboss-webserver30-tomcat8-openshift rhdm73-kieserver-openshift
        redhat-sso71-openshift jboss-datavirt64-driver-openshift fuse7-java-openshift
        golang redhat-sso70-openshift jboss-amq-63 fis-karaf-openshift jboss-fuse70-java-openshift
        rhdm73-decisioncentral-openshift openjdk-11-rhel7 rhpam73-businesscentral-indexing-openshift
        jboss-webserver31-tomcat8-openshift jboss-webserver50-tomcat9-openshift fis-java-openshift
        rhpam73-businesscentral-monitoring-openshift rhpam73-businesscentral-openshift
        jboss-webserver31-tomcat7-openshift jboss-amq-62 jboss-eap70-openshift nginx
        perl redhat-sso73-openshift jboss-webserver30-tomcat7-openshift jboss-eap64-openshift
        fuse7-karaf-openshift mysql rhpam73-smartrouter-openshift jboss-datagrid72-openshift
        jboss-fuse70-eap-openshift redhat-openjdk18-openshift jenkins mariadb python
        jboss-datagrid65-client-openshift jboss-datagrid71-client-openshift dotnet-runtime
        jboss-eap72-openshift jboss-processserver64-openshift jboss-datagrid71-openshift
        jboss-datavirt64-openshift httpd php jboss-fuse70-console nodejs redis redhat-sso72-openshift
        rhdm73-decisioncentral-indexing-openshift jboss-decisionserver64-openshift
        eap-cd-openshift jenkins-agent-nodejs mongodb jenkins-agent-maven postgresql
        ruby dotnet fuse-apicurito-generator fuse7-console fuse7-eap-openshift rhdm73-optaweb-employee-rostering-openshift
        rhpam73-kieserver-openshift apicast-gateway jboss-datagrid73-openshift jboss-eap71-openshift
        apicurito-ui jboss-datagrid65-openshift java modern-webapp '
      status: "True"
      type: ImageChangesInProgress
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "False"
      type: SamplesExist
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "False"
      type: ImportCredentialsExist
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "True"
      type: ConfigurationValid
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "False"
      type: RemovePending
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "False"
      type: MigrationInProgress
    - lastTransitionTime: "2019-08-02T20:42:15Z"
      lastUpdateTime: "2019-08-02T20:42:15Z"
      status: "False"
      type: ImportImageErrorsExist
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 5 Gabe Montero 2019-08-02 21:01:51 UTC
My first delete processed, and the next one that hung immediately restarted:

time="2019-08-02T20:42:03Z" level=info msg="CRDUPDATE exist false update"
time="2019-08-02T20:42:06Z" level=info msg="Initiating finalizer processing for a SampleResource delete attempt"
time="2019-08-02T20:42:06Z" level=info msg="CRDUPDATE remove finalizer update"
time="2019-08-02T20:42:09Z" level=info msg="A previous delete attempt has been successfully completed"
time="2019-08-02T20:42:09Z" level=info msg="delete of Config recognized"
time="2019-08-02T20:42:09Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace"
time="2019-08-02T20:42:09Z" level=info msg="creating default Config"


Issue might be related to https://github.com/openshift/cluster-samples-operator/blob/master/pkg/stub/handler.go#L467-L473 ... where we should return an error to initiate retry on the delete processing, where it eventually goes throug after the imports are done.

Possibly could also augment https://github.com/openshift/cluster-samples-operator/blob/master/pkg/stub/handler.go#L153-L168 to abort image stream tracking

As well as short circuit the upsert paths based on the delete timestamp being set.

Comment 6 XiuJuan Wang 2019-08-05 08:39:43 UTC
Yeah, you find the rule..
I could reproduce this almost every time. 
Deleted config.samples.operator frequestly.Then will met this issue when deleted the config config.samples.operator when ImageChangesInProgres is not finished. Conducing ImageChangesInProgres hang on a long time.

Comment 7 Gabe Montero 2019-08-05 13:31:26 UTC
OK thanks for the confirm @XiuJuan, and good that we are seeing the same thing

I'll be working on the fix today ... I think I have a handle on it.

Comment 8 Gabe Montero 2019-08-05 17:32:49 UTC
Fix is working for me locally/manually.

Working on updating the e2e tests to cover this, and then a PR will be forthcoming.

Comment 10 XiuJuan Wang 2019-08-08 07:25:15 UTC
@Gabe, Seem I failed to verified this issue.
After several deletes config in continuous,the ImageChangesInProgress stuck again.

time="2019-08-08T07:20:09Z" level=info msg="Received watch event imagestream/must-gather but not upserting since deletion of the Config is in progress"
time="2019-08-08T07:20:09Z" level=info msg="Received watch event imagestream/jboss-webserver50-tomcat9-openshift but not upserting since deletion of the Config is in progress"
time="2019-08-08T07:22:11Z" level=error msg="unable to sync: retry secret event because in the middle of an sample upsert cycle, requeuing"
time="2019-08-08T07:22:12Z" level=error msg="unable to sync: retry secret event because in the middle of an sample upsert cycle, requeuing"
time="2019-08-08T07:22:12Z" level=error msg="unable to sync: A delete attempt has come in while creating samples; initiating retry; creation loop should abort soon, requeuing"

$ oc get co openshift-samples -o yaml | grep Progressing -B3
  - lastTransitionTime: "2019-08-08T06:43:47Z"
    message: Samples processing to 4.2.0-0.nightly-2019-08-08-002434
    status: "True"
    type: Progressing
$ oc get co openshift-samples 
NAME                VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-samples   4.2.0-0.nightly-2019-08-08-002434   True        True          False      41m

Checked in 4.2.0-0.nightly-2019-08-08-002434 payload.

Comment 11 Gabe Montero 2019-08-08 13:08:32 UTC
@XiuJuan can you please attach the *FULL* operator pod logs, and the *FULL* configs.samples yaml, when you are able to reproduced this.

Comment 12 XiuJuan Wang 2019-08-09 02:19:32 UTC
Created attachment 1601961 [details]
samples operator log

Comment 13 XiuJuan Wang 2019-08-09 02:20:03 UTC
Created attachment 1601962 [details]
samples operator config

Comment 14 Gabe Montero 2019-08-09 12:13:13 UTC
OK I see what is wrong now with the full data .... I did not hit the timing window exposed in my testing.

Will have PR up later today my time.

Comment 16 XiuJuan Wang 2019-08-12 13:14:00 UTC
Can't reproduce this issue any more in payload 4.2.0-0.nightly-2019-08-12-080504.

Comment 17 errata-xmlrpc 2019-10-16 06:34:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.