Bug 1735711
| Summary: | ImageChangesInProgress stuck in true for long time | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | XiuJuan Wang <xiuwang> | ||||||
| Component: | ImageStreams | Assignee: | Gabe Montero <gmontero> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | XiuJuan Wang <xiuwang> | ||||||
| Severity: | low | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 4.1.0 | CC: | adam.kaplan, aos-bugs, gmontero, jokerman, wzheng | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 4.2.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: |
Cause: quick, successive, deletes of the samples operator config object could lead to the last delete hanging and the operator stuck in its ImageChangesInProgress condition stuck in true, which results in the clusteroperator object for the samples operator being stuck in Progressing==True
Consequence: Indeterminate state for cluster samples
Fix: Corrections to the coordination between the delete finalizer and samples upsert were introduced
Result: quick, successive deletes of the samples operator config object work as expected.
|
Story Points: | --- | ||||||
| Clone Of: | |||||||||
| : | 1737590 (view as bug list) | Environment: | |||||||
| Last Closed: | 2019-10-16 06:34:15 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1737590 | ||||||||
| Attachments: |
|
||||||||
XiuJuan - by "Do samples operator finalizer some times", do you mean you delete the config.samples object, perhaps a few times in a row? @gabe yes, I delete the config.samples crd several times, then met the hanging issue.
after 2h, the imagestreams finsihed import.
$oc describe config.samples.operator | grep -A1 -B3 ImageChangesInProgress
Last Transition Time: 2019-08-01T10:25:37Z
Last Update Time: 2019-08-01T10:25:37Z
Status: False
Type: ImageChangesInProgress
Last Transition Time: 2019-08-01T10:25:06Z
OK I may have reproduced this ... I at least reproduced an issue where successive deletes done quickly enough can hang on the last one, and one of the byproducts is that InProgress is still true. Any chance your last delete hung @XiuJuan ? Regardless minimally need to fix the thing I'm seeing. Note the config obj when I'm hung / stuck in in progress == true:
gmontero ~/ci-debug/4-2 $ oc get configs.samples -o yaml
apiVersion: v1
items:
- apiVersion: samples.operator.openshift.io/v1
kind: Config
metadata:
creationTimestamp: "2019-08-02T20:42:12Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2019-08-02T20:42:13Z"
finalizers:
- samples.operator.openshift.io/finalizer
generation: 2
name: cluster
resourceVersion: "98456"
selfLink: /apis/samples.operator.openshift.io/v1/configs/cluster
uid: 01390f91-b566-11e9-a3b3-02a744e6a3a4
spec:
architectures:
- x86_64
managementState: Managed
status:
conditions:
- lastTransitionTime: "2019-08-02T20:42:15Z"
lastUpdateTime: "2019-08-02T20:42:15Z"
reason: 'jboss-fuse70-karaf-openshift jboss-webserver30-tomcat8-openshift rhdm73-kieserver-openshift
redhat-sso71-openshift jboss-datavirt64-driver-openshift fuse7-java-openshift
golang redhat-sso70-openshift jboss-amq-63 fis-karaf-openshift jboss-fuse70-java-openshift
rhdm73-decisioncentral-openshift openjdk-11-rhel7 rhpam73-businesscentral-indexing-openshift
jboss-webserver31-tomcat8-openshift jboss-webserver50-tomcat9-openshift fis-java-openshift
rhpam73-businesscentral-monitoring-openshift rhpam73-businesscentral-openshift
jboss-webserver31-tomcat7-openshift jboss-amq-62 jboss-eap70-openshift nginx
perl redhat-sso73-openshift jboss-webserver30-tomcat7-openshift jboss-eap64-openshift
fuse7-karaf-openshift mysql rhpam73-smartrouter-openshift jboss-datagrid72-openshift
jboss-fuse70-eap-openshift redhat-openjdk18-openshift jenkins mariadb python
jboss-datagrid65-client-openshift jboss-datagrid71-client-openshift dotnet-runtime
jboss-eap72-openshift jboss-processserver64-openshift jboss-datagrid71-openshift
jboss-datavirt64-openshift httpd php jboss-fuse70-console nodejs redis redhat-sso72-openshift
rhdm73-decisioncentral-indexing-openshift jboss-decisionserver64-openshift
eap-cd-openshift jenkins-agent-nodejs mongodb jenkins-agent-maven postgresql
ruby dotnet fuse-apicurito-generator fuse7-console fuse7-eap-openshift rhdm73-optaweb-employee-rostering-openshift
rhpam73-kieserver-openshift apicast-gateway jboss-datagrid73-openshift jboss-eap71-openshift
apicurito-ui jboss-datagrid65-openshift java modern-webapp '
status: "True"
type: ImageChangesInProgress
- lastTransitionTime: "2019-08-02T20:42:15Z"
lastUpdateTime: "2019-08-02T20:42:15Z"
status: "False"
type: SamplesExist
- lastTransitionTime: "2019-08-02T20:42:15Z"
lastUpdateTime: "2019-08-02T20:42:15Z"
status: "False"
type: ImportCredentialsExist
- lastTransitionTime: "2019-08-02T20:42:15Z"
lastUpdateTime: "2019-08-02T20:42:15Z"
status: "True"
type: ConfigurationValid
- lastTransitionTime: "2019-08-02T20:42:15Z"
lastUpdateTime: "2019-08-02T20:42:15Z"
status: "False"
type: RemovePending
- lastTransitionTime: "2019-08-02T20:42:15Z"
lastUpdateTime: "2019-08-02T20:42:15Z"
status: "False"
type: MigrationInProgress
- lastTransitionTime: "2019-08-02T20:42:15Z"
lastUpdateTime: "2019-08-02T20:42:15Z"
status: "False"
type: ImportImageErrorsExist
kind: List
metadata:
resourceVersion: ""
selfLink: ""
My first delete processed, and the next one that hung immediately restarted: time="2019-08-02T20:42:03Z" level=info msg="CRDUPDATE exist false update" time="2019-08-02T20:42:06Z" level=info msg="Initiating finalizer processing for a SampleResource delete attempt" time="2019-08-02T20:42:06Z" level=info msg="CRDUPDATE remove finalizer update" time="2019-08-02T20:42:09Z" level=info msg="A previous delete attempt has been successfully completed" time="2019-08-02T20:42:09Z" level=info msg="delete of Config recognized" time="2019-08-02T20:42:09Z" level=info msg="Copying secret pull-secret from the openshift-config namespace into the operator's namespace" time="2019-08-02T20:42:09Z" level=info msg="creating default Config" Issue might be related to https://github.com/openshift/cluster-samples-operator/blob/master/pkg/stub/handler.go#L467-L473 ... where we should return an error to initiate retry on the delete processing, where it eventually goes throug after the imports are done. Possibly could also augment https://github.com/openshift/cluster-samples-operator/blob/master/pkg/stub/handler.go#L153-L168 to abort image stream tracking As well as short circuit the upsert paths based on the delete timestamp being set. Yeah, you find the rule.. I could reproduce this almost every time. Deleted config.samples.operator frequestly.Then will met this issue when deleted the config config.samples.operator when ImageChangesInProgres is not finished. Conducing ImageChangesInProgres hang on a long time. OK thanks for the confirm @XiuJuan, and good that we are seeing the same thing I'll be working on the fix today ... I think I have a handle on it. Fix is working for me locally/manually. Working on updating the e2e tests to cover this, and then a PR will be forthcoming. @Gabe, Seem I failed to verified this issue.
After several deletes config in continuous,the ImageChangesInProgress stuck again.
time="2019-08-08T07:20:09Z" level=info msg="Received watch event imagestream/must-gather but not upserting since deletion of the Config is in progress"
time="2019-08-08T07:20:09Z" level=info msg="Received watch event imagestream/jboss-webserver50-tomcat9-openshift but not upserting since deletion of the Config is in progress"
time="2019-08-08T07:22:11Z" level=error msg="unable to sync: retry secret event because in the middle of an sample upsert cycle, requeuing"
time="2019-08-08T07:22:12Z" level=error msg="unable to sync: retry secret event because in the middle of an sample upsert cycle, requeuing"
time="2019-08-08T07:22:12Z" level=error msg="unable to sync: A delete attempt has come in while creating samples; initiating retry; creation loop should abort soon, requeuing"
$ oc get co openshift-samples -o yaml | grep Progressing -B3
- lastTransitionTime: "2019-08-08T06:43:47Z"
message: Samples processing to 4.2.0-0.nightly-2019-08-08-002434
status: "True"
type: Progressing
$ oc get co openshift-samples
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
openshift-samples 4.2.0-0.nightly-2019-08-08-002434 True True False 41m
Checked in 4.2.0-0.nightly-2019-08-08-002434 payload.
@XiuJuan can you please attach the *FULL* operator pod logs, and the *FULL* configs.samples yaml, when you are able to reproduced this. Created attachment 1601961 [details]
samples operator log
Created attachment 1601962 [details]
samples operator config
OK I see what is wrong now with the full data .... I did not hit the timing window exposed in my testing. Will have PR up later today my time. Can't reproduce this issue any more in payload 4.2.0-0.nightly-2019-08-12-080504. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |
Description of problem: Do samples operator finalizer some times, the ImageChangesInProgress will stuck in true for long time(more than 1h). Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-08-01-035705 How reproducible: sometimes Steps to Reproduce: 1.Do samples operator finalizer some times. 2.Check samples operator crd. 3. Actual results: $ oc describe config.samples.operator | grep -A1 -B3 ImageChangesInProgress Last Update Time: 2019-08-01T08:42:00Z Reason: fuse-apicurito-generator jboss-processserver64-openshift redis rhpam73-kieserver-openshift jboss-webserver31-tomcat7-openshift jboss-webserver31-tomcat8-openshift jboss-decisionserver64-openshift fuse7-karaf-openshift jenkins-agent-nodejs nginx python rhdm73-decisioncentral-indexing-openshift jboss-datagrid72-openshift dotnet-runtime jenkins-agent-maven rhpam73-businesscentral-openshift jboss-amq-62 java openjdk-11-rhel7 fis-java-openshift jboss-fuse70-karaf-openshift httpd mariadb rhpam73-smartrouter-openshift jboss-datagrid71-client-openshift jboss-datagrid73-openshift dotnet Status: True Type: ImageChangesInProgress Last Transition Time: 2019-08-01T08:41:45Z Expected results: Should update the managed imagestreams fast. Additional info: $ oc logs -f cluster-samples-operator-54c57b54f7-mvtpz | grep fuse-apicurito-generator time="2019-08-01T08:36:20Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T08:41:22Z" level=info msg="one time ignoring of delete event for imagestream fuse-apicurito-generator as part of group delete" time="2019-08-01T08:41:38Z" level=info msg="processing file fuse-apicurito-generator-rhel7.json from dir /opt/openshift/operator/ocp-x86_64/fis/imagestreams" time="2019-08-01T08:41:39Z" level=info msg="created imagestream fuse-apicurito-generator" time="2019-08-01T08:41:39Z" level=info msg="caching imagestream event fuse-apicurito-generator because we have not yet completed all the samples upserts" time="2019-08-01T08:42:01Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T08:42:03Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator" time="2019-08-01T08:42:11Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator" time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T08:46:20Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T08:46:20Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T08:51:53Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T08:51:53Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator" time="2019-08-01T08:51:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:01:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:01:53Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:01:53Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:05:59Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:05:59Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:13:28Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:23:28Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:23:28Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:24:55Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:34:46Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:34:46Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:44:46Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:44:46Z" level=info msg="There are no more image imports in flight for imagestream fuse-apicurito-generator" time="2019-08-01T09:44:46Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " time="2019-08-01T09:51:41Z" level=info msg="Imagestream fuse-apicurito-generator watch event do upsert false; no errors in prep true, possibly update operator conditions true" time="2019-08-01T09:51:41Z" level=info msg="imagestream fuse-apicurito-generator still not finished with its image imports, including " $ oc get images | grep fuse-apicurito-generator sha256:6ef87e9fc054e3e38954a663fef6feb0662711ebd128c363b32d02f8ec023510 registry.redhat.io/fuse7/fuse-apicurito-generator@sha256:6ef87e9fc054e3e38954a663fef6feb0662711ebd128c363b32d02f8ec023510 sha256:86e3cbe3e43c4048147c61bdf7b0d682f95753e4fb456ea8319302152a75073a registry.redhat.io/fuse7/fuse-apicurito-generator@sha256:86e3cbe3e43c4048147c61bdf7b0d682f95753e4fb456ea8319302152a75073a