Description of problem: APIServerServiceUnavailableErrorjava error makes ImageChangesInProgress keeping true that blocked the upgrade processed This error shouldn't keep so long time while the apiserver is runing well Version-Release number of selected component (if applicable): 4.5.0-rc.7 How reproducible low frequency Steps to Reproduce: 1.Upgrade cluster from 4.4.11 to 4.5.0-rc.7 2.The upgrade is blocked by openshift-samples due to the error: message: 'error creating samples: the server is currently unable to handle the request (put imagestreams.image.openshift.io jboss-fuse70-eap-openshift)' reason: 'APIServerServiceUnavailableErrorjava ' status: "True" type: ImageChangesInProgress 3. Actual results: $oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.11 True True 131m Unable to apply 4.5.0-rc.7: the cluster operator openshift-samples has not yet successfully rolled out $ oc get co openshift-samples -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2020-07-08T06:59:23Z" generation: 1 name: openshift-samples resourceVersion: "90669" selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples uid: da63f977-9f1c-43c2-a4fc-8aed49318896 spec: {} status: conditions: - lastTransitionTime: "2020-07-08T09:03:58Z" message: Samples installation successful at 4.4.11 status: "True" type: Available - lastTransitionTime: "2020-07-08T09:03:58Z" message: Samples processing to 4.5.0-rc.7 status: "True" type: Progressing - lastTransitionTime: "2020-07-08T09:03:58Z" status: "False" type: Degraded extension: null relatedObjects: - group: samples.operator.openshift.io name: cluster resource: configs - group: "" name: openshift-cluster-samples-operator resource: namespaces - group: "" name: openshift resource: namespaces versions: - name: operator version: 4.4.11 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-rc.7 True False False 3h18m cloud-credential 4.5.0-rc.7 True False False 3h45m cluster-autoscaler 4.5.0-rc.7 True False False 3h32m config-operator 4.5.0-rc.7 True False False 125m console 4.5.0-rc.7 True False False 141m csi-snapshot-controller 4.5.0-rc.7 True False False 3h24m dns 4.4.11 True False False 3h39m etcd 4.5.0-rc.7 True False False 3h39m image-registry 4.5.0-rc.7 True False False 3h24m ingress 4.5.0-rc.7 True False False 3h24m insights 4.5.0-rc.7 True False False 3h32m kube-apiserver 4.5.0-rc.7 True False False 3h38m kube-controller-manager 4.5.0-rc.7 True False False 3h38m kube-scheduler 4.5.0-rc.7 True False False 3h38m kube-storage-version-migrator 4.5.0-rc.7 True False False 3h24m machine-api 4.5.0-rc.7 True False False 3h32m machine-approver 4.5.0-rc.7 True False False 112m machine-config 4.4.11 True False False 133m marketplace 4.5.0-rc.7 True False False 110m monitoring 4.5.0-rc.7 True False False 109m network 4.4.11 True False False 3h40m node-tuning 4.5.0-rc.7 True False False 111m openshift-apiserver 4.5.0-rc.7 True False False 113m openshift-controller-manager 4.5.0-rc.7 True False False 3h32m openshift-samples 4.4.11 True True False 87m operator-lifecycle-manager 4.5.0-rc.7 True False False 3h39m operator-lifecycle-manager-catalog 4.5.0-rc.7 True False False 3h40m operator-lifecycle-manager-packageserver 4.5.0-rc.7 True False False 110m service-ca 4.5.0-rc.7 True False False 3h40m service-catalog-apiserver 4.4.11 True False False 3h41m service-catalog-controller-manager 4.4.11 True False False 3h41m storage 4.5.0-rc.7 True False False 111m Expected results: This error shouldn't keep so long time while the apiserver is runing well Additional info: Can't provide too more log due to the cluster has been removed by jenkins ci.
@XiuJuan I understand the logs are no longer available. But since you reported this: message: 'error creating samples: the server is currently unable to handle the request (put imagestreams.image.openshift.io jboss-fuse70-eap-openshift)' reason: 'APIServerServiceUnavailableErrorjava ' status: "True" type: ImageChangesInProgress I suspect you had access to the entire `oc get configs.samples cluster -o yaml` output. Any chance you still have the entire object's yaml, and not just that subset of the entire yaml? In the interim I'm still trying to reverse engineer how we got into this state, but the entire yaml could prove helpful.
I think I have a simple fix for this (that intermittent APIServer error on the initial create should go to SamplesExists, not ImageChangesInProgress), but I'd like to examine the full sample operator config yaml if @XiuJuan has it.
I've looked at this enough now I don't need the additional fields from the config obj yaml. Getting actual API server errors is pretty rare, so we uncovered an issue that has been there for a while. Setting the error report for samples create on the image in ImageChangesInProgress condition messes up the transition of ImageChangesInProgress from true to false, because a non-imagestream name is in the reason field. Will have fix up shortly.
Gabe, thanks The logs in comment #0 are all I get before the cluster is removed. Glad you have got clue.
while the likelihood of this happening is remote, as it will block an upgrade, I'm bumping the severity
Upgrade several cluster from 4.4-> 4.5-> 4.6 openshift-samples don't meet the bug issue even openshift-apiserver is degraded. openshift-apiserver 4.6.0-0.nightly-2020-07-14-224428 False False True 3h21m openshift-controller-manager 4.6.0-0.nightly-2020-07-14-224428 True True False 4h22m openshift-samples 4.6.0-0.nightly-2020-07-14-224428 True False False 3h29m Then I deleted apiserver pods during openshift-samples processing. The apiserver error APIServerConflictError moved to SamplesExist. Hence mark this bug as verified. $oc get configs.samples -o yaml apiVersion: v1 items: - apiVersion: samples.operator.openshift.io/v1 kind: Config metadata: creationTimestamp: "2020-07-15T02:53:08Z" finalizers: - samples.operator.openshift.io/finalizer generation: 3 name: cluster resourceVersion: "253677" selfLink: /apis/samples.operator.openshift.io/v1/configs/cluster uid: f456399d-7938-412a-ba66-d40b3a69cc41 spec: architectures: - x86_64 managementState: Managed status: architectures: - x86_64 conditions: - lastTransitionTime: "2020-07-15T02:53:08Z" lastUpdateTime: "2020-07-15T02:53:08Z" status: "True" type: ImportCredentialsExist - lastTransitionTime: "2020-07-15T02:53:13Z" lastUpdateTime: "2020-07-15T02:53:13Z" status: "True" type: ConfigurationValid - lastTransitionTime: "2020-07-15T08:33:00Z" lastUpdateTime: "2020-07-15T08:35:44Z" status: "False" type: ImportImageErrorsExist - lastTransitionTime: "2020-07-15T08:34:44Z" lastUpdateTime: "2020-07-15T08:34:44Z" status: "False" type: ImageChangesInProgress - lastTransitionTime: "2020-07-15T08:34:47Z" lastUpdateTime: "2020-07-15T08:35:44Z" message: 'error creating samples: Operation cannot be fulfilled on imagestreams.image.openshift.io "jboss-datagrid73-openshift": the object has been modified; please apply your changes to the latest version and try again' reason: APIServerConflictError status: Unknown type: SamplesExist - lastTransitionTime: "2020-07-15T08:33:03Z" lastUpdateTime: "2020-07-15T08:33:03Z" status: "False" type: RemovePending - lastTransitionTime: "2020-07-15T04:50:00Z" lastUpdateTime: "2020-07-15T04:50:00Z" status: "False" type: MigrationInProgress managementState: Managed version: 4.6.0-0.nightly-2020-07-14-224428 kind: List metadata:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196