1854857 – APIServerServiceUnavailableErrorjava error makes ImageChangesInProgress keeping true that blocked the upgrade processed

Bug 1854857 - APIServerServiceUnavailableErrorjava error makes ImageChangesInProgress keeping true that blocked the upgrade processed

Summary: APIServerServiceUnavailableErrorjava error makes ImageChangesInProgress keepi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Samples
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Gabe Montero
QA Contact:	XiuJuan Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1857201
TreeView+	depends on / blocked

Reported:	2020-07-08 11:26 UTC by XiuJuan Wang
Modified:	2020-10-27 16:13 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: intermittent API server errors were reported on the wrong condition (ImageChangesInProgress instead of SamplesExists) of the cluster operator config object. Consequence: when API server communication returned and all the samples were installed, the samples operator would fail to switch Progressing to false because there was unexpected data in its ImageChangesInProgress condition, and upgrades would incorrectly be marked as incomplete. Fix: code change was made to update SamplesExists with error reports on APIServer communication Result: upgrades are no longer blocked if intermittent APIServer errors occur while samples operator is upgrading.
Clone Of:
Environment:
Last Closed:	2020-10-27 16:12:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-samples-operator pull 303	0	None	closed	Bug 1854857: initial create errors should map to SamplesExists instead of ImageChangesInProgress	2020-11-19 12:54:08 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:13:15 UTC

Description XiuJuan Wang 2020-07-08 11:26:07 UTC

Description of problem:

APIServerServiceUnavailableErrorjava error makes ImageChangesInProgress keeping true that blocked the upgrade processed

This error shouldn't keep so long time while the apiserver is runing well

Version-Release number of selected component (if applicable):
4.5.0-rc.7

How reproducible
low frequency 

Steps to Reproduce:
1.Upgrade cluster from 4.4.11 to 4.5.0-rc.7
2.The upgrade is blocked by openshift-samples due to the error:

     message: 'error creating samples: the server is currently unable to handle the
        request (put imagestreams.image.openshift.io jboss-fuse70-eap-openshift)'
      reason: 'APIServerServiceUnavailableErrorjava '
      status: "True"
      type: ImageChangesInProgress
3.

Actual results:
$oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.11    True        True          131m    Unable to apply 4.5.0-rc.7: the cluster operator openshift-samples has not yet successfully rolled out

$ oc get  co openshift-samples -o yaml 
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-07-08T06:59:23Z"
  generation: 1
  name: openshift-samples
  resourceVersion: "90669"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-samples
  uid: da63f977-9f1c-43c2-a4fc-8aed49318896
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-07-08T09:03:58Z"
    message: Samples installation successful at 4.4.11
    status: "True"
    type: Available
  - lastTransitionTime: "2020-07-08T09:03:58Z"
    message: Samples processing to 4.5.0-rc.7
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-07-08T09:03:58Z"
    status: "False"
    type: Degraded
  extension: null
  relatedObjects:
  - group: samples.operator.openshift.io
    name: cluster
    resource: configs
  - group: ""
    name: openshift-cluster-samples-operator
    resource: namespaces
  - group: ""
    name: openshift
    resource: namespaces
  versions:
  - name: operator
    version: 4.4.11

$ oc get co 
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-rc.7   True        False         False      3h18m
cloud-credential                           4.5.0-rc.7   True        False         False      3h45m
cluster-autoscaler                         4.5.0-rc.7   True        False         False      3h32m
config-operator                            4.5.0-rc.7   True        False         False      125m
console                                    4.5.0-rc.7   True        False         False      141m
csi-snapshot-controller                    4.5.0-rc.7   True        False         False      3h24m
dns                                        4.4.11       True        False         False      3h39m
etcd                                       4.5.0-rc.7   True        False         False      3h39m
image-registry                             4.5.0-rc.7   True        False         False      3h24m
ingress                                    4.5.0-rc.7   True        False         False      3h24m
insights                                   4.5.0-rc.7   True        False         False      3h32m
kube-apiserver                             4.5.0-rc.7   True        False         False      3h38m
kube-controller-manager                    4.5.0-rc.7   True        False         False      3h38m
kube-scheduler                             4.5.0-rc.7   True        False         False      3h38m
kube-storage-version-migrator              4.5.0-rc.7   True        False         False      3h24m
machine-api                                4.5.0-rc.7   True        False         False      3h32m
machine-approver                           4.5.0-rc.7   True        False         False      112m
machine-config                             4.4.11       True        False         False      133m
marketplace                                4.5.0-rc.7   True        False         False      110m
monitoring                                 4.5.0-rc.7   True        False         False      109m
network                                    4.4.11       True        False         False      3h40m
node-tuning                                4.5.0-rc.7   True        False         False      111m
openshift-apiserver                        4.5.0-rc.7   True        False         False      113m
openshift-controller-manager               4.5.0-rc.7   True        False         False      3h32m
openshift-samples                          4.4.11       True        True          False      87m
operator-lifecycle-manager                 4.5.0-rc.7   True        False         False      3h39m
operator-lifecycle-manager-catalog         4.5.0-rc.7   True        False         False      3h40m
operator-lifecycle-manager-packageserver   4.5.0-rc.7   True        False         False      110m
service-ca                                 4.5.0-rc.7   True        False         False      3h40m
service-catalog-apiserver                  4.4.11       True        False         False      3h41m
service-catalog-controller-manager         4.4.11       True        False         False      3h41m
storage                                    4.5.0-rc.7   True        False         False      111m

Expected results:
This error shouldn't keep so long time while the apiserver is runing well

Additional info:

Can't provide too more log due to the cluster has been removed by jenkins ci.

Comment 1 Gabe Montero 2020-07-08 14:27:20 UTC

@XiuJuan

I understand the logs are no longer available.

But since you reported this:

     message: 'error creating samples: the server is currently unable to handle the
        request (put imagestreams.image.openshift.io jboss-fuse70-eap-openshift)'
      reason: 'APIServerServiceUnavailableErrorjava '
      status: "True"
      type: ImageChangesInProgress


I suspect you had access to the entire `oc get configs.samples cluster -o yaml` output.

Any chance you still have the entire object's yaml, and not just that subset of the entire yaml?

In the interim I'm still trying to reverse engineer how we got into this state, but the 
entire yaml could prove helpful.

Comment 2 Gabe Montero 2020-07-08 15:13:23 UTC

I think I have a simple fix for this (that intermittent APIServer error on the initial create should go to SamplesExists, not ImageChangesInProgress), but I'd like to examine the full sample operator config yaml if @XiuJuan has it.

Comment 3 Gabe Montero 2020-07-09 11:48:14 UTC

I've looked at this enough now I don't need the additional fields from the config obj yaml.

Getting actual API server errors is pretty rare, so we uncovered an issue that has been there for a while.

Setting the error report for samples create on the image in ImageChangesInProgress condition
messes up the transition of ImageChangesInProgress from true to false, because a non-imagestream
name is in the reason field.

Will have fix up shortly.

Comment 4 XiuJuan Wang 2020-07-09 12:57:12 UTC

Gabe, thanks
The logs in comment #0 are all I get before the cluster is removed.
Glad you have got clue.

Comment 5 Gabe Montero 2020-07-09 14:06:02 UTC

while the likelihood of this happening is remote, as it will block an upgrade, I'm bumping the severity

Comment 9 XiuJuan Wang 2020-07-15 08:45:50 UTC

Upgrade several cluster from 4.4-> 4.5-> 4.6
openshift-samples don't meet the bug issue even openshift-apiserver is degraded.

openshift-apiserver                        4.6.0-0.nightly-2020-07-14-224428   False       False         True       3h21m
openshift-controller-manager               4.6.0-0.nightly-2020-07-14-224428   True        True          False      4h22m
openshift-samples                          4.6.0-0.nightly-2020-07-14-224428   True        False         False      3h29m

Then I deleted apiserver pods during openshift-samples processing. The apiserver error APIServerConflictError moved to SamplesExist.

Hence mark this bug as verified.
$oc get  configs.samples -o yaml 
apiVersion: v1
items:
- apiVersion: samples.operator.openshift.io/v1
  kind: Config
  metadata:
    creationTimestamp: "2020-07-15T02:53:08Z"
    finalizers:
    - samples.operator.openshift.io/finalizer
    generation: 3
    name: cluster
    resourceVersion: "253677"
    selfLink: /apis/samples.operator.openshift.io/v1/configs/cluster
    uid: f456399d-7938-412a-ba66-d40b3a69cc41
  spec:
    architectures:
    - x86_64
    managementState: Managed
  status:
    architectures:
    - x86_64
    conditions:
    - lastTransitionTime: "2020-07-15T02:53:08Z"
      lastUpdateTime: "2020-07-15T02:53:08Z"
      status: "True"
      type: ImportCredentialsExist
    - lastTransitionTime: "2020-07-15T02:53:13Z"
      lastUpdateTime: "2020-07-15T02:53:13Z"
      status: "True"
      type: ConfigurationValid
    - lastTransitionTime: "2020-07-15T08:33:00Z"
      lastUpdateTime: "2020-07-15T08:35:44Z"
      status: "False"
      type: ImportImageErrorsExist
    - lastTransitionTime: "2020-07-15T08:34:44Z"
      lastUpdateTime: "2020-07-15T08:34:44Z"
      status: "False"
      type: ImageChangesInProgress
    - lastTransitionTime: "2020-07-15T08:34:47Z"
      lastUpdateTime: "2020-07-15T08:35:44Z"
      message: 'error creating samples: Operation cannot be fulfilled on imagestreams.image.openshift.io
        "jboss-datagrid73-openshift": the object has been modified; please apply your
        changes to the latest version and try again'
      reason: APIServerConflictError
      status: Unknown
      type: SamplesExist
    - lastTransitionTime: "2020-07-15T08:33:03Z"
      lastUpdateTime: "2020-07-15T08:33:03Z"
      status: "False"
      type: RemovePending
    - lastTransitionTime: "2020-07-15T04:50:00Z"
      lastUpdateTime: "2020-07-15T04:50:00Z"
      status: "False"
      type: MigrationInProgress
    managementState: Managed
    version: 4.6.0-0.nightly-2020-07-14-224428
kind: List
metadata:

Comment 11 errata-xmlrpc 2020-10-27 16:12:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.