Bug 1702582 - polling apply actions in CVO should report more resource specific errors
Summary: polling apply actions in CVO should report more resource specific errors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: Joseph Callen
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-24 07:43 UTC by Johnny Liu
Modified: 2019-10-16 06:28 UTC (History)
0 users

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Add additional deployment status information to provided in info level log Reason: To provide additional information of the status of the deployment for troubleshooting Result: Displays deployment conditions
Clone Of:
Environment:
Last Closed: 2019-10-16 06:28:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:28:21 UTC

Description Johnny Liu 2019-04-24 07:43:02 UTC
Description of problem:

Version-Release number of the following components:
upgrade from 4.1.0-0.nightly-2019-04-22-005054 to 4.1.0-0.nightly-2019-04-22-192604

How reproducible:
Always

Steps to Reproduce:
1. Fresh install a cluster with 4.1.0-0.nightly-2019-04-22-005054.
2. Trigger an upgrade towards 4.1.0-0.nightly-2019-04-22-192604.
3. Upon upgrade in progress, block all traffic from quay.io.
4. Wait until upgrade failed.

Actual results:
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-04-22-192604   True        True          62m     Unable to apply 4.1.0-0.nightly-2019-04-22-192604: the update could not be applied

# oc describe clusterversion
Name:         version
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterVersion
Metadata:
  Creation Timestamp:  2019-04-23T06:36:14Z
  Generation:          2
  Resource Version:    405822
  Self Link:           /apis/config.openshift.io/v1/clusterversions/version
  UID:                 17af3504-6592-11e9-9955-0293645e251a
Spec:
  Channel:     stable-4.0
  Cluster ID:  3b07acc8-66ba-4c9f-a465-5127b755487a
  Desired Update:
    Image:    registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-04-22-192604
    Version:  
  Upstream:   https://api.openshift.com/api/upgrades_info/v1/graph
Status:
  Available Updates:  <nil>
  Conditions:
    Last Transition Time:  2019-04-23T06:49:11Z
    Message:               Done applying 4.1.0-0.nightly-2019-04-22-005054
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-04-24T07:07:01Z
    Message:               Could not update deployment "openshift-dns-operator/dns-operator" (47 of 333)
    Reason:                UpdatePayloadFailed
    Status:                True
    Type:                  Failing
    Last Transition Time:  2019-04-23T08:58:22Z
    Message:               Unable to apply 4.1.0-0.nightly-2019-04-22-192604: the update could not be applied
    Reason:                UpdatePayloadFailed
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-04-23T06:36:29Z
    Message:               Unable to retrieve available updates: unknown version 4.1.0-0.nightly-2019-04-22-192604
    Reason:                RemoteFailed
    Status:                False
    Type:                  RetrievedUpdates
  Desired:
    Image:    registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-04-22-192604
    Version:  4.1.0-0.nightly-2019-04-22-192604
  History:
    Completion Time:    <nil>
    Image:              registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-04-22-192604
    Started Time:       2019-04-23T08:58:22Z
    State:              Partial
    Version:            4.1.0-0.nightly-2019-04-22-192604
    Completion Time:    2019-04-23T08:58:22Z
    Image:              registry.svc.ci.openshift.org/ocp/release@sha256:3f3628cd9b694705cb0627ce72e61932df5d9938a291fabba1ed691230f7b548
    Started Time:       2019-04-23T06:36:29Z
    State:              Completed
    Version:            4.1.0-0.nightly-2019-04-22-005054
  Observed Generation:  2
  Version Hash:         VDUuFi-LWdE=
Events:                 <none>

From the output of `oc describe clusterversion`, saw 'Could not update deployment "openshift-dns-operator/dns-operator"' error message

Expected results:
CVO should do a bit more and reported why 'Could not update deployment "openshift-dns-operator/dns-operator"' with a better message.


Additional info:
The only way of trouble-shooting is checking cluster-version-operator log.
# oc logs -f cluster-version-operator-694fb8bf89-w86mk -n openshift-cluster-version
<--snip-->
I0423 09:12:35.598277       1 apps.go:77] Deployment dns-operator is not ready. status: (replicas: 2, updated: 1, ready: 1, unavailable: 1)
E0423 09:12:35.598319       1 task.go:77] error running apply for deployment "openshift-dns-operator/dns-operator" (47 of 333): timed out waiting for the condition
I0423 09:12:35.598356       1 task_graph.go:560] Canceled worker 8
I0423 09:12:35.598366       1 task_graph.go:580] Workers finished
I0423 09:12:35.598375       1 task_graph.go:588] Result of work: [Could not update deployment "openshift-dns-operator/dns-operator" (47 of 333)]
I0423 09:12:35.598392       1 sync_worker.go:667] Summarizing 1 errors
I0423 09:12:35.598399       1 sync_worker.go:671] Update error 47/333: UpdatePayloadFailed Could not update deployment "openshift-dns-operator/dns-operator" (47 of 333) (*errors.errorString: timed out waiting for the condition)
I0423 09:12:35.598426       1 task_graph.go:508] No more reachable nodes in graph, continue
I0423 09:12:35.598442       1 task_graph.go:544] No more work
E0423 09:12:35.598461       1 sync_worker.go:288] unable to synchronize image (waiting 49.936801596s): Could not update deployment "openshift-dns-operator/dns-operator" (47 of 333)
<--snip-->

Comment 1 Abhinav Dahiya 2019-04-24 16:22:01 UTC
cluster version object need not include detailed errors. It's our goal to make sure it include high level information that guides admins what to look next.

> I0423 09:12:35.598277       1 apps.go:77] Deployment dns-operator is not ready. status: (replicas: 2, updated: 1, ready: 1, unavailable: 1)
> E0423 09:12:35.598319       1 task.go:77] error running apply for deployment "openshift-dns-operator/dns-operator" (47 of 333): timed out waiting for the condition

From the logs you provided exactly what the CVO thinks why it is failing to update the deployment. Which I think should cover the detail from CVO's perspective.

We *might* try to make them better. but it is good for now.

Comment 2 Abhinav Dahiya 2019-05-23 22:03:12 UTC
https://github.com/openshift/cluster-version-operator/pull/187 was merged to include more information for failing deployment in CVO logs

Comment 3 Johnny Liu 2019-05-24 08:06:29 UTC
No available 4.2 nightly build yet, pending the verification.

Comment 4 Johnny Liu 2019-06-25 10:49:22 UTC
Verified this bug with 4.2.0-0.nightly-2019-06-25-003324, and PASS.

$ oc logs cluster-version-operator-65544b6768-vsmg6 -n openshift-cluster-version|grep machine-api
<--sinp-->
I0625 10:34:31.093887       1 apps.go:94] Deployment openshift-apiserver-operator is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1, reason: MinimumReplicasUnavailable, message: Deployment does not have minimum availability., reason: ProgressDeadlineExceeded, message: ReplicaSet "openshift-apiserver-operator-c8cf58dbc" has timed out progressing.)
E0625 10:34:31.093918       1 task.go:77] error running apply for deployment "openshift-apiserver-operator/openshift-apiserver-operator" (96 of 377): timed out waiting for the condition
<--sinp-->

Comment 5 errata-xmlrpc 2019-10-16 06:28:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.