Bug 1745973 - Insights operator should not report "degraded" after one unsuccessful upload attempt
Summary: Insights operator should not report "degraded" after one unsuccessful upload ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Insights Operator
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: Ivan Necas
QA Contact: Dmitry Misharov
Radek Vokál
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-27 11:17 UTC by Radek Vokál
Modified: 2020-07-30 15:43 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:37:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift insights-operator pull 16 0 'None' closed Bug 1745973: Don't reported Degraded on upload error, report UploadDegraded 2021-02-15 14:02:09 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:38:05 UTC

Description Radek Vokál 2019-08-27 11:17:10 UTC
Description of problem:
Insights operator switches to "degraded" after one unsuccessful attempt to upload the data. The insights data are not critical and I propose we report degraded after at least X number of attempts where X > 5 or even a higher number.

Comment 1 Eric Paris 2019-08-27 12:26:13 UTC
Failure to upload, especially if it's because "our" end is broken must not make the operator degraded.  Since that means customers can't update....

Comment 8 Dmitry Misharov 2019-09-25 08:21:24 UTC
Can you please provide verification steps? As I understand I need to simulate some network instability or inaccessibility of ingress service.

Comment 9 Dmitry Misharov 2019-09-25 09:07:21 UTC
Verified on 4.2.0-0.ci-2019-09-25-043459.
Steps to verify:
1. Replace the endpoint to some not valid url
> oc -n openshift-config create secret generic support --from-literal=endpoint=http://localhost --dry-run -o yaml | oc apply -f - -n openshift-config
2. Restart insights-operator
> oc delete pods --namespace=openshift-insights --all
3. Check the logs:

> insightsclient.go:163] Unable to build a request, possible invalid token: Post http://localhost: dial tcp [::1]:80: connect: connection refused
> insightsuploader.go:132] Unable to upload report after 0s: unable to build request to connect to Insights server
> status.go:145] Number of last upload failures 5 exceeded than threshold 5. Marking as degraded.

Comment 10 Ivan Necas 2019-09-25 09:28:16 UTC
```
oc -n openshift-config create secret generic support --from-literal=endpoint=http://localhost --dry-run -o yaml | oc apply -f - -n openshift-config
oc kill pods --namespace=openshift-insights # to workaround #1753755
```

In operator logs notice something like:



0925 09:17:50.207271       1 insightsclient.go:160] Uploading application/vnd.redhat.openshift.periodic to http://localhost
I0925 09:17:50.208665       1 insightsclient.go:163] Unable to build a request, possible invalid token: Post http://localhost: dial tcp [::1]:80: connect: connection refused
I0925 09:17:50.208695       1 insightsuploader.go:132] Unable to upload report after 10ms: unable to build request to connect to Insights server
I0925 09:17:50.208708       1 controllerstatus.go:40] name=insightsuploader healthy=false reason=UploadFailed message=Unable to report: unable to build request to connect to Insights server
I0925 09:18:26.415451       1 status.go:142] Number of last upload failures 1 lower than threshold 5. Not marking as degraded.


After first few failures, the operator is still not marked as degraded:
```
oc get clusteroperator insights        
NAME       VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
insights   4.2.0-0.ci-2019-09-25-043459   True        False         False      37m
```

But looking at deatils `oc get clusteroperator insights -o yaml`, on sees additional condition:

```
  - lastTransitionTime: "2019-09-25T09:18:26Z"
    message: 'Unable to report: unable to build request to connect to Insights server'
    reason: UploadFailed
    status: "True"
    type: UploadDegraded
```

After 5 attempts, the operator should turn into degarded state, while keeping the `UplaodDegraded` state as well.

When changing the endpoint back to the proper value

```
oc -n openshift-config create secret generic support --from-literal=endpoint=https://cloud.redhat.com/api/ingress/v1/upload --dry-run -o yaml | oc apply -f - -n openshift-config
```

The operator should get back to Degarded=false, and the UploadUpload degraded state should go away.

Comment 11 errata-xmlrpc 2019-10-16 06:37:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Comment 12 Ali Okan Yuksel 2020-07-30 15:43:38 UTC
This problem still exists on 4.3.29. And solution steps are not clear enough.


Details:

[root@lbint ~]# oc describe co insights                                                                                        Name:         insights
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-07-29T22:48:24Z
  Generation:          1
  Resource Version:    236636
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/insights
  UID:                 df26032f-d4d3-4dd5-b399-8af9da9673d9
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-07-30T09:41:48Z
    Message:               Unable to report: unable to build request to connect to Insights server: Post https://cloud.redhat.com/api/ingress/v1/upload: x509: certificate is valid for *.apps.data.tr.test.com, not cloud.redhat.com
    Reason:                UploadFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-07-29T22:48:24Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-07-29T22:50:24Z
    Message:               An error has occurred
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-07-29T22:50:24Z
    Status:                False
    Type:                  Disabled
    Last Transition Time:  2020-07-30T09:41:50Z
    Message:               Unable to report: unable to build request to connect to Insights server: Post https://cloud.redhat.com/api/ingress/v1/upload: x509: certificate is valid for *.apps.data.tr.test.com, not cloud.redhat.com
    Reason:                UploadFailed
    Status:                True
    Type:                  UploadDegraded
  Extension:
    Last Report Time:  <nil>
  Related Objects:
    Group:
    Name:       openshift-insights
    Resource:   namespaces
    Group:      apps
    Name:       insights-operator
    Namespace:  openshift-insights
    Resource:   deployments
  Versions:
    Name:     operator
    Version:  4.3.29
Events:       <none>


Note You need to log in before you can comment on or make changes to this bug.