Bug 1745973

Summary:	Insights operator should not report "degraded" after one unsuccessful upload attempt
Product:	OpenShift Container Platform	Reporter:	Radek Vokál <rvokal>
Component:	Insights Operator	Assignee:	Ivan Necas <inecas>
Status:	CLOSED ERRATA	QA Contact:	Dmitry Misharov <dmisharo>
Severity:	medium	Docs Contact:	Radek Vokál <rvokal>
Priority:	medium
Version:	4.2.0	CC:	dmisharo, eparis, inecas, kaox.gen, mfojtik
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:37:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Radek Vokál 2019-08-27 11:17:10 UTC

Description of problem:
Insights operator switches to "degraded" after one unsuccessful attempt to upload the data. The insights data are not critical and I propose we report degraded after at least X number of attempts where X > 5 or even a higher number.

Comment 1 Eric Paris 2019-08-27 12:26:13 UTC

Failure to upload, especially if it's because "our" end is broken must not make the operator degraded.  Since that means customers can't update....

Comment 8 Dmitry Misharov 2019-09-25 08:21:24 UTC

Can you please provide verification steps? As I understand I need to simulate some network instability or inaccessibility of ingress service.

Comment 9 Dmitry Misharov 2019-09-25 09:07:21 UTC

Verified on 4.2.0-0.ci-2019-09-25-043459.
Steps to verify:
1. Replace the endpoint to some not valid url
> oc -n openshift-config create secret generic support --from-literal=endpoint=http://localhost --dry-run -o yaml | oc apply -f - -n openshift-config
2. Restart insights-operator
> oc delete pods --namespace=openshift-insights --all
3. Check the logs:

> insightsclient.go:163] Unable to build a request, possible invalid token: Post http://localhost: dial tcp [::1]:80: connect: connection refused
> insightsuploader.go:132] Unable to upload report after 0s: unable to build request to connect to Insights server
> status.go:145] Number of last upload failures 5 exceeded than threshold 5. Marking as degraded.

Comment 10 Ivan Necas 2019-09-25 09:28:16 UTC

```
oc -n openshift-config create secret generic support --from-literal=endpoint=http://localhost --dry-run -o yaml | oc apply -f - -n openshift-config
oc kill pods --namespace=openshift-insights # to workaround #1753755
```

In operator logs notice something like:



0925 09:17:50.207271       1 insightsclient.go:160] Uploading application/vnd.redhat.openshift.periodic to http://localhost
I0925 09:17:50.208665       1 insightsclient.go:163] Unable to build a request, possible invalid token: Post http://localhost: dial tcp [::1]:80: connect: connection refused
I0925 09:17:50.208695       1 insightsuploader.go:132] Unable to upload report after 10ms: unable to build request to connect to Insights server
I0925 09:17:50.208708       1 controllerstatus.go:40] name=insightsuploader healthy=false reason=UploadFailed message=Unable to report: unable to build request to connect to Insights server
I0925 09:18:26.415451       1 status.go:142] Number of last upload failures 1 lower than threshold 5. Not marking as degraded.


After first few failures, the operator is still not marked as degraded:
```
oc get clusteroperator insights        
NAME       VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
insights   4.2.0-0.ci-2019-09-25-043459   True        False         False      37m
```

But looking at deatils `oc get clusteroperator insights -o yaml`, on sees additional condition:

```
  - lastTransitionTime: "2019-09-25T09:18:26Z"
    message: 'Unable to report: unable to build request to connect to Insights server'
    reason: UploadFailed
    status: "True"
    type: UploadDegraded
```

After 5 attempts, the operator should turn into degarded state, while keeping the `UplaodDegraded` state as well.

When changing the endpoint back to the proper value

```
oc -n openshift-config create secret generic support --from-literal=endpoint=https://cloud.redhat.com/api/ingress/v1/upload --dry-run -o yaml | oc apply -f - -n openshift-config
```

The operator should get back to Degarded=false, and the UploadUpload degraded state should go away.

Comment 11 errata-xmlrpc 2019-10-16 06:37:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Comment 12 Ali Okan Yuksel 2020-07-30 15:43:38 UTC

This problem still exists on 4.3.29. And solution steps are not clear enough.


Details:

[root@lbint ~]# oc describe co insights                                                                                        Name:         insights
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-07-29T22:48:24Z
  Generation:          1
  Resource Version:    236636
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/insights
  UID:                 df26032f-d4d3-4dd5-b399-8af9da9673d9
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-07-30T09:41:48Z
    Message:               Unable to report: unable to build request to connect to Insights server: Post https://cloud.redhat.com/api/ingress/v1/upload: x509: certificate is valid for *.apps.data.tr.test.com, not cloud.redhat.com
    Reason:                UploadFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-07-29T22:48:24Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-07-29T22:50:24Z
    Message:               An error has occurred
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-07-29T22:50:24Z
    Status:                False
    Type:                  Disabled
    Last Transition Time:  2020-07-30T09:41:50Z
    Message:               Unable to report: unable to build request to connect to Insights server: Post https://cloud.redhat.com/api/ingress/v1/upload: x509: certificate is valid for *.apps.data.tr.test.com, not cloud.redhat.com
    Reason:                UploadFailed
    Status:                True
    Type:                  UploadDegraded
  Extension:
    Last Report Time:  <nil>
  Related Objects:
    Group:
    Name:       openshift-insights
    Resource:   namespaces
    Group:      apps
    Name:       insights-operator
    Namespace:  openshift-insights
    Resource:   deployments
  Versions:
    Name:     operator
    Version:  4.3.29
Events:       <none>