Description of problem: Insights operator switches to "degraded" after one unsuccessful attempt to upload the data. The insights data are not critical and I propose we report degraded after at least X number of attempts where X > 5 or even a higher number.
Failure to upload, especially if it's because "our" end is broken must not make the operator degraded. Since that means customers can't update....
Can you please provide verification steps? As I understand I need to simulate some network instability or inaccessibility of ingress service.
Verified on 4.2.0-0.ci-2019-09-25-043459. Steps to verify: 1. Replace the endpoint to some not valid url > oc -n openshift-config create secret generic support --from-literal=endpoint=http://localhost --dry-run -o yaml | oc apply -f - -n openshift-config 2. Restart insights-operator > oc delete pods --namespace=openshift-insights --all 3. Check the logs: > insightsclient.go:163] Unable to build a request, possible invalid token: Post http://localhost: dial tcp [::1]:80: connect: connection refused > insightsuploader.go:132] Unable to upload report after 0s: unable to build request to connect to Insights server > status.go:145] Number of last upload failures 5 exceeded than threshold 5. Marking as degraded.
``` oc -n openshift-config create secret generic support --from-literal=endpoint=http://localhost --dry-run -o yaml | oc apply -f - -n openshift-config oc kill pods --namespace=openshift-insights # to workaround #1753755 ``` In operator logs notice something like: 0925 09:17:50.207271 1 insightsclient.go:160] Uploading application/vnd.redhat.openshift.periodic to http://localhost I0925 09:17:50.208665 1 insightsclient.go:163] Unable to build a request, possible invalid token: Post http://localhost: dial tcp [::1]:80: connect: connection refused I0925 09:17:50.208695 1 insightsuploader.go:132] Unable to upload report after 10ms: unable to build request to connect to Insights server I0925 09:17:50.208708 1 controllerstatus.go:40] name=insightsuploader healthy=false reason=UploadFailed message=Unable to report: unable to build request to connect to Insights server I0925 09:18:26.415451 1 status.go:142] Number of last upload failures 1 lower than threshold 5. Not marking as degraded. After first few failures, the operator is still not marked as degraded: ``` oc get clusteroperator insights NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE insights 4.2.0-0.ci-2019-09-25-043459 True False False 37m ``` But looking at deatils `oc get clusteroperator insights -o yaml`, on sees additional condition: ``` - lastTransitionTime: "2019-09-25T09:18:26Z" message: 'Unable to report: unable to build request to connect to Insights server' reason: UploadFailed status: "True" type: UploadDegraded ``` After 5 attempts, the operator should turn into degarded state, while keeping the `UplaodDegraded` state as well. When changing the endpoint back to the proper value ``` oc -n openshift-config create secret generic support --from-literal=endpoint=https://cloud.redhat.com/api/ingress/v1/upload --dry-run -o yaml | oc apply -f - -n openshift-config ``` The operator should get back to Degarded=false, and the UploadUpload degraded state should go away.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922
This problem still exists on 4.3.29. And solution steps are not clear enough. Details: [root@lbint ~]# oc describe co insights Name: insights Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-07-29T22:48:24Z Generation: 1 Resource Version: 236636 Self Link: /apis/config.openshift.io/v1/clusteroperators/insights UID: df26032f-d4d3-4dd5-b399-8af9da9673d9 Spec: Status: Conditions: Last Transition Time: 2020-07-30T09:41:48Z Message: Unable to report: unable to build request to connect to Insights server: Post https://cloud.redhat.com/api/ingress/v1/upload: x509: certificate is valid for *.apps.data.tr.test.com, not cloud.redhat.com Reason: UploadFailed Status: True Type: Degraded Last Transition Time: 2020-07-29T22:48:24Z Status: True Type: Available Last Transition Time: 2020-07-29T22:50:24Z Message: An error has occurred Status: False Type: Progressing Last Transition Time: 2020-07-29T22:50:24Z Status: False Type: Disabled Last Transition Time: 2020-07-30T09:41:50Z Message: Unable to report: unable to build request to connect to Insights server: Post https://cloud.redhat.com/api/ingress/v1/upload: x509: certificate is valid for *.apps.data.tr.test.com, not cloud.redhat.com Reason: UploadFailed Status: True Type: UploadDegraded Extension: Last Report Time: <nil> Related Objects: Group: Name: openshift-insights Resource: namespaces Group: apps Name: insights-operator Namespace: openshift-insights Resource: deployments Versions: Name: operator Version: 4.3.29 Events: <none>