Bug 1745973
| Summary: | Insights operator should not report "degraded" after one unsuccessful upload attempt | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Radek Vokál <rvokal> |
| Component: | Insights Operator | Assignee: | Ivan Necas <inecas> |
| Status: | CLOSED ERRATA | QA Contact: | Dmitry Misharov <dmisharo> |
| Severity: | medium | Docs Contact: | Radek Vokál <rvokal> |
| Priority: | medium | ||
| Version: | 4.2.0 | CC: | dmisharo, eparis, inecas, kaox.gen, mfojtik |
| Target Milestone: | --- | ||
| Target Release: | 4.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-16 06:37:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Radek Vokál
2019-08-27 11:17:10 UTC
Failure to upload, especially if it's because "our" end is broken must not make the operator degraded. Since that means customers can't update.... Can you please provide verification steps? As I understand I need to simulate some network instability or inaccessibility of ingress service. Verified on 4.2.0-0.ci-2019-09-25-043459. Steps to verify: 1. Replace the endpoint to some not valid url > oc -n openshift-config create secret generic support --from-literal=endpoint=http://localhost --dry-run -o yaml | oc apply -f - -n openshift-config 2. Restart insights-operator > oc delete pods --namespace=openshift-insights --all 3. Check the logs: > insightsclient.go:163] Unable to build a request, possible invalid token: Post http://localhost: dial tcp [::1]:80: connect: connection refused > insightsuploader.go:132] Unable to upload report after 0s: unable to build request to connect to Insights server > status.go:145] Number of last upload failures 5 exceeded than threshold 5. Marking as degraded. ``` oc -n openshift-config create secret generic support --from-literal=endpoint=http://localhost --dry-run -o yaml | oc apply -f - -n openshift-config oc kill pods --namespace=openshift-insights # to workaround #1753755 ``` In operator logs notice something like: 0925 09:17:50.207271 1 insightsclient.go:160] Uploading application/vnd.redhat.openshift.periodic to http://localhost I0925 09:17:50.208665 1 insightsclient.go:163] Unable to build a request, possible invalid token: Post http://localhost: dial tcp [::1]:80: connect: connection refused I0925 09:17:50.208695 1 insightsuploader.go:132] Unable to upload report after 10ms: unable to build request to connect to Insights server I0925 09:17:50.208708 1 controllerstatus.go:40] name=insightsuploader healthy=false reason=UploadFailed message=Unable to report: unable to build request to connect to Insights server I0925 09:18:26.415451 1 status.go:142] Number of last upload failures 1 lower than threshold 5. Not marking as degraded. After first few failures, the operator is still not marked as degraded: ``` oc get clusteroperator insights NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE insights 4.2.0-0.ci-2019-09-25-043459 True False False 37m ``` But looking at deatils `oc get clusteroperator insights -o yaml`, on sees additional condition: ``` - lastTransitionTime: "2019-09-25T09:18:26Z" message: 'Unable to report: unable to build request to connect to Insights server' reason: UploadFailed status: "True" type: UploadDegraded ``` After 5 attempts, the operator should turn into degarded state, while keeping the `UplaodDegraded` state as well. When changing the endpoint back to the proper value ``` oc -n openshift-config create secret generic support --from-literal=endpoint=https://cloud.redhat.com/api/ingress/v1/upload --dry-run -o yaml | oc apply -f - -n openshift-config ``` The operator should get back to Degarded=false, and the UploadUpload degraded state should go away. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 This problem still exists on 4.3.29. And solution steps are not clear enough.
Details:
[root@lbint ~]# oc describe co insights Name: insights
Namespace:
Labels: <none>
Annotations: <none>
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
Creation Timestamp: 2020-07-29T22:48:24Z
Generation: 1
Resource Version: 236636
Self Link: /apis/config.openshift.io/v1/clusteroperators/insights
UID: df26032f-d4d3-4dd5-b399-8af9da9673d9
Spec:
Status:
Conditions:
Last Transition Time: 2020-07-30T09:41:48Z
Message: Unable to report: unable to build request to connect to Insights server: Post https://cloud.redhat.com/api/ingress/v1/upload: x509: certificate is valid for *.apps.data.tr.test.com, not cloud.redhat.com
Reason: UploadFailed
Status: True
Type: Degraded
Last Transition Time: 2020-07-29T22:48:24Z
Status: True
Type: Available
Last Transition Time: 2020-07-29T22:50:24Z
Message: An error has occurred
Status: False
Type: Progressing
Last Transition Time: 2020-07-29T22:50:24Z
Status: False
Type: Disabled
Last Transition Time: 2020-07-30T09:41:50Z
Message: Unable to report: unable to build request to connect to Insights server: Post https://cloud.redhat.com/api/ingress/v1/upload: x509: certificate is valid for *.apps.data.tr.test.com, not cloud.redhat.com
Reason: UploadFailed
Status: True
Type: UploadDegraded
Extension:
Last Report Time: <nil>
Related Objects:
Group:
Name: openshift-insights
Resource: namespaces
Group: apps
Name: insights-operator
Namespace: openshift-insights
Resource: deployments
Versions:
Name: operator
Version: 4.3.29
Events: <none>
|