Bug 2081844

Summary: (release-4.10) disconnected insights operator remains degraded after editing pull secret
Product: OpenShift Container Platform Reporter: Michael Hrivnak <mhrivnak>
Component: Insights OperatorAssignee: Tomas Remes <tremes>
Status: CLOSED ERRATA QA Contact: Joao Fula <jfula>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.10CC: bschmaus, inecas, mklika, tremes, wking
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2081997 2083467 (view as bug list) Environment:
Last Closed: 2022-05-23 13:25:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2081997    
Bug Blocks: 2083467    

Description Michael Hrivnak 2022-05-04 19:36:05 UTC
Description of problem:

I have a disconnected 4.10.3 cluster for a customer PoC. The insights operator is degraded as expected, because it can't reach cloud.redhat.com. I followed the directions linked below for editing the pull secret to stop it from trying to connect to cloud.redhat.com, but it keeps trying and remains degraded.

https://docs.openshift.com/container-platform/4.10/post_installation_configuration/connected-to-disconnected.html#connected-to-disconnected-restore-insights_connected-to-disconnected


Version-Release number of selected component (if applicable):

4.10.3


How reproducible:

I've only tried once.


Steps to Reproduce:
1. install disconnected cluster with platform baremetal
2. run the documented procedure to remove "cloud.redhat.com" from the cluster's pull secret
3. run `oc get co` or `oc get clusterversion`


Actual results:

See that insights is still degraded.

% oc get co insights
NAME       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
insights   4.10.3    True        False         True       5d23h   Unable to report: unable to build request to connect to Insights server: Post "https://cloud.redhat.com/api/ingress/v1/upload": dial tcp 23.218.165.26:443: i/o timeout

% oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.3    True        False         5d23h   Error while reconciling 4.10.3: the cluster operator insights is degraded


Expected results:

not degraded


Additional info:

Here is the pull secret contents, just to show that I followed the directions (I changed the cluster FQDN to remove customer references). Note that cloud.redhat.com is not in the list:

% oc get secret/pull-secret -n openshift-config -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq '.auths | keys[]'
"poc-registry-quay-quay-poc.apps.some-demo-cluster.some.customer.com"
"quay.io"
"registry.connect.redhat.com"
"registry.redhat.io"


The insight-operator pod log shows this every 2 minutes:

I0504 19:16:01.859337       1 controller.go:203] Number of last upload failures 341 exceeded the threshold 5. Marking as degraded.
I0504 19:16:01.859386       1 controller.go:380] The operator has some internal errors: Unable to report: unable to build request to connect to Insights server: Post "https://cloud.redhat.com/api/ingress/v1/upload": dial tcp 23.218.165.26:443: i/o timeout
I0504 19:16:01.859391       1 controller.go:385] The operator is marked as disabled


I also see that when I made the change to the pull secret, 2 new machineconfigs got created: one for masters, one for workers. I see that each master node has these annotations indicating that the new machineconfig did get applied:

  "machineconfiguration.openshift.io/currentConfig": "rendered-master-59c21ecfa14984911e815a3d4e1eb0db",
  "machineconfiguration.openshift.io/desiredConfig": "rendered-master-59c21ecfa14984911e815a3d4e1eb0db",

% oc get machineconfigs
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             6d
00-worker                                          14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             6d
01-master-container-runtime                        14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             6d
01-master-kubelet                                  14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             6d
01-worker-container-runtime                        14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             6d
01-worker-kubelet                                  14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             6d
50-masters-chrony-configuration                                                               2.2.0             6d
50-workers-chrony-configuration                                                               2.2.0             6d
99-assisted-installer-master-ssh                                                              3.1.0             6d
99-master-generated-registries                     14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             6d
99-master-ssh                                                                                 3.2.0             6d
99-worker-generated-registries                     14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             6d
99-worker-ssh                                                                                 3.2.0             6d
rendered-master-59c21ecfa14984911e815a3d4e1eb0db   14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             107m
rendered-master-5dc83cfb95c57b713c47070143b4b429   14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             6d
rendered-worker-abd7e90e575ef59cee3208838d35613d   14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             6d
rendered-worker-bad7a562a38faa199500cf825b227404   14a1ca2cb91ff7e0faf9146b21ba12cd6c652d22   3.2.0             107m

Comment 1 Michael Hrivnak 2022-05-04 21:08:09 UTC
I also see this in the operator log every 5 minutes:


I0504 20:57:08.916006       1 configobserver.go:127] Refreshing configuration from cluster pull secret
I0504 20:57:08.918181       1 configobserver.go:154] Refreshing configuration from cluster secret
I0504 20:57:08.919550       1 configobserver.go:112] support secret does not exist

Comment 6 Joao Fula 2022-05-12 09:57:57 UTC
Verified on 4.10.0-0.nightly-2022-05-11-183751.

Steps to reproduce:
1.  Create support secret
2.  Add the value/key combinations:
interval = 1m
endpoint = https://httpstat.us/404
3. Wait until the insights cluster operator becomes degraded.
4. Remove the cloud token from the pull secret
5. Wait until the insights cluster operator becomes disabled.
Conditions to verify:
    insights cluster operator is not degraded
    insights cluster operator is not upload_degraded
    insights cluster operator is disabled

Comment 9 errata-xmlrpc 2022-05-23 13:25:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2258