1945326 – console-operator: does not check route health periodically

Bug 1945326 - console-operator: does not check route health periodically

Summary: console-operator: does not check route health periodically

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Management Console
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Jakub Hadvig
QA Contact:	Yadan Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-31 16:54 UTC by Seth Jennings
Modified:	2021-07-27 22:57 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:57:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift console-operator pull 521	0	None	open	Bug 1945326: Resync controllers every minute	2021-04-06 10:01:15 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:57:28 UTC

Description Seth Jennings 2021-03-31 16:54:54 UTC

Description of problem:

If ingress is not completely linked end-to-end before console-operator does its health check, the console ClusterOperator will report Degraded=True for a long time because it does not recheck the health unless something triggers a sync in the controller.

https://github.com/openshift/console-operator/blob/fc764e21d15174c495a8b79aa46d4f22158f1049/pkg/console/controllers/route/controller.go#L150

Version-Release number of selected component (if applicable):
4.7.5

How reproducible:
Always when console comes up before ingress

Steps to Reproduce:
1.
2.
3.

Actual results:
console ClusterOperator is degraded even though the health route is working

Expected results:
console ClusterOperator should check health route periodically to minimize time between the health route starting to work and Degraded condition on the console ClusterOperator being set False.

Additional info:
We have a test that is waiting for all ClusterOperators to be Available=True and Degraded=False as a metrics of cluster installation completion.  However, this does not work at the moment with console being Degraded for a long (bound unknown) period of time.

Comment 2 Yadan Pei 2021-04-09 06:32:17 UTC

1. Fist set deployment/console-operator unmanaged then update deployment/console-operator to a higher log level, such as Debug
$ cat <<EOF >version-patch-first-override.yaml
- op: add
  path: /spec/overrides
  value:
  - kind: Deployment
    group: apps/v1
    name: console-operator
    namespace: openshift-console-operator
    unmanaged: true
EOF

$ oc patch clusterversion version --type json -p "$(cat version-patch-first-override.yaml)"
$ oc edit deployment console-operator 
       containers:
       - args:
         - -v=2   => change it to 4 and Save
         - --config=/var/run/configmaps/config/controller-config.yaml
         command:
         - console
         - operator

2. Check new console-operator pod logs
$ oc get pods -n openshift-console-operator
NAME                               READY   STATUS    RESTARTS   AGE
console-operator-d6bb5df8d-2dp9h   1/1     Running   0          3m
$ oc logs -f console-operator-d6bb5df8d-2dp9h
.....
I0409 05:55:13.234330       1 operator.go:177] started syncing operator "cluster" (2021-04-09 05:55:13.234323559 +0000 UTC m=+86.734094448)
...
I0409 05:56:13.233622       1 operator.go:177] started syncing operator "cluster" (2021-04-09 05:56:13.233613838 +0000 UTC m=+146.733384727)
.....
I0409 05:57:13.236961       1 operator.go:177] started syncing operator "cluster" (2021-04-09 05:57:13.236956616 +0000 UTC m=+206.736727605)
I0409 05:57:13.243762       1 apps.go:154] Deployment "openshift-console/downloads" changes: {"spec":{"progressDeadlineSeconds":null,"revisionHistoryLimit":null,"strategy":{"rollingUpdate":null,"type":null},"template":{"spec":{"containers":[{"args":["-c","cat \u003c\u003cEOF \u003e\u003e/tmp/serve.py\nimport errno, http.server, os, re, signal, socket, sys, tarfile, tempfile, threading, time, zipfile\n\nsignal.signal(signal.SIGTERM, lambda signum, frame: sys.exit(0))\n\ndef write_index(path, message):\n\twith open(path, 'wb') as f:\n\t\tf.write('\\n'.join([\n\t\t\t'\u003c!doctype html\u003e',\n\t\t\t'\u003chtml lang=\"en\"\u003e',\n\t\t\t'\u003chead\u003e',\n\t\t\t'  \u003cmeta charset=\"utf-8\"\u003e',\n\t\t\t'\u003c/head\u003e',\n\t\t\t'\u003cbody\u003e',\n\t\t\t'  {}'.format(message),\n\t\t\t'\u003c/body\u003e',\n\t\t\t'\u003c/html\u003e',\n\t\t\t'',\n\t\t]).encode('utf-8'))\n\n# Launch multiple listeners as threads\nclass Thread(threading.Thread):\n\tdef __init__(self, i, socket):\n\t\tthreading.Thread.__init__(self)\n\t\tself.i = i\n\t\tself.socket = socket\n\t\tself.daemon = True\n\t\tself.start()\n\n\tdef run(self):\n\t\thttpd = http.server.HTTPServer(addr, http.server.SimpleHTTPRequestHandler, False)\n\n\t\t# Prevent the HTTP server from re-binding every handler.\n\t\t# https://stackoverflow.com/questions/46210672/\n\t\thttpd.socket = self.socket\n\t\thttpd.server_bind = self.server_close = lambda self: None\n\n\t\thttpd.serve_forever()\n\ntemp_dir = tempfile.mkdtemp()\nprint('serving from {}'.format(temp_dir))\nos.chdir(temp_dir)\nfor arch in ['amd64']:\n\tos.mkdir(arch)\n\tfor operating_system in ['linux', 'mac', 'windows']:\n\t\tos.mkdir(os.path.join(arch, operating_system))\nfor arch in ['arm64', 'ppc64le', 's390x']:\n\tos.mkdir(arch)\n\tfor operating_system in ['linux']:\n\t\tos.mkdir(os.path.join(arch, operating_system))\ncontent = ['\u003ca href=\"oc-license\"\u003elicense\u003c/a\u003e']\nos.symlink('/usr/share/openshift/LICENSE', 'oc-license')\n\nfor arch, operating_system, path in [\n\t\t('amd64', 'linux', '/usr/share/openshift/linux_amd64/oc'),\n\t\t('amd64', 'mac', '/usr/share/openshift/mac/oc'),\n\t\t('amd64', 'windows', '/usr/share/openshift/windows/oc.exe'),\n\t\t('arm64', 'linux', '/usr/share/openshift/linux_arm64/oc'),\n\t\t('ppc64le', 'linux', '/usr/share/openshift/linux_ppc64le/oc'),\n\t\t('s390x', 'linux', '/usr/share/openshift/linux_s390x/oc'),\n\t\t]:\n\tbasename = os.path.basename(path)\n\ttarget_path = os.path.join(arch, operating_system, basename)\n\tos.symlink(path, target_path)\n\tbase_root, _ = os.path.splitext(basename)\n\tarchive_path_root = os.path.join(arch, operating_system, base_root)\n\twith tarfile.open('{}.tar'.format(archive_path_root), 'w') as tar:\n\t\ttar.add(path, basename)\n\twith zipfile.ZipFile('{}.zip'.format(archive_path_root), 'w') as zip:\n\t\tzip.write(path, basename)\n\tcontent.append('\u003ca href=\"{0}\"\u003eoc ({1} {2})\u003c/a\u003e (\u003ca href=\"{0}.tar\"\u003etar\u003c/a\u003e \u003ca href=\"{0}.zip\"\u003ezip\u003c/a\u003e)'.format(target_path, arch, operating_system))\n\nfor root, directories, filenames in os.walk(temp_dir):\n\troot_link = os.path.relpath(temp_dir, os.path.join(root, 'child')).replace(os.path.sep, '/')\n\tfor directory in directories:\n\t\twrite_index(\n\t\t\tpath=os.path.join(root, directory, 'index.html'),\n\t\t\tmessage='\u003cp\u003eDirectory listings are disabled.  See \u003ca href=\"{}\"\u003ehere\u003c/a\u003e for available content.\u003c/p\u003e'.format(root_link),\n\t\t)\n\nwrite_index(\n\tpath=os.path.join(temp_dir, 'index.html'),\n\tmessage='\\n'.join(\n\t\t['\u003cul\u003e'] +\n\t\t['  \u003cli\u003e{}\u003c/li\u003e'.format(entry) for entry in content] +\n\t\t['\u003c/ul\u003e']\n\t),\n)\n\n# Create socket\n# IPv6 should handle IPv4 passively so long as it is not bound to a\n# specific address or set to IPv6_ONLY\n# https://stackoverflow.com/questions/25817848/python-3-does-http-server-support-ipv6\ntry:\n\taddr = ('::', 8080)\n\tsock = socket.socket(socket.AF_INET6, socket.SOCK_STREAM)\nexcept socket.error as err:\n\t# errno.EAFNOSUPPORT is \"socket.error: [Errno 97] Address family not supported by protocol\"\n\t# When IPv6 is disabled, socket will bind using IPv4.\n\tif err.errno == errno.EAFNOSUPPORT:\n\t\taddr = ('', 8080)\n\t\tsock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\n\telse:\n\t\traise    \nsock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)\nsock.bind(addr)\nsock.listen(5)\n\n[Thread(i, socket=sock) for i in range(100)]\ntime.sleep(9e9)\nEOF\nexec python3 /tmp/serve.py"],"command":["/bin/sh"],"image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c18c81505b650222a12eb50d5fdeff4546a01e05d7ae722009a2ef66fa54a07","imagePullPolicy":"IfNotPresent","livenessProbe":{"httpGet":{"path":"/","port":8080,"scheme":"HTTP"}},"name":"download-server","ports":[{"containerPort":8080,"name":"http","protocol":"TCP"}],"readinessProbe":{"failureThreshold":3,"httpGet":{"path":"/","port":8080,"scheme":"HTTP"}},"resources":{"requests":{"cpu":"10m","memory":"50Mi"}},"terminationMessagePolicy":"FallbackToLogsOnError"}],"dnsPolicy":null,"restartPolicy":null,"schedulerName":null,"securityContext":null}}}}
I0409 05:57:13.249214       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-console-operator", Name:"console-operator", UID:"c959a033-0d2a-4c6b-9d1f-f313f719f71c", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'DeploymentUpdated' Updated Deployment.apps/downloads -n openshift-console because it changed
I0409 05:57:13.249640       1 route.go:151] route "downloads" ingress 'default' found and admitted, host: downloads-openshift-console.apps.yapei48.qe.azure.devcluster.openshift.com
I0409 05:57:13.251632       1 operator.go:226] console is in a managed state.
I0409 05:57:13.251647       1 sync_v400.go:49] running sync loop 4.0.0
I0409 05:57:13.253425       1 controller.go:238] oc-cli-downloads consoleclidownloads custom resource exists and is in the correct state
I0409 05:57:13.261187       1 controller.go:238] odo-cli-downloads consoleclidownloads custom resource exists and is in the correct state
I0409 05:57:13.282418       1 route.go:151] route "console" ingress 'default' found and admitted, host: console-openshift-console.apps.yapei48.qe.azure.devcluster.openshift.com
E0409 05:57:13.290784       1 sync_v400.go:514] failed to get "prometheus-plugin2" plugin: consoleplugin.console.openshift.io "prometheus-plugin2" not found
I0409 05:57:13.322490       1 sync_v400.go:377] service-ca configmap exists and is in the correct state
I0409 05:57:13.336928       1 sync_v400.go:408] trusted-ca-bundle configmap exists and is in the correct state
I0409 05:57:13.336970       1 sync_v400.go:478] no custom logo configured
I0409 05:57:13.435253       1 request.go:591] Throttling request took 98.236356ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-console/configmaps/default-ingress-cert
I0409 05:57:13.635641       1 request.go:591] Throttling request took 196.582115ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-console/secrets/console-oauth-config
I0409 05:57:13.639303       1 route.go:151] route "console" ingress 'default' found and admitted, host: console-openshift-console.apps.yapei48.qe.azure.devcluster.openshift.com
I0409 05:57:13.687257       1 sync_v400.go:135] -----------------------
I0409 05:57:13.687275       1 sync_v400.go:136] sync loop 4.0.0 resources updated: false
I0409 05:57:13.687285       1 sync_v400.go:137] -----------------------
I0409 05:57:13.687299       1 sync_v400.go:166] sync_v400: updating console status
I0409 05:57:13.687306       1 route.go:151] route "console" ingress 'default' found and admitted, host: console-openshift-console.apps.yapei48.qe.azure.devcluster.openshift.com
I0409 05:57:13.687314       1 metrics.go:47] metric console_url host transition: https://console-openshift-console.apps.yapei48.qe.azure.devcluster.openshift.com to https://console-openshift-console.apps.yapei48.qe.azure.devcluster.openshift.com
I0409 05:57:13.835404       1 request.go:591] Throttling request took 147.995695ms, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-config-managed/configmaps/console-public
I0409 05:57:13.838633       1 sync_v400.go:183] sync loop 4.0.0 complete
I0409 05:57:13.838671       1 operator.go:218] finished syncing operator "cluster" (23.3Âµs)
I0409 05:57:16.799595       1 httplog.go:89] "HTTP" verb="GET" URI="/readyz" latency="7.240007ms" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54160" resp=200
I0409 05:57:17.957826       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:18.275074       1 reflector.go:530] k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go:172: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:18.452014       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:20.097500       1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="1.958728ms" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54190" resp=200
I0409 05:57:26.244241       1 httplog.go:89] "HTTP" verb="GET" URI="/metrics" latency="15.018822ms" userAgent="Prometheus/2.24.1" srcIP="10.131.0.37:35174" resp=200
I0409 05:57:26.793129       1 httplog.go:89] "HTTP" verb="GET" URI="/readyz" latency="516.707Âµs" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54266" resp=200
I0409 05:57:27.958254       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:28.275695       1 reflector.go:530] k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go:172: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:28.452950       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:30.099916       1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="4.304364ms" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54298" resp=200
I0409 05:57:31.095523       1 httplog.go:89] "HTTP" verb="GET" URI="/metrics" latency="3.880158ms" userAgent="Prometheus/2.24.1" srcIP="10.128.2.25:54256" resp=200
I0409 05:57:36.800937       1 httplog.go:89] "HTTP" verb="GET" URI="/readyz" latency="8.219122ms" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54372" resp=200
I0409 05:57:37.959423       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:38.276210       1 reflector.go:530] k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go:172: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:38.453366       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:40.098124       1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="2.273034ms" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54400" resp=200
I0409 05:57:46.797793       1 httplog.go:89] "HTTP" verb="GET" URI="/readyz" latency="4.834372ms" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54480" resp=200
I0409 05:57:47.961905       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:48.276846       1 reflector.go:530] k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go:172: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:48.453834       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:50.102234       1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="6.474696ms" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54516" resp=200
I0409 05:57:56.243922       1 httplog.go:89] "HTTP" verb="GET" URI="/metrics" latency="14.954522ms" userAgent="Prometheus/2.24.1" srcIP="10.131.0.37:35174" resp=200
I0409 05:57:56.797847       1 httplog.go:89] "HTTP" verb="GET" URI="/readyz" latency="4.847772ms" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54584" resp=200
I0409 05:57:57.962275       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:58.277971       1 reflector.go:530] k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go:172: Watch close - *v1.ConfigMap total 0 items received
I0409 05:57:58.455054       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:58:00.097874       1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="1.931529ms" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54612" resp=200
I0409 05:58:01.109778       1 httplog.go:89] "HTTP" verb="GET" URI="/metrics" latency="3.764557ms" userAgent="Prometheus/2.24.1" srcIP="10.128.2.25:54256" resp=200
I0409 05:58:06.796767       1 httplog.go:89] "HTTP" verb="GET" URI="/readyz" latency="4.063661ms" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54684" resp=200
I0409 05:58:07.963448       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:58:07.964909       1 reflector.go:434] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: watch of *v1.ConfigMap closed with: too old resource version: 113453 (113491)
I0409 05:58:08.279187       1 reflector.go:530] k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go:172: Watch close - *v1.ConfigMap total 0 items received
I0409 05:58:08.280922       1 reflector.go:434] k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go:172: watch of *v1.ConfigMap closed with: too old resource version: 113456 (113492)
I0409 05:58:08.455950       1 reflector.go:530] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 0 items received
I0409 05:58:08.457314       1 reflector.go:434] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: watch of *v1.ConfigMap closed with: too old resource version: 113457 (113492)
I0409 05:58:09.293723       1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206
I0409 05:58:09.346901       1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206
I0409 05:58:09.485458       1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/apiserver/pkg/authentication/request/headerrequest/requestheader_controller.go:172
I0409 05:58:10.101434       1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="6.096593ms" userAgent="kube-probe/1.20" srcIP="10.129.0.1:54716" resp=200
I0409 05:58:13.238385       1 controller.go:83] console is in a managed state: syncing service
I0409 05:58:13.238949       1 controller.go:101] console is in a managed state: syncing ConsoleCliDownloads custom resources
....
I0409 05:58:13.241447       1 operator.go:177] started syncing operator "cluster" (2021-04-09 05:58:13.2414394 +0000 UTC m=+266.741210389)



From console-operator pog log, we can see that console operator is syncing every minutes, console route health check is synced every minute in route sync controller in which we print log `console is in a managed state: syncing route` for our case


Moving to VERIFIED, this is checked against 4.8.0-0.nightly-2021-04-08-124424

Comment 5 errata-xmlrpc 2021-07-27 22:57:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.