Bug 2001825 - [SNO]ingress/authentication clusteroperator degraded when enable ccm from start
Summary: [SNO]ingress/authentication clusteroperator degraded when enable ccm from start
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.9
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.10.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks: 2004924
TreeView+ depends on / blocked
 
Reported: 2021-09-07 09:34 UTC by sunzhaohua
Modified: 2022-04-11 08:33 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2004924 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:07:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cloud-provider-aws pull 9 0 None open Bug 2001825: Merge https://github.com/kubernetes/cloud-provider-aws:master into master 2021-09-15 09:56:14 UTC
Github openshift cluster-cloud-controller-manager-operator pull 120 0 None open Bug 2001825: Enforce the cloud-route controller disabled across platforms 2021-09-13 12:21:47 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:08:19 UTC

Description sunzhaohua 2021-09-07 09:34:18 UTC
Description of problem:
Set up a sno cluster with template ipi-on-aws/versioned-installer-customer_vpc-disconnected-sno-ci and enable ccm from start, cluster install failed,  ingress/authentication clusteroperator degraded.

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          3h3m    Unable to apply 4.9.0-0.nightly-2021-09-06-055314: some cluster operators have not yet rolled out

How reproducible:
always

Steps to Reproduce:
1. Set up a sno cluster with template ipi-on-aws/versioned-installer-customer_vpc-disconnected-sno-ci and enable ccm from start
cat <<EOF > manifests/manifest_feature_gate.yaml
---
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
  name: cluster
spec:
  featureSet: TechPreviewNoUpgrade
2.
3.

Actual results:
Cluster installation failed.

09-07 14:25:11.350  level=debug msg=Still waiting for the cluster to initialize: Working towards 4.9.0-0.nightly-2021-09-06-055314: 713 of 734 done (97% complete)
09-07 14:26:48.048  level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console
09-07 14:58:19.736  level=error msg=Cluster operator authentication Degraded is True with OAuthServerRouteEndpointAccessibleController_SyncError::ProxyConfigController_SyncError: OAuthServerRouteEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.zhsun97s.qe.devcluster.openshift.com/healthz": EOF
09-07 14:58:19.736  level=error msg=ProxyConfigControllerDegraded: endpoint("https://oauth-openshift.apps.zhsun97s.qe.devcluster.openshift.com/healthz") is unreachable with proxy(Get "https://oauth-openshift.apps.zhsun97s.qe.devcluster.openshift.com/healthz": EOF) and without proxy(Get "https://oauth-openshift.apps.zhsun97s.qe.devcluster.openshift.com/healthz": context deadline exceeded)
09-07 14:58:19.736  level=info msg=Cluster operator authentication Available is False with OAuthServerRouteEndpointAccessibleController_EndpointUnavailable: OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.zhsun97s.qe.devcluster.openshift.com/healthz": EOF
09-07 14:58:19.736  level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
09-07 14:58:19.736  level=info msg=Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.9.0-0.nightly-2021-09-06-055314, 0 replicas available
09-07 14:58:19.737  level=info msg=Cluster operator console Available is False with Deployment_InsufficientReplicas::RouteHealth_FailedGet: DeploymentAvailable: 0 replicas available for console deployment
09-07 14:58:19.737  level=info msg=RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.zhsun97s.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.zhsun97s.qe.devcluster.openshift.com": EOF
09-07 14:58:19.737  level=info msg=Cluster operator etcd RecentBackup is Unknown with ControllerStarted: 
09-07 14:58:19.737  level=error msg=Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
09-07 14:58:19.737  level=info msg=Cluster operator insights Disabled is True with Disabled: Health reporting is disabled
09-07 14:58:19.737  level=info msg=Cluster operator network ManagementStateDegraded is False with : 
09-07 14:58:19.737  level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
09-07 14:58:19.737  level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
09-07 14:58:19.737  level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
09-07 14:58:19.737  level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
09-07 14:58:19.738  level=fatal msg=failed to initialize the cluster: Some cluster operators are still updating: authentication, console


$ oc get node
NAME                                        STATUS   ROLES           AGE    VERSION
ip-10-0-57-132.us-east-2.compute.internal   Ready    master,worker   179m   v1.22.0-rc.0+75ee307

$ oc get co
NAME                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication             4.9.0-0.nightly-2021-09-06-055314   False       False         True       179m    OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.zhsun97s.qe.devcluster.openshift.com/healthz": EOF
baremetal                  4.9.0-0.nightly-2021-09-06-055314   True        False         False      176m
cloud-controller-manager   4.9.0-0.nightly-2021-09-06-055314   True        False         False      3h
cloud-credential           4.9.0-0.nightly-2021-09-06-055314   True        False         False      3h2m
cluster-autoscaler         4.9.0-0.nightly-2021-09-06-055314   True        False         False      176m
config-operator            4.9.0-0.nightly-2021-09-06-055314   True        False         False      178m
console                    4.9.0-0.nightly-2021-09-06-055314   False       True          False      170m    DeploymentAvailable: 0 replicas available for console deployment
RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.zhsun97s.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.zhsun97s.qe.devcluster.openshift.com": EOF
csi-snapshot-controller                    4.9.0-0.nightly-2021-09-06-055314   True        False         False      178m
dns                                        4.9.0-0.nightly-2021-09-06-055314   True        False         False      176m
etcd                                       4.9.0-0.nightly-2021-09-06-055314   True        False         False      177m
image-registry                             4.9.0-0.nightly-2021-09-06-055314   True        False         False      174m
ingress                                    4.9.0-0.nightly-2021-09-06-055314   True        False         True       170m    The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)

Expected results:
Cluster installation is success.

Additional info: 
must-gather: http://file.rdu.redhat.com/~zhsun/must-gather.local.8824229733135208710.tar.gz
kubeconfig: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/40558/artifact/workdir/install-dir/auth/kubeconfig/*view*/

Comment 9 sunzhaohua 2021-09-16 05:37:36 UTC
verified
clusterversion: 4.10.0-0.nightly-2021-09-15-220746

oc get node
NAME                                        STATUS   ROLES           AGE    VERSION
ip-10-0-65-233.us-east-2.compute.internal   Ready    master,worker   173m   v1.22.0-rc.0+75ee307

$ oc get featuregate cluster -o yaml
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
  creationTimestamp: "2021-09-16T02:41:16Z"
  generation: 1
  name: cluster
  resourceVersion: "993"
  uid: 80c405d9-e8be-4193-a2a9-2b1b27be6264
spec:
  featureSet: TechPreviewNoUpgrade

sh-4.4# cat /etc/systemd/system/kubelet.service
      --cloud-provider=external \

$ oc describe po kube-controller-manager-ip-10-0-65-233.us-east-2.compute.internal -n openshift-kube-controller-manager | grep cloud-provider -C 20
--cloud-provider=external
        --requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt -v=2 --tls-cert-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.crt --tls-private-key-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.key --allocate-node-cidrs=false --cert-dir=/var/run/kubernetes --cloud-provider=external

Comment 12 errata-xmlrpc 2022-03-10 16:07:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.