Hide Forgot
Description of problem: kube-apiserver operator is degraded when enable openshift-multitenant mode with error: $ oc get co kube-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.10.0-0.nightly-2021-12-23-153012 True False True 4h49m ValidatingAdmissionWebhookConfigurationDegraded: multus-validating-config.k8s.io: dial tcp 172.30.92.175:443: i/o timeout $ oc get netnamespaces | grep kube-apiserver openshift-kube-apiserver 0 openshift-kube-apiserver-operator 13846250 $ oc get netnamespaces | grep multus openshift-multus 4093322 $ oc get svc -n openshift-multus NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE multus-admission-controller ClusterIP 172.30.92.175 <none> 443/TCP,8443/TCP 6h42m network-metrics-service ClusterIP None <none> 8443/TCP 6h42m Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-12-23-153012 How reproducible: always Steps to Reproduce: 1. setup cluster and enable multitenant 2. 3. Actual results: Expected results: Additional info: not sure if ValidatingAdmissionWebhookConfigurationDegraded is new added in 4.10. I did not see this error in 4.9 version.
## Reproduction steps: ~~~ ../openshift-install-4.10.0-0.nightly-2021-12-23-153012 create manifests --log-level=debug --dir=install-config cat <<'EOF' > install-config/manifests/cluster-network-03-config.yml apiVersion: operator.openshift.io/v1 kind: Network metadata: name: cluster spec: defaultNetwork: type: OpenShiftSDN openshiftSDNConfig: mode: Multitenant EOF ../openshift-install-4.10.0-0.nightly-2021-12-23-153012 create cluster --log-level=debug --dir=install-config ~~~ ## Symptoms ~~~ [akaris@linux ipi-us-west-2]$ oc get co | grep kube-apiserver kube-apiserver 4.10.0-0.nightly-2021-12-23-153012 True False True 45m ValidatingAdmissionWebhookConfigurationDegraded: multus-validating-config.k8s.io: dial tcp 172.30.14.167:443: i/o timeout ~~~ And: ~~~ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-595c999448-75p55 (...) I0105 18:22:21.736926 1 status_controller.go:211] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2022-01-05T18:07:49Z","message":"ValidatingAdmissionWebhookConfigurationDegraded: multus-validating-config.k8s.io: dial tcp 172.30.14.167:443: i/o timeout","reason":"ValidatingAdmissionWebhookConfiguration_WebhookServiceConnectionError","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-01-05T18:22:21Z","message":"NodeInstallerProgressing: 3 nodes are at revision 10","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2022-01-05T18:13:30Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 10","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-01-05T18:05:31Z","message":"KubeletMinorVersionUpgradeable: Kubelet and API server minor versions are synced.","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} I0105 18:22:21.751957 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"73805f7a-aae4-4023-b870-370a5130fdd0", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Progressing changed from True to False ("NodeInstallerProgressing: 3 nodes are at revision 10"),Available message changed from "StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 7; 2 nodes are at revision 10" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 10" E0105 18:22:22.652930 1 base_controller.go:272] ConnectivityCheckController reconciliation failed: customresourcedefinitions.apiextensions.k8s.io "podnetworkconnectivitychecks.controlplane.operator.openshift.io" already exists I0105 18:22:22.653325 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"73805f7a-aae4-4023-b870-370a5130fdd0", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CustomResourceDefinitionCreateFailed' Failed to create CustomResourceDefinition.apiextensions.k8s.io/podnetworkconnectivitychecks.controlplane.operator.openshift.io: customresourcedefinitions.apiextensions.k8s.io "podnetworkconnectivitychecks.controlplane.operator.openshift.io" already exists I0105 18:22:22.921761 1 request.go:665] Waited for 1.184239944s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/serviceaccounts/installer-sa E0105 18:22:23.562199 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout I0105 18:22:24.117180 1 request.go:665] Waited for 1.596831621s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/revision-pruner-10-ip-10-0-207-239.us-west-2.compute.internal I0105 18:22:25.322848 1 request.go:665] Waited for 1.001132807s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/configmaps/client-ca E0105 18:22:26.564250 1 degraded_webhook.go:56] validation.machine.machine.openshift.io: dial tcp 172.30.43.8:443: i/o timeout I0105 18:22:26.916638 1 request.go:665] Waited for 1.015735618s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/secrets/node-kubeconfigs E0105 18:22:27.564702 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout I0105 18:22:28.116852 1 request.go:665] Waited for 1.795964631s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-185-28.us-west-2.compute.internal I0105 18:22:29.316760 1 request.go:665] Waited for 1.795603527s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods?labelSelector=app%3Dinstaller E0105 18:22:29.565944 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout I0105 18:22:30.517346 1 request.go:665] Waited for 1.792440164s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/secrets/node-kubeconfigs I0105 18:22:31.717004 1 request.go:665] Waited for 1.596201388s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-207-239.us-west-2.compute.internal E0105 18:22:32.566947 1 degraded_webhook.go:56] validation.machineset.machine.openshift.io: dial tcp 172.30.43.8:443: i/o timeout I0105 18:22:32.717208 1 request.go:665] Waited for 1.196119821s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-185-28.us-west-2.compute.internal E0105 18:22:33.572917 1 degraded_webhook.go:128] dial tcp 172.30.14.167:443: i/o timeout E0105 18:22:35.573829 1 degraded_webhook.go:128] dial tcp 172.30.14.167:443: i/o timeout I0105 18:22:39.316927 1 request.go:665] Waited for 1.08936891s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-185-28.us-west-2.compute.internal E0105 18:22:39.574843 1 degraded_webhook.go:128] dial tcp 172.30.150.191:443: i/o timeout E0105 18:22:41.576073 1 degraded_webhook.go:128] dial tcp 172.30.150.191:443: i/o timeout E0105 18:22:44.577804 1 degraded_webhook.go:56] volumesnapshotclasses.snapshot.storage.k8s.io: dial tcp 172.30.150.191:443: i/o timeout E0105 18:22:45.579584 1 degraded_webhook.go:128] dial tcp 172.30.95.236:443: i/o timeout E0105 18:22:47.580945 1 degraded_webhook.go:128] dial tcp 172.30.95.236:443: i/o timeout E0105 18:22:50.582763 1 degraded_webhook.go:56] pod-identity-webhook.amazonaws.com: dial tcp 172.30.95.236:443: i/o timeout E0105 18:22:51.583595 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout E0105 18:22:53.584325 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout E0105 18:22:56.584790 1 degraded_webhook.go:56] default.machine.machine.openshift.io: dial tcp 172.30.43.8:443: i/o timeout E0105 18:22:57.585008 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout E0105 18:22:59.585998 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout E0105 18:23:02.587596 1 degraded_webhook.go:56] default.machineset.machine.openshift.io: dial tcp 172.30.43.8:443: i/o timeout E0105 18:23:03.593371 1 degraded_webhook.go:128] dial tcp 172.30.14.167:443: i/o timeout E0105 18:23:05.594368 1 degraded_webhook.go:128] dial tcp 172.30.14.167:443: i/o timeout E0105 18:23:09.596183 1 degraded_webhook.go:128] dial tcp 172.30.150.191:443: i/o timeout E0105 18:23:11.597135 1 degraded_webhook.go:128] dial tcp 172.30.150.191:443: i/o timeout E0105 18:23:14.597718 1 degraded_webhook.go:56] volumesnapshotclasses.snapshot.storage.k8s.io: dial tcp 172.30.150.191:443: i/o timeout (...) ~~~ ## Root cause This is clearly due to denied TCP traffic from the kube-apiserver-operator pod to the webhook services in question. Here is the code that performs the checks: https://github.com/openshift/cluster-kube-apiserver-operator/blob/c39e6d1281f46582c47c6ebe2c01229b575ed494/pkg/operator/webhooksupportabilitycontroller/degraded_webhook.go#L34 ~~~ // updateWebhookConfigurationDegraded updates the condition specified after // checking that the services associated with the specified webhooks exist // and have at least one ready endpoint. func (c *webhookSupportabilityController) updateWebhookConfigurationDegraded(ctx context.Context, condition operatorv1.OperatorCondition, webhookInfos []webhookInfo) v1helpers.UpdateStatusFunc { var serviceMsgs []string var tlsMsgs []string for _, webhook := range webhookInfos { if webhook.Service != nil { err := c.assertService(webhook.Service) if err != nil { msg := fmt.Sprintf("%s: %s", webhook.Name, err) if webhook.FailurePolicyIsIgnore { klog.Error(msg) continue } serviceMsgs = append(serviceMsgs, msg) continue } err = c.assertConnect(ctx, webhook.Service, webhook.CABundle) if err != nil { msg := fmt.Sprintf("%s: %s", webhook.Name, err) if webhook.FailurePolicyIsIgnore { klog.Error(msg) continue } tlsMsgs = append(tlsMsgs, msg) continue } } } (...) ~~~ https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.10/pkg/operator/webhooksupportabilitycontroller/degraded_webhook.go#L96 ~~~ // assertConnect performs a dns lookup of service, opens a tcp connection, and performs a tls handshake. func (c *webhookSupportabilityController) assertConnect(ctx context.Context, reference *serviceReference, caBundle []byte) error { host := reference.Name + "." + reference.Namespace + ".svc" port := "443" if reference.Port != nil { port = fmt.Sprintf("%d", *reference.Port) } rootCAs := x509.NewCertPool() if len(caBundle) > 0 { rootCAs.AppendCertsFromPEM(caBundle) } // the last error that occurred in the loop below var err error // retry up to 3 times on error for i := 0; i < 3; i++ { select { case <-ctx.Done(): return nil case <-time.After(time.Duration(i) * time.Second): } dialer := &tls.Dialer{ NetDialer: &net.Dialer{Timeout: 1 * time.Second}, Config: &tls.Config{ ServerName: host, RootCAs: rootCAs, }, } var conn net.Conn conn, err = dialer.DialContext(ctx, "tcp", net.JoinHostPort(host, port)) if err != nil { if i != 2 { // log err since only last one is reported runtime.HandleError(err) } continue } // error from closing connection should not affect Degraded condition runtime.HandleError(conn.Close()) break } return err ~~~ The any connection attempt towards the endpoints indeed fails: ~~~ [akaris@linux ipi-us-west-2]$ oc rsh -n openshift-kube-apiserver-operator kube-apiserver-operator-595c999448-75p55 sh-4.4# curl -v --connect-timeout 5 https://172.30.43.8:443 * Rebuilt URL to: https://172.30.43.8:443/ * Trying 172.30.43.8... * TCP_NODELAY set * Connection timed out after 5001 milliseconds * Closing connection 0 curl: (28) Connection timed out after 5001 milliseconds sh-4.4# ~~~ ~~~ [akaris@linux ipi-us-west-2]$ oc get svc -A | grep 172.30.43.8 openshift-machine-api machine-api-operator-webhook ClusterIP 172.30.43.8 <none> 443/TCP 63m ~~~ ## Solution Use https://docs.openshift.com/container-platform/4.9/networking/openshift_sdn/multitenant-isolation.html ~~~ [akaris@linux ipi-us-west-2]$ oc get netnamespaces NAME NETID EGRESS IPS (...) openshift-kube-apiserver 0 openshift-kube-apiserver-operator 5354513 (...) ~~~ ~~~ $ oc adm pod-network make-projects-global openshift-kube-apiserver-operator ~~~ ~~~ [akaris@linux ipi-us-west-2]$ oc get netnamespaces | grep kube-api openshift-kube-apiserver 0 openshift-kube-apiserver-operator 0 ~~~ ~~~ sh-4.4# curl -v --connect-timeout 5 https://172.30.43.8:443 -k * Rebuilt URL to: https://172.30.43.8:443/ * Trying 172.30.43.8... * TCP_NODELAY set * Connected to 172.30.43.8 (172.30.43.8) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.3 (OUT), TLS handshake, [no content] (0): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server accepted to use h2 * Server certificate: * subject: CN=machine-api-operator-webhook.openshift-machine-api.svc * start date: Jan 5 18:05:33 2022 GMT * expire date: Jan 5 18:05:34 2024 GMT * issuer: CN=openshift-service-serving-signer@1641405914 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. * Using HTTP2, server supports multi-use * Connection state changed (HTTP/2 confirmed) * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 * TLSv1.3 (OUT), TLS app data, [no content] (0): * TLSv1.3 (OUT), TLS app data, [no content] (0): * TLSv1.3 (OUT), TLS app data, [no content] (0): * Using Stream ID: 1 (easy handle 0x55ba9126f740) * TLSv1.3 (OUT), TLS app data, [no content] (0): > GET / HTTP/2 > Host: 172.30.43.8 > User-Agent: curl/7.61.1 > Accept: */* > * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.3 (IN), TLS app data, [no content] (0): * Connection state changed (MAX_CONCURRENT_STREAMS == 250)! * TLSv1.3 (OUT), TLS app data, [no content] (0): * TLSv1.3 (IN), TLS app data, [no content] (0): * TLSv1.3 (IN), TLS app data, [no content] (0): * TLSv1.3 (IN), TLS app data, [no content] (0): < HTTP/2 404 < content-type: text/plain; charset=utf-8 < x-content-type-options: nosniff < content-length: 19 < date: Wed, 05 Jan 2022 19:10:20 GMT < * TLSv1.3 (IN), TLS app data, [no content] (0): 404 page not found * Connection #0 to host 172.30.43.8 left intact sh-4.4# ~~~ ~~~ [akaris@linux ipi-us-west-2]$ oc get co | grep kube-apiserver kube-apiserver 4.10.0-0.nightly-2021-12-23-153012 True False False 57m ~~~ ## Fix TBD, I'll work on that next. You can tear down the reproducer, I got my own now, thank you :-)
Verified this bug on 4.10.0-0.nightly-2022-01-11-065245 $ oc get clusternetwork NAME CLUSTER NETWORK SERVICE NETWORK PLUGIN NAME default 10.128.0.0/14 172.30.0.0/16 redhat/openshift-ovs-multitenant $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.0-0.nightly-2022-01-11-065245 True False False 13h baremetal 4.10.0-0.nightly-2022-01-11-065245 True False False 13h cloud-controller-manager 4.10.0-0.nightly-2022-01-11-065245 True False False 13h cloud-credential 4.10.0-0.nightly-2022-01-11-065245 True False False 13h cluster-autoscaler 4.10.0-0.nightly-2022-01-11-065245 True False False 13h config-operator 4.10.0-0.nightly-2022-01-11-065245 True False False 13h console 4.10.0-0.nightly-2022-01-11-065245 True False False 13h csi-snapshot-controller 4.10.0-0.nightly-2022-01-11-065245 True False False 13h dns 4.10.0-0.nightly-2022-01-11-065245 True False False 13h etcd 4.10.0-0.nightly-2022-01-11-065245 True False False 13h image-registry 4.10.0-0.nightly-2022-01-11-065245 True False False 13h ingress 4.10.0-0.nightly-2022-01-11-065245 True False False 13h insights 4.10.0-0.nightly-2022-01-11-065245 True False False 13h kube-apiserver 4.10.0-0.nightly-2022-01-11-065245 True False False 13h kube-controller-manager 4.10.0-0.nightly-2022-01-11-065245 True False False 13h kube-scheduler 4.10.0-0.nightly-2022-01-11-065245 True False False 13h kube-storage-version-migrator 4.10.0-0.nightly-2022-01-11-065245 True False False 13h machine-api 4.10.0-0.nightly-2022-01-11-065245 True False False 13h machine-approver 4.10.0-0.nightly-2022-01-11-065245 True False False 13h machine-config 4.10.0-0.nightly-2022-01-11-065245 True False False 13h marketplace 4.10.0-0.nightly-2022-01-11-065245 True False False 13h monitoring 4.10.0-0.nightly-2022-01-11-065245 True False False 13h network 4.10.0-0.nightly-2022-01-11-065245 True False False 13h node-tuning 4.10.0-0.nightly-2022-01-11-065245 True False False 13h openshift-apiserver 4.10.0-0.nightly-2022-01-11-065245 True False False 13h openshift-controller-manager 4.10.0-0.nightly-2022-01-11-065245 True False False 13h openshift-samples 4.10.0-0.nightly-2022-01-11-065245 True False False 13h operator-lifecycle-manager 4.10.0-0.nightly-2022-01-11-065245 True False False 13h operator-lifecycle-manager-catalog 4.10.0-0.nightly-2022-01-11-065245 True False False 13h operator-lifecycle-manager-packageserver 4.10.0-0.nightly-2022-01-11-065245 True False False 13h service-ca 4.10.0-0.nightly-2022-01-11-065245 True False False 13h storage 4.10.0-0.nightly-2022-01-11-065245 True False False 13h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056