Bug 2036861
| Summary: | kube-apiserver is degraded while enable multitenant | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | zhaozhanqi <zzhao> |
| Component: | Networking | Assignee: | Andreas Karis <akaris> |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | medium | CC: | akaris, yunjiang |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-12 04:40:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
zhaozhanqi
2022-01-04 09:18:58 UTC
## Reproduction steps:
~~~
../openshift-install-4.10.0-0.nightly-2021-12-23-153012 create manifests --log-level=debug --dir=install-config
cat <<'EOF' > install-config/manifests/cluster-network-03-config.yml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
name: cluster
spec:
defaultNetwork:
type: OpenShiftSDN
openshiftSDNConfig:
mode: Multitenant
EOF
../openshift-install-4.10.0-0.nightly-2021-12-23-153012 create cluster --log-level=debug --dir=install-config
~~~
## Symptoms
~~~
[akaris@linux ipi-us-west-2]$ oc get co | grep kube-apiserver
kube-apiserver 4.10.0-0.nightly-2021-12-23-153012 True False True 45m ValidatingAdmissionWebhookConfigurationDegraded: multus-validating-config.k8s.io: dial tcp 172.30.14.167:443: i/o timeout
~~~
And:
~~~
oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-595c999448-75p55
(...)
I0105 18:22:21.736926 1 status_controller.go:211] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2022-01-05T18:07:49Z","message":"ValidatingAdmissionWebhookConfigurationDegraded: multus-validating-config.k8s.io: dial tcp 172.30.14.167:443: i/o timeout","reason":"ValidatingAdmissionWebhookConfiguration_WebhookServiceConnectionError","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-01-05T18:22:21Z","message":"NodeInstallerProgressing: 3 nodes are at revision 10","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2022-01-05T18:13:30Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 10","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-01-05T18:05:31Z","message":"KubeletMinorVersionUpgradeable: Kubelet and API server minor versions are synced.","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I0105 18:22:21.751957 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"73805f7a-aae4-4023-b870-370a5130fdd0", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Progressing changed from True to False ("NodeInstallerProgressing: 3 nodes are at revision 10"),Available message changed from "StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 7; 2 nodes are at revision 10" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 10"
E0105 18:22:22.652930 1 base_controller.go:272] ConnectivityCheckController reconciliation failed: customresourcedefinitions.apiextensions.k8s.io "podnetworkconnectivitychecks.controlplane.operator.openshift.io" already exists
I0105 18:22:22.653325 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"73805f7a-aae4-4023-b870-370a5130fdd0", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CustomResourceDefinitionCreateFailed' Failed to create CustomResourceDefinition.apiextensions.k8s.io/podnetworkconnectivitychecks.controlplane.operator.openshift.io: customresourcedefinitions.apiextensions.k8s.io "podnetworkconnectivitychecks.controlplane.operator.openshift.io" already exists
I0105 18:22:22.921761 1 request.go:665] Waited for 1.184239944s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/serviceaccounts/installer-sa
E0105 18:22:23.562199 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout
I0105 18:22:24.117180 1 request.go:665] Waited for 1.596831621s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/revision-pruner-10-ip-10-0-207-239.us-west-2.compute.internal
I0105 18:22:25.322848 1 request.go:665] Waited for 1.001132807s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/configmaps/client-ca
E0105 18:22:26.564250 1 degraded_webhook.go:56] validation.machine.machine.openshift.io: dial tcp 172.30.43.8:443: i/o timeout
I0105 18:22:26.916638 1 request.go:665] Waited for 1.015735618s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/secrets/node-kubeconfigs
E0105 18:22:27.564702 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout
I0105 18:22:28.116852 1 request.go:665] Waited for 1.795964631s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-185-28.us-west-2.compute.internal
I0105 18:22:29.316760 1 request.go:665] Waited for 1.795603527s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods?labelSelector=app%3Dinstaller
E0105 18:22:29.565944 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout
I0105 18:22:30.517346 1 request.go:665] Waited for 1.792440164s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/secrets/node-kubeconfigs
I0105 18:22:31.717004 1 request.go:665] Waited for 1.596201388s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-207-239.us-west-2.compute.internal
E0105 18:22:32.566947 1 degraded_webhook.go:56] validation.machineset.machine.openshift.io: dial tcp 172.30.43.8:443: i/o timeout
I0105 18:22:32.717208 1 request.go:665] Waited for 1.196119821s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-185-28.us-west-2.compute.internal
E0105 18:22:33.572917 1 degraded_webhook.go:128] dial tcp 172.30.14.167:443: i/o timeout
E0105 18:22:35.573829 1 degraded_webhook.go:128] dial tcp 172.30.14.167:443: i/o timeout
I0105 18:22:39.316927 1 request.go:665] Waited for 1.08936891s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-185-28.us-west-2.compute.internal
E0105 18:22:39.574843 1 degraded_webhook.go:128] dial tcp 172.30.150.191:443: i/o timeout
E0105 18:22:41.576073 1 degraded_webhook.go:128] dial tcp 172.30.150.191:443: i/o timeout
E0105 18:22:44.577804 1 degraded_webhook.go:56] volumesnapshotclasses.snapshot.storage.k8s.io: dial tcp 172.30.150.191:443: i/o timeout
E0105 18:22:45.579584 1 degraded_webhook.go:128] dial tcp 172.30.95.236:443: i/o timeout
E0105 18:22:47.580945 1 degraded_webhook.go:128] dial tcp 172.30.95.236:443: i/o timeout
E0105 18:22:50.582763 1 degraded_webhook.go:56] pod-identity-webhook.amazonaws.com: dial tcp 172.30.95.236:443: i/o timeout
E0105 18:22:51.583595 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout
E0105 18:22:53.584325 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout
E0105 18:22:56.584790 1 degraded_webhook.go:56] default.machine.machine.openshift.io: dial tcp 172.30.43.8:443: i/o timeout
E0105 18:22:57.585008 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout
E0105 18:22:59.585998 1 degraded_webhook.go:128] dial tcp 172.30.43.8:443: i/o timeout
E0105 18:23:02.587596 1 degraded_webhook.go:56] default.machineset.machine.openshift.io: dial tcp 172.30.43.8:443: i/o timeout
E0105 18:23:03.593371 1 degraded_webhook.go:128] dial tcp 172.30.14.167:443: i/o timeout
E0105 18:23:05.594368 1 degraded_webhook.go:128] dial tcp 172.30.14.167:443: i/o timeout
E0105 18:23:09.596183 1 degraded_webhook.go:128] dial tcp 172.30.150.191:443: i/o timeout
E0105 18:23:11.597135 1 degraded_webhook.go:128] dial tcp 172.30.150.191:443: i/o timeout
E0105 18:23:14.597718 1 degraded_webhook.go:56] volumesnapshotclasses.snapshot.storage.k8s.io: dial tcp 172.30.150.191:443: i/o timeout
(...)
~~~
## Root cause
This is clearly due to denied TCP traffic from the kube-apiserver-operator pod to the webhook services in question. Here is the code that performs the checks:
https://github.com/openshift/cluster-kube-apiserver-operator/blob/c39e6d1281f46582c47c6ebe2c01229b575ed494/pkg/operator/webhooksupportabilitycontroller/degraded_webhook.go#L34
~~~
// updateWebhookConfigurationDegraded updates the condition specified after
// checking that the services associated with the specified webhooks exist
// and have at least one ready endpoint.
func (c *webhookSupportabilityController) updateWebhookConfigurationDegraded(ctx context.Context, condition operatorv1.OperatorCondition, webhookInfos []webhookInfo) v1helpers.UpdateStatusFunc {
var serviceMsgs []string
var tlsMsgs []string
for _, webhook := range webhookInfos {
if webhook.Service != nil {
err := c.assertService(webhook.Service)
if err != nil {
msg := fmt.Sprintf("%s: %s", webhook.Name, err)
if webhook.FailurePolicyIsIgnore {
klog.Error(msg)
continue
}
serviceMsgs = append(serviceMsgs, msg)
continue
}
err = c.assertConnect(ctx, webhook.Service, webhook.CABundle)
if err != nil {
msg := fmt.Sprintf("%s: %s", webhook.Name, err)
if webhook.FailurePolicyIsIgnore {
klog.Error(msg)
continue
}
tlsMsgs = append(tlsMsgs, msg)
continue
}
}
}
(...)
~~~
https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.10/pkg/operator/webhooksupportabilitycontroller/degraded_webhook.go#L96
~~~
// assertConnect performs a dns lookup of service, opens a tcp connection, and performs a tls handshake.
func (c *webhookSupportabilityController) assertConnect(ctx context.Context, reference *serviceReference, caBundle []byte) error {
host := reference.Name + "." + reference.Namespace + ".svc"
port := "443"
if reference.Port != nil {
port = fmt.Sprintf("%d", *reference.Port)
}
rootCAs := x509.NewCertPool()
if len(caBundle) > 0 {
rootCAs.AppendCertsFromPEM(caBundle)
}
// the last error that occurred in the loop below
var err error
// retry up to 3 times on error
for i := 0; i < 3; i++ {
select {
case <-ctx.Done():
return nil
case <-time.After(time.Duration(i) * time.Second):
}
dialer := &tls.Dialer{
NetDialer: &net.Dialer{Timeout: 1 * time.Second},
Config: &tls.Config{
ServerName: host,
RootCAs: rootCAs,
},
}
var conn net.Conn
conn, err = dialer.DialContext(ctx, "tcp", net.JoinHostPort(host, port))
if err != nil {
if i != 2 {
// log err since only last one is reported
runtime.HandleError(err)
}
continue
}
// error from closing connection should not affect Degraded condition
runtime.HandleError(conn.Close())
break
}
return err
~~~
The any connection attempt towards the endpoints indeed fails:
~~~
[akaris@linux ipi-us-west-2]$ oc rsh -n openshift-kube-apiserver-operator kube-apiserver-operator-595c999448-75p55
sh-4.4# curl -v --connect-timeout 5 https://172.30.43.8:443
* Rebuilt URL to: https://172.30.43.8:443/
* Trying 172.30.43.8...
* TCP_NODELAY set
* Connection timed out after 5001 milliseconds
* Closing connection 0
curl: (28) Connection timed out after 5001 milliseconds
sh-4.4#
~~~
~~~
[akaris@linux ipi-us-west-2]$ oc get svc -A | grep 172.30.43.8
openshift-machine-api machine-api-operator-webhook ClusterIP 172.30.43.8 <none> 443/TCP 63m
~~~
## Solution
Use https://docs.openshift.com/container-platform/4.9/networking/openshift_sdn/multitenant-isolation.html
~~~
[akaris@linux ipi-us-west-2]$ oc get netnamespaces
NAME NETID EGRESS IPS
(...)
openshift-kube-apiserver 0
openshift-kube-apiserver-operator 5354513
(...)
~~~
~~~
$ oc adm pod-network make-projects-global openshift-kube-apiserver-operator
~~~
~~~
[akaris@linux ipi-us-west-2]$ oc get netnamespaces | grep kube-api
openshift-kube-apiserver 0
openshift-kube-apiserver-operator 0
~~~
~~~
sh-4.4# curl -v --connect-timeout 5 https://172.30.43.8:443 -k
* Rebuilt URL to: https://172.30.43.8:443/
* Trying 172.30.43.8...
* TCP_NODELAY set
* Connected to 172.30.43.8 (172.30.43.8) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
* subject: CN=machine-api-operator-webhook.openshift-machine-api.svc
* start date: Jan 5 18:05:33 2022 GMT
* expire date: Jan 5 18:05:34 2024 GMT
* issuer: CN=openshift-service-serving-signer@1641405914
* SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* Using Stream ID: 1 (easy handle 0x55ba9126f740)
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> GET / HTTP/2
> Host: 172.30.43.8
> User-Agent: curl/7.61.1
> Accept: */*
>
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/2 404
< content-type: text/plain; charset=utf-8
< x-content-type-options: nosniff
< content-length: 19
< date: Wed, 05 Jan 2022 19:10:20 GMT
<
* TLSv1.3 (IN), TLS app data, [no content] (0):
404 page not found
* Connection #0 to host 172.30.43.8 left intact
sh-4.4#
~~~
~~~
[akaris@linux ipi-us-west-2]$ oc get co | grep kube-apiserver
kube-apiserver 4.10.0-0.nightly-2021-12-23-153012 True False False 57m
~~~
## Fix
TBD, I'll work on that next. You can tear down the reproducer, I got my own now, thank you :-)
Verified this bug on 4.10.0-0.nightly-2022-01-11-065245 $ oc get clusternetwork NAME CLUSTER NETWORK SERVICE NETWORK PLUGIN NAME default 10.128.0.0/14 172.30.0.0/16 redhat/openshift-ovs-multitenant $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.0-0.nightly-2022-01-11-065245 True False False 13h baremetal 4.10.0-0.nightly-2022-01-11-065245 True False False 13h cloud-controller-manager 4.10.0-0.nightly-2022-01-11-065245 True False False 13h cloud-credential 4.10.0-0.nightly-2022-01-11-065245 True False False 13h cluster-autoscaler 4.10.0-0.nightly-2022-01-11-065245 True False False 13h config-operator 4.10.0-0.nightly-2022-01-11-065245 True False False 13h console 4.10.0-0.nightly-2022-01-11-065245 True False False 13h csi-snapshot-controller 4.10.0-0.nightly-2022-01-11-065245 True False False 13h dns 4.10.0-0.nightly-2022-01-11-065245 True False False 13h etcd 4.10.0-0.nightly-2022-01-11-065245 True False False 13h image-registry 4.10.0-0.nightly-2022-01-11-065245 True False False 13h ingress 4.10.0-0.nightly-2022-01-11-065245 True False False 13h insights 4.10.0-0.nightly-2022-01-11-065245 True False False 13h kube-apiserver 4.10.0-0.nightly-2022-01-11-065245 True False False 13h kube-controller-manager 4.10.0-0.nightly-2022-01-11-065245 True False False 13h kube-scheduler 4.10.0-0.nightly-2022-01-11-065245 True False False 13h kube-storage-version-migrator 4.10.0-0.nightly-2022-01-11-065245 True False False 13h machine-api 4.10.0-0.nightly-2022-01-11-065245 True False False 13h machine-approver 4.10.0-0.nightly-2022-01-11-065245 True False False 13h machine-config 4.10.0-0.nightly-2022-01-11-065245 True False False 13h marketplace 4.10.0-0.nightly-2022-01-11-065245 True False False 13h monitoring 4.10.0-0.nightly-2022-01-11-065245 True False False 13h network 4.10.0-0.nightly-2022-01-11-065245 True False False 13h node-tuning 4.10.0-0.nightly-2022-01-11-065245 True False False 13h openshift-apiserver 4.10.0-0.nightly-2022-01-11-065245 True False False 13h openshift-controller-manager 4.10.0-0.nightly-2022-01-11-065245 True False False 13h openshift-samples 4.10.0-0.nightly-2022-01-11-065245 True False False 13h operator-lifecycle-manager 4.10.0-0.nightly-2022-01-11-065245 True False False 13h operator-lifecycle-manager-catalog 4.10.0-0.nightly-2022-01-11-065245 True False False 13h operator-lifecycle-manager-packageserver 4.10.0-0.nightly-2022-01-11-065245 True False False 13h service-ca 4.10.0-0.nightly-2022-01-11-065245 True False False 13h storage 4.10.0-0.nightly-2022-01-11-065245 True False False 13h Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |