Description of problem: - ovnkube-master pods keep restarting post upgrade 4.7.3 to 4.7.6 - The cluster started as an IPI installation over vmware with network customization for windows container support. First release 4.7.0 with three upgrade before this one ( 4.7.1/4.7.2./4.7.3). $ omg get all -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE pod/ovnkube-master-d22xq 6/6 Running 225 2d pod/ovnkube-master-njwg7 6/6 Running 88 3d pod/ovnkube-master-xdspg 6/6 Running 132 2d Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: The previous upgrade 4.7.0 -->47.1-->4.7.2-->4.7.3 was smooth and without issue like this, now with this release it seems that something with the ovn-kubernetes cni provider is wrong Expected results: Additional info: - There is no load balancer it's an IPI installation cluster as required by Hybrid Network Suppport / Windows Container Support. So the cluster is using the keepalived static pod configuration on ocp. - Latest & earlier must gather reports available under /cases/02920882
Latest Must gather & sos report from Master nodes on support-shell - containerID: cri-o://bc9b104a876b2f4acabccbf451da48667417fe03a2ce8b7a08bcb955efaa65e3 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada lastState: terminated: containerID: cri-o://e31ea45ef72191b999e412adffefbad19b13a35167db7394b901b89e276248e3 exitCode: 1 finishedAt: '2021-04-23T10:13:46Z' message: 's ssl:172.21.8.40:9641,ssl:172.21.8.41:9641,ssl:172.21.8.42:9641 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --nb-cert-common-name ovn I0423 10:13:46.476810 1 config.go:1306] Parsed config file /run/ovnkube-config/ovnkube.conf I0423 10:13:46.476938 1 config.go:1307] Parsed config: {Default:{MTU:1400 ConntrackZone:64000 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 RawClusterSubnets:10.128.0.0/14/23 ClusterSubnets:[]} Logging:{File: CNIFile: Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:5} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableEgressIP:true} Kubernetes:{Kubeconfig: CACert: APIServer:https://api-int.openshift02.swqa.tst:6443 Token: CompatServiceCIDR: RawServiceCIDRs:172.30.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes MetricsBindAddress: OVNMetricsBindAddress: MetricsEnablePprof:false OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:kubernetes.io/os=windows NoHostSubnetNodes:nil} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} Gateway:{Mode:local Interface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64} MasterHA:{ElectionLeaseDuration:60 ElectionRenewDeadline:30 ElectionRetryPeriod:20} HybridOverlay:{Enabled:true RawClusterSubnets:10.132.0.0/14 ClusterSubnets:[] VXLANPort:9898}} F0423 10:13:46.495274 1 ovndbchecker.go:118] unable to turn on memory trimming for SB DB, stderr: 2021-04-23T10:13:46Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory) , error: OVN command ''/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb-server/memory-trim-on-compaction on'' failed: exit status 1 ' reason: Error startedAt: '2021-04-23T10:13:46Z' name: ovn-dbchecker ready: true restartCount: 2 started: true state: running: startedAt: '2021-04-23T10:15:22Z' - containerID: cri-o://8d095a08e051c276111b805f783a8078335bdc6c585221fd7c28473c22a1c06c image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada lastState: terminated: containerID: cri-o://c2275456fde334e67a13719723487e2d3189054d55e0dbde3e8735398313a6cc exitCode: 1 finishedAt: '2021-04-26T08:19:01Z' message: 'bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.226.246\:1936 2021-04-26T08:19:01.767Z|01981|nbctl|INFO|Running command run --if-exists -- remove load_balancer 7bea036d-75b0-43cf-96b7-8e7f1b52354b vips "\"172.30.244.244:80\"" I0426 08:19:01.827455 1 loadbalancer.go:337] Reject ACL not found to remove for name: 7bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.244.244\:80 I0426 08:19:01.827535 1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: 7bea036d-75b0-43cf-96b7-8e7f1b52354b, vip: 172.30.244.244:80. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: 7bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.244.244\:80 2021-04-26T08:19:01.830Z|01982|nbctl|INFO|Running command run --if-exists -- remove load_balancer 7bea036d-75b0-43cf-96b7-8e7f1b52354b vips "\"172.30.140.22:443\"" I0426 08:19:01.893546 1 loadbalancer.go:337] Reject ACL not found to remove for name: 7bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.140.22\:443 I0426 08:19:01.893620 1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: 7bea036d-75b0-43cf-96b7-8e7f1b52354b, vip: 172.30.140.22:443. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: 7bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.140.22\:443 2021-04-26T08:19:01.896Z|01983|nbctl|INFO|Running command run --if-exists -- remove load_balancer 7bea036d-75b0-43cf-96b7-8e7f1b52354b vips "\"172.30.2.14:443\"" E0426 08:19:01.950738 1 leaderelection.go:325] error retrieving resource lock openshift-ovn-kubernetes/ovn-kubernetes-master: Get "https://api-int.openshift02.swqa.tst:6443/api/v1/namespaces/openshift-ovn-kubernetes/configmaps/ovn-kubernetes-master": context deadline exceeded I0426 08:19:01.950775 1 leaderelection.go:278] failed to renew lease openshift-ovn-kubernetes/ovn-kubernetes-master: timed out waiting for the condition I0426 08:19:01.950789 1 master.go:105] No longer leader; exiting ' reason: Error startedAt: '2021-04-26T08:17:09Z' name: ovnkube-master ready: true restartCount: 225 started: true state: running: startedAt: '2021-04-26T08:19:28Z' - containerID: cri-o://d7ecab81e9bd79255aa9a0dfb3d9688039323796419b6e2f39a51be115b62ec1 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada lastState: terminated: containerID: cri-o://04178649cdf7c1e055a00a265df0db64697eee151b241dfaf94002e9deb1a3a4 exitCode: 0 finishedAt: '2021-04-23T10:13:46Z' reason: Completed startedAt: '2021-04-23T10:08:58Z' name: sbdb ready: true restartCount: 2 started: true state: running: startedAt: '2021-04-23T10:14:01Z' hostIP: 172.21.8.40 phase: Running podIP: 172.21.8.40 podIPs: - ip: 172.21.8.40 qosClass: Burstable startTime: '2021-04-23T10:03:58Z' ---------------------- must-gather.local.6934947225619097706]$ omg logs pod/ovnkube-master-d22xq -c ovnkube-master -n openshift-ovn-kubernetes | grep -i error 2021-04-26T08:38:38.775807889Z W0426 08:38:38.774559 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.NetworkPolicy ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2021-04-26T08:38:38.775908763Z W0426 08:38:38.775891 1 reflector.go:436] github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:117: watch of *v1.EgressIP ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2021-04-26T08:38:38.775962320Z W0426 08:38:38.775942 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2021-04-26T08:38:38.775995478Z W0426 08:38:38.775986 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2021-04-26T08:38:38.776134334Z W0426 08:38:38.776125 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2021-04-26T08:38:38.776261354Z W0426 08:38:38.776254 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2021-04-26T08:38:38.776372097Z W0426 08:38:38.776363 1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2021-04-26T08:38:38.784545847Z W0426 08:38:38.778348 1 reflector.go:436] k8s.io/apiextensions-apiserver/pkg/client/informers/externalversions/factory.go:117: watch of *v1beta1.CustomResourceDefinition ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2021-04-26T08:40:44.363753311Z I0426 08:40:44.363736 1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: 7bea036d-75b0-43cf-96b7-8e7f1b52354b, vip: 172.30.141.249:5443. No entry in cache and erro occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: 7bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.141.249\:5443 2021-04-26T08:40:44.572660119Z I0426 08:40:44.570252 1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e, vip: 172.21.8.46:31623. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e-172.21.8.46\:31623 2021-04-26T08:40:44.594511660Z I0426 08:40:44.594491 1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e, vip: 172.21.8.46:32023. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e-172.21.8.46\:32023 2021-04-26T08:40:44.621579632Z I0426 08:40:44.621563 1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e, vip: 172.21.8.46:32181. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e-172.21.8.46\:32181 2021-04-26T08:40:44.647520360Z I0426 08:40:44.647498 1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e, vip: 172.21.8.46:32742. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e-172.21.8.46\:32742 2021-04-26T08:40:44.682844927Z I0426 08:40:44.682835 1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e, vip: 172.21.8.46:30386. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e-172.21.8.46\:30386 2021-04-26T08:40:44.705004020Z I0426 08:40:44.704994
All the masters are bouncing for: 2021-04-26T08:19:01.950792531Z E0426 08:19:01.950738 1 leaderelection.go:325] error retrieving resource lock openshift-ovn-kubernetes/ovn-kubernetes-master: Get "https://api-int.openshift02.swqa.tst:6443/api/v1/namespaces/openshift-ovn-kubernetes/configmaps/ovn-kubernetes-master": context deadline exceeded 2021-04-26T08:19:01.950792531Z I0426 08:19:01.950775 1 leaderelection.go:278] failed to renew lease openshift-ovn-kubernetes/ovn-kubernetes-master: timed out waiting for the condition 2021-04-26T08:19:01.950825291Z I0426 08:19:01.950789 1 master.go:105] No longer leader; exiting At some point, they time out their request to the apiserver and they shut down. This is what I saw in the logs_previous of all the masters. All the api servers have a considerable amount of restarts: kube-apiserver-openshift02-tvdck-master-0 5/5 Running 94 1d kube-apiserver-openshift02-tvdck-master-1 5/5 Running 90 1d kube-apiserver-openshift02-tvdck-master-2 5/5 Running 111 1d Two logs says it's waiting for 6443 to be available again, so we don't get the real reason for restarting. timed out waiting for port :6443 to be released The one on master-1 starts after 08:19:01, and it shuts down again: 2021-04-26T08:43:48.312615310Z I0426 08:43:48.312548 213 sdn_readyz_wait.go:143] reached openshift-oauth-apiserver via SDN after 88 milliseconds 2021-04-26T08:43:48.313384403Z E0426 08:43:48.313334 213 repair.go:301] the cluster IP 172.30.126.26 may have leaked: flagging for later clean up 2021-04-26T08:43:48.314454080Z I0426 08:43:48.314366 213 httplog.go:94] "HTTP" verb="GET" URI="/apis/flowcontrol.apiserver.k8s.io/v1beta1/flowschemas/catch-all" latency="3.638461ms" userAgent="kube-apiserver/v1.20.0+bafe72f (linux/amd64) kubernetes/bafe72f" srcIP="[::1]:49520" resp=200 2021-04-26T08:43:48.325279367Z I0426 08:43:48.324934 213 apf_controller.go:266] Running API Priority and Fairness config worker 2021-04-26T08:43:48.325306816Z I0426 08:43:48.325264 213 cache.go:39] Caches are synced for AvailableConditionController controller 2021-04-26T08:43:48.333685772Z I0426 08:43:48.333609 213 shared_informer.go:247] Caches are synced for cluster_authentication_trust_controller 2021-04-26T08:43:48.339556432Z I0426 08:43:48.339485 213 shared_informer.go:247] Caches are synced for crd-autoregister 2021-04-26T08:43:48.362726075Z I0426 08:43:48.362616 213 cache.go:39] Caches are synced for autoregister controller 2021-04-26T08:43:48.368098677Z I0426 08:43:48.368027 213 httplog.go:94] "HTTP" verb="GET" URI="/apis/flowcontrol.apiserver.k8s.io/v1beta1/prioritylevelconfigurations/catch-all" latency="46.999798ms" userAgent="kube-apiserver/v1.20.0+bafe72f (linux/amd64) kubernetes/bafe72f" srcIP="[::1]:49520" resp=200 2021-04-26T08:43:48.368119897Z I0426 08:43:48.368093 213 cache.go:39] Caches are synced for APIServiceRegistrationController controller 2021-04-26T08:43:48.389546269Z I0426 08:43:48.389461 213 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/services" latency="69.749084ms" userAgent="kube-apiserver/v1.20.0+bafe72f (linux/amd64) kubernetes/bafe72f" srcIP="[::1]:49520" resp=200 2021-04-26T08:43:48.394990427Z I0426 08:43:48.393499 213 httplog.go:94] "HTTP" verb="GET" URI="/apis/flowcontrol.apiserver.k8s.io/v1beta1/prioritylevelconfigurations/exempt" latency="13.328424ms" userAgent="kube-apiserver/v1.20.0+bafe72f (linux/amd64) kubernetes/bafe72f" srcIP="[::1]:49520" resp=200 2021-04-26T08:43:48.426738531Z I0426 08:43:48.426372 213 shared_informer.go:247] Caches are synced for node_authorizer 2021-04-26T08:43:48.447148403Z I0426 08:43:48.442058 213 aggregator.go:231] Updating OpenAPI spec because v1.route.openshift.io is updated 2021-04-26T08:43:48.484563943Z I0426 08:43:48.482520 213 healthz.go:266] poststarthook/rbac/bootstrap-roles,poststarthook/scheduling/bootstrap-system-priority-classes,shutdown check failed: readyz 2021-04-26T08:43:48.484563943Z [-]poststarthook/rbac/bootstrap-roles failed: not finished 2021-04-26T08:43:48.484563943Z [-]poststarthook/scheduling/bootstrap-system-priority-classes failed: not finished 2021-04-26T08:43:48.484563943Z [-]shutdown failed: process is shutting down and then it shuts down. If I check the apiserver on master2 at the time of the ovnkube-master error: apiserver on msater 2: 2021-04-26T08:19:01.071457028Z I0426 08:19:01.065629 102 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc06e05fce0, {READY <nil>} 2021-04-26T08:19:01.071457028Z I0426 08:19:01.066840 102 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" 2021-04-26T08:19:01.228113812Z W0426 08:19:01.228041 102 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://172.21.8.41:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting... 2021-04-26T08:19:01.325114453Z W0426 08:19:01.325046 102 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://172.21.8.41:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting... 2021-04-26T08:19:01.796223110Z W0426 08:19:01.796146 102 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://172.21.8.41:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting... 2021-04-26T08:19:01.804741475Z I0426 08:19:01.804675 102 trace.go:205] Trace[1894265952]: "cacher list" type:*storage.CSIDriver (26-Apr-2021 08:18:58.804) (total time: 3000ms): 2021-04-26T08:19:01.804741475Z Trace[1894265952]: [3.000390326s] [3.000390326s] END 2021-04-26T08:19:01.806269576Z I0426 08:19:01.804929 102 trace.go:205] Trace[337787948]: "List" url:/apis/storage.k8s.io/v1/csidrivers,user-agent:kubelet.exe/v1.20.0 (windows/amd64) kubernetes/cac2421,client:172.21.8.42 (26-Apr-2021 08:18:58.804) (total time: 3000ms): 2021-04-26T08:19:01.806269576Z Trace[337787948]: [3.000668268s] [3.000668268s] END 2021-04-26T08:19:01.907128509Z W0426 08:19:01.907057 102 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://172.21.8.41:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting... 2021-04-26T08:19:01.935561713Z W0426 08:19:01.933845 102 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://172.21.8.41:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting... 2021-04-26T08:19:02.047888799Z E0426 08:19:02.047825 102 memcache.go:101] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request 2021-04-26T08:19:02.154264204Z I0426 08:19:02.153675 102 trace.go:205] Trace[1913750063]: "Create" url:/api/v1/namespaces/openshift-marketplace/serviceaccounts,user-agent:catalog/v0.0.0 (linux/amd64) kubernetes/$Format,client:172.21.8.42 (26-Apr-2021 08:19:00.542) (total time: 1610ms): 2021-04-26T08:19:02.154264204Z Trace[1913750063]: [1.610705492s] [1.610705492s] END 2021-04-26T08:19:02.236812346Z I0426 08:19:02.236727 102 trace.go:205] Trace[1894080860]: "Create" url:/api/v1/namespaces/openshift-marketplace/serviceaccounts,user-agent:catalog/v0.0.0 (linux/amd64) kubernetes/$Format,client:172.21.8.42 (26-Apr-2021 08:18:58.292) (total time: 3943ms): 2021-04-26T08:19:02.236812346Z Trace[1894080860]: [3.943793009s] [3.943793009s] END 2021-04-26T08:19:02.417111205Z I0426 08:19:02.417040 102 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" 2021-04-26T08:19:02.574512715Z I0426 08:19:02.574389 102 client.go:360] parsed scheme: "passthrough" 2021-04-26T08:19:02.574512715Z I0426 08:19:02.574423 102 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://localhost:2379 <nil> 0 <nil>}] <nil> <nil>} 2021-04-26T08:19:02.574512715Z I0426 08:19:02.574438 102 clientconn.go:948] ClientConn switching balancer to "pick_first" 2021-04-26T08:19:02.574792842Z I0426 08:19:02.574753 102 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStat Going back to etcd: On etcd on master 0 (what's backing https://172.21.8.41:2379 I think) I am not seeing anything related to the request from master, but there are some requests taking more than one minute: 2021-04-26T08:18:01.803648356Z 2021-04-26 08:18:01.803605 W | etcdserver: request "header:<ID:14026593988871619459 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/pods/openshift-kube-controller-manager/kube-controller-manager-openshift02-tvdck-master-1\" mod_revision:49691263 > success:<request_put:<key:\"/kubernetes.io/pods/openshift-kube-controller-manager/kube-controller-manager-openshift02-tvdck-master-1\" value_size:18653 >> failure:<>>" with result "size:20" took too long (141.14734ms) to execute 2021-04-26T08:18:01.936449854Z 2021-04-26 08:18:01.936265 W | etcdserver: request "header:<ID:17666346912641502105 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/events/openshift-marketplace/redhat-marketplace-gg9q5.16795a87fde829f9\" mod_revision:0 > success:<request_put:<key:\"/kubernetes.io/events/openshift-marketplace/redhat-marketplace-gg9q5.16795a87fde829f9\" value_size:741 lease:8442974875786721941 >> failure:<>>" with result "size:20" took too long (132.567234ms) to execute 2021-04-26T08:19:08.284610296Z 2021-04-26 08:19:08.278299 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1m3.643745769s) to execute 2021-04-26T08:19:08.383984071Z 2021-04-26 08:19:08.383944 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1m4.899034204s) to execute 2021-04-26T08:19:22.890951797Z 2021-04-26 08:19:22.890458 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-service-ca-operator/service-ca-operator-lock\" " with result "error:context canceled" took too long (1m19.453799065s) to execute 2021-04-26T08:19:22.953455395Z 2021-04-26 08:19:22.937646 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-controller-manager/cluster-policy-controller\" " with result "error:context canceled" took too long (1m17.911003707s) to execute 2021-04-26T08:19:22.953455395Z WARNING: 2021/04/26 08:19:22 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" Given the considerations above, I am moving this to apiserver, considering it could slip to etcd (but I am too unfamiliar with the interaction between them).
Etcd shows lots of errors around 08:19 too, on all three nodes: 2021-04-26T08:18:01.803648356Z 2021-04-26 08:18:01.803605 W | etcdserver: request "header:<ID:14026593988871619459 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/pods/openshift-kube-controller-manager/kube-controller-manager-openshift02-tvdck-master-1\" mod_revision:49691263 > success:<request_put:<key:\"/kubernetes.io/pods/openshift-kube-controller-manager/kube-controller-manager-openshift02-tvdck-master-1\" value_size:18653 >> failure:<>>" with result "size:20" took too long (141.14734ms) to execute 2021-04-26T08:18:01.936449854Z 2021-04-26 08:18:01.936265 W | etcdserver: request "header:<ID:17666346912641502105 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/events/openshift-marketplace/redhat-marketplace-gg9q5.16795a87fde829f9\" mod_revision:0 > success:<request_put:<key:\"/kubernetes.io/events/openshift-marketplace/redhat-marketplace-gg9q5.16795a87fde829f9\" value_size:741 lease:8442974875786721941 >> failure:<>>" with result "size:20" took too long (132.567234ms) to execute 2021-04-26T08:19:08.284610296Z 2021-04-26 08:19:08.278299 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1m3.643745769s) to execute 2021-04-26T08:19:08.383984071Z 2021-04-26 08:19:08.383944 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1m4.899034204s) to execute 2021-04-26T08:19:22.890951797Z 2021-04-26 08:19:22.890458 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-service-ca-operator/service-ca-operator-lock\" " with result "error:context canceled" took too long (1m19.453799065s) to execute 2021-04-26T08:19:22.953455395Z 2021-04-26 08:19:22.937646 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-controller-manager/cluster-policy-controller\" " with result "error:context canceled" took too long (1m17.911003707s) to execute 2021-04-26T08:19:22.953455395Z WARNING: 2021/04/26 08:19:22 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" 2021-04-26T08:1 2021-04-26T08:18:59.278253911Z 2021-04-26 08:18:59.277614 W | rafthttp: health check for peer 1f84f55eb037bb5f could not connect: read tcp 172.21.8.42:51728->172.21.8.41:2380: i/o timeout 2021-04-26T08:19:01.348427523Z 2021-04-26 08:19:01.348390 I | embed: rejected connection from "172.21.8.42:38846" (error "EOF", ServerName "") 2021-04-26T08:19:02.334501491Z 2021-04-26 08:19:02.333472 W | rafthttp: health check for peer 1f84f55eb037bb5f could not connect: read tcp 172.21.8.40:48696->172.21.8.41:2380: i/o timeout 2021-04-26T08:19:02.949160993Z 2021-04-26 08:19:02.949120 I | embed: rejected connection from "172.21.8.40:43632" (error "EOF", ServerName "") Change is high that the network went down during this period. Another hint for that, kube-apiserver also experienced http write timeouts: 2021-04-26T08:19:40.431773236Z I0426 08:19:40.181292 37 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" 2021-04-26T08:19:40.783304621Z I0426 08:19:40.781614 37 cacher.go:1238] Forcing watcher close due to unresponsiveness: *core.Service 2021-04-26T08:19:42.785542018Z E0426 08:19:42.785512 37 writers.go:107] apiserver was unable to write a JSON response: http: Handler timeout 4 components independently with network request timeouts is suspicious. Sending back to networking.
I'm the customer impacted from this behavior. I thought it's not an api issue but it's a network issue specifically related to ovn, if you look at the must gather you will see core dump of the ovn pod on all the three master node like the one also reported in this bug https://bugzilla.redhat.com/show_bug.cgi?id=1943413. During this phase the cluster loose api reachability but also you are not able to login or to gather log via must-gather or to create container because they stuck in ContainerCreating status because they didn't get any ip address. In this way the cluster is unstable and unusable because you didn't know whet the pods will crash and for how log the cluster will be unreachable. Also note that after the upgrade cluster's cpu utilization - without any workload on it only all ci/cd dev are on the old 4.5 cluster - went up to 25%-30% always consumed.
Took a look into the logs. The main issue is ovnkube-master and other pods are having problems contacting the api server and continuously restart. If we look at kube-apiserver there's a lot of transport is closing, and if we check etcd there are some quorum guard failures. It looks to me like there is resource contention (either CPU, memory, or both). I see from the sosreport on one of the nodes the RAM is somewhat low: [trozet@fedora memory]$ cat free total used free shared buff/cache available Mem: 16416956 14433032 167544 379692 1816380 1234460 Swap: 0 0 0 Additionally I see in dmesg and the core must gather there are core dumps on northd and journald: [566489.699222] systemd[1]: systemd-journal-flush.service: Consumed 0 CPU time [566489.699381] systemd[1]: Stopping Flush Journal to Persistent Storage... [566489.699519] systemd[1]: Stopped Journal Service. [566501.223604] systemd-journald[2072112]: File /var/log/journal/db59b16c433b4f088367d70770e0a529/system.journal corrupted or uncleanly shut down, renaming and replacing. [566509.127309] systemd[1]: Started Journal Service. [566520.418976] systemd-coredump[2072055]: Process 1652223 (systemd-journal) of user 0 dumped core. [566520.419145] systemd-coredump[2072055]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.8e76dd4e0f6b41e5acc7184ebb6282db.1652223.1619411248000000.lz4 [566520.419424] systemd-coredump[2072055]: Stack trace of thread 1652223: [566520.419541] systemd-coredump[2072055]: #0 0x00007f81bb6194df journal_file_append_entry (libsystemd-shared-239.so) [566520.419723] systemd-coredump[2072055]: #1 0x0000563706651d1c dispatch_message_real (systemd-journald) [566520.419888] systemd-coredump[2072055]: #2 0x000056370665739d stdout_stream_log (systemd-journald) [566520.420046] systemd-coredump[2072055]: #3 0x0000563706657654 stdout_stream_line (systemd-journald) [566520.420207] systemd-coredump[2072055]: #4 0x0000563706657def stdout_stream_scan (systemd-journald) [566520.420383] systemd-coredump[2072055]: #5 0x000056370665819a stdout_stream_process (systemd-journald) [566520.420548] systemd-coredump[2072055]: #6 0x00007f81bb609b40 source_dispatch (libsystemd-shared-239.so) [566527.653260] printk: systemd-coredum: 5 output lines suppressed due to ratelimiting [566627.393710] updating oom_score_adj for 2072428 (kubelet) from -999 to -999 because it shares mm with 2147 (kubelet). Report if this is unexpected. openshift02-tvdck-master-0_core_dump: total 48064 -rw-r--r--. 1 trozet trozet 3381894 Apr 23 06:04 core.systemd-journal.0.f78fa98b65584355b04b61fdb11da53e.859.1619129096000000.lz4 -rw-r--r--. 1 trozet trozet 45831400 Apr 23 06:04 core.ovn-northd.0.f78fa98b65584355b04b61fdb11da53e.4029.1619103625000000.lz4 The northd core might be the same as: https://bugzilla.redhat.com/show_bug.cgi?id=1943413#c59 but I will file another OVN bug so the OVN team can take a look at it. The etcd, kapiserver, ovnkube-master pods are all host networked and do not use OVN networking to communicate. If there is a problem with networking between them it's most likely host related. If we look at the sosreport we can see a bunch of problems in the networking stack: Tcp: 8283225 active connection openings 6705267 passive connection openings 1281379 failed connection attempts 2514925 connection resets received 2119 connections established 2217546216 segments received 2750481221 segments sent out 2753477 segments retransmitted 397 bad segments received 7841840 resets sent TcpExt: 1907 SYN cookies sent 3716 SYN cookies received 4204 invalid SYN cookies received 475 resets received for embryonic SYN_RECV sockets 151197 packets pruned from receive queue because of socket buffer overrun 2234 ICMP packets dropped because they were out-of-window 2093438 TCP sockets finished time wait in fast timer 2 time wait sockets recycled by time stamp 582963 packets rejected in established connections because of timestamp 58030999 delayed acks sent 15928 delayed acks further delayed because of locked socket Quick ack mode was activated 3585102 times 1836721 times the listen queue of a socket overflowed 1836732 SYNs to LISTEN sockets dropped 983205214 packet headers predicted 133627178 acknowledgments not containing data payload received 755888118 predicted acknowledgments TCPSackRecovery: 15867 Detected reordering 62604 times using SACK Detected reordering 2412 times using time stamp 5236 congestion windows fully recovered without slow start 1457 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 8232 25237 congestion windows recovered without slow start after partial ack TCPLostRetransmit: 531026 TCPSackFailures: 12 2 timeouts in loss state 25130 fast retransmits 1388 retransmits in slow start TCPTimeouts: 186092 TCPLossProbes: 2071776 TCPLossProbeRecovery: 6100 TCPSackRecoveryFail: 4 118028 packets collapsed in receive queue due to low socket buffer ^we can see above lots of low socket buffer, memory, and connections dropped. This also indicates to me that the host cannot keep up. Now if we focus on memory, we can see who is taking up most of the RAM during the sosreport: %Mem RSS Process 3.3 3555920 ovsdb-server -vconsole:info -vfile:off 4.2 560068 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer 4.5 2028772 /usr/bin/ruby /usr/local/bin/fluentd --suppress-config-dump 5.0 - - 10.3 - - 10.4 5316592 ovsdb-server -vconsole:info -vfile:off 10.8 3831100 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig 11.9 10224352 etcd --initial-advertise-peer-urls=https://172.21.8.41:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-openshift02-tvdck-master-0.crt 14.3 - - 14.3 176656 ovn-northd --no-chdir -vconsole:info 19.0 - - 19.0 616092 journalctl --no-pager --unit 19.9 - - 25.6 3670068 kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=172.21.8.41 28.3 - - 28.3 6981276 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off 29.3 1277928 /usr/libexec/platform-python -s /usr/sbin/sosreport 35.7 - - 35.7 616336 journalctl --no-pager --catalog We can see that ovn-controller is using 28% of the system's RAM. This is way too much at 7GB. Now we look at ovn-controller: 2021-04-26T08:40:59.094134419Z 2021-04-26T08:40:59Z|318134|ofctrl|INFO|OpenFlow error: OFPT_ERROR (xid=0x2320): OFPBRC_BAD_VERSION 2021-04-26T08:40:59.094134419Z ***decode error: OFPBRC_BAD_TYPE*** 2021-04-26T08:40:59.094134419Z 00000000 ff ff 00 10 00 00 23 20-00 22 00 02 00 00 00 f1 |......# ."......| 2021-04-26T08:40:59.094164758Z 2021-04-26T08:40:59Z|318135|ofctrl|INFO|OpenFlow error: OFPT_ERROR (xid=0x2320): OFPBRC_BAD_VERSION 2021-04-26T08:40:59.094164758Z ***decode error: OFPBRC_BAD_TYPE*** 2021-04-26T08:40:59.094164758Z 00000000 ff ff 00 10 00 00 23 20-00 22 00 02 00 00 00 f1 |......# ."......| 2021-04-26T08:40:59.094199271Z 2021-04-26T08:40:59Z|318136|ofctrl|INFO|OpenFlow error: OFPT_ERROR (xid=0x2320): OFPBRC_BAD_VERSION 2021-04-26T08:40:59.094199271Z ***decode error: OFPBRC_BAD_TYPE*** 2021-04-26T08:40:59.094199271Z 00000000 ff ff 00 10 00 00 23 20-00 22 00 02 00 00 00 f1 |......# ."......| OVS complains in another log: 2021-04-25T18:13:51.653255903Z 2021-04-25T18:13:51.643Z|57494|connmgr|INFO|br-int<->unix#979216: sending OFPBMC_BAD_LEN error reply to OFPT_FLOW_MOD message 2021-04-25T18:13:51.653255903Z 2021-04-25T18:13:51.643Z|57495|vconn_stream|ERR|received too-short ofp_header (0 bytes) 2021-04-25T18:13:51.653255903Z 2021-04-25T18:13:51.643Z|57496|rconn|WARN|br-int<->unix#979216: connection dropped (Protocol error) 2021-04-25T18:13:51.653255903Z 2021-04-25T18:13:51.643Z|57497|connmgr|INFO|br-int<->unix#979216: 4005 flow_mods in the last 17 s (344 adds, 3642 deletes, 19 modifications) 2021-04-25T18:13:53.003691314Z 2021-04-25T18:13:53.001Z|57498|ofp_msgs|WARN|unknown OpenFlow message (version 255, type 255) 2021-04-25T18:13:53.003691314Z 2021-04-25T18:13:53.001Z|57499|vconn|ERR|unix#1015512: received OpenFlow version 0xff != expected 06 Looks like there may be some corruption/memory leak with ovn-controller here. It would be good if we could generate a core of ovn-controller in this state: Can we try to generate a core for this ovn-controller on openshift02-tvdck-master-0 doing "kill -11 <pid>"? I believe that will generate a core. I'll go ahead and file a bug on this too. Also in kube-apiserver I see: 2021-04-26T09:27:55.649131530Z E0426 09:27:55.649069 37 fieldmanager.go:186] [SHOULD NOT HAPPEN] failed to update managedFields for /, Kind=: failed to convert new object (batch/v1beta1, Kind=CronJob) to smd typed: .spec.jobTemplate.spec.template.spec.containers[name="indexmanagement"].env: duplicate entries for key [name="POLICY_MAPPING"] 2021-04-26T09:27:56.251981856Z W0426 09:27:56.251905 37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured 2021-04-26T09:27:56.265460064Z W0426 09:27:56.264932 37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured 2021-04-26T09:27:56.281971228Z W0426 09:27:56.281886 37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured 2021-04-26T09:27:57.008885981Z W0426 09:27:57.008761 37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured 2021-04-26T09:27:57.027293039Z W0426 09:27:57.027168 37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured 2021-04-26T09:27:57.044257584Z W0426 09:27:57.044177 37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured 2021-04-26T09:27:57.174630661Z E0426 09:27:57.174559 37 fieldmanager.go:186] [SHOULD NOT HAPPEN] failed to update managedFields for /, Kind=: failed to convert new object (batch/v1beta1, Kind=CronJob) to smd typed: .spec.jobTemplate.spec.template.spec.containers[name="indexmanagement"].env: duplicate entries for key [name="POLICY_MAPPING"] @sttts not sure if this is expected or not, so just wanted to highlight it in case.
Minor correction: the parsing I had on the memory was not quite accurate. ovn-controller is actually consuming 41% memory with 6.7GB RSS: [trozet@fedora process]$ cat ps_auxwwwm | egrep 'USER|ovn-controller' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 3824 0.0 0.0 143732 2324 ? - Apr19 0:11 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f/userdata -c 6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-ovn-kubernetes_ovnkube-node-g8n2c_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e/ovn-controller/0.log --log-level info -n k8s_ovn-controller_ovnkube-node-g8n2c_openshift-ovn-kubernetes_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e_0 -P /var/run/containers/storage/overlay-containers/6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f -s root 3983 28.3 40.9 6981276 6719588 ? - Apr19 2819:43 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt -vconsole:info
@trozet (In reply to Tim Rozet from comment #8) > > Can we try to generate a core for this ovn-controller on > openshift02-tvdck-master-0 doing "kill -11 <pid>"? I believe that will > generate a core. I'll go ahead and file a bug on this too. If you need that I can try to do it on the node and eventually I can upload the dump here or on the case. Let me know how to proceed. Gianluca
Hi Gianluca, That would be great if you can provide that. Also after you kill it and it restarts, would be good to know: 1. Does the cluster becomes more stable? 2. Does ovn-controller now consume less RAM? 3. If 2 is true, is it slowly increasing?
Gianluca, We suspect the issue with ovn-controller is coming from egress firewall. It looks like you have some egress firewalls created. Can you please try removing those and let us know if it helps? In the meantime we will evaluate a proper fix. Thanks.
Hi Tim, Thank you for your update. We have deleted them and now restarting everything to see how things behave. It can be related or a completely separate issue, but the kube-apiserver pods have over 300 restarts and kube-apiserver container keeps crashing on 2 masters at least, so most likely the performance issues on the cluster. Not sure if OVN is causing that or if it is a separate one, but trying to check it
(In reply to Tim Rozet from comment #11) > Hi Gianluca, > That would be great if you can provide that. Also after you kill it and it > restarts, would be good to know: > 1. Does the cluster becomes more stable? > 2. Does ovn-controller now consume less RAM? > 3. If 2 is true, is it slowly increasing? Hi Tim, I was able to do the test, here below the command issued: =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2021.04.29 15:53:06 =~=~=~=~=~=~=~=~=~=~=~= ps -aux | egrep 'USER|ovn-controller' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND core 259048 0.0 0.0 12792 1112 pts/0 S+ 13:53 0:00 grep -E --color=auto USER|ovn-controller root 3523610 0.0 0.0 143732 1916 ? Ssl Apr26 0:03 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata -c 9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-ovn-kubernetes_ovnkube-node-g8n2c_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e/ovn-controller/1.log --log-level info -n k8s_ovn-controller_ovnkube-node-g8n2c_openshift-ovn-kubernetes_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e_1 -P /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2 -s root 3523636 33.4 11.8 2204988 1944488 ? Rsl Apr26 1352:22 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt -vconsole:info [core@openshift02-tvdck-master-0 ~]$ kill -11 3523636 -bash: kill: (3523636) - Operation not permitted [core@openshift02-tvdck-master-0 ~]$ kill -11 3523636sudo [core@openshift02-tvdck-master-0 ~]$ sudo kill -11 3523636ps -aux | egrep 'USER|ovn-controller' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND core 269916 0.0 0.0 12792 988 pts/0 S+ 13:54 0:00 grep -E --color=auto USER|ovn-controller root 3523610 0.0 0.0 143732 1904 ? Ssl Apr26 0:03 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata -c 9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-ovn-kubernetes_ovnkube-node-g8n2c_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e/ovn-controller/1.log --log-level info -n k8s_ovn-controller_ovnkube-node-g8n2c_openshift-ovn-kubernetes_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e_1 -P /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2 -s root 3523636 33.4 11.8 2205320 1945732 ? Ssl Apr26 1353:12 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt -vconsole:info [core@openshift02-tvdck-master-0 ~]$ ps -aux | egrep 'USER|ovn-controller' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 270902 0.0 0.0 143732 2628 ? Ssl 13:54 0:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata -c 713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-ovn-kubernetes_ovnkube-node-g8n2c_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e/ovn-controller/2.log --log-level info -n k8s_ovn-controller_ovnkube-node-g8n2c_openshift-ovn-kubernetes_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e_2 -P /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a -s root 270931 65.2 1.0 420244 165864 ? Rsl 13:54 0:06 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt -vconsole:info core 272632 0.0 0.0 12924 1100 pts/0 S+ 13:54 0:00 grep -E --color=auto USER|ovn-controller [core@openshift02-tvdck-master-0 ~]$ ps -aux | egrep 'USER|ovn-controller' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 270902 0.0 0.0 143732 2628 ? Ssl 13:54 0:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata -c 713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-ovn-kubernetes_ovnkube-node-g8n2c_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e/ovn-controller/2.log --log-level info -n k8s_ovn-controller_ovnkube-node-g8n2c_openshift-ovn-kubernetes_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e_2 -P /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a -s root 270931 69.7 1.0 423436 169068 ? Rsl 13:54 0:17 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt -vconsole:info core 274754 0.0 0.0 12924 2568 pts/0 S+ 13:54 0:00 grep -E --color=auto USER|ovn-controller [core@openshift02-tvdck-master-0 ~]$ l After the kill I hit the issue that I'm experiencing so the cluster become unstable mainly with some resource consumption on master-2 that in some case become unready. After a while - maybe after some restart of the ovn pod - the cluster become more responsive and after looking at it with Andre Costa, we decided do do a reboot of all vm starting from the worker. After the reboot the adm top show: root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# date Thu Apr 29 17:24:23 CEST 2021 root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# oc adm top pod -l app=ovnkube-node --containers -n openshift-ovn-kubernetes POD NAME CPU(cores) MEMORY(bytes) ovnkube-node-5648w kube-rbac-proxy 0m 19Mi ovnkube-node-5648w ovn-controller 757m 180Mi ovnkube-node-5648w ovnkube-node 38m 54Mi ovnkube-node-84wkf kube-rbac-proxy 0m 17Mi ovnkube-node-84wkf ovn-controller 1014m 185Mi ovnkube-node-84wkf ovnkube-node 101m 57Mi ovnkube-node-bqk8j kube-rbac-proxy 0m 25Mi ovnkube-node-bqk8j ovn-controller 739m 177Mi ovnkube-node-bqk8j ovnkube-node 94m 52Mi ovnkube-node-bvt9t kube-rbac-proxy 0m 20Mi ovnkube-node-bvt9t ovn-controller 930m 185Mi ovnkube-node-bvt9t ovnkube-node 87m 74Mi ovnkube-node-g8n2c kube-rbac-proxy 0m 14Mi ovnkube-node-g8n2c ovn-controller 990m 178Mi ovnkube-node-g8n2c ovnkube-node 77m 45Mi ovnkube-node-ph9k4 kube-rbac-proxy 0m 30Mi ovnkube-node-ph9k4 ovn-controller 1069m 182Mi ovnkube-node-ph9k4 ovnkube-node 77m 72Mi ovnkube-node-xf7v2 kube-rbac-proxy 0m 15Mi ovnkube-node-xf7v2 ovn-controller 803m 182Mi ovnkube-node-xf7v2 ovnkube-node 94m 75Mi root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# This morning the output is: root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# date Fri Apr 30 07:52:09 CEST 2021 root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# oc adm top pod -l app=ovnkube-node --containers -n openshift-ovn-kubernetes POD NAME CPU(cores) MEMORY(bytes) ovnkube-node-5648w kube-rbac-proxy 0m 14Mi ovnkube-node-5648w ovn-controller 1161m 1739Mi ovnkube-node-5648w ovnkube-node 45m 53Mi ovnkube-node-84wkf kube-rbac-proxy 0m 17Mi ovnkube-node-84wkf ovn-controller 448m 778Mi ovnkube-node-84wkf ovnkube-node 64m 60Mi ovnkube-node-bqk8j kube-rbac-proxy 0m 16Mi ovnkube-node-bqk8j ovn-controller 719m 874Mi ovnkube-node-bqk8j ovnkube-node 39m 36Mi ovnkube-node-g8n2c kube-rbac-proxy 0m 17Mi ovnkube-node-g8n2c ovn-controller 886m 1252Mi ovnkube-node-g8n2c ovnkube-node 65m 42Mi ovnkube-node-j9cs7 kube-rbac-proxy 0m 19Mi ovnkube-node-j9cs7 ovn-controller 872m 767Mi ovnkube-node-j9cs7 ovnkube-node 85m 55Mi ovnkube-node-ph9k4 kube-rbac-proxy 0m 22Mi ovnkube-node-ph9k4 ovn-controller 1104m 751Mi ovnkube-node-ph9k4 ovnkube-node 96m 52Mi ovnkube-node-xf7v2 kube-rbac-proxy 0m 29Mi ovnkube-node-xf7v2 ovn-controller 898m 842Mi ovnkube-node-xf7v2 ovnkube-node 84m 99Mi root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# The pod with cpu > 1000 are on this nodes: root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# oc get pod ovnkube-node-5648w -n openshift-ovn-kubernetes -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-node-5648w 3/3 Running 0 10d 172.21.8.42 openshift02-tvdck-master-1 <none> <none> root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# oc get pod ovnkube-node-ph9k4 -n openshift-ovn-kubernetes -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-node-ph9k4 3/3 Running 0 10d 172.21.8.46 openshift02-tvdck-worker-nnn9b <none> <none> root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# This is the top -c on the worker node: top - 05:55:55 up 14:56, 1 user, load average: 6.07, 6.22, 6.05 Tasks: 573 total, 3 running, 567 sleeping, 0 stopped, 3 zombie %Cpu(s): 23.0 us, 5.0 sy, 0.0 ni, 51.5 id, 18.5 wa, 0.6 hi, 1.3 si, 0.0 st MiB Mem : 32167.3 total, 286.1 free, 17311.1 used, 14570.2 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 15278.7 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6832 root 20 0 1018276 762404 6456 R 98.7 2.3 728:12.13 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-ce+ 20293 1000660+ 20 0 72.3g 4.9g 515364 S 19.6 15.6 184:27.04 /usr/lib/jvm/jre/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly + 13270 root 20 0 1941156 660460 312020 S 17.3 2.0 152:35.88 /usr/local/bin/px-storage As said by Andre the egress rule was removed yesterday there are two in place and they are quite simple: root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# cat /ocp/config/egress-demo-lel.yaml apiVersion: k8s.ovn.org/v1 kind: EgressFirewall metadata: name: default spec: egress: - type: Allow to: cidrSelector: 172.23.199.72/32 cidrSelector: 172.27.0.65/32 - type: Allow to: dnsName: docker.io dnsName: registry-1.docker.io dnsName: mcr.microsoft.com dnsName: azure.microsoft.com - type: Deny to: cidrSelector: 0.0.0.0/0 root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# Let me know if you need more test, we can do everything on that cluster. Gianluca
Thanks Gianluca and Andre. I'll get the OVN team to analyze the cores. If you can please also grab a must gather and the network must gather: 'oc adm must-gather -- gather_network_logs'. Having the latest DBs might help us reproduce the problem and/or provide some insight into what logical flows are causing this problem.
Hi Tim, Attaching the information requested. Also I would like to point that ETCD seems to have some performance issues. Would this be worth to check it with the respective team? Not sure if the OVN might be influencing this or not, but if ETCD is not responding it makes sense the issues we have with the kube-apiserver and the slowness of interacting with the cluster and the number of restarts in the api pods. etcdctl endpoint status -w table +--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://<IP>:2379 | 70a3eccd1ab6c2a8 | 3.4.9 | 407 MB | true | false | 8172 | 105701328 | 105701328 | | | https://<IP>:2379 | 1f84f55eb037bb5f | 3.4.9 | 407 MB | false | false | 8172 | 105701328 | 105701328 | | | https://<IP>:2379 | 84479000a194f52b | 3.4.9 | 407 MB | false | false | 8172 | 105701328 | 105701328 | | +--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ sh-4.4# etcdctl check perf --load=l 60 / 60 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1m0s FAIL: Throughput too low: 4298 writes/s PASS: Slowest request took 0.466931s PASS: Stddev is 0.039390s FAIL sh-4.4# sh-4.4# etcdctl check perf --load=m 60 / 60 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1m0s FAIL: Throughput too low: 615 writes/s Slowest request took too long: 1.033440s Stddev too high: 0.134737s FAIL I have tried to replicate the issues with OVN somehow and so far I don't see anything. Upgrade to 4.7.8 went well and I tried deploying some egressFirewall policies and still nothing. My cluster is running on my local VMs and I don't see any performance issues with ETCD nor constant restarts on the API.
@anusaxen We would like to dig deeper into cluster perf could you please provide a dump of Prometheus state. Often times the dump is very large 60GB or more possible so please consider this when running the command. ``` #!/bin/bash ARTIFACT_DIR=$PWD mkdir -p $ARTIFACT_DIR/metrics echo "Snapshotting prometheus ..." oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- tar cvzf - -C /prometheus . >$ARTIFACT_DIR/metrics/prometheus.tar.gz ``` Please ensure the file can be decompressed before uploading. With this data we will be able to drill deeply into the root cause.
etcd team will dig into the logs provided as a first pass.
(In reply to Sam Batschelet from comment #21) > @anusaxen > > We would like to dig deeper into cluster perf could you please provide a > dump of Prometheus state. Often times the dump is very large 60GB or more > possible so please consider this when running the command. > > ``` > #!/bin/bash > ARTIFACT_DIR=$PWD > mkdir -p $ARTIFACT_DIR/metrics > echo "Snapshotting prometheus ..." > oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 > -- tar cvzf - -C /prometheus . >$ARTIFACT_DIR/metrics/prometheus.tar.gz > ``` > > Please ensure the file can be decompressed before uploading. With this data > we will be able to drill deeply into the root cause. Hi Sam, I'e just uploaded the data (.tar.gz size is 1,8GB) to the case.
first pass I am very concerned by the etcd outages, many of which span hours. ``` cat host_service_logs/masters/kubelet_service.log | awk '/Readiness probe for "etcd-quorum-guard.*/ { print $1, $2, $3, $4, $10, $14; }' | wc -l 5602 ```
Not sure if the performance issues in etcd are caused by OVN or if they are caused by a separate issue but etcd doesn't show any specific issue that needs further investigation. Sending the BZ back to @trozet.
I see from the audit logs that an operator called "dapr-operator" is deleting the openshift-ingress services: openshift02-tvdck-master-1-audit-2021-05-07T04-14-34.582.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"33039c90-8978-4945-bd0c-4c78e0bd8dac","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-ingress/services/router-internal-default","verb":"delete","user":{"username":"system:serviceaccount:dapr-system:dapr-operator","uid":"238776f7-c142-41c4-8854-e0d4a81ccf91","groups":["system:serviceaccounts","system:serviceaccounts:dapr-system","system:authenticated"]},"sourceIPs":["172.21.8.44"],"userAgent":"operator/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election","objectRef":{"resource":"services","namespace":"openshift-ingress","name":"router-internal-default","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2021-05-07T04:14:17.151385Z","stageTimestamp":"2021-05-07T04:14:17.268381Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:dapr-system:dapr-operator","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"dapr-operator\" of ClusterRole \"dapr-operator-admin\" to ServiceAccount \"dapr-operator/dapr-system\""}} I'm unfamiliar with this operator, I'm guessing it is this: https://docs.dapr.io/operations/hosting/kubernetes/kubernetes-overview/ Either way it shouldn't be deleting core openshift services. Can we try disabling this operator?