1953613 – [OVNKubernetes] ovnkube-master pods keep restarting post Upgrading an IPI vsphere cluster from 4.7.3 to 4.7.6

Bug 1953613 - [OVNKubernetes] ovnkube-master pods keep restarting post Upgrading an IPI vsphere cluster from 4.7.3 to 4.7.6

Summary: [OVNKubernetes] ovnkube-master pods keep restarting post Upgrading an IPI vs...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Alexander Constantinescu
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:	1954853 1954855 1955161 1955167 1957030
Blocks:	1956358
TreeView+	depends on / blocked

Reported:	2021-04-26 13:56 UTC by Anil Dhingra
Modified:	2024-10-01 18:00 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1956358 (view as bug list)
Environment:
Last Closed:	2021-06-04 09:15:16 UTC
Target Upstream Version:
Embargoed:
Flags:	trozet: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ovn-org ovn-kubernetes pull 2194	0	None	open	Egress firewall fixes - inequality operator removal, ACL direction change	2022-07-27 23:49:22 UTC

Description Anil Dhingra 2021-04-26 13:56:23 UTC

Description of problem:

- ovnkube-master pods keep restarting post upgrade 4.7.3 to 4.7.6 
- The cluster started as an IPI installation over vmware with network customization for windows container support. First release 4.7.0 with three upgrade before this one ( 4.7.1/4.7.2./4.7.3).

$ omg get all -n  openshift-ovn-kubernetes
NAME                      READY  STATUS   RESTARTS  AGE
pod/ovnkube-master-d22xq  6/6    Running  225       2d
pod/ovnkube-master-njwg7  6/6    Running  88        3d
pod/ovnkube-master-xdspg  6/6    Running  132       2d
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

The previous upgrade 4.7.0 -->47.1-->4.7.2-->4.7.3 was smooth and without issue like this, now with this release it seems that something with the ovn-kubernetes cni provider is wrong

Expected results:


Additional info:
- There is no load balancer it's an IPI installation cluster as required by Hybrid Network Suppport / Windows Container Support. So the cluster is using the keepalived static pod configuration on ocp.

- Latest & earlier must gather reports available under /cases/02920882

Comment 1 Anil Dhingra 2021-04-26 14:49:48 UTC

Latest Must gather & sos report from Master nodes on support-shell 

 - containerID: cri-o://bc9b104a876b2f4acabccbf451da48667417fe03a2ce8b7a08bcb955efaa65e3
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada
    lastState:
      terminated:
        containerID: cri-o://e31ea45ef72191b999e412adffefbad19b13a35167db7394b901b89e276248e3
        exitCode: 1
        finishedAt: '2021-04-23T10:13:46Z'
        message: 's ssl:172.21.8.40:9641,ssl:172.21.8.41:9641,ssl:172.21.8.42:9641
          --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt
          --nb-client-cacert /ovn-ca/ca-bundle.crt --nb-cert-common-name ovn

          I0423 10:13:46.476810       1 config.go:1306] Parsed config file /run/ovnkube-config/ovnkube.conf

          I0423 10:13:46.476938       1 config.go:1307] Parsed config: {Default:{MTU:1400
          ConntrackZone:64000 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000
          OpenFlowProbe:180 RawClusterSubnets:10.128.0.0/14/23 ClusterSubnets:[]}
          Logging:{File: CNIFile: Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:5}
          CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableEgressIP:true}
          Kubernetes:{Kubeconfig: CACert: APIServer:https://api-int.openshift02.swqa.tst:6443
          Token: CompatServiceCIDR: RawServiceCIDRs:172.30.0.0/16 ServiceCIDRs:[]
          OVNConfigNamespace:openshift-ovn-kubernetes MetricsBindAddress: OVNMetricsBindAddress:
          MetricsEnablePprof:false OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:kubernetes.io/os=windows
          NoHostSubnetNodes:nil} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName:
          Scheme: northbound:false externalID: exec:<nil>} OvnSouth:{Address: PrivKey:
          Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>}
          Gateway:{Mode:local Interface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false
          V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64} MasterHA:{ElectionLeaseDuration:60
          ElectionRenewDeadline:30 ElectionRetryPeriod:20} HybridOverlay:{Enabled:true
          RawClusterSubnets:10.132.0.0/14 ClusterSubnets:[] VXLANPort:9898}}

          F0423 10:13:46.495274       1 ovndbchecker.go:118] unable to turn on memory
          trimming for SB DB, stderr: 2021-04-23T10:13:46Z|00001|unixctl|WARN|failed
          to connect to /var/run/ovn/ovnsb_db.ctl

          ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file
          or directory)

          , error: OVN command ''/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl
          ovsdb-server/memory-trim-on-compaction on'' failed: exit status 1

          '
        reason: Error
        startedAt: '2021-04-23T10:13:46Z'
    name: ovn-dbchecker
    ready: true
    restartCount: 2
    started: true
    state:
      running:
        startedAt: '2021-04-23T10:15:22Z'
  - containerID: cri-o://8d095a08e051c276111b805f783a8078335bdc6c585221fd7c28473c22a1c06c
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada
    lastState:
      terminated:
        containerID: cri-o://c2275456fde334e67a13719723487e2d3189054d55e0dbde3e8735398313a6cc
        exitCode: 1
        finishedAt: '2021-04-26T08:19:01Z'
        message: 'bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.226.246\:1936

          2021-04-26T08:19:01.767Z|01981|nbctl|INFO|Running command run --if-exists
          -- remove load_balancer 7bea036d-75b0-43cf-96b7-8e7f1b52354b vips "\"172.30.244.244:80\""

          I0426 08:19:01.827455       1 loadbalancer.go:337] Reject ACL not found
          to remove for name: 7bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.244.244\:80

          I0426 08:19:01.827535       1 loadbalancer.go:305] Unable to delete Reject
          ACL for load-balancer: 7bea036d-75b0-43cf-96b7-8e7f1b52354b, vip: 172.30.244.244:80.
          No entry in cache and error occurred while trying to find the ACL by name
          in OVN, error: reject ACL not found to remove for name: 7bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.244.244\:80

          2021-04-26T08:19:01.830Z|01982|nbctl|INFO|Running command run --if-exists
          -- remove load_balancer 7bea036d-75b0-43cf-96b7-8e7f1b52354b vips "\"172.30.140.22:443\""

          I0426 08:19:01.893546       1 loadbalancer.go:337] Reject ACL not found
          to remove for name: 7bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.140.22\:443

          I0426 08:19:01.893620       1 loadbalancer.go:305] Unable to delete Reject
          ACL for load-balancer: 7bea036d-75b0-43cf-96b7-8e7f1b52354b, vip: 172.30.140.22:443.
          No entry in cache and error occurred while trying to find the ACL by name
          in OVN, error: reject ACL not found to remove for name: 7bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.140.22\:443

          2021-04-26T08:19:01.896Z|01983|nbctl|INFO|Running command run --if-exists
          -- remove load_balancer 7bea036d-75b0-43cf-96b7-8e7f1b52354b vips "\"172.30.2.14:443\""

          E0426 08:19:01.950738       1 leaderelection.go:325] error retrieving resource
          lock openshift-ovn-kubernetes/ovn-kubernetes-master: Get "https://api-int.openshift02.swqa.tst:6443/api/v1/namespaces/openshift-ovn-kubernetes/configmaps/ovn-kubernetes-master":
          context deadline exceeded

          I0426 08:19:01.950775       1 leaderelection.go:278] failed to renew lease
          openshift-ovn-kubernetes/ovn-kubernetes-master: timed out waiting for the
          condition

          I0426 08:19:01.950789       1 master.go:105] No longer leader; exiting

          '
        reason: Error
        startedAt: '2021-04-26T08:17:09Z'
    name: ovnkube-master
    ready: true
    restartCount: 225
    started: true
    state:
      running:
        startedAt: '2021-04-26T08:19:28Z'
  - containerID: cri-o://d7ecab81e9bd79255aa9a0dfb3d9688039323796419b6e2f39a51be115b62ec1
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:282ac9111e6fb5cc4ffe8fb17ab7809a154166a2dc5b7b0bf81c0bd3b9b62ada
    lastState:
      terminated:
        containerID: cri-o://04178649cdf7c1e055a00a265df0db64697eee151b241dfaf94002e9deb1a3a4
        exitCode: 0
        finishedAt: '2021-04-23T10:13:46Z'
        reason: Completed
        startedAt: '2021-04-23T10:08:58Z'
    name: sbdb
    ready: true
    restartCount: 2
    started: true
    state:
      running:
        startedAt: '2021-04-23T10:14:01Z'
  hostIP: 172.21.8.40
  phase: Running
  podIP: 172.21.8.40
  podIPs:
  - ip: 172.21.8.40
  qosClass: Burstable
  startTime: '2021-04-23T10:03:58Z'

----------------------

must-gather.local.6934947225619097706]$ omg logs pod/ovnkube-master-d22xq -c ovnkube-master -n  openshift-ovn-kubernetes | grep -i error
2021-04-26T08:38:38.775807889Z W0426 08:38:38.774559       1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.NetworkPolicy ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2021-04-26T08:38:38.775908763Z W0426 08:38:38.775891       1 reflector.go:436] github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:117: watch of *v1.EgressIP ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2021-04-26T08:38:38.775962320Z W0426 08:38:38.775942       1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Endpoints ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2021-04-26T08:38:38.775995478Z W0426 08:38:38.775986       1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Pod ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2021-04-26T08:38:38.776134334Z W0426 08:38:38.776125       1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2021-04-26T08:38:38.776261354Z W0426 08:38:38.776254       1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2021-04-26T08:38:38.776372097Z W0426 08:38:38.776363       1 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2021-04-26T08:38:38.784545847Z W0426 08:38:38.778348       1 reflector.go:436] k8s.io/apiextensions-apiserver/pkg/client/informers/externalversions/factory.go:117: watch of *v1beta1.CustomResourceDefinition ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2021-04-26T08:40:44.363753311Z I0426 08:40:44.363736       1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: 7bea036d-75b0-43cf-96b7-8e7f1b52354b, vip: 172.30.141.249:5443. No entry in cache and erro occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: 7bea036d-75b0-43cf-96b7-8e7f1b52354b-172.30.141.249\:5443
2021-04-26T08:40:44.572660119Z I0426 08:40:44.570252       1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e, vip: 172.21.8.46:31623. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e-172.21.8.46\:31623
2021-04-26T08:40:44.594511660Z I0426 08:40:44.594491       1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e, vip: 172.21.8.46:32023. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e-172.21.8.46\:32023
2021-04-26T08:40:44.621579632Z I0426 08:40:44.621563       1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e, vip: 172.21.8.46:32181. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e-172.21.8.46\:32181
2021-04-26T08:40:44.647520360Z I0426 08:40:44.647498       1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e, vip: 172.21.8.46:32742. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e-172.21.8.46\:32742
2021-04-26T08:40:44.682844927Z I0426 08:40:44.682835       1 loadbalancer.go:305] Unable to delete Reject ACL for load-balancer: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e, vip: 172.21.8.46:30386. No entry in cache and error occurred while trying to find the ACL by name in OVN, error: reject ACL not found to remove for name: bf9f0abc-7b3b-4e69-9d5c-3b8639e3cf9e-172.21.8.46\:30386
2021-04-26T08:40:44.705004020Z I0426 08:40:44.704994

Comment 2 Federico Paolinelli 2021-04-27 14:49:50 UTC

All the masters are bouncing for:

2021-04-26T08:19:01.950792531Z E0426 08:19:01.950738       1 leaderelection.go:325] error retrieving resource lock openshift-ovn-kubernetes/ovn-kubernetes-master: Get "https://api-int.openshift02.swqa.tst:6443/api/v1/namespaces/openshift-ovn-kubernetes/configmaps/ovn-kubernetes-master": context deadline exceeded
2021-04-26T08:19:01.950792531Z I0426 08:19:01.950775       1 leaderelection.go:278] failed to renew lease openshift-ovn-kubernetes/ovn-kubernetes-master: timed out waiting for the condition
2021-04-26T08:19:01.950825291Z I0426 08:19:01.950789       1 master.go:105] No longer leader; exiting

At some point, they time out their request to the apiserver and they shut down. This is what I saw in the logs_previous of all the masters.

All the api servers have a considerable amount of restarts:

kube-apiserver-openshift02-tvdck-master-0      5/5    Running    94        1d
kube-apiserver-openshift02-tvdck-master-1      5/5    Running    90        1d
kube-apiserver-openshift02-tvdck-master-2      5/5    Running    111       1d


Two logs says it's waiting for 6443 to be available again, so we don't get the real reason for restarting.
timed out waiting for port :6443 to be released

The one on master-1 starts after 08:19:01, and it shuts down again:

2021-04-26T08:43:48.312615310Z I0426 08:43:48.312548     213 sdn_readyz_wait.go:143] reached openshift-oauth-apiserver via SDN after 88 milliseconds
2021-04-26T08:43:48.313384403Z E0426 08:43:48.313334     213 repair.go:301] the cluster IP 172.30.126.26 may have leaked: flagging for later clean up
2021-04-26T08:43:48.314454080Z I0426 08:43:48.314366     213 httplog.go:94] "HTTP" verb="GET" URI="/apis/flowcontrol.apiserver.k8s.io/v1beta1/flowschemas/catch-all" latency="3.638461ms" userAgent="kube-apiserver/v1.20.0+bafe72f (linux/amd64) kubernetes/bafe72f" srcIP="[::1]:49520" resp=200
2021-04-26T08:43:48.325279367Z I0426 08:43:48.324934     213 apf_controller.go:266] Running API Priority and Fairness config worker
2021-04-26T08:43:48.325306816Z I0426 08:43:48.325264     213 cache.go:39] Caches are synced for AvailableConditionController controller
2021-04-26T08:43:48.333685772Z I0426 08:43:48.333609     213 shared_informer.go:247] Caches are synced for cluster_authentication_trust_controller 
2021-04-26T08:43:48.339556432Z I0426 08:43:48.339485     213 shared_informer.go:247] Caches are synced for crd-autoregister 
2021-04-26T08:43:48.362726075Z I0426 08:43:48.362616     213 cache.go:39] Caches are synced for autoregister controller
2021-04-26T08:43:48.368098677Z I0426 08:43:48.368027     213 httplog.go:94] "HTTP" verb="GET" URI="/apis/flowcontrol.apiserver.k8s.io/v1beta1/prioritylevelconfigurations/catch-all" latency="46.999798ms" userAgent="kube-apiserver/v1.20.0+bafe72f (linux/amd64) kubernetes/bafe72f" srcIP="[::1]:49520" resp=200
2021-04-26T08:43:48.368119897Z I0426 08:43:48.368093     213 cache.go:39] Caches are synced for APIServiceRegistrationController controller
2021-04-26T08:43:48.389546269Z I0426 08:43:48.389461     213 httplog.go:94] "HTTP" verb="GET" URI="/api/v1/services" latency="69.749084ms" userAgent="kube-apiserver/v1.20.0+bafe72f (linux/amd64) kubernetes/bafe72f" srcIP="[::1]:49520" resp=200
2021-04-26T08:43:48.394990427Z I0426 08:43:48.393499     213 httplog.go:94] "HTTP" verb="GET" URI="/apis/flowcontrol.apiserver.k8s.io/v1beta1/prioritylevelconfigurations/exempt" latency="13.328424ms" userAgent="kube-apiserver/v1.20.0+bafe72f (linux/amd64) kubernetes/bafe72f" srcIP="[::1]:49520" resp=200
2021-04-26T08:43:48.426738531Z I0426 08:43:48.426372     213 shared_informer.go:247] Caches are synced for node_authorizer 
2021-04-26T08:43:48.447148403Z I0426 08:43:48.442058     213 aggregator.go:231] Updating OpenAPI spec because v1.route.openshift.io is updated
2021-04-26T08:43:48.484563943Z I0426 08:43:48.482520     213 healthz.go:266] poststarthook/rbac/bootstrap-roles,poststarthook/scheduling/bootstrap-system-priority-classes,shutdown check failed: readyz
2021-04-26T08:43:48.484563943Z [-]poststarthook/rbac/bootstrap-roles failed: not finished
2021-04-26T08:43:48.484563943Z [-]poststarthook/scheduling/bootstrap-system-priority-classes failed: not finished
2021-04-26T08:43:48.484563943Z [-]shutdown failed: process is shutting down

and then it shuts down.



If I check the apiserver on master2 at the time of the ovnkube-master error:

apiserver on msater 2:

2021-04-26T08:19:01.071457028Z I0426 08:19:01.065629     102 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc06e05fce0, {READY <nil>}
2021-04-26T08:19:01.071457028Z I0426 08:19:01.066840     102 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2021-04-26T08:19:01.228113812Z W0426 08:19:01.228041     102 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://172.21.8.41:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
2021-04-26T08:19:01.325114453Z W0426 08:19:01.325046     102 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://172.21.8.41:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
2021-04-26T08:19:01.796223110Z W0426 08:19:01.796146     102 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://172.21.8.41:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
2021-04-26T08:19:01.804741475Z I0426 08:19:01.804675     102 trace.go:205] Trace[1894265952]: "cacher list" type:*storage.CSIDriver (26-Apr-2021 08:18:58.804) (total time: 3000ms):
2021-04-26T08:19:01.804741475Z Trace[1894265952]: [3.000390326s] [3.000390326s] END
2021-04-26T08:19:01.806269576Z I0426 08:19:01.804929     102 trace.go:205] Trace[337787948]: "List" url:/apis/storage.k8s.io/v1/csidrivers,user-agent:kubelet.exe/v1.20.0 (windows/amd64) kubernetes/cac2421,client:172.21.8.42 (26-Apr-2021 08:18:58.804) (total time: 3000ms):
2021-04-26T08:19:01.806269576Z Trace[337787948]: [3.000668268s] [3.000668268s] END
2021-04-26T08:19:01.907128509Z W0426 08:19:01.907057     102 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://172.21.8.41:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
2021-04-26T08:19:01.935561713Z W0426 08:19:01.933845     102 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://172.21.8.41:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
2021-04-26T08:19:02.047888799Z E0426 08:19:02.047825     102 memcache.go:101] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
2021-04-26T08:19:02.154264204Z I0426 08:19:02.153675     102 trace.go:205] Trace[1913750063]: "Create" url:/api/v1/namespaces/openshift-marketplace/serviceaccounts,user-agent:catalog/v0.0.0 (linux/amd64) kubernetes/$Format,client:172.21.8.42 (26-Apr-2021 08:19:00.542) (total time: 1610ms):
2021-04-26T08:19:02.154264204Z Trace[1913750063]: [1.610705492s] [1.610705492s] END
2021-04-26T08:19:02.236812346Z I0426 08:19:02.236727     102 trace.go:205] Trace[1894080860]: "Create" url:/api/v1/namespaces/openshift-marketplace/serviceaccounts,user-agent:catalog/v0.0.0 (linux/amd64) kubernetes/$Format,client:172.21.8.42 (26-Apr-2021 08:18:58.292) (total time: 3943ms):
2021-04-26T08:19:02.236812346Z Trace[1894080860]: [3.943793009s] [3.943793009s] END
2021-04-26T08:19:02.417111205Z I0426 08:19:02.417040     102 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2021-04-26T08:19:02.574512715Z I0426 08:19:02.574389     102 client.go:360] parsed scheme: "passthrough"
2021-04-26T08:19:02.574512715Z I0426 08:19:02.574423     102 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://localhost:2379  <nil> 0 <nil>}] <nil> <nil>}
2021-04-26T08:19:02.574512715Z I0426 08:19:02.574438     102 clientconn.go:948] ClientConn switching balancer to "pick_first"
2021-04-26T08:19:02.574792842Z I0426 08:19:02.574753     102 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStat


Going back to etcd:


On etcd on master 0 (what's backing https://172.21.8.41:2379 I think) I am not seeing anything related to the request from master, but there are some requests taking more than one minute:

2021-04-26T08:18:01.803648356Z 2021-04-26 08:18:01.803605 W | etcdserver: request "header:<ID:14026593988871619459 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/pods/openshift-kube-controller-manager/kube-controller-manager-openshift02-tvdck-master-1\" mod_revision:49691263 > success:<request_put:<key:\"/kubernetes.io/pods/openshift-kube-controller-manager/kube-controller-manager-openshift02-tvdck-master-1\" value_size:18653 >> failure:<>>" with result "size:20" took too long (141.14734ms) to execute
2021-04-26T08:18:01.936449854Z 2021-04-26 08:18:01.936265 W | etcdserver: request "header:<ID:17666346912641502105 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/events/openshift-marketplace/redhat-marketplace-gg9q5.16795a87fde829f9\" mod_revision:0 > success:<request_put:<key:\"/kubernetes.io/events/openshift-marketplace/redhat-marketplace-gg9q5.16795a87fde829f9\" value_size:741 lease:8442974875786721941 >> failure:<>>" with result "size:20" took too long (132.567234ms) to execute
2021-04-26T08:19:08.284610296Z 2021-04-26 08:19:08.278299 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1m3.643745769s) to execute
2021-04-26T08:19:08.383984071Z 2021-04-26 08:19:08.383944 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1m4.899034204s) to execute
2021-04-26T08:19:22.890951797Z 2021-04-26 08:19:22.890458 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-service-ca-operator/service-ca-operator-lock\" " with result "error:context canceled" took too long (1m19.453799065s) to execute
2021-04-26T08:19:22.953455395Z 2021-04-26 08:19:22.937646 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-controller-manager/cluster-policy-controller\" " with result "error:context canceled" took too long (1m17.911003707s) to execute
2021-04-26T08:19:22.953455395Z WARNING: 2021/04/26 08:19:22 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"


Given the considerations above, I am moving this to apiserver, considering it could slip to etcd (but I am too unfamiliar with the interaction between them).

Comment 3 Stefan Schimanski 2021-04-28 07:48:57 UTC

Etcd shows lots of errors around 08:19 too, on all three nodes:

2021-04-26T08:18:01.803648356Z 2021-04-26 08:18:01.803605 W | etcdserver: request "header:<ID:14026593988871619459 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/pods/openshift-kube-controller-manager/kube-controller-manager-openshift02-tvdck-master-1\" mod_revision:49691263 > success:<request_put:<key:\"/kubernetes.io/pods/openshift-kube-controller-manager/kube-controller-manager-openshift02-tvdck-master-1\" value_size:18653 >> failure:<>>" with result "size:20" took too long (141.14734ms) to execute
2021-04-26T08:18:01.936449854Z 2021-04-26 08:18:01.936265 W | etcdserver: request "header:<ID:17666346912641502105 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/events/openshift-marketplace/redhat-marketplace-gg9q5.16795a87fde829f9\" mod_revision:0 > success:<request_put:<key:\"/kubernetes.io/events/openshift-marketplace/redhat-marketplace-gg9q5.16795a87fde829f9\" value_size:741 lease:8442974875786721941 >> failure:<>>" with result "size:20" took too long (132.567234ms) to execute
2021-04-26T08:19:08.284610296Z 2021-04-26 08:19:08.278299 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1m3.643745769s) to execute
2021-04-26T08:19:08.383984071Z 2021-04-26 08:19:08.383944 W | etcdserver: read-only range request "key:\"/kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1m4.899034204s) to execute
2021-04-26T08:19:22.890951797Z 2021-04-26 08:19:22.890458 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-service-ca-operator/service-ca-operator-lock\" " with result "error:context canceled" took too long (1m19.453799065s) to execute
2021-04-26T08:19:22.953455395Z 2021-04-26 08:19:22.937646 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-controller-manager/cluster-policy-controller\" " with result "error:context canceled" took too long (1m17.911003707s) to execute
2021-04-26T08:19:22.953455395Z WARNING: 2021/04/26 08:19:22 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2021-04-26T08:1


2021-04-26T08:18:59.278253911Z 2021-04-26 08:18:59.277614 W | rafthttp: health check for peer 1f84f55eb037bb5f could not connect: read tcp 172.21.8.42:51728->172.21.8.41:2380: i/o timeout
2021-04-26T08:19:01.348427523Z 2021-04-26 08:19:01.348390 I | embed: rejected connection from "172.21.8.42:38846" (error "EOF", ServerName "")


2021-04-26T08:19:02.334501491Z 2021-04-26 08:19:02.333472 W | rafthttp: health check for peer 1f84f55eb037bb5f could not connect: read tcp 172.21.8.40:48696->172.21.8.41:2380: i/o timeout
2021-04-26T08:19:02.949160993Z 2021-04-26 08:19:02.949120 I | embed: rejected connection from "172.21.8.40:43632" (error "EOF", ServerName "")


Change is high that the network went down during this period.

Another hint for that, kube-apiserver also experienced http write timeouts:

  2021-04-26T08:19:40.431773236Z I0426 08:19:40.181292      37 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
  2021-04-26T08:19:40.783304621Z I0426 08:19:40.781614      37 cacher.go:1238] Forcing watcher close due to unresponsiveness: *core.Service
  2021-04-26T08:19:42.785542018Z E0426 08:19:42.785512      37 writers.go:107] apiserver was unable to write a JSON response: http: Handler timeout

4 components independently with network request timeouts is suspicious. Sending back to networking.

Comment 5 Gianluca 2021-04-28 13:11:28 UTC

I'm the customer impacted from this behavior.
I thought it's not an api issue but it's a network issue specifically related to ovn, if you look at the must gather you will see core dump of the ovn pod on all the three master node like the one also  reported in this bug https://bugzilla.redhat.com/show_bug.cgi?id=1943413.
During this phase the cluster loose api reachability but also you are not able to login or to gather log via must-gather or to create container because they stuck in ContainerCreating status because they didn't get any ip address. In this way the cluster is unstable and unusable because you didn't know whet the pods will crash and for how log the cluster will be unreachable.
Also note that after the upgrade cluster's cpu utilization - without any workload on it only all ci/cd dev are on the old 4.5 cluster - went up to 25%-30% always consumed.

Comment 8 Tim Rozet 2021-04-28 21:44:41 UTC

Took a look into the logs. The main issue is ovnkube-master and other pods are having problems contacting the api server and continuously restart. If we look at kube-apiserver there's a lot of transport is closing, and if we check etcd there are some quorum guard failures. It looks to me like there is resource contention (either CPU, memory, or both). I see from the sosreport on one of the nodes the RAM is somewhat low:
[trozet@fedora memory]$ cat free
              total        used        free      shared  buff/cache   available
Mem:       16416956    14433032      167544      379692     1816380     1234460
Swap:             0           0           0


Additionally I see in dmesg and the core must gather there are core dumps on northd and journald:

[566489.699222] systemd[1]: systemd-journal-flush.service: Consumed 0 CPU time
[566489.699381] systemd[1]: Stopping Flush Journal to Persistent Storage...
[566489.699519] systemd[1]: Stopped Journal Service.
[566501.223604] systemd-journald[2072112]: File /var/log/journal/db59b16c433b4f088367d70770e0a529/system.journal corrupted or uncleanly shut down, renaming and replacing.
[566509.127309] systemd[1]: Started Journal Service.
[566520.418976] systemd-coredump[2072055]: Process 1652223 (systemd-journal) of user 0 dumped core.
[566520.419145] systemd-coredump[2072055]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.8e76dd4e0f6b41e5acc7184ebb6282db.1652223.1619411248000000.lz4
[566520.419424] systemd-coredump[2072055]: Stack trace of thread 1652223:
[566520.419541] systemd-coredump[2072055]: #0  0x00007f81bb6194df journal_file_append_entry (libsystemd-shared-239.so)
[566520.419723] systemd-coredump[2072055]: #1  0x0000563706651d1c dispatch_message_real (systemd-journald)
[566520.419888] systemd-coredump[2072055]: #2  0x000056370665739d stdout_stream_log (systemd-journald)
[566520.420046] systemd-coredump[2072055]: #3  0x0000563706657654 stdout_stream_line (systemd-journald)
[566520.420207] systemd-coredump[2072055]: #4  0x0000563706657def stdout_stream_scan (systemd-journald)
[566520.420383] systemd-coredump[2072055]: #5  0x000056370665819a stdout_stream_process (systemd-journald)
[566520.420548] systemd-coredump[2072055]: #6  0x00007f81bb609b40 source_dispatch (libsystemd-shared-239.so)
[566527.653260] printk: systemd-coredum: 5 output lines suppressed due to ratelimiting
[566627.393710] updating oom_score_adj for 2072428 (kubelet) from -999 to -999 because it shares mm with 2147 (kubelet). Report if this is unexpected.

openshift02-tvdck-master-0_core_dump:
total 48064
-rw-r--r--. 1 trozet trozet  3381894 Apr 23 06:04 core.systemd-journal.0.f78fa98b65584355b04b61fdb11da53e.859.1619129096000000.lz4
-rw-r--r--. 1 trozet trozet 45831400 Apr 23 06:04 core.ovn-northd.0.f78fa98b65584355b04b61fdb11da53e.4029.1619103625000000.lz4


The northd core might be the same as:
https://bugzilla.redhat.com/show_bug.cgi?id=1943413#c59

but I will file another OVN bug so the OVN team can take a look at it.

The etcd, kapiserver, ovnkube-master pods are all host networked and do not use OVN networking to communicate. If there is a problem with networking between them it's most likely host related. If we look at the sosreport we can see a bunch of problems in the networking stack:
Tcp:
    8283225 active connection openings
    6705267 passive connection openings
    1281379 failed connection attempts
    2514925 connection resets received
    2119 connections established
    2217546216 segments received
    2750481221 segments sent out
    2753477 segments retransmitted
    397 bad segments received
    7841840 resets sent
TcpExt:
    1907 SYN cookies sent
    3716 SYN cookies received
    4204 invalid SYN cookies received
    475 resets received for embryonic SYN_RECV sockets
    151197 packets pruned from receive queue because of socket buffer overrun
    2234 ICMP packets dropped because they were out-of-window
    2093438 TCP sockets finished time wait in fast timer
    2 time wait sockets recycled by time stamp
    582963 packets rejected in established connections because of timestamp
    58030999 delayed acks sent
    15928 delayed acks further delayed because of locked socket
    Quick ack mode was activated 3585102 times
    1836721 times the listen queue of a socket overflowed
    1836732 SYNs to LISTEN sockets dropped
    983205214 packet headers predicted
    133627178 acknowledgments not containing data payload received
    755888118 predicted acknowledgments
    TCPSackRecovery: 15867
    Detected reordering 62604 times using SACK
    Detected reordering 2412 times using time stamp
    5236 congestion windows fully recovered without slow start
    1457 congestion windows partially recovered using Hoe heuristic
    TCPDSACKUndo: 8232
    25237 congestion windows recovered without slow start after partial ack
    TCPLostRetransmit: 531026
    TCPSackFailures: 12
    2 timeouts in loss state
    25130 fast retransmits
    1388 retransmits in slow start
    TCPTimeouts: 186092
    TCPLossProbes: 2071776
    TCPLossProbeRecovery: 6100
    TCPSackRecoveryFail: 4
    118028 packets collapsed in receive queue due to low socket buffer

^we can see above lots of low socket buffer, memory, and connections dropped. This also indicates to me that the host cannot keep up. Now if we focus on memory, we can see who is taking up most of the RAM during the sosreport:

%Mem   RSS    Process
3.3 3555920 ovsdb-server -vconsole:info -vfile:off
4.2 560068 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer
4.5 2028772 /usr/bin/ruby /usr/local/bin/fluentd --suppress-config-dump
5.0 - -  
10.3 - -  
10.4 5316592 ovsdb-server -vconsole:info -vfile:off
10.8 3831100 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig
11.9 10224352 etcd --initial-advertise-peer-urls=https://172.21.8.41:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-openshift02-tvdck-master-0.crt
14.3 - -  
14.3 176656 ovn-northd --no-chdir -vconsole:info
19.0 - -  
19.0 616092 journalctl --no-pager --unit
19.9 - -  
25.6 3670068 kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=172.21.8.41
28.3 - -  
28.3 6981276 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off
29.3 1277928 /usr/libexec/platform-python -s /usr/sbin/sosreport
35.7 - -  
35.7 616336 journalctl --no-pager --catalog

We can see that ovn-controller is using 28% of the system's RAM. This is way too much at 7GB. Now we look at ovn-controller:
2021-04-26T08:40:59.094134419Z 2021-04-26T08:40:59Z|318134|ofctrl|INFO|OpenFlow error: OFPT_ERROR (xid=0x2320): OFPBRC_BAD_VERSION
2021-04-26T08:40:59.094134419Z ***decode error: OFPBRC_BAD_TYPE***
2021-04-26T08:40:59.094134419Z 00000000  ff ff 00 10 00 00 23 20-00 22 00 02 00 00 00 f1 |......# ."......|
2021-04-26T08:40:59.094164758Z 2021-04-26T08:40:59Z|318135|ofctrl|INFO|OpenFlow error: OFPT_ERROR (xid=0x2320): OFPBRC_BAD_VERSION
2021-04-26T08:40:59.094164758Z ***decode error: OFPBRC_BAD_TYPE***
2021-04-26T08:40:59.094164758Z 00000000  ff ff 00 10 00 00 23 20-00 22 00 02 00 00 00 f1 |......# ."......|
2021-04-26T08:40:59.094199271Z 2021-04-26T08:40:59Z|318136|ofctrl|INFO|OpenFlow error: OFPT_ERROR (xid=0x2320): OFPBRC_BAD_VERSION
2021-04-26T08:40:59.094199271Z ***decode error: OFPBRC_BAD_TYPE***
2021-04-26T08:40:59.094199271Z 00000000  ff ff 00 10 00 00 23 20-00 22 00 02 00 00 00 f1 |......# ."......|

OVS complains in another log:
2021-04-25T18:13:51.653255903Z 2021-04-25T18:13:51.643Z|57494|connmgr|INFO|br-int<->unix#979216: sending OFPBMC_BAD_LEN error reply to OFPT_FLOW_MOD message
2021-04-25T18:13:51.653255903Z 2021-04-25T18:13:51.643Z|57495|vconn_stream|ERR|received too-short ofp_header (0 bytes)
2021-04-25T18:13:51.653255903Z 2021-04-25T18:13:51.643Z|57496|rconn|WARN|br-int<->unix#979216: connection dropped (Protocol error)
2021-04-25T18:13:51.653255903Z 2021-04-25T18:13:51.643Z|57497|connmgr|INFO|br-int<->unix#979216: 4005 flow_mods in the last 17 s (344 adds, 3642 deletes, 19 modifications)
2021-04-25T18:13:53.003691314Z 2021-04-25T18:13:53.001Z|57498|ofp_msgs|WARN|unknown OpenFlow message (version 255, type 255)
2021-04-25T18:13:53.003691314Z 2021-04-25T18:13:53.001Z|57499|vconn|ERR|unix#1015512: received OpenFlow version 0xff != expected 06

Looks like there may be some corruption/memory leak with ovn-controller here. It would be good if we could generate a core of ovn-controller in this state:
Can we try to generate a core for this ovn-controller on openshift02-tvdck-master-0 doing "kill -11 <pid>"? I believe that will generate a core. I'll go ahead and file a bug on this too.

Also in kube-apiserver I see:
2021-04-26T09:27:55.649131530Z E0426 09:27:55.649069      37 fieldmanager.go:186] [SHOULD NOT HAPPEN] failed to update managedFields for /, Kind=: failed to convert new object (batch/v1beta1, Kind=CronJob) to smd typed: .spec.jobTemplate.spec.template.spec.containers[name="indexmanagement"].env: duplicate entries for key [name="POLICY_MAPPING"]
2021-04-26T09:27:56.251981856Z W0426 09:27:56.251905      37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured
2021-04-26T09:27:56.265460064Z W0426 09:27:56.264932      37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured
2021-04-26T09:27:56.281971228Z W0426 09:27:56.281886      37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured
2021-04-26T09:27:57.008885981Z W0426 09:27:57.008761      37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured
2021-04-26T09:27:57.027293039Z W0426 09:27:57.027168      37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured
2021-04-26T09:27:57.044257584Z W0426 09:27:57.044177      37 admission.go:98] ClusterQuotaAdmission received non-meta object *unstructured.Unstructured
2021-04-26T09:27:57.174630661Z E0426 09:27:57.174559      37 fieldmanager.go:186] [SHOULD NOT HAPPEN] failed to update managedFields for /, Kind=: failed to convert new object (batch/v1beta1, Kind=CronJob) to smd typed: .spec.jobTemplate.spec.template.spec.containers[name="indexmanagement"].env: duplicate entries for key [name="POLICY_MAPPING"]

@sttts not sure if this is expected or not, so just wanted to highlight it in case.

Comment 9 Tim Rozet 2021-04-28 21:55:37 UTC

Minor correction: the parsing I had on the memory was not quite accurate. ovn-controller is actually consuming 41% memory with 6.7GB RSS:

[trozet@fedora process]$ cat ps_auxwwwm  | egrep 'USER|ovn-controller'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        3824  0.0  0.0 143732  2324 ?        -    Apr19   0:11 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f/userdata -c 6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-ovn-kubernetes_ovnkube-node-g8n2c_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e/ovn-controller/0.log --log-level info -n k8s_ovn-controller_ovnkube-node-g8n2c_openshift-ovn-kubernetes_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e_0 -P /var/run/containers/storage/overlay-containers/6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 6d11511d3356979d09e19079b3a52f7f1c58c8353945337f30aa80c21f37b03f -s
root        3983 28.3 40.9 6981276 6719588 ?     -    Apr19 2819:43 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt -vconsole:info

Comment 10 Gianluca 2021-04-29 07:22:08 UTC

@trozet (In reply to Tim Rozet from comment #8)
>
> Can we try to generate a core for this ovn-controller on
> openshift02-tvdck-master-0 doing "kill -11 <pid>"? I believe that will
> generate a core. I'll go ahead and file a bug on this too.

If you need that I can try to do it on the node and eventually I can upload the dump here or on the case.
Let me know how to proceed.

Gianluca

Comment 11 Tim Rozet 2021-04-29 13:08:21 UTC

Hi Gianluca,
That would be great if you can provide that. Also after you kill it and it restarts, would be good to know:
1. Does the cluster becomes more stable?
2. Does ovn-controller now consume less RAM?
3. If 2 is true, is it slowly increasing?

Comment 12 Tim Rozet 2021-04-29 14:37:14 UTC

Gianluca,
We suspect the issue with ovn-controller is coming from egress firewall. It looks like you have some egress firewalls created. Can you please try removing those and let us know if it helps? In the meantime we will evaluate a proper fix. Thanks.

Comment 13 Andre Costa 2021-04-29 15:01:53 UTC

Hi Tim,

Thank you for your update.

We have deleted them and now restarting everything to see how things behave. 

It can be related or a completely separate issue, but the kube-apiserver pods have over 300 restarts and kube-apiserver container keeps crashing on 2 masters at least, so most likely the performance issues on the cluster. Not sure if OVN is causing that or if it is a separate one, but trying to check it

Comment 15 Gianluca 2021-04-30 05:58:42 UTC

(In reply to Tim Rozet from comment #11)
> Hi Gianluca,
> That would be great if you can provide that. Also after you kill it and it
> restarts, would be good to know:
> 1. Does the cluster becomes more stable?
> 2. Does ovn-controller now consume less RAM?
> 3. If 2 is true, is it slowly increasing?

Hi Tim,
I was able to do the test, here below the command issued:

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2021.04.29 15:53:06 =~=~=~=~=~=~=~=~=~=~=~=
ps -aux  | egrep 'USER|ovn-controller'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
core      259048  0.0  0.0  12792  1112 pts/0    S+   13:53   0:00 grep -E --color=auto USER|ovn-controller
root     3523610  0.0  0.0 143732  1916 ?        Ssl  Apr26   0:03 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata -c 9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-ovn-kubernetes_ovnkube-node-g8n2c_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e/ovn-controller/1.log --log-level info -n k8s_ovn-controller_ovnkube-node-g8n2c_openshift-ovn-kubernetes_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e_1 -P /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2 -s
root     3523636 33.4 11.8 2204988 1944488 ?     Rsl  Apr26 1352:22 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt -vconsole:info
[core@openshift02-tvdck-master-0 ~]$ kill -11 3523636
-bash: kill: (3523636) - Operation not permitted
[core@openshift02-tvdck-master-0 ~]$ kill -11 3523636sudo 
[core@openshift02-tvdck-master-0 ~]$ sudo kill -11 3523636ps -aux  | egrep 'USER|ovn-controller'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
core      269916  0.0  0.0  12792   988 pts/0    S+   13:54   0:00 grep -E --color=auto USER|ovn-controller
root     3523610  0.0  0.0 143732  1904 ?        Ssl  Apr26   0:03 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata -c 9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-ovn-kubernetes_ovnkube-node-g8n2c_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e/ovn-controller/1.log --log-level info -n k8s_ovn-controller_ovnkube-node-g8n2c_openshift-ovn-kubernetes_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e_1 -P /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 9a472c41ff47fe4c98f049f87c21155aef0e5a107a779c8ddbdbfc89d973d1e2 -s
root     3523636 33.4 11.8 2205320 1945732 ?     Ssl  Apr26 1353:12 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt -vconsole:info
[core@openshift02-tvdck-master-0 ~]$ ps -aux  | egrep 'USER|ovn-controller'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      270902  0.0  0.0 143732  2628 ?        Ssl  13:54   0:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata -c 713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-ovn-kubernetes_ovnkube-node-g8n2c_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e/ovn-controller/2.log --log-level info -n k8s_ovn-controller_ovnkube-node-g8n2c_openshift-ovn-kubernetes_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e_2 -P /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a -s
root      270931 65.2  1.0 420244 165864 ?       Rsl  13:54   0:06 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt -vconsole:info
core      272632  0.0  0.0  12924  1100 pts/0    S+   13:54   0:00 grep -E --color=auto USER|ovn-controller
[core@openshift02-tvdck-master-0 ~]$ ps -aux  | egrep 'USER|ovn-controller'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      270902  0.0  0.0 143732  2628 ?        Ssl  13:54   0:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata -c 713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-ovn-kubernetes_ovnkube-node-g8n2c_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e/ovn-controller/2.log --log-level info -n k8s_ovn-controller_ovnkube-node-g8n2c_openshift-ovn-kubernetes_27d64f48-a34c-4cfb-acf2-1e6ac0c9605e_2 -P /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 713e5a8d9e00d8c9a8b65f0da1bab6fad68801385ef8b9df45c5f74a492aa57a -s
root      270931 69.7  1.0 423436 169068 ?       Rsl  13:54   0:17 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt -vconsole:info
core      274754  0.0  0.0  12924  2568 pts/0    S+   13:54   0:00 grep -E --color=auto USER|ovn-controller
[core@openshift02-tvdck-master-0 ~]$ l

After the kill I hit the issue that I'm experiencing so the cluster become unstable mainly with some resource consumption on master-2 that in some case become unready. After a while - maybe after some restart of the ovn pod - the cluster become more responsive and after looking at it with Andre Costa, we decided do do a reboot of all vm starting from the worker.

After the reboot the adm top show:

root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# date
Thu Apr 29 17:24:23 CEST 2021
root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# oc adm top pod -l app=ovnkube-node --containers -n openshift-ovn-kubernetes
POD                  NAME              CPU(cores)   MEMORY(bytes)
ovnkube-node-5648w   kube-rbac-proxy   0m           19Mi
ovnkube-node-5648w   ovn-controller    757m         180Mi
ovnkube-node-5648w   ovnkube-node      38m          54Mi
ovnkube-node-84wkf   kube-rbac-proxy   0m           17Mi
ovnkube-node-84wkf   ovn-controller    1014m        185Mi
ovnkube-node-84wkf   ovnkube-node      101m         57Mi
ovnkube-node-bqk8j   kube-rbac-proxy   0m           25Mi
ovnkube-node-bqk8j   ovn-controller    739m         177Mi
ovnkube-node-bqk8j   ovnkube-node      94m          52Mi
ovnkube-node-bvt9t   kube-rbac-proxy   0m           20Mi
ovnkube-node-bvt9t   ovn-controller    930m         185Mi
ovnkube-node-bvt9t   ovnkube-node      87m          74Mi
ovnkube-node-g8n2c   kube-rbac-proxy   0m           14Mi
ovnkube-node-g8n2c   ovn-controller    990m         178Mi
ovnkube-node-g8n2c   ovnkube-node      77m          45Mi
ovnkube-node-ph9k4   kube-rbac-proxy   0m           30Mi
ovnkube-node-ph9k4   ovn-controller    1069m        182Mi
ovnkube-node-ph9k4   ovnkube-node      77m          72Mi
ovnkube-node-xf7v2   kube-rbac-proxy   0m           15Mi
ovnkube-node-xf7v2   ovn-controller    803m         182Mi
ovnkube-node-xf7v2   ovnkube-node      94m          75Mi
root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)#

This morning the output is:
root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# date
Fri Apr 30 07:52:09 CEST 2021
root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# oc adm top pod -l app=ovnkube-node --containers -n openshift-ovn-kubernetes
POD                  NAME              CPU(cores)   MEMORY(bytes)
ovnkube-node-5648w   kube-rbac-proxy   0m           14Mi
ovnkube-node-5648w   ovn-controller    1161m        1739Mi
ovnkube-node-5648w   ovnkube-node      45m          53Mi
ovnkube-node-84wkf   kube-rbac-proxy   0m           17Mi
ovnkube-node-84wkf   ovn-controller    448m         778Mi
ovnkube-node-84wkf   ovnkube-node      64m          60Mi
ovnkube-node-bqk8j   kube-rbac-proxy   0m           16Mi
ovnkube-node-bqk8j   ovn-controller    719m         874Mi
ovnkube-node-bqk8j   ovnkube-node      39m          36Mi
ovnkube-node-g8n2c   kube-rbac-proxy   0m           17Mi
ovnkube-node-g8n2c   ovn-controller    886m         1252Mi
ovnkube-node-g8n2c   ovnkube-node      65m          42Mi
ovnkube-node-j9cs7   kube-rbac-proxy   0m           19Mi
ovnkube-node-j9cs7   ovn-controller    872m         767Mi
ovnkube-node-j9cs7   ovnkube-node      85m          55Mi
ovnkube-node-ph9k4   kube-rbac-proxy   0m           22Mi
ovnkube-node-ph9k4   ovn-controller    1104m        751Mi
ovnkube-node-ph9k4   ovnkube-node      96m          52Mi
ovnkube-node-xf7v2   kube-rbac-proxy   0m           29Mi
ovnkube-node-xf7v2   ovn-controller    898m         842Mi
ovnkube-node-xf7v2   ovnkube-node      84m          99Mi
root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)#


The pod with cpu > 1000 are on this nodes:
root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# oc get pod ovnkube-node-5648w -n openshift-ovn-kubernetes -o wide
NAME                 READY   STATUS    RESTARTS   AGE   IP            NODE                         NOMINATED NODE   READINESS GATES
ovnkube-node-5648w   3/3     Running   0          10d   172.21.8.42   openshift02-tvdck-master-1   <none>           <none>
root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)#

root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# oc get pod ovnkube-node-ph9k4  -n openshift-ovn-kubernetes -o wide
NAME                 READY   STATUS    RESTARTS   AGE   IP            NODE                             NOMINATED NODE   READINESS GATES
ovnkube-node-ph9k4   3/3     Running   0          10d   172.21.8.46   openshift02-tvdck-worker-nnn9b   <none>           <none>
root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# 

This is the top -c on the worker node:

top - 05:55:55 up 14:56,  1 user,  load average: 6.07, 6.22, 6.05
Tasks: 573 total,   3 running, 567 sleeping,   0 stopped,   3 zombie
%Cpu(s): 23.0 us,  5.0 sy,  0.0 ni, 51.5 id, 18.5 wa,  0.6 hi,  1.3 si,  0.0 st
MiB Mem :  32167.3 total,    286.1 free,  17311.1 used,  14570.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  15278.7 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   6832 root      20   0 1018276 762404   6456 R  98.7   2.3 728:12.13 ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid -p /ovn-ce+
  20293 1000660+  20   0   72.3g   4.9g 515364 S  19.6  15.6 184:27.04 /usr/lib/jvm/jre/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly +
  13270 root      20   0 1941156 660460 312020 S  17.3   2.0 152:35.88 /usr/local/bin/px-storage

As said by Andre the egress rule was removed yesterday there are two in place and they are quite simple:
root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)# cat /ocp/config/egress-demo-lel.yaml
apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: default
spec:
  egress:
  - type: Allow
    to:
      cidrSelector: 172.23.199.72/32
      cidrSelector: 172.27.0.65/32
  - type: Allow
    to:
      dnsName: docker.io
      dnsName: registry-1.docker.io
      dnsName: mcr.microsoft.com
      dnsName: azure.microsoft.com
  - type: Deny
    to:
      cidrSelector: 0.0.0.0/0
root:~ (☸️ |dapr-system/api-openshift02-swqa-tst:6443/gianluca.lini:dapr-system)#

Let me know if you need more test, we can do everything on that cluster.

Gianluca

Comment 17 Tim Rozet 2021-04-30 19:42:38 UTC

Thanks Gianluca and Andre. I'll get the OVN team to analyze the cores. If you can please also grab a must gather and the network must gather: 'oc adm must-gather -- gather_network_logs'. Having the latest DBs might help us reproduce the problem and/or provide some insight into what logical flows are causing this problem.

Comment 18 Andre Costa 2021-05-04 10:32:55 UTC

Hi Tim,

Attaching the information requested. 

Also I would like to point that ETCD seems to have some performance issues. Would this be worth to check it with the respective team? Not sure if the OVN might be influencing this or not, but if ETCD is not responding it makes sense the issues we have with the kube-apiserver and the slowness of interacting with the cluster and the number of restarts in the api pods.

etcdctl endpoint status -w table
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://<IP>:2379        | 70a3eccd1ab6c2a8 |   3.4.9 |  407 MB |      true |      false |      8172 |  105701328 |          105701328 |        |
| https://<IP>:2379        | 1f84f55eb037bb5f |   3.4.9 |  407 MB |     false |      false |      8172 |  105701328 |          105701328 |        |
| https://<IP>:2379        | 84479000a194f52b |   3.4.9 |  407 MB |     false |      false |      8172 |  105701328 |          105701328 |        |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
sh-4.4# etcdctl check perf --load=l
 60 / 60 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1m0s
FAIL: Throughput too low: 4298 writes/s
PASS: Slowest request took 0.466931s
PASS: Stddev is 0.039390s
FAIL
sh-4.4#
sh-4.4# etcdctl check perf --load=m
 60 / 60 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1m0s
FAIL: Throughput too low: 615 writes/s
Slowest request took too long: 1.033440s
Stddev too high: 0.134737s
FAIL

I have tried to replicate the issues with OVN somehow and so far I don't see anything. Upgrade to 4.7.8 went well and I tried deploying some egressFirewall policies and still nothing. My cluster is running on my local VMs and I don't see any performance issues with ETCD nor constant restarts on the API.

Comment 21 Sam Batschelet 2021-05-04 15:14:45 UTC

@anusaxen

We would like to dig deeper into cluster perf could you please provide a dump of Prometheus state. Often times the dump is very large 60GB or more possible so please consider this when running the command.

```
#!/bin/bash
ARTIFACT_DIR=$PWD
mkdir -p $ARTIFACT_DIR/metrics
echo "Snapshotting prometheus ..."
oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0 -- tar cvzf - -C /prometheus .  >$ARTIFACT_DIR/metrics/prometheus.tar.gz
```

Please ensure the file can be decompressed before uploading. With this data we will be able to drill deeply into the root cause.

Comment 22 Sam Batschelet 2021-05-04 15:16:45 UTC

etcd team will dig into the logs provided as a first pass.

Comment 23 Gianluca 2021-05-04 15:51:59 UTC

(In reply to Sam Batschelet from comment #21)
> @anusaxen
> 
> We would like to dig deeper into cluster perf could you please provide a
> dump of Prometheus state. Often times the dump is very large 60GB or more
> possible so please consider this when running the command.
> 
> ```
> #!/bin/bash
> ARTIFACT_DIR=$PWD
> mkdir -p $ARTIFACT_DIR/metrics
> echo "Snapshotting prometheus ..."
> oc --insecure-skip-tls-verify exec -n openshift-monitoring prometheus-k8s-0
> -- tar cvzf - -C /prometheus .  >$ARTIFACT_DIR/metrics/prometheus.tar.gz
> ```
> 
> Please ensure the file can be decompressed before uploading. With this data
> we will be able to drill deeply into the root cause.

Hi Sam,
I'e just uploaded the data (.tar.gz size is 1,8GB) to the case.

Comment 24 Sam Batschelet 2021-05-04 17:21:10 UTC

first pass I am very concerned by the etcd outages, many of which span hours. 

```
cat  host_service_logs/masters/kubelet_service.log | awk '/Readiness probe for "etcd-quorum-guard.*/ { print $1, $2, $3, $4, $10, $14; }' | wc -l  
5602
```

Comment 28 Suresh Kolichala 2021-05-04 21:35:29 UTC

Not sure if the performance issues in etcd are caused by OVN or if they are caused by a separate issue but etcd doesn't show any specific issue that needs further investigation.

Sending the BZ back to @trozet.

Comment 33 Tim Rozet 2021-05-10 19:02:23 UTC

I see from the audit logs that an operator called "dapr-operator" is deleting the openshift-ingress services:

openshift02-tvdck-master-1-audit-2021-05-07T04-14-34.582.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"33039c90-8978-4945-bd0c-4c78e0bd8dac","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-ingress/services/router-internal-default","verb":"delete","user":{"username":"system:serviceaccount:dapr-system:dapr-operator","uid":"238776f7-c142-41c4-8854-e0d4a81ccf91","groups":["system:serviceaccounts","system:serviceaccounts:dapr-system","system:authenticated"]},"sourceIPs":["172.21.8.44"],"userAgent":"operator/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election","objectRef":{"resource":"services","namespace":"openshift-ingress","name":"router-internal-default","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2021-05-07T04:14:17.151385Z","stageTimestamp":"2021-05-07T04:14:17.268381Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:dapr-system:dapr-operator","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"dapr-operator\" of ClusterRole \"dapr-operator-admin\" to ServiceAccount \"dapr-operator/dapr-system\""}}

I'm unfamiliar with this operator, I'm guessing it is this:
https://docs.dapr.io/operations/hosting/kubernetes/kubernetes-overview/

Either way it shouldn't be deleting core openshift services. Can we try disabling this operator?

Note You need to log in before you can comment on or make changes to this bug.