2041307 – [4.7z] [Documentation] OVN idle service cannot be accessed after restarting ovnkube-master pods

Bug 2041307 - [4.7z] [Documentation] OVN idle service cannot be accessed after restarting ovnkube-master pods

Summary: [4.7z] [Documentation] OVN idle service cannot be accessed after restarting o...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Andrew Taylor
QA Contact:	zhaozhanqi
Docs Contact:	Latha S
URL:
Whiteboard:
Duplicates (1):	2107657 (view as bug list)
Depends On:	2018398 2023985
Blocks:	2042870
TreeView+	depends on / blocked

Reported:	2022-01-17 03:22 UTC by zhaozhanqi
Modified:	2022-07-19 16:30 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2018398
Clones:	2042870 (view as bug list)
Environment:
Last Closed:	2022-06-15 14:33:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6671241	0	None	None	None	2022-01-26 09:58:08 UTC

Comment 1 zhaozhanqi 2022-01-17 04:51:01 UTC

pod cannot be worked up on 4.6/4.7 upgrade to 4.7.41 version, this seems like not same totally with above issue. 


$ oc get ep -n idle-upgrade
NAME           ENDPOINTS   AGE
test-service   <none>      17h

$ oc get svc -n idle-upgrade -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      idling.alpha.openshift.io/idled-at: "2022-01-16T11:06:29Z"
      idling.alpha.openshift.io/unidle-targets: '[{"kind":"ReplicationController","name":"test-rc","replicas":2}]'
    creationTimestamp: "2022-01-16T11:06:23Z"
    labels:
      name: test-service
    name: test-service
    namespace: idle-upgrade
    resourceVersion: "43749"
    selfLink: /api/v1/namespaces/idle-upgrade/services/test-service
    uid: f3c9bd55-3aae-41a4-8e05-431cbd562ccc
  spec:
    clusterIP: 172.30.4.73
    clusterIPs:
    - 172.30.4.73
    ports:
    - name: http
      port: 27017
      protocol: TCP
      targetPort: 8080
    selector:
      name: test-pods
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


#### Tring to access the service from ovn node but timeout ###

sh-4.4# curl --connect-timeout 10 172.30.4.73:27017
curl: (28) Connection timed out after 10001 milliseconds
sh-4.4# 


#####found the pod cannot be waked up ####
$ oc get ep -n idle-upgrade
NAME           ENDPOINTS   AGE
test-service   <none>      17h


Here is the must-gather logs: http://file.apac.redhat.com/~zzhao/must-gather.local.3054145436653881521.tar.gz

Comment 3 Andreas Karis 2022-01-17 18:37:46 UTC

Hi there,

Could you generate and attach an oc adm inspect for that namespace?
~~~
namespace=idle-upgrade
exclude_list="secrets"
oc adm inspect -n $namespace $(oc api-resources --verbs=get,list --namespaced=true | tail -n+2 | egrep -v "$exclude_list" | awk '{print $1}' | tr '\n' ',' | sed 's/,$//')
~~~

Thanks!

- Andreas

Comment 4 Andreas Karis 2022-01-17 18:49:09 UTC

Also, while still in the erroneous state, can you provide an OVN / network must-gather so that I can have a look at the databases:
https://access.redhat.com/solutions/5459251
~~~
oc adm must-gather -- /usr/bin/gather_network_logs
~~~

Comment 5 Andreas Karis 2022-01-17 22:33:10 UTC

Hi there,

Just some further info. I can reproduce this after an upgrade, as well. I upgraded from 4.6.53 to 4.7.41.

I idle the SVC first, then upgrade the cluster:
~~~
[akaris@linux 2041307]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
ip-10-0-139-82ec2internal-debug     1/1     Running   0          12m
nginx-deployment-66b6c48dd5-cdbdz   1/1     Running   0          110s
[akaris@linux 2041307]$ oc get deployment
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   1/1     1            1           14m
[akaris@linux 2041307]$ oc get -o yaml svc nginx
apiVersion: v1
kind: Service
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"nginx"},"name":"nginx","namespace":"default"},"spec":{"ports":[{"name":"http","port":27017,"protocol":"TCP","targetPort":80}],"selector":{"app":"nginx"},"sessionAffinity":"None","type":"ClusterIP"}}
  creationTimestamp: "2022-01-17T19:20:06Z"
  labels:
    app: nginx
  name: nginx
  namespace: default
  resourceVersion: "35183"
  selfLink: /api/v1/namespaces/default/services/nginx
  uid: bd9056a3-ad65-410c-a7a9-b3cb71e88a7d
spec:
  clusterIP: 172.30.174.162
  ports:
  - name: http
    port: 27017
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
[akaris@linux 2041307]$ oc get pods
No resources found in default namespace.
[akaris@linux 2041307]$ oc get deployment
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           16m
[akaris@linux 2041307]$ oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release:4.7.41-x86_64 --force --allow-explicit-upgrade
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image quay.io/openshift-release-dev/ocp-release:4.7.41-x86_64
[akaris@linux 2041307]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.53    True        True          4s      Working towards quay.io/openshift-release-dev/ocp-release:4.7.41-x86_64: downloading update
~~~

After the upgrade is done, I try to unidle the SVC by connecting to it:
~~~
[akaris@linux 2041307]$ oc debug node/ip-10-0-139-82.ec2.internal
Starting pod/ip-10-0-139-82ec2internal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.139.82
If you don't see a command prompt, try pressing enter.
(failed reverse-i-search)`curl': ^C
sh-4.4# time curl 172.30.174.162:27017
^C

real	1m9.746s
user	0m0.005s
sys	0m0.005s
sh-4.4# 

~~~

The OVN part does its job. However, we see that the events are deleted from the SB DB, and then OVN recreates them, and so forth:
~~~
[akaris@linux 2041307]$ oc rsh -n openshift-ovn-kubernetes -c sbdb ovnkube-master-zlkbw
sh-4.4# for i in {0..300}; do echo "." ; ovn-sbctl list controller_event; sleep 1 ; done
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_uuid               : a8ae2e61-b38e-492e-9f0b-d54eac67e1ae
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 2
.
_uuid               : a8ae2e61-b38e-492e-9f0b-d54eac67e1ae
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 2
.
_uuid               : a8ae2e61-b38e-492e-9f0b-d54eac67e1ae
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 2
.
_uuid               : a8ae2e61-b38e-492e-9f0b-d54eac67e1ae
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 2
.
.
.
_uuid               : 66d045a0-258a-404d-8493-1a3e7bf1e13a
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 3
.
_uuid               : 66d045a0-258a-404d-8493-1a3e7bf1e13a
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 3
.
_uuid               : 66d045a0-258a-404d-8493-1a3e7bf1e13a
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 3
.
.
.
.
.
.
_uuid               : c86eee8a-6b78-4fa5-8e41-4603335e96a4
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 4
.
_uuid               : c86eee8a-6b78-4fa5-8e41-4603335e96a4
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 4
.
_uuid               : c86eee8a-6b78-4fa5-8e41-4603335e96a4
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 4
.
_uuid               : c86eee8a-6b78-4fa5-8e41-4603335e96a4
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 4
.
_uuid               : c86eee8a-6b78-4fa5-8e41-4603335e96a4
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 4
.
.
.
.
.
.
.
.
.
.
.
.
_uuid               : e7cfbbd9-525a-495d-b259-b71e227899ac
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 5
.
_uuid               : e7cfbbd9-525a-495d-b259-b71e227899ac
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 5
.
_uuid               : e7cfbbd9-525a-495d-b259-b71e227899ac
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 5
.
_uuid               : e7cfbbd9-525a-495d-b259-b71e227899ac
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_uuid               : f6fd5410-c5f8-449f-9279-ebedf17c79ff
chassis             : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a
event_info          : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"}
event_type          : empty_lb_backends
seq_num             : 6
.
.
.
.
.
.
^C
sh-4.4# 
~~~

The OVNKUBE layer does not generate the necessary event (the event below is from before the upgrade):
~~~
[akaris@linux 2041307]$ oc get events | grep Need
155m        Normal    NeedPods                                     service/nginx                            The service nginx needs pods
~~~

Log level 5:
~~~
[akaris@linux 2041307]$ oc get configmap -n openshift-ovn-kubernetes env-overrides -o yaml
apiVersion: v1
data:
  _master: |
    # This sets the log level for the ovn-kubernetes master process as well as the ovn-dbchecker:
    OVN_KUBE_LOG_LEVEL=5
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"_master":"# This sets the log level for the ovn-kubernetes master process as well as the ovn-dbchecker:\nOVN_KUBE_LOG_LEVEL=5\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"env-overrides","namespace":"openshift-ovn-kubernetes"}}
  creationTimestamp: "2022-01-17T19:15:50Z"
  name: env-overrides
  namespace: openshift-ovn-kubernetes
  resourceVersion: "30197"
  selfLink: /api/v1/namespaces/openshift-ovn-kubernetes/configmaps/env-overrides
  uid: e3df0f4a-5df7-4654-be49-c6e113273ba7
~~~

The logs do not show this here:
~~~
[akaris@linux 2041307]$ oc logs -n openshift-ovn-kubernetes ovnkube-master-h7g4w -c ovnkube-master | grep 'Sending a NeedPods event for service'
[akaris@linux 2041307]$ 
~~~

But neither do they show an error:
~~~
[akaris@linux 2041307]$ oc logs -n openshift-ovn-kubernetes ovnkube-master-h7g4w -c ovnkube-master | grep 'Unable to remove controller'
[akaris@linux 2041307]$ 
~~~

But the deletion is happening:
~~~
[akaris@linux 2041307]$ oc logs -n openshift-ovn-kubernetes ovnkube-master-h7g4w -c ovnkube-master | grep 'destroy'
I0117 21:40:06.019387       1 ovs.go:169] exec(1457): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event c37c0d86-dfc5-4fb0-a439-bd07ad284d7b
I0117 21:40:11.006009       1 ovs.go:169] exec(1459): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event a8ae2e61-b38e-492e-9f0b-d54eac67e1ae
I0117 21:40:16.018151       1 ovs.go:169] exec(1461): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event 66d045a0-258a-404d-8493-1a3e7bf1e13a
I0117 21:40:26.011394       1 ovs.go:169] exec(1464): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event c86eee8a-6b78-4fa5-8e41-4603335e96a4
I0117 21:40:41.013367       1 ovs.go:169] exec(1468): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event e7cfbbd9-525a-495d-b259-b71e227899ac
I0117 21:41:11.014046       1 ovs.go:169] exec(1475): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event f6fd5410-c5f8-449f-9279-ebedf17c79ff
~~~

Hence, the failure (or rather the symptom of the failure) is somewhere here in the code. Given that the destroy succeeds, but that we do not see the Infof log "Sending a NeedPods ...":
~~~
 541 func (oc *Controller) ovnControllerEventChecker() {
 542         ticker := time.NewTicker(5 * time.Second)
 543 
 544         _, _, err := util.RunOVNNbctl("set", "nb_global", ".", "options:controller_event=true")
 545         if err != nil {
 546                 klog.Error("Unable to enable controller events. Unidling not possible")
 547                 return
 548         }
 549 
 550         for {
 551                 select {
 552                 case <-ticker.C:
 553                         out, _, err := util.RunOVNSbctl("--format=json", "list", "controller_event")
 554                         if err != nil {
 555                                 continue
 556                         }
 557 
 558                         events, err := extractEmptyLBBackendsEvents([]byte(out))
 559                         if err != nil || len(events) == 0 {
 560                                 continue
 561                         }
 562 
 563                         for _, event := range events {
 564                                 _, _, err := util.RunOVNSbctl("destroy", "controller_event", event.uuid)
 565                                 if err != nil {
 566                                         // Don't unidle until we are able to remove the controller event
 567                                         klog.Errorf("Unable to remove controller event %s", event.uuid)
 568                                         continue
 569                                 }
 570                                 if serviceName, ok := oc.GetServiceVIPToName(event.vip, event.protocol); ok {
 571                                         serviceRef := kapi.ObjectReference{
 572                                                 Kind:      "Service",
 573                                                 Namespace: serviceName.Namespace,
 574                                                 Name:      serviceName.Name,
 575                                         }
 576                                         klog.V(5).Infof("Sending a NeedPods event for service %s in namespace %s.", serviceName.Name, serviceName.Namespace)
 577                                         oc.recorder.Eventf(&serviceRef, kapi.EventTypeNormal, "NeedPods", "The service %s needs pods", serviceName.Name)
 578                                 }
 579                         }
 580                 case <-oc.stopChan:
 581                         return
 582                 }
 583         }
 584 }
 585 
 586 func podScheduled(pod *kapi.Pod) bool {
 587         return pod.Spec.NodeName != ""
 588 }
 NORMAL  ᚠ cd1a152  go-controller/pkg/ovn/ovn.go   
~~~

That means that oc.GetServiceVIPToName fails and that most likely, we are not finding a match here:
~~~
 570                                 if serviceName, ok := oc.GetServiceVIPToName(event.vip, event.protocol); ok {
~~~

And that would indicate that the map entry is perhaps missing:
~~~
1234 // GetServiceVIPToName retrieves the associated k8s service name for a load balancer VIP
1235 func (oc *Controller) GetServiceVIPToName(vip string, protocol kapi.Protocol) (types.NamespacedName, bool) {
1236         oc.serviceVIPToNameLock.Lock()
1237         defer oc.serviceVIPToNameLock.Unlock()
1238         namespace, ok := oc.serviceVIPToName[ServiceVIPKey{vip, protocol}]
1239         return namespace, ok
1240 }
~~~

I'll continue my research tomorrow.

Comment 6 Andreas Karis 2022-01-18 10:16:04 UTC

Update - this is not an upgrade related issue. This happens in both 4.7.41 and 4.6.53 when a service is idled, and when the ovnkube master process is restarted:
~~~
[akaris@linux 2041307]$ oc idle nginx
Deployment "default/nginx-deployment" has been idled 
[akaris@linux 2041307]$ oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
ip-10-0-141-63us-west-1computeinternal-debug   1/1     Running   0          45s
[akaris@linux 2041307]$ oc get deployment
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           94s
[akaris@linux 2041307]$ oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
ip-10-0-141-63us-west-1computeinternal-debug   1/1     Running   0          50s
[akaris@linux 2041307]$ oc get pods -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-c968t   6/6     Running   6          31m
ovnkube-master-vnvm9   6/6     Running   1          31m
ovnkube-master-wxn75   6/6     Running   6          31m
ovnkube-node-6xds7     3/3     Running   0          21m
ovnkube-node-c8hcj     3/3     Running   0          31m
ovnkube-node-l2xhj     3/3     Running   0          21m
ovnkube-node-wf8xx     3/3     Running   0          31m
ovnkube-node-whjxv     3/3     Running   0          31m
ovnkube-node-zfv55     3/3     Running   0          24m
[akaris@linux 2041307]$ oc delete pod -n openshift-ovn-kubernetes -l app=ovnkube-master
pod "ovnkube-master-c968t" deleted
pod "ovnkube-master-vnvm9" deleted
pod "ovnkube-master-wxn75" deleted
[akaris@linux 2041307]$ oc get pods -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-gv5wn   6/6     Running   0          115s
ovnkube-master-v7r2h   6/6     Running   0          116s
ovnkube-master-w8594   6/6     Running   0          108s
ovnkube-node-6xds7     3/3     Running   0          24m
ovnkube-node-c8hcj     3/3     Running   0          34m
ovnkube-node-l2xhj     3/3     Running   0          24m
ovnkube-node-wf8xx     3/3     Running   0          34m
ovnkube-node-whjxv     3/3     Running   0          34m
ovnkube-node-zfv55     3/3     Running   0          26m
~~~

~~~
sh-4.4# time curl 172.30.175.221:27017
^C

real	0m44.482s
user	0m0.003s
sys	0m0.003s
~~~

And then, nothing shows up:
~~~
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           7m47s
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           7m49s
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           7m50s
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           8m3s
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           8m5s
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           8m6s
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           8m8s
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           8m9s
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           8m18s
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           8m21s
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           8m23s
[akaris@linux 2041307]$ oc get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           8m25s
(reverse-i-search)`dele': oc ^Clete pod -n openshift-ovn-kubernetes -l app=ovnkube-master
[akaris@linux 2041307]$ oc get deployment
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   0/0     0            0           10m
[akaris@linux 2041307]$ 
~~~

Comment 7 Andreas Karis 2022-01-18 11:51:00 UTC

Here's what's happening in the OCP 4.6 and 4.7 code.

When the ovnkube daemon is restarted, it must add endpoints in AddEndpoints to build the map / cache in oc.serviceVIPToName. See #5 for the retrieval logic later on. Here is AddServiceVIPToName:
~~~
1227 // AddServiceVIPToName associates a k8s service name with a load balancer VIP
1228 func (oc *Controller) AddServiceVIPToName(vip string, protocol kapi.Protocol, namespace, name string) {
1229         oc.serviceVIPToNameLock.Lock()
1230         defer oc.serviceVIPToNameLock.Unlock()
1231         klog.Infof("BZ2041307 (AddServiceVIPToName): oc.serviceVIPToName[%v] = %v", ServiceVIPKey{vip, protocol}, types.NamespacedName{Namespace: namespace, Name: name})
1232         oc.serviceVIPToName[ServiceVIPKey{vip, protocol}] = types.NamespacedName{Namespace: namespace, Name: name}
1233 }       
1234         
1235 // GetServiceVIPToName retrieves the associated k8s service name for a load balancer VIP
 NORMAL  ᚠ BZ2041307-reproducer-4.7.41!  go-controller/pkg/ovn/ovn.go                                                                                                                                                                                                              go  utf-8[unix]   90% ☰ 1228/1356 ㏑ : 28 
  1 go-controller/pkg/ovn/endpoints.go|111 col 8| AddEndpoints    
~~~

That method has a single caller, AddEndpoints:
~~~
 51 // AddEndpoints adds endpoints and creates corresponding resources in OVN
 52 func (ovn *Controller) AddEndpoints(ep *kapi.Endpoints) error {
 53         klog.Infof("Adding endpoints: %s for namespace: %s", ep.Name, ep.Namespace)
 54         // get service
 55         // TODO: cache the service
 56         svc, err := ovn.watchFactory.GetService(ep.Namespace, ep.Name)
 57         if err != nil {
 58                 // This is not necessarily an error. For e.g when there are endpoints
 59                 // without a corresponding service.
 60                 klog.V(5).Infof("No service found for endpoint %s in namespace %s", ep.Name, ep.Namespace)
 61                 return nil
 62         }
 63         if !util.IsClusterIPSet(svc) {
 64                 klog.V(5).Infof("Skipping service %s due to clusterIP = %q", svc.Name, svc.Spec.ClusterIP)
 65                 return nil
 66         }
 67 
 68         klog.V(5).Infof("Matching service %s found for ep: %s, with cluster IP: %s", svc.Name, ep.Name, svc.Spec.ClusterIP)
 69 
 70         klog.Infof("BZ2041307 (AddEndpoints): %s, ep: %v", svc.Name, *ep)
 71         protoPortMap := ovn.getLbEndpoints(ep)
 72         klog.V(5).Infof("Matching service %s ports: %v", svc.Name, svc.Spec.Ports)
 73         klog.Infof("BZ2041307 (AddEndpoints): Matching service %s ports: %v", svc.Name, svc.Spec.Ports)
 74         klog.Infof("BZ2041307 (AddEndpoints): %s, protoPortMap: %v", svc.Name, protoPortMap)
 75         for _, svcPort := range svc.Spec.Ports {
 76                 klog.Infof("BZ2041307 (AddEndpoints): %s, svcPort: %v", svc.Name, svcPort)
 77                 klog.Infof("BZ2041307 (AddEndpoints): %s, protoPortMap: %v", svc.Name, protoPortMap)
 78                 lbEps, isFound := protoPortMap[svcPort.Protocol][svcPort.Name]
 79                 if !isFound {
 80                         continue
 81
...                 }
~~~

AddEndpoints will continue the loop at this point if protoPortMap does not contain a match for the service. Otherwise, it will continue all the way down and eventually run `ovn.AddServiceVIPToName(vip, svcPort.Protocol, svc.Namespace, svc.Name)`:

~~~
...
 82                 klog.Infof("BZ2041307 (AddEndpoints): isFound, %s, %v", svc.Name, svcPort)
 83                 if !ovn.SCTPSupport && svcPort.Protocol == kapi.ProtocolSCTP {
 84                         klog.Errorf("Rejecting endpoint creation for unsupported SCTP protocol: %s, %s", ep.Namespace, ep.Name)
 85                         continue
 86                 }
 87                 klog.Infof("BZ2041307 (AddEndpoints): passed SCTP check %s, %v", svc.Name, svcPort)
 88                 if util.ServiceTypeHasNodePort(svc) {
 89                         if err := ovn.createGatewayVIPs(svcPort.Protocol, svcPort.NodePort, lbEps.IPs, lbEps.Port); err != nil {
 90                                 klog.Errorf("Error in creating Node Port for svc %s, node port: %d - %v\n", svc.Name, svcPort.NodePort, err)
 91                                 continue
 92                         }
 93                 }
 94                 klog.Infof("BZ2041307 (AddEndpoints): passed GatewayVIP check %s, %v", svc.Name, svcPort)
 95                 if util.ServiceTypeHasClusterIP(svc) {
 96                         klog.Infof("BZ2041307 (AddEndpoints): hasClusterIP %s, %v", svc.Name, svcPort)
 97                         var loadBalancer string
 98                         loadBalancer, err = ovn.getLoadBalancer(svcPort.Protocol)
 99                         if err != nil {
100                                 klog.Errorf("Failed to get load balancer for %s (%v)", svcPort.Protocol, err)
101                                 continue
102                         }
103                         klog.Infof("BZ2041307 (AddEndpoints): loadbalancer %s: %v", svc.Name, loadBalancer)
104                         if err = ovn.createLoadBalancerVIPs(loadBalancer, []string{svc.Spec.ClusterIP}, svcPort.Port, lbEps.IPs, lbEps.Port); err != nil {
105                                 klog.Errorf("Error in creating Cluster IP for svc %s, target port: %d - %v\n", svc.Name, lbEps.Port, err)
106                                 continue
107                         }
108                         klog.Infof("BZ2041307 (AddEndpoints): loadbalancerVIPs passed %s", svc.Name)
109                         vip := util.JoinHostPortInt32(svc.Spec.ClusterIP, svcPort.Port)
110                         klog.Infof("BZ2041307 (AddEndpoints): vip %s = %v", svc.Name, vip)
111                         ovn.AddServiceVIPToName(vip, svcPort.Protocol, svc.Namespace, svc.Name)
 NORMAL  ᚠ BZ2041307-reproducer-4.7.41!  go-controller/pkg/ovn/endpoints.go      
~~~

Unfortunately, an idle service has no endpoints:
~~~
[akaris@linux 2041307]$ oc get ep
NAME         ENDPOINTS                                             AGE
kubernetes   10.0.163.13:6443,10.0.168.127:6443,10.0.253.15:6443   128m
nginx        <none>                                                54m
~~~

So, this here yields an empty protoPortMap for the service:
~~~
71         protoPortMap := ovn.getLbEndpoints(ep)
~~~

~~~
 18 func (ovn *Controller) getLbEndpoints(ep *kapi.Endpoints) map[kapi.Protocol]map[string]lbEndpoints {
 19         protoPortMap := map[kapi.Protocol]map[string]lbEndpoints{
 20                 kapi.ProtocolTCP:  make(map[string]lbEndpoints),
 21                 kapi.ProtocolUDP:  make(map[string]lbEndpoints),
 22                 kapi.ProtocolSCTP: make(map[string]lbEndpoints),
 23         }
 24         klog.Infof("BZ2041307 (getLbEndpoints): ep.Subsets: %v", ep.Subsets)
 25         for _, s := range ep.Subsets {
 26                 klog.Infof("BZ2041307 (getLbEndpoints): %s, s.Addresses: %v", s.Addresses)
 27                 for _, ip := range s.Addresses {
 28                         klog.Infof("BZ2041307 (getLbEndpoints): %s, s.Ports: %v", s.Ports)
 29                         for _, port := range s.Ports {
 30                                 klog.Infof("BZ2041307 (getLbEndpoints): %s, s: %v, ip: %v, port: %v", s, ip, port)
 31                                 var ips []string
 32                                 klog.Infof("BZ2041307 (getLbEndpoints): util.ValidatePort(%v, %v)", port.Protocol, port.Port)
 33                                 if err := util.ValidatePort(port.Protocol, port.Port); err != nil {
 34                                         klog.Errorf("Invalid endpoint port: %s: %v", port.Name, err)
 35                                         continue
 36                                 }
 37                                 if lbEps, ok := protoPortMap[port.Protocol][port.Name]; ok {
 38                                         ips = append(lbEps.IPs, ip.IP)
 39                                 } else {
 40                                         ips = []string{ip.IP}
 41                                 }
 42                                 protoPortMap[port.Protocol][port.Name] = lbEndpoints{IPs: ips, Port: port.Port}
 43                         }
 44                 }
 45         }
 46         klog.V(5).Infof("Endpoint Protocol Map is: %v", protoPortMap)
 47         klog.Infof("BZ2041307 (getLbEndpoints): Endpoint Protocol Map is: %v", protoPortMap)
 48         return protoPortMap
 49 }
 NORMAL  ᚠ BZ2041307-reproducer-4.7.41!  go-controller/pkg/ovn/endpoints.go
~~~

So, when ovnkube-master is restarted, we see only this:
~~~
I0118 11:26:56.889656       1 endpoints.go:70] BZ2041307 (AddEndpoints): nginx, ep: {{ } {nginx  default /api/v1/namespaces/default/endpoints/nginx 1cf60665-4016-4d09-958c-7bbf52e0c801 43233 0 2022-01-18 10:40:03 +0000 UTC <nil> <nil> map[app:nginx] map[idling.alpha.openshift.io/idled-at:2022-01-18T10:42:36Z idling.alpha.openshift.io/unidle-targets:[{"kind":"Deployment","name":"nginx-deployment","group":"apps","replicas":1}]] [] []  [{kube-controller-manager Update v1 2022-01-18 10:41:47 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{},"f:labels":{".":{},"f:app":{}}}}} {oc Update v1 2022-01-18 10:42:37 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{"f:idling.alpha.openshift.io/idled-at":{},"f:idling.alpha.openshift.io/unidle-targets":{}}}}}]} []}
I0118 11:26:56.889714       1 endpoints.go:24] BZ2041307 (getLbEndpoints): [], ep.Subsets: %!v(MISSING)
I0118 11:26:56.889721       1 endpoints.go:47] BZ2041307 (getLbEndpoints): Endpoint Protocol Map is: map[SCTP:map[] TCP:map[] UDP:map[]]
I0118 11:26:56.889732       1 endpoints.go:73] BZ2041307 (AddEndpoints): Matching service nginx ports: [{http TCP <nil> 27017 {0 80 } 0}]
I0118 11:26:56.889742       1 endpoints.go:74] BZ2041307 (AddEndpoints): nginx, protoPortMap: map[SCTP:map[] TCP:map[] UDP:map[]]
I0118 11:26:56.889751       1 endpoints.go:76] BZ2041307 (AddEndpoints): nginx, svcPort: {http TCP <nil> 27017 {0 80 } 0}
I0118 11:26:56.889759       1 endpoints.go:77] BZ2041307 (AddEndpoints): nginx, protoPortMap: map[SCTP:map[] TCP:map[] UDP:map[]]
~~~

We never find any endpoints (logically) and we never add the SVC to oc.serviceVIPToName.

Comment 8 Andreas Karis 2022-01-18 12:11:37 UTC

This is related to: https://github.com/ovn-org/ovn-kubernetes/pull/2124

Comment 9 Andreas Karis 2022-01-20 11:01:28 UTC

Just to clarify: this isn't an upgrade issue. This happens whenever the ovnkube-master process is restarted while a service is idle. I'll have to investigate if this ever worked in prior versions, but my gut feeling tells me that this was always broken.

Comment 10 Andreas Karis 2022-01-25 17:27:53 UTC

I had a look at the latest 4.5 code from downstream and the code looks very similar to 4.6 and 4.7.

~~~
git checkout downstream/release-4.5
~~~

#### Retrieval logic:

~~~
331 func (oc *Controller) ovnControllerEventChecker() {
332         ticker := time.NewTicker(5 * time.Second)
333 
334         _, _, err := util.RunOVNNbctl("set", "nb_global", ".", "options:controller_event=true")
335         if err != nil {
336                 klog.Error("Unable to enable controller events. Unidling not possible")
337                 return
338         }
339 
340         for {
341                 select {
342                 case <-ticker.C:
343                         out, _, err := util.RunOVNSbctl("--format=json", "list", "controller_event")
344                         if err != nil {
345                                 continue
346                         }
347 
348                         events, err := extractEmptyLBBackendsEvents([]byte(out))
349                         if err != nil || len(events) == 0 {
350                                 continue
351                         }
352 
353                         for _, event := range events {
354                                 _, _, err := util.RunOVNSbctl("destroy", "controller_event", event.uuid)
355                                 if err != nil {
356                                         // Don't unidle until we are able to remove the controller event
357                                         klog.Errorf("Unable to remove controller event %s", event.uuid)
358                                         continue
359                                 }
360                                 if serviceName, ok := oc.GetServiceVIPToName(event.vip, event.protocol); ok {
361                                         serviceRef := kapi.ObjectReference{
362                                                 Kind:      "Service",
363                                                 Namespace: serviceName.Namespace,
364                                                 Name:      serviceName.Name,
365                                         }
366                                         klog.V(5).Infof("Sending a NeedPods event for service %s in namespace %s.    ", serviceName.Name, serviceName.Namespace)
367                                         oc.recorder.Eventf(&serviceRef, kapi.EventTypeNormal, "NeedPods", "The se    rvice %s needs pods", serviceName.Name)
368                                 }
369                         }
370                 case <-oc.stopChan:
371                         return
372                 }
373         }
374 }
375 
376 func podWantsNetwork(pod *kapi.Pod) bool {
377         return !pod.Spec.HostNetwork
378 }
 NORMAL  ᚠ b8ac1b4  go-controller/pkg/ovn/ovn.go 
~~~

Retrieval from map oc.serviceVIPToName is the same as in 4.6/4.7:
~~~
668 // GetServiceVIPToName retrieves the associated k8s service name for a load balancer VIP
669 func (oc *Controller) GetServiceVIPToName(vip string, protocol kapi.Protocol) (types.NamespacedName, bool) {
670         oc.serviceVIPToNameLock.Lock()
671         defer oc.serviceVIPToNameLock.Unlock()
672         namespace, ok := oc.serviceVIPToName[ServiceVIPKey{vip, protocol}]
673         return namespace, ok
674 }
675 
 NORMAL  ᚠ b8ac1b4  go-controller/pkg/ovn/ovn.go 
~~~

#### And populating the cache:

~~~
661 // AddServiceVIPToName associates a k8s service name with a load balancer VIP
662 func (oc *Controller) AddServiceVIPToName(vip string, protocol kapi.Protocol, namespace, name string) {
663         oc.serviceVIPToNameLock.Lock()
664         defer oc.serviceVIPToNameLock.Unlock()
665         oc.serviceVIPToName[ServiceVIPKey{vip, protocol}] = types.NamespacedName{Namespace: namespace, Name: name    }
666 }
 NORMAL  ᚠ b8ac1b4  go-controller/pkg/ovn/ovn.go
~~~

Again, there's a single user of that method in go-controller/pkg/ovn/endpoints.go:
~~~
// AddEndpoints adds endpoints and creates corresponding resources in OVN
func (ovn *Controller) AddEndpoints(ep *kapi.Endpoints) error {
        klog.V(5).Infof("Adding endpoints: %s for namespace: %s", ep.Name, ep.Namespace)
        // get service
        // TODO: cache the service
        svc, err := ovn.watchFactory.GetService(ep.Namespace, ep.Name)
        if err != nil {
                // This is not necessarily an error. For e.g when there are endpoints
                // without a corresponding service.
                klog.V(5).Infof("no service found for endpoint %s in namespace %s",
                        ep.Name, ep.Namespace)
                return nil
        }
        if !util.IsClusterIPSet(svc) {
                klog.V(5).Infof("Skipping service %s due to clusterIP = %q",
                        svc.Name, svc.Spec.ClusterIP)
                return nil
        }
        klog.V(5).Infof("Matching service %s found for ep: %s, with cluster IP: %s", svc.Name, ep.Name,
                svc.Spec.ClusterIP)

        protoPortMap := ovn.getLbEndpoints(ep)
        klog.V(5).Infof("Matching service %s ports: %v", svc.Name, svc.Spec.Ports)
        for proto, portMap := range protoPortMap {
                for svcPortName, lbEps := range portMap {
(...)
~~~

... compared to 4.6/4.7, there is a nested loop here. However, this loop uses what's retrieved from the endpoint list, hence an empty map, and thus idle services will not be added to the map, the same as in 4.6 and 4.7 (we will never get to ovn.AddServiceVIPToName).

~~~
(...)
                        ips := lbEps.IPs
                        targetPort := lbEps.Port
                        for _, svcPort := range svc.Spec.Ports {
                                if svcPort.Protocol == proto && svcPort.Name == svcPortName {
                                        if !ovn.SCTPSupport && proto == kapi.ProtocolSCTP {
                                                klog.Errorf("Rejecting endpoint creation for unsupported SCTP protocol: %s, %s", ep.Namespace, ep.Name)
                                                continue
                                        }
                                        if util.ServiceTypeHasNodePort(svc) {
                                                klog.V(5).Infof("Creating Gateways IP for NodePort: %d, %v", svcPort.NodePort, ips)
                                                err = ovn.createGatewayVIPs(svcPort.Protocol, svcPort.NodePort, ips, targetPort)
                                                if err != nil {
                                                        klog.Errorf("Error in creating Node Port for svc %s, node port: %d - %v\n", svc.Name, svcPort.NodePort, err)
                                                        continue
                                                }
                                        }
                                        if util.ServiceTypeHasClusterIP(svc) {
                                                var loadBalancer string
                                                loadBalancer, err = ovn.getLoadBalancer(svcPort.Protocol)
                                                if err != nil {
                                                        klog.Errorf("Failed to get loadbalancer for %s (%v)",
                                                                svcPort.Protocol, err)
                                                        continue
                                                }
                                                err = ovn.createLoadBalancerVIPs(loadBalancer,
                                                        []string{svc.Spec.ClusterIP},
                                                        svcPort.Port, ips, targetPort)
                                                if err != nil {
                                                        klog.Errorf("Error in creating Cluster IP for svc %s, target port: %d - %v\n", svc.Name, targetPort, err)
                                                        continue
                                                }
                                                vip := util.JoinHostPortInt32(svc.Spec.ClusterIP, svcPort.Port)
                                                ovn.handleExternalIPs(svc, svcPort, ips, targetPort, false)
                                        }
                                }
                        }
                }
        }
        return nil
}
~~~

The code for 4.4 ...
~~~
git checkout downstream/release-4.4
~~~
... is basically the same as the code for 4.5 for the relevant sections.

I also ran a spotcheck against 4.5.41 and it exhibits exactly the same symptoms:
~~~
[akaris@linux 2041307]$ export KUBECONFIG=/home/akaris/development/aws-env/ipi-us-east-1/install-config/auth/kubeconfig
[akaris@linux 2041307]$ oc get clusterversion
oc apply NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.41    True        False         60m     Cluster version is 4.5.41
[akaris@linux 2041307]$ cat nginx.yaml 
apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  ports:
  - port: 27017
    protocol: TCP
    targetPort: 80
    name: "http"
  selector:
    app: nginx
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
[akaris@linux 2041307]$ oc apply -f nginx.yaml 
service/nginx created
deployment.apps/nginx-deployment created
[akaris@linux 2041307]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-6b474476c4-wdwgb   1/1     Running   0          10s
[akaris@linux 2041307]$ oc get deployment
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   1/1     1            1           14s
[akaris@linux 2041307]$ oc get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP            NODE                           NOMINATED NODE   READINESS GATES
nginx-deployment-6b474476c4-wdwgb   1/1     Running   0          47s   10.128.2.17   ip-10-0-172-234.ec2.internal   <none>           <none>
[akaris@linux 2041307]$ oc get svc -o wide
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP                            PORT(S)     AGE    SELECTOR
kubernetes   ClusterIP      172.30.0.1     <none>                                 443/TCP     100m   <none>
nginx        ClusterIP      172.30.17.17   <none>                                 27017/TCP   79s    app=nginx
openshift    ExternalName   <none>         kubernetes.default.svc.cluster.local   <none>      85m    <none>
[akaris@linux 2041307]$ oc debug node/ip-10-0-172-234.ec2.internal -- curl 172.30.17.17:27017
Starting pod/ip-10-0-172-234ec2internal-debug ...
To use host binaries, run `chroot /host`
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>
100   612  100   612    0     0   130k      0 --:--:-- --:--:-- --:--:--  149k

Removing debug pod ...
[akaris@linux 2041307]$ oc get svc -o wide
oc NAME         TYPE           CLUSTER-IP     EXTERNAL-IP                            PORT(S)     AGE     SELECTOR
kubernetes   ClusterIP      172.30.0.1     <none>                                 443/TCP     103m    <none>
nginx        ClusterIP      172.30.17.17   <none>                                 27017/TCP   3m51s   app=nginx
openshift    ExternalName   <none>         kubernetes.default.svc.cluster.local   <none>      88m     <none>

#### idling and waking up

[akaris@linux 2041307]$ oc idle nginx
The service "default/nginx" has been marked as idled 
The service will unidle Deployment "default/nginx-deployment" to 1 replicas once it receives traffic 
Deployment "default/nginx-deployment" has been idled 
[akaris@linux 2041307]$ oc get pods -o wide
No resources found in default namespace.
[akaris@linux 2041307]$ oc get ep -o wide
NAME         ENDPOINTS                                              AGE
kubernetes   10.0.137.190:6443,10.0.149.204:6443,10.0.172.70:6443   103m
nginx        <none>                                                 4m14s
[akaris@linux 2041307]$ oc get svc -o wide
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP                            PORT(S)     AGE     SELECTOR
kubernetes   ClusterIP      172.30.0.1     <none>                                 443/TCP     103m    <none>
nginx        ClusterIP      172.30.17.17   <none>                                 27017/TCP   4m18s   app=nginx
openshift    ExternalName   <none>         kubernetes.default.svc.cluster.local   <none>      88m     <none>
[akaris@linux 2041307]$ oc get deployment -o wide
NAME               READY   UP-TO-DATE   AVAILABLE   AGE     CONTAINERS   IMAGES         SELECTOR
nginx-deployment   0/0     0            0           4m22s   nginx        nginx:1.14.2   app=nginx
[akaris@linux 2041307]$ oc debug node/ip-10-0-172-234.ec2.internal -- curl -m 60 172.30.17.17:27017
Starting pod/ip-10-0-172-234ec2internal-debug ...
To use host binaries, run `chroot /host`
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   612  100   612    0     0     86      0  0:00:0<!DOCTYPE html>-:--     0
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>
7  0:00:07 --:--:--   150

Removing debug pod ...
[akaris@linux 2041307]$ oc get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP            NODE                           NOMINATED NODE   READINESS GATES
nginx-deployment-6b474476c4-lzt6q   1/1     Running   0          28s   10.128.2.17   ip-10-0-172-234.ec2.internal   <none>           <none>
[akaris@linux 2041307]$ oc get svc -o wide
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP                            PORT(S)     AGE    SELECTOR
kubernetes   ClusterIP      172.30.0.1     <none>                                 443/TCP     104m   <none>
nginx        ClusterIP      172.30.17.17   <none>                                 27017/TCP   5m2s   app=nginx
openshift    ExternalName   <none>         kubernetes.default.svc.cluster.local   <none>      89m    <none>
[akaris@linux 2041307]$ oc get ep -o wide
NAME         ENDPOINTS                                              AGE
kubernetes   10.0.137.190:6443,10.0.149.204:6443,10.0.172.70:6443   104m
nginx        10.128.2.17:80                                         5m7s


#### idling, deleting the ovnkube master leader ; not waking up

[akaris@linux 2041307]$ oc idle nginx
The service "default/nginx" has been marked as idled 
The service will unidle Deployment "default/nginx-deployment" to 1 replicas once it receives traffic 
Deployment "default/nginx-deployment" has been idled 
[akaris@linux 2041307]$ oc get pods
oNo resources found in default namespace.
c[akaris@linux 2041307]$ oc get ep
oc NAME         ENDPOINTS                                              AGE
kubernetes   10.0.137.190:6443,10.0.149.204:6443,10.0.172.70:6443   106m
nginx        <none>                                                 7m12s
[akaris@linux 2041307]$ oc get svc
NAME         TYPE           CLUSTER-IP     EXTERNAL-IP                            PORT(S)     AGE
kubernetes   ClusterIP      172.30.0.1     <none>                                 443/TCP     106m
nginx        ClusterIP      172.30.17.17   <none>                                 27017/TCP   7m13s
openshift    ExternalName   <none>         kubernetes.default.svc.cluster.local   <none>      91m

[akaris@linux 2041307]$  oc delete pod -n openshift-ovn-kubernetes ovnkube-master-kfkt
(...)

[akaris@linux 2041307]$ oc get pods -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-7d6xp   4/4     Running   1          99m
ovnkube-master-9plqt   2/4     Running   0          28s
ovnkube-master-j4dvq   4/4     Running   0          99m
ovnkube-node-7q265     2/2     Running   0          81m
ovnkube-node-9wlfj     2/2     Running   0          99m
ovnkube-node-ghfq7     2/2     Running   0          99m
ovnkube-node-q74sk     2/2     Running   0          81m
ovnkube-node-wb8qt     2/2     Running   0          99m
ovnkube-node-z9q44     2/2     Running   0          81m
ovs-node-5crdp         1/1     Running   0          99m
ovs-node-skjfh         1/1     Running   0          81m
ovs-node-sm4b9         1/1     Running   0          81m
ovs-node-ssfnd         1/1     Running   0          81m
ovs-node-sz6sf         1/1     Running   0          99m
ovs-node-vxhv7         1/1     Running   0          99m
[akaris@linux 2041307]$ oc get pods -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-7d6xp   4/4     Running   1          100m
ovnkube-master-9plqt   4/4     Running   0          56s
ovnkube-master-j4dvq   4/4     Running   0          100m
ovnkube-node-7q265     2/2     Running   0          82m
ovnkube-node-9wlfj     2/2     Running   0          100m
ovnkube-node-ghfq7     2/2     Running   0          100m
ovnkube-node-q74sk     2/2     Running   0          82m
ovnkube-node-wb8qt     2/2     Running   0          100m
ovnkube-node-z9q44     2/2     Running   0          82m
ovs-node-5crdp         1/1     Running   0          100m
ovs-node-skjfh         1/1     Running   0          82m
ovs-node-sm4b9         1/1     Running   0          82m
ovs-node-ssfnd         1/1     Running   0          82m
ovs-node-sz6sf         1/1     Running   0          100m
ovs-node-vxhv7         1/1     Running   0          100m
[akaris@linux 2041307]$  oc logs -c ovnkube-master -n openshift-ovn-kubernetes ovnkube-master-9plqt | tail
I0125 17:05:00.997515       1 pods.go:231] [openshift-kube-apiserver/revision-pruner-5-ip-10-0-149-204.ec2.internal] addLogicalPort took 84.984982ms
I0125 17:05:01.046245       1 pods.go:231] [openshift-machine-api/machine-api-operator-69fcc8cb7d-49mwj] addLogicalPort took 62.329628ms
I0125 17:05:01.055968       1 pods.go:231] [openshift-service-ca/service-ca-69d9df78f9-htnb8] addLogicalPort took 72.216816ms
I0125 17:05:01.074806       1 pods.go:231] [openshift-dns/dns-default-4zvb9] addLogicalPort took 86.241346ms
I0125 17:05:01.075224       1 pods.go:231] [openshift-kube-controller-manager/revision-pruner-5-ip-10-0-172-70.ec2.internal] addLogicalPort took 83.423011ms
I0125 17:05:01.078542       1 pods.go:231] [openshift-dns/dns-default-4wfrt] addLogicalPort took 90.12651ms
I0125 17:05:01.091201       1 pods.go:231] [openshift-ingress/router-default-8695db5c69-fdvzz] addLogicalPort took 97.682271ms
I0125 17:05:01.108856       1 pods.go:231] [openshift-kube-scheduler/installer-5-ip-10-0-172-70.ec2.internal] addLogicalPort took 52.834518ms
I0125 17:05:01.117518       1 pods.go:231] [openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover-9g478] addLogicalPort took 38.925562ms
I0125 17:05:02.132551       1 loadbalancer.go:222] Existing Service Reject ACL found: 0700a1a1-93d9-4d63-beec-1b0885480571 for 5c3f1e7d-3170-4a29-9dfd-38890a2d8129-172.30.17.17:27017
[akaris@linux 2041307]$ oc debug node/ip-10-0-172-234.ec2.internal -- curl -m 60 172.30.17.17:27017
Starting pod/ip-10-0-172-234ec2internal-debug ...
To use host binaries, run `chroot /host`
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:01:00 --:--:--     0
curl: (28) Connection timed out after 60001 milliseconds

Removing debug pod ...
[akaris@linux 2041307]$ oc debug node/ip-10-0-172-234.ec2.internal -- curl -m 60 172.30.17.17:27017
Starting pod/ip-10-0-172-234ec2internal-debug ...
To use host binaries, run `chroot /host`
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:01:00 --:--:--     0
curl: (28) Connection timed out after 60001 milliseconds

Removing debug pod ...
[akaris@linux 2041307]$ oc get pods
No resources found in default namespace.
[akaris@linux 2041307]$ oc get svc
oc NAME         TYPE           CLUSTER-IP     EXTERNAL-IP                            PORT(S)     AGE
kubernetes   ClusterIP      172.30.0.1     <none>                                 443/TCP     112m
nginx        ClusterIP      172.30.17.17   <none>                                 27017/TCP   13m
openshift    ExternalName   <none>         kubernetes.default.svc.cluster.local   <none>      97m
[akaris@linux 2041307]$ oc get ep
NAME         ENDPOINTS                                              AGE
kubernetes   10.0.137.190:6443,10.0.149.204:6443,10.0.172.70:6443   113m
nginx        <none>                                                 13m
[akaris@linux 2041307]$ 
~~~

Monitoing the sbdb while that's happening:
~~~
karis@linux ipi-us-east-1]$ oc rsh -n openshift-ovn-kubernetes -c sbdb ovnkube-master-j4dvq
sh-4.2# ovn-sbctl list controller_event
sh-4.2# for i in {0..300}; do echo "." ; ovn-sbctl list controller_event; sleep 1 ; done
.
.
.
.
.
.
.
.
.
.
.
.
_uuid               : 64cfba2c-2f06-4c4a-bc1e-2a2ff276ab93
chassis             : bcd347cb-12cc-4b07-896a-62d85fdbee4a
event_info          : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"}
event_type          : empty_lb_backends
seq_num             : 7
.
_uuid               : 64cfba2c-2f06-4c4a-bc1e-2a2ff276ab93
chassis             : bcd347cb-12cc-4b07-896a-62d85fdbee4a
event_info          : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"}
event_type          : empty_lb_backends
seq_num             : 7
.
_uuid               : 64cfba2c-2f06-4c4a-bc1e-2a2ff276ab93
chassis             : bcd347cb-12cc-4b07-896a-62d85fdbee4a
event_info          : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"}
event_type          : empty_lb_backends
seq_num             : 7
.
.
.
.
.
_uuid               : 885cf534-156e-4d33-adb9-503c2565a144
chassis             : bcd347cb-12cc-4b07-896a-62d85fdbee4a
event_info          : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"}
event_type          : empty_lb_backends
seq_num             : 8
.
.
.
.
.
.
.
.
_uuid               : 62c2c817-8ab0-4178-90cd-d757a91a631e
chassis             : bcd347cb-12cc-4b07-896a-62d85fdbee4a
event_info          : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"}
event_type          : empty_lb_backends
seq_num             : 9
.
_uuid               : 62c2c817-8ab0-4178-90cd-d757a91a631e
chassis             : bcd347cb-12cc-4b07-896a-62d85fdbee4a
event_info          : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"}
event_type          : empty_lb_backends
seq_num             : 9
.
_uuid               : 62c2c817-8ab0-4178-90cd-d757a91a631e
chassis             : bcd347cb-12cc-4b07-896a-62d85fdbee4a
event_info          : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"}
event_type          : empty_lb_backends
seq_num             : 9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_uuid               : 71d6b249-1f75-43f3-af8a-30ffff8bd195
chassis             : bcd347cb-12cc-4b07-896a-62d85fdbee4a
event_info          : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"}
event_type          : empty_lb_backends
seq_num             : 10
.
.
.
.
.
.
.
.
.
.
^C
sh-4.2# 
~~~

Comment 12 Andreas Karis 2022-01-26 09:58:08 UTC

Hello docs team,

We need to document a short-coming of OVNKubernetes on OCP 4.6 and 4.7. The following issue has been present in all versions of OVN Kubernetes <= 4.7 and is fixed starting with version 4.8 and beyond.

Issue description:

Idled applications (https://docs.openshift.com/container-platform/4.7/applications/idling-applications.html) do not automatically wake up after ovnkube-master pods

Comment 13 Andreas Karis 2022-01-26 10:00:48 UTC

Documentation team, I created knowledge base solution https://access.redhat.com/solutions/6671241. Please decide if it should be enough to document this in a KCS, or if you want to integrate this as part of our 4.6 and 4.7 documentation. Either way, the Knowledge Base Solution should contain all the details that you need in order to document this issue.

Thanks,

Andreas

Comment 15 Andrew Taylor 2022-04-26 17:57:44 UTC

Hey Andreas, thanks for the hard work on this. 

As I understand your request, you're stating that idled applications do not automatically wake up. As a result, you must manually scale the deployment or resources using oc scale. 

At the very bottom of the page you linked in comment 12, it states: 

~~~
Applications can also be manually unidled by scaling up the resources.

Procedure
To scale up a DeploymentConfig, run:


$ oc scale --replicas=1 dc <dc_name>
~~~

https://docs.openshift.com/container-platform/4.7/applications/idling-applications.html#idle-unidling-applications_idling-applications


Is this not exactly what you're asking us to document?

Comment 17 Andrew Taylor 2022-05-23 16:05:28 UTC

Hi Andreas, 

Can you follow up on my above comment when you get a chance? 

Thanks

Comment 18 Andrew Taylor 2022-06-09 14:18:13 UTC

Per comment 11, I'd like to ask Feng Pan to review my feedback in comment 15. It seems we already have documented the suggested updates:

https://docs.openshift.com/container-platform/4.7/applications/idling-applications.html#idle-unidling-applications_idling-applications

Please correct me if my understanding of the docs request is incorrect.

Comment 19 Andreas Karis 2022-06-09 14:57:37 UTC

Hi! Sorry I don't know why I missed this.

What we need to document (or refer to the KCS) is what I wrote in here: https://access.redhat.com/solutions/6671241

So, we need to document somewhere that auto-unidling in 4.6 and 4.7 doesn't work (and in fact never worked).

https://bugzilla.redhat.com/show_bug.cgi?id=2041307#c13
~~~
Documentation team, I created knowledge base solution https://access.redhat.com/solutions/6671241. Please decide if it should be enough to document this in a KCS, or if you want to integrate this as part of our 4.6 and 4.7 documentation. Either way, the Knowledge Base Solution should contain all the details that you need in order to document this issue.
~~~

So, what I'm asking for is ideally another note at the bottom of below:
https://docs.openshift.com/container-platform/4.7/applications/idling-applications.html#idle-unidling-applications_idling-applications

NOTE:
Automatic unidling by a router is currently only supported by the default HAProxy router.
	
NOTE:
Services do not support automatic unidling if you configure Kuryr-Kubernetes as an SDN.

NOTE:
In OpenShift 4.6 and 4.7, idled services do not become active again when they receive network traffic if they were idled while the leader ovnkube-master process was restarted or after the ovnkube-master leader changed. For further details, refer to https://access.redhat.com/solutions/6671241

Comment 20 Andrew Taylor 2022-06-09 16:41:29 UTC

Thanks for explaining that, Andreas!

I've created a pull request for your review: 
https://github.com/openshift/openshift-docs/pull/46520

Comment 21 Andrew Taylor 2022-06-09 17:04:09 UTC

Setting ON_QA as well.

Comment 22 Andreas Karis 2022-06-09 17:48:35 UTC

looks good to me :)

Comment 23 zhaozhanqi 2022-06-15 13:23:01 UTC

docs looks good to me. move this to verified.

Comment 24 Andrew Taylor 2022-06-15 14:33:26 UTC

These changes are now live. I will close this bug as current release. Thanks for your help, Andreas!

https://docs.openshift.com/container-platform/4.7/applications/idling-applications.html#idle-unidling-applications_idling-applications

Comment 25 Miciah Dashiel Butler Masters 2022-07-19 16:30:33 UTC

*** Bug 2107657 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.