pod cannot be worked up on 4.6/4.7 upgrade to 4.7.41 version, this seems like not same totally with above issue. $ oc get ep -n idle-upgrade NAME ENDPOINTS AGE test-service <none> 17h $ oc get svc -n idle-upgrade -o yaml apiVersion: v1 items: - apiVersion: v1 kind: Service metadata: annotations: idling.alpha.openshift.io/idled-at: "2022-01-16T11:06:29Z" idling.alpha.openshift.io/unidle-targets: '[{"kind":"ReplicationController","name":"test-rc","replicas":2}]' creationTimestamp: "2022-01-16T11:06:23Z" labels: name: test-service name: test-service namespace: idle-upgrade resourceVersion: "43749" selfLink: /api/v1/namespaces/idle-upgrade/services/test-service uid: f3c9bd55-3aae-41a4-8e05-431cbd562ccc spec: clusterIP: 172.30.4.73 clusterIPs: - 172.30.4.73 ports: - name: http port: 27017 protocol: TCP targetPort: 8080 selector: name: test-pods sessionAffinity: None type: ClusterIP status: loadBalancer: {} kind: List metadata: resourceVersion: "" selfLink: "" #### Tring to access the service from ovn node but timeout ### sh-4.4# curl --connect-timeout 10 172.30.4.73:27017 curl: (28) Connection timed out after 10001 milliseconds sh-4.4# #####found the pod cannot be waked up #### $ oc get ep -n idle-upgrade NAME ENDPOINTS AGE test-service <none> 17h Here is the must-gather logs: http://file.apac.redhat.com/~zzhao/must-gather.local.3054145436653881521.tar.gz
Hi there, Could you generate and attach an oc adm inspect for that namespace? ~~~ namespace=idle-upgrade exclude_list="secrets" oc adm inspect -n $namespace $(oc api-resources --verbs=get,list --namespaced=true | tail -n+2 | egrep -v "$exclude_list" | awk '{print $1}' | tr '\n' ',' | sed 's/,$//') ~~~ Thanks! - Andreas
Also, while still in the erroneous state, can you provide an OVN / network must-gather so that I can have a look at the databases: https://access.redhat.com/solutions/5459251 ~~~ oc adm must-gather -- /usr/bin/gather_network_logs ~~~
Hi there, Just some further info. I can reproduce this after an upgrade, as well. I upgraded from 4.6.53 to 4.7.41. I idle the SVC first, then upgrade the cluster: ~~~ [akaris@linux 2041307]$ oc get pods NAME READY STATUS RESTARTS AGE ip-10-0-139-82ec2internal-debug 1/1 Running 0 12m nginx-deployment-66b6c48dd5-cdbdz 1/1 Running 0 110s [akaris@linux 2041307]$ oc get deployment NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 1/1 1 1 14m [akaris@linux 2041307]$ oc get -o yaml svc nginx apiVersion: v1 kind: Service metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"nginx"},"name":"nginx","namespace":"default"},"spec":{"ports":[{"name":"http","port":27017,"protocol":"TCP","targetPort":80}],"selector":{"app":"nginx"},"sessionAffinity":"None","type":"ClusterIP"}} creationTimestamp: "2022-01-17T19:20:06Z" labels: app: nginx name: nginx namespace: default resourceVersion: "35183" selfLink: /api/v1/namespaces/default/services/nginx uid: bd9056a3-ad65-410c-a7a9-b3cb71e88a7d spec: clusterIP: 172.30.174.162 ports: - name: http port: 27017 protocol: TCP targetPort: 80 selector: app: nginx sessionAffinity: None type: ClusterIP status: loadBalancer: {} [akaris@linux 2041307]$ oc get pods No resources found in default namespace. [akaris@linux 2041307]$ oc get deployment NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 16m [akaris@linux 2041307]$ oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release:4.7.41-x86_64 --force --allow-explicit-upgrade warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image quay.io/openshift-release-dev/ocp-release:4.7.41-x86_64 [akaris@linux 2041307]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.53 True True 4s Working towards quay.io/openshift-release-dev/ocp-release:4.7.41-x86_64: downloading update ~~~ After the upgrade is done, I try to unidle the SVC by connecting to it: ~~~ [akaris@linux 2041307]$ oc debug node/ip-10-0-139-82.ec2.internal Starting pod/ip-10-0-139-82ec2internal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.139.82 If you don't see a command prompt, try pressing enter. (failed reverse-i-search)`curl': ^C sh-4.4# time curl 172.30.174.162:27017 ^C real 1m9.746s user 0m0.005s sys 0m0.005s sh-4.4# ~~~ The OVN part does its job. However, we see that the events are deleted from the SB DB, and then OVN recreates them, and so forth: ~~~ [akaris@linux 2041307]$ oc rsh -n openshift-ovn-kubernetes -c sbdb ovnkube-master-zlkbw sh-4.4# for i in {0..300}; do echo "." ; ovn-sbctl list controller_event; sleep 1 ; done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . _uuid : a8ae2e61-b38e-492e-9f0b-d54eac67e1ae chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 2 . _uuid : a8ae2e61-b38e-492e-9f0b-d54eac67e1ae chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 2 . _uuid : a8ae2e61-b38e-492e-9f0b-d54eac67e1ae chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 2 . _uuid : a8ae2e61-b38e-492e-9f0b-d54eac67e1ae chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 2 . . . _uuid : 66d045a0-258a-404d-8493-1a3e7bf1e13a chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 3 . _uuid : 66d045a0-258a-404d-8493-1a3e7bf1e13a chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 3 . _uuid : 66d045a0-258a-404d-8493-1a3e7bf1e13a chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 3 . . . . . . _uuid : c86eee8a-6b78-4fa5-8e41-4603335e96a4 chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 4 . _uuid : c86eee8a-6b78-4fa5-8e41-4603335e96a4 chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 4 . _uuid : c86eee8a-6b78-4fa5-8e41-4603335e96a4 chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 4 . _uuid : c86eee8a-6b78-4fa5-8e41-4603335e96a4 chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 4 . _uuid : c86eee8a-6b78-4fa5-8e41-4603335e96a4 chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 4 . . . . . . . . . . . . _uuid : e7cfbbd9-525a-495d-b259-b71e227899ac chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 5 . _uuid : e7cfbbd9-525a-495d-b259-b71e227899ac chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 5 . _uuid : e7cfbbd9-525a-495d-b259-b71e227899ac chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 5 . _uuid : e7cfbbd9-525a-495d-b259-b71e227899ac chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . _uuid : f6fd5410-c5f8-449f-9279-ebedf17c79ff chassis : f76ccada-3bc2-4fbd-bf77-bb2eed28bc4a event_info : {load_balancer="6ed1feda-9cd7-45cc-9847-81a9a20f167b", protocol=tcp, vip="172.30.174.162:27017"} event_type : empty_lb_backends seq_num : 6 . . . . . . ^C sh-4.4# ~~~ The OVNKUBE layer does not generate the necessary event (the event below is from before the upgrade): ~~~ [akaris@linux 2041307]$ oc get events | grep Need 155m Normal NeedPods service/nginx The service nginx needs pods ~~~ Log level 5: ~~~ [akaris@linux 2041307]$ oc get configmap -n openshift-ovn-kubernetes env-overrides -o yaml apiVersion: v1 data: _master: | # This sets the log level for the ovn-kubernetes master process as well as the ovn-dbchecker: OVN_KUBE_LOG_LEVEL=5 kind: ConfigMap metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"v1","data":{"_master":"# This sets the log level for the ovn-kubernetes master process as well as the ovn-dbchecker:\nOVN_KUBE_LOG_LEVEL=5\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"env-overrides","namespace":"openshift-ovn-kubernetes"}} creationTimestamp: "2022-01-17T19:15:50Z" name: env-overrides namespace: openshift-ovn-kubernetes resourceVersion: "30197" selfLink: /api/v1/namespaces/openshift-ovn-kubernetes/configmaps/env-overrides uid: e3df0f4a-5df7-4654-be49-c6e113273ba7 ~~~ The logs do not show this here: ~~~ [akaris@linux 2041307]$ oc logs -n openshift-ovn-kubernetes ovnkube-master-h7g4w -c ovnkube-master | grep 'Sending a NeedPods event for service' [akaris@linux 2041307]$ ~~~ But neither do they show an error: ~~~ [akaris@linux 2041307]$ oc logs -n openshift-ovn-kubernetes ovnkube-master-h7g4w -c ovnkube-master | grep 'Unable to remove controller' [akaris@linux 2041307]$ ~~~ But the deletion is happening: ~~~ [akaris@linux 2041307]$ oc logs -n openshift-ovn-kubernetes ovnkube-master-h7g4w -c ovnkube-master | grep 'destroy' I0117 21:40:06.019387 1 ovs.go:169] exec(1457): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event c37c0d86-dfc5-4fb0-a439-bd07ad284d7b I0117 21:40:11.006009 1 ovs.go:169] exec(1459): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event a8ae2e61-b38e-492e-9f0b-d54eac67e1ae I0117 21:40:16.018151 1 ovs.go:169] exec(1461): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event 66d045a0-258a-404d-8493-1a3e7bf1e13a I0117 21:40:26.011394 1 ovs.go:169] exec(1464): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event c86eee8a-6b78-4fa5-8e41-4603335e96a4 I0117 21:40:41.013367 1 ovs.go:169] exec(1468): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event e7cfbbd9-525a-495d-b259-b71e227899ac I0117 21:41:11.014046 1 ovs.go:169] exec(1475): /usr/bin/ovn-sbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.128.219:9642,ssl:10.0.156.145:9642,ssl:10.0.173.46:9642 --timeout=15 destroy controller_event f6fd5410-c5f8-449f-9279-ebedf17c79ff ~~~ Hence, the failure (or rather the symptom of the failure) is somewhere here in the code. Given that the destroy succeeds, but that we do not see the Infof log "Sending a NeedPods ...": ~~~ 541 func (oc *Controller) ovnControllerEventChecker() { 542 ticker := time.NewTicker(5 * time.Second) 543 544 _, _, err := util.RunOVNNbctl("set", "nb_global", ".", "options:controller_event=true") 545 if err != nil { 546 klog.Error("Unable to enable controller events. Unidling not possible") 547 return 548 } 549 550 for { 551 select { 552 case <-ticker.C: 553 out, _, err := util.RunOVNSbctl("--format=json", "list", "controller_event") 554 if err != nil { 555 continue 556 } 557 558 events, err := extractEmptyLBBackendsEvents([]byte(out)) 559 if err != nil || len(events) == 0 { 560 continue 561 } 562 563 for _, event := range events { 564 _, _, err := util.RunOVNSbctl("destroy", "controller_event", event.uuid) 565 if err != nil { 566 // Don't unidle until we are able to remove the controller event 567 klog.Errorf("Unable to remove controller event %s", event.uuid) 568 continue 569 } 570 if serviceName, ok := oc.GetServiceVIPToName(event.vip, event.protocol); ok { 571 serviceRef := kapi.ObjectReference{ 572 Kind: "Service", 573 Namespace: serviceName.Namespace, 574 Name: serviceName.Name, 575 } 576 klog.V(5).Infof("Sending a NeedPods event for service %s in namespace %s.", serviceName.Name, serviceName.Namespace) 577 oc.recorder.Eventf(&serviceRef, kapi.EventTypeNormal, "NeedPods", "The service %s needs pods", serviceName.Name) 578 } 579 } 580 case <-oc.stopChan: 581 return 582 } 583 } 584 } 585 586 func podScheduled(pod *kapi.Pod) bool { 587 return pod.Spec.NodeName != "" 588 } NORMAL ᚠ cd1a152 go-controller/pkg/ovn/ovn.go ~~~ That means that oc.GetServiceVIPToName fails and that most likely, we are not finding a match here: ~~~ 570 if serviceName, ok := oc.GetServiceVIPToName(event.vip, event.protocol); ok { ~~~ And that would indicate that the map entry is perhaps missing: ~~~ 1234 // GetServiceVIPToName retrieves the associated k8s service name for a load balancer VIP 1235 func (oc *Controller) GetServiceVIPToName(vip string, protocol kapi.Protocol) (types.NamespacedName, bool) { 1236 oc.serviceVIPToNameLock.Lock() 1237 defer oc.serviceVIPToNameLock.Unlock() 1238 namespace, ok := oc.serviceVIPToName[ServiceVIPKey{vip, protocol}] 1239 return namespace, ok 1240 } ~~~ I'll continue my research tomorrow.
Update - this is not an upgrade related issue. This happens in both 4.7.41 and 4.6.53 when a service is idled, and when the ovnkube master process is restarted: ~~~ [akaris@linux 2041307]$ oc idle nginx Deployment "default/nginx-deployment" has been idled [akaris@linux 2041307]$ oc get pods NAME READY STATUS RESTARTS AGE ip-10-0-141-63us-west-1computeinternal-debug 1/1 Running 0 45s [akaris@linux 2041307]$ oc get deployment NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 94s [akaris@linux 2041307]$ oc get pods NAME READY STATUS RESTARTS AGE ip-10-0-141-63us-west-1computeinternal-debug 1/1 Running 0 50s [akaris@linux 2041307]$ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-c968t 6/6 Running 6 31m ovnkube-master-vnvm9 6/6 Running 1 31m ovnkube-master-wxn75 6/6 Running 6 31m ovnkube-node-6xds7 3/3 Running 0 21m ovnkube-node-c8hcj 3/3 Running 0 31m ovnkube-node-l2xhj 3/3 Running 0 21m ovnkube-node-wf8xx 3/3 Running 0 31m ovnkube-node-whjxv 3/3 Running 0 31m ovnkube-node-zfv55 3/3 Running 0 24m [akaris@linux 2041307]$ oc delete pod -n openshift-ovn-kubernetes -l app=ovnkube-master pod "ovnkube-master-c968t" deleted pod "ovnkube-master-vnvm9" deleted pod "ovnkube-master-wxn75" deleted [akaris@linux 2041307]$ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-gv5wn 6/6 Running 0 115s ovnkube-master-v7r2h 6/6 Running 0 116s ovnkube-master-w8594 6/6 Running 0 108s ovnkube-node-6xds7 3/3 Running 0 24m ovnkube-node-c8hcj 3/3 Running 0 34m ovnkube-node-l2xhj 3/3 Running 0 24m ovnkube-node-wf8xx 3/3 Running 0 34m ovnkube-node-whjxv 3/3 Running 0 34m ovnkube-node-zfv55 3/3 Running 0 26m ~~~ ~~~ sh-4.4# time curl 172.30.175.221:27017 ^C real 0m44.482s user 0m0.003s sys 0m0.003s ~~~ And then, nothing shows up: ~~~ [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 7m47s [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 7m49s [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 7m50s [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 8m3s [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 8m5s [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 8m6s [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 8m8s [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 8m9s [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 8m18s [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 8m21s [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 8m23s [akaris@linux 2041307]$ oc get deployments NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 8m25s (reverse-i-search)`dele': oc ^Clete pod -n openshift-ovn-kubernetes -l app=ovnkube-master [akaris@linux 2041307]$ oc get deployment NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/0 0 0 10m [akaris@linux 2041307]$ ~~~
Here's what's happening in the OCP 4.6 and 4.7 code. When the ovnkube daemon is restarted, it must add endpoints in AddEndpoints to build the map / cache in oc.serviceVIPToName. See #5 for the retrieval logic later on. Here is AddServiceVIPToName: ~~~ 1227 // AddServiceVIPToName associates a k8s service name with a load balancer VIP 1228 func (oc *Controller) AddServiceVIPToName(vip string, protocol kapi.Protocol, namespace, name string) { 1229 oc.serviceVIPToNameLock.Lock() 1230 defer oc.serviceVIPToNameLock.Unlock() 1231 klog.Infof("BZ2041307 (AddServiceVIPToName): oc.serviceVIPToName[%v] = %v", ServiceVIPKey{vip, protocol}, types.NamespacedName{Namespace: namespace, Name: name}) 1232 oc.serviceVIPToName[ServiceVIPKey{vip, protocol}] = types.NamespacedName{Namespace: namespace, Name: name} 1233 } 1234 1235 // GetServiceVIPToName retrieves the associated k8s service name for a load balancer VIP NORMAL ᚠ BZ2041307-reproducer-4.7.41! go-controller/pkg/ovn/ovn.go go utf-8[unix] 90% ☰ 1228/1356 ㏑ : 28 1 go-controller/pkg/ovn/endpoints.go|111 col 8| AddEndpoints ~~~ That method has a single caller, AddEndpoints: ~~~ 51 // AddEndpoints adds endpoints and creates corresponding resources in OVN 52 func (ovn *Controller) AddEndpoints(ep *kapi.Endpoints) error { 53 klog.Infof("Adding endpoints: %s for namespace: %s", ep.Name, ep.Namespace) 54 // get service 55 // TODO: cache the service 56 svc, err := ovn.watchFactory.GetService(ep.Namespace, ep.Name) 57 if err != nil { 58 // This is not necessarily an error. For e.g when there are endpoints 59 // without a corresponding service. 60 klog.V(5).Infof("No service found for endpoint %s in namespace %s", ep.Name, ep.Namespace) 61 return nil 62 } 63 if !util.IsClusterIPSet(svc) { 64 klog.V(5).Infof("Skipping service %s due to clusterIP = %q", svc.Name, svc.Spec.ClusterIP) 65 return nil 66 } 67 68 klog.V(5).Infof("Matching service %s found for ep: %s, with cluster IP: %s", svc.Name, ep.Name, svc.Spec.ClusterIP) 69 70 klog.Infof("BZ2041307 (AddEndpoints): %s, ep: %v", svc.Name, *ep) 71 protoPortMap := ovn.getLbEndpoints(ep) 72 klog.V(5).Infof("Matching service %s ports: %v", svc.Name, svc.Spec.Ports) 73 klog.Infof("BZ2041307 (AddEndpoints): Matching service %s ports: %v", svc.Name, svc.Spec.Ports) 74 klog.Infof("BZ2041307 (AddEndpoints): %s, protoPortMap: %v", svc.Name, protoPortMap) 75 for _, svcPort := range svc.Spec.Ports { 76 klog.Infof("BZ2041307 (AddEndpoints): %s, svcPort: %v", svc.Name, svcPort) 77 klog.Infof("BZ2041307 (AddEndpoints): %s, protoPortMap: %v", svc.Name, protoPortMap) 78 lbEps, isFound := protoPortMap[svcPort.Protocol][svcPort.Name] 79 if !isFound { 80 continue 81 ... } ~~~ AddEndpoints will continue the loop at this point if protoPortMap does not contain a match for the service. Otherwise, it will continue all the way down and eventually run `ovn.AddServiceVIPToName(vip, svcPort.Protocol, svc.Namespace, svc.Name)`: ~~~ ... 82 klog.Infof("BZ2041307 (AddEndpoints): isFound, %s, %v", svc.Name, svcPort) 83 if !ovn.SCTPSupport && svcPort.Protocol == kapi.ProtocolSCTP { 84 klog.Errorf("Rejecting endpoint creation for unsupported SCTP protocol: %s, %s", ep.Namespace, ep.Name) 85 continue 86 } 87 klog.Infof("BZ2041307 (AddEndpoints): passed SCTP check %s, %v", svc.Name, svcPort) 88 if util.ServiceTypeHasNodePort(svc) { 89 if err := ovn.createGatewayVIPs(svcPort.Protocol, svcPort.NodePort, lbEps.IPs, lbEps.Port); err != nil { 90 klog.Errorf("Error in creating Node Port for svc %s, node port: %d - %v\n", svc.Name, svcPort.NodePort, err) 91 continue 92 } 93 } 94 klog.Infof("BZ2041307 (AddEndpoints): passed GatewayVIP check %s, %v", svc.Name, svcPort) 95 if util.ServiceTypeHasClusterIP(svc) { 96 klog.Infof("BZ2041307 (AddEndpoints): hasClusterIP %s, %v", svc.Name, svcPort) 97 var loadBalancer string 98 loadBalancer, err = ovn.getLoadBalancer(svcPort.Protocol) 99 if err != nil { 100 klog.Errorf("Failed to get load balancer for %s (%v)", svcPort.Protocol, err) 101 continue 102 } 103 klog.Infof("BZ2041307 (AddEndpoints): loadbalancer %s: %v", svc.Name, loadBalancer) 104 if err = ovn.createLoadBalancerVIPs(loadBalancer, []string{svc.Spec.ClusterIP}, svcPort.Port, lbEps.IPs, lbEps.Port); err != nil { 105 klog.Errorf("Error in creating Cluster IP for svc %s, target port: %d - %v\n", svc.Name, lbEps.Port, err) 106 continue 107 } 108 klog.Infof("BZ2041307 (AddEndpoints): loadbalancerVIPs passed %s", svc.Name) 109 vip := util.JoinHostPortInt32(svc.Spec.ClusterIP, svcPort.Port) 110 klog.Infof("BZ2041307 (AddEndpoints): vip %s = %v", svc.Name, vip) 111 ovn.AddServiceVIPToName(vip, svcPort.Protocol, svc.Namespace, svc.Name) NORMAL ᚠ BZ2041307-reproducer-4.7.41! go-controller/pkg/ovn/endpoints.go ~~~ Unfortunately, an idle service has no endpoints: ~~~ [akaris@linux 2041307]$ oc get ep NAME ENDPOINTS AGE kubernetes 10.0.163.13:6443,10.0.168.127:6443,10.0.253.15:6443 128m nginx <none> 54m ~~~ So, this here yields an empty protoPortMap for the service: ~~~ 71 protoPortMap := ovn.getLbEndpoints(ep) ~~~ ~~~ 18 func (ovn *Controller) getLbEndpoints(ep *kapi.Endpoints) map[kapi.Protocol]map[string]lbEndpoints { 19 protoPortMap := map[kapi.Protocol]map[string]lbEndpoints{ 20 kapi.ProtocolTCP: make(map[string]lbEndpoints), 21 kapi.ProtocolUDP: make(map[string]lbEndpoints), 22 kapi.ProtocolSCTP: make(map[string]lbEndpoints), 23 } 24 klog.Infof("BZ2041307 (getLbEndpoints): ep.Subsets: %v", ep.Subsets) 25 for _, s := range ep.Subsets { 26 klog.Infof("BZ2041307 (getLbEndpoints): %s, s.Addresses: %v", s.Addresses) 27 for _, ip := range s.Addresses { 28 klog.Infof("BZ2041307 (getLbEndpoints): %s, s.Ports: %v", s.Ports) 29 for _, port := range s.Ports { 30 klog.Infof("BZ2041307 (getLbEndpoints): %s, s: %v, ip: %v, port: %v", s, ip, port) 31 var ips []string 32 klog.Infof("BZ2041307 (getLbEndpoints): util.ValidatePort(%v, %v)", port.Protocol, port.Port) 33 if err := util.ValidatePort(port.Protocol, port.Port); err != nil { 34 klog.Errorf("Invalid endpoint port: %s: %v", port.Name, err) 35 continue 36 } 37 if lbEps, ok := protoPortMap[port.Protocol][port.Name]; ok { 38 ips = append(lbEps.IPs, ip.IP) 39 } else { 40 ips = []string{ip.IP} 41 } 42 protoPortMap[port.Protocol][port.Name] = lbEndpoints{IPs: ips, Port: port.Port} 43 } 44 } 45 } 46 klog.V(5).Infof("Endpoint Protocol Map is: %v", protoPortMap) 47 klog.Infof("BZ2041307 (getLbEndpoints): Endpoint Protocol Map is: %v", protoPortMap) 48 return protoPortMap 49 } NORMAL ᚠ BZ2041307-reproducer-4.7.41! go-controller/pkg/ovn/endpoints.go ~~~ So, when ovnkube-master is restarted, we see only this: ~~~ I0118 11:26:56.889656 1 endpoints.go:70] BZ2041307 (AddEndpoints): nginx, ep: {{ } {nginx default /api/v1/namespaces/default/endpoints/nginx 1cf60665-4016-4d09-958c-7bbf52e0c801 43233 0 2022-01-18 10:40:03 +0000 UTC <nil> <nil> map[app:nginx] map[idling.alpha.openshift.io/idled-at:2022-01-18T10:42:36Z idling.alpha.openshift.io/unidle-targets:[{"kind":"Deployment","name":"nginx-deployment","group":"apps","replicas":1}]] [] [] [{kube-controller-manager Update v1 2022-01-18 10:41:47 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{},"f:labels":{".":{},"f:app":{}}}}} {oc Update v1 2022-01-18 10:42:37 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{"f:idling.alpha.openshift.io/idled-at":{},"f:idling.alpha.openshift.io/unidle-targets":{}}}}}]} []} I0118 11:26:56.889714 1 endpoints.go:24] BZ2041307 (getLbEndpoints): [], ep.Subsets: %!v(MISSING) I0118 11:26:56.889721 1 endpoints.go:47] BZ2041307 (getLbEndpoints): Endpoint Protocol Map is: map[SCTP:map[] TCP:map[] UDP:map[]] I0118 11:26:56.889732 1 endpoints.go:73] BZ2041307 (AddEndpoints): Matching service nginx ports: [{http TCP <nil> 27017 {0 80 } 0}] I0118 11:26:56.889742 1 endpoints.go:74] BZ2041307 (AddEndpoints): nginx, protoPortMap: map[SCTP:map[] TCP:map[] UDP:map[]] I0118 11:26:56.889751 1 endpoints.go:76] BZ2041307 (AddEndpoints): nginx, svcPort: {http TCP <nil> 27017 {0 80 } 0} I0118 11:26:56.889759 1 endpoints.go:77] BZ2041307 (AddEndpoints): nginx, protoPortMap: map[SCTP:map[] TCP:map[] UDP:map[]] ~~~ We never find any endpoints (logically) and we never add the SVC to oc.serviceVIPToName.
This is related to: https://github.com/ovn-org/ovn-kubernetes/pull/2124
Just to clarify: this isn't an upgrade issue. This happens whenever the ovnkube-master process is restarted while a service is idle. I'll have to investigate if this ever worked in prior versions, but my gut feeling tells me that this was always broken.
I had a look at the latest 4.5 code from downstream and the code looks very similar to 4.6 and 4.7. ~~~ git checkout downstream/release-4.5 ~~~ #### Retrieval logic: ~~~ 331 func (oc *Controller) ovnControllerEventChecker() { 332 ticker := time.NewTicker(5 * time.Second) 333 334 _, _, err := util.RunOVNNbctl("set", "nb_global", ".", "options:controller_event=true") 335 if err != nil { 336 klog.Error("Unable to enable controller events. Unidling not possible") 337 return 338 } 339 340 for { 341 select { 342 case <-ticker.C: 343 out, _, err := util.RunOVNSbctl("--format=json", "list", "controller_event") 344 if err != nil { 345 continue 346 } 347 348 events, err := extractEmptyLBBackendsEvents([]byte(out)) 349 if err != nil || len(events) == 0 { 350 continue 351 } 352 353 for _, event := range events { 354 _, _, err := util.RunOVNSbctl("destroy", "controller_event", event.uuid) 355 if err != nil { 356 // Don't unidle until we are able to remove the controller event 357 klog.Errorf("Unable to remove controller event %s", event.uuid) 358 continue 359 } 360 if serviceName, ok := oc.GetServiceVIPToName(event.vip, event.protocol); ok { 361 serviceRef := kapi.ObjectReference{ 362 Kind: "Service", 363 Namespace: serviceName.Namespace, 364 Name: serviceName.Name, 365 } 366 klog.V(5).Infof("Sending a NeedPods event for service %s in namespace %s. ", serviceName.Name, serviceName.Namespace) 367 oc.recorder.Eventf(&serviceRef, kapi.EventTypeNormal, "NeedPods", "The se rvice %s needs pods", serviceName.Name) 368 } 369 } 370 case <-oc.stopChan: 371 return 372 } 373 } 374 } 375 376 func podWantsNetwork(pod *kapi.Pod) bool { 377 return !pod.Spec.HostNetwork 378 } NORMAL ᚠ b8ac1b4 go-controller/pkg/ovn/ovn.go ~~~ Retrieval from map oc.serviceVIPToName is the same as in 4.6/4.7: ~~~ 668 // GetServiceVIPToName retrieves the associated k8s service name for a load balancer VIP 669 func (oc *Controller) GetServiceVIPToName(vip string, protocol kapi.Protocol) (types.NamespacedName, bool) { 670 oc.serviceVIPToNameLock.Lock() 671 defer oc.serviceVIPToNameLock.Unlock() 672 namespace, ok := oc.serviceVIPToName[ServiceVIPKey{vip, protocol}] 673 return namespace, ok 674 } 675 NORMAL ᚠ b8ac1b4 go-controller/pkg/ovn/ovn.go ~~~ #### And populating the cache: ~~~ 661 // AddServiceVIPToName associates a k8s service name with a load balancer VIP 662 func (oc *Controller) AddServiceVIPToName(vip string, protocol kapi.Protocol, namespace, name string) { 663 oc.serviceVIPToNameLock.Lock() 664 defer oc.serviceVIPToNameLock.Unlock() 665 oc.serviceVIPToName[ServiceVIPKey{vip, protocol}] = types.NamespacedName{Namespace: namespace, Name: name } 666 } NORMAL ᚠ b8ac1b4 go-controller/pkg/ovn/ovn.go ~~~ Again, there's a single user of that method in go-controller/pkg/ovn/endpoints.go: ~~~ // AddEndpoints adds endpoints and creates corresponding resources in OVN func (ovn *Controller) AddEndpoints(ep *kapi.Endpoints) error { klog.V(5).Infof("Adding endpoints: %s for namespace: %s", ep.Name, ep.Namespace) // get service // TODO: cache the service svc, err := ovn.watchFactory.GetService(ep.Namespace, ep.Name) if err != nil { // This is not necessarily an error. For e.g when there are endpoints // without a corresponding service. klog.V(5).Infof("no service found for endpoint %s in namespace %s", ep.Name, ep.Namespace) return nil } if !util.IsClusterIPSet(svc) { klog.V(5).Infof("Skipping service %s due to clusterIP = %q", svc.Name, svc.Spec.ClusterIP) return nil } klog.V(5).Infof("Matching service %s found for ep: %s, with cluster IP: %s", svc.Name, ep.Name, svc.Spec.ClusterIP) protoPortMap := ovn.getLbEndpoints(ep) klog.V(5).Infof("Matching service %s ports: %v", svc.Name, svc.Spec.Ports) for proto, portMap := range protoPortMap { for svcPortName, lbEps := range portMap { (...) ~~~ ... compared to 4.6/4.7, there is a nested loop here. However, this loop uses what's retrieved from the endpoint list, hence an empty map, and thus idle services will not be added to the map, the same as in 4.6 and 4.7 (we will never get to ovn.AddServiceVIPToName). ~~~ (...) ips := lbEps.IPs targetPort := lbEps.Port for _, svcPort := range svc.Spec.Ports { if svcPort.Protocol == proto && svcPort.Name == svcPortName { if !ovn.SCTPSupport && proto == kapi.ProtocolSCTP { klog.Errorf("Rejecting endpoint creation for unsupported SCTP protocol: %s, %s", ep.Namespace, ep.Name) continue } if util.ServiceTypeHasNodePort(svc) { klog.V(5).Infof("Creating Gateways IP for NodePort: %d, %v", svcPort.NodePort, ips) err = ovn.createGatewayVIPs(svcPort.Protocol, svcPort.NodePort, ips, targetPort) if err != nil { klog.Errorf("Error in creating Node Port for svc %s, node port: %d - %v\n", svc.Name, svcPort.NodePort, err) continue } } if util.ServiceTypeHasClusterIP(svc) { var loadBalancer string loadBalancer, err = ovn.getLoadBalancer(svcPort.Protocol) if err != nil { klog.Errorf("Failed to get loadbalancer for %s (%v)", svcPort.Protocol, err) continue } err = ovn.createLoadBalancerVIPs(loadBalancer, []string{svc.Spec.ClusterIP}, svcPort.Port, ips, targetPort) if err != nil { klog.Errorf("Error in creating Cluster IP for svc %s, target port: %d - %v\n", svc.Name, targetPort, err) continue } vip := util.JoinHostPortInt32(svc.Spec.ClusterIP, svcPort.Port) ovn.handleExternalIPs(svc, svcPort, ips, targetPort, false) } } } } } return nil } ~~~ The code for 4.4 ... ~~~ git checkout downstream/release-4.4 ~~~ ... is basically the same as the code for 4.5 for the relevant sections. I also ran a spotcheck against 4.5.41 and it exhibits exactly the same symptoms: ~~~ [akaris@linux 2041307]$ export KUBECONFIG=/home/akaris/development/aws-env/ipi-us-east-1/install-config/auth/kubeconfig [akaris@linux 2041307]$ oc get clusterversion oc apply NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.41 True False 60m Cluster version is 4.5.41 [akaris@linux 2041307]$ cat nginx.yaml apiVersion: v1 kind: Service metadata: labels: app: nginx name: nginx spec: ports: - port: 27017 protocol: TCP targetPort: 80 name: "http" selector: app: nginx sessionAffinity: None type: ClusterIP --- apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment labels: app: nginx spec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:1.14.2 ports: - containerPort: 80 [akaris@linux 2041307]$ oc apply -f nginx.yaml service/nginx created deployment.apps/nginx-deployment created [akaris@linux 2041307]$ oc get pods NAME READY STATUS RESTARTS AGE nginx-deployment-6b474476c4-wdwgb 1/1 Running 0 10s [akaris@linux 2041307]$ oc get deployment NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 1/1 1 1 14s [akaris@linux 2041307]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-deployment-6b474476c4-wdwgb 1/1 Running 0 47s 10.128.2.17 ip-10-0-172-234.ec2.internal <none> <none> [akaris@linux 2041307]$ oc get svc -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 100m <none> nginx ClusterIP 172.30.17.17 <none> 27017/TCP 79s app=nginx openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 85m <none> [akaris@linux 2041307]$ oc debug node/ip-10-0-172-234.ec2.internal -- curl 172.30.17.17:27017 Starting pod/ip-10-0-172-234ec2internal-debug ... To use host binaries, run `chroot /host` % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0<!DOCTYPE html> <html> <head> <title>Welcome to nginx!</title> <style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; } </style> </head> <body> <h1>Welcome to nginx!</h1> <p>If you see this page, the nginx web server is successfully installed and working. Further configuration is required.</p> <p>For online documentation and support please refer to <a href="http://nginx.org/">nginx.org</a>.<br/> Commercial support is available at <a href="http://nginx.com/">nginx.com</a>.</p> <p><em>Thank you for using nginx.</em></p> </body> </html> 100 612 100 612 0 0 130k 0 --:--:-- --:--:-- --:--:-- 149k Removing debug pod ... [akaris@linux 2041307]$ oc get svc -o wide oc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 103m <none> nginx ClusterIP 172.30.17.17 <none> 27017/TCP 3m51s app=nginx openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 88m <none> #### idling and waking up [akaris@linux 2041307]$ oc idle nginx The service "default/nginx" has been marked as idled The service will unidle Deployment "default/nginx-deployment" to 1 replicas once it receives traffic Deployment "default/nginx-deployment" has been idled [akaris@linux 2041307]$ oc get pods -o wide No resources found in default namespace. [akaris@linux 2041307]$ oc get ep -o wide NAME ENDPOINTS AGE kubernetes 10.0.137.190:6443,10.0.149.204:6443,10.0.172.70:6443 103m nginx <none> 4m14s [akaris@linux 2041307]$ oc get svc -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 103m <none> nginx ClusterIP 172.30.17.17 <none> 27017/TCP 4m18s app=nginx openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 88m <none> [akaris@linux 2041307]$ oc get deployment -o wide NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR nginx-deployment 0/0 0 0 4m22s nginx nginx:1.14.2 app=nginx [akaris@linux 2041307]$ oc debug node/ip-10-0-172-234.ec2.internal -- curl -m 60 172.30.17.17:27017 Starting pod/ip-10-0-172-234ec2internal-debug ... To use host binaries, run `chroot /host` % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 612 100 612 0 0 86 0 0:00:0<!DOCTYPE html>-:-- 0 <html> <head> <title>Welcome to nginx!</title> <style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; } </style> </head> <body> <h1>Welcome to nginx!</h1> <p>If you see this page, the nginx web server is successfully installed and working. Further configuration is required.</p> <p>For online documentation and support please refer to <a href="http://nginx.org/">nginx.org</a>.<br/> Commercial support is available at <a href="http://nginx.com/">nginx.com</a>.</p> <p><em>Thank you for using nginx.</em></p> </body> </html> 7 0:00:07 --:--:-- 150 Removing debug pod ... [akaris@linux 2041307]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-deployment-6b474476c4-lzt6q 1/1 Running 0 28s 10.128.2.17 ip-10-0-172-234.ec2.internal <none> <none> [akaris@linux 2041307]$ oc get svc -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 104m <none> nginx ClusterIP 172.30.17.17 <none> 27017/TCP 5m2s app=nginx openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 89m <none> [akaris@linux 2041307]$ oc get ep -o wide NAME ENDPOINTS AGE kubernetes 10.0.137.190:6443,10.0.149.204:6443,10.0.172.70:6443 104m nginx 10.128.2.17:80 5m7s #### idling, deleting the ovnkube master leader ; not waking up [akaris@linux 2041307]$ oc idle nginx The service "default/nginx" has been marked as idled The service will unidle Deployment "default/nginx-deployment" to 1 replicas once it receives traffic Deployment "default/nginx-deployment" has been idled [akaris@linux 2041307]$ oc get pods oNo resources found in default namespace. c[akaris@linux 2041307]$ oc get ep oc NAME ENDPOINTS AGE kubernetes 10.0.137.190:6443,10.0.149.204:6443,10.0.172.70:6443 106m nginx <none> 7m12s [akaris@linux 2041307]$ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 106m nginx ClusterIP 172.30.17.17 <none> 27017/TCP 7m13s openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 91m [akaris@linux 2041307]$ oc delete pod -n openshift-ovn-kubernetes ovnkube-master-kfkt (...) [akaris@linux 2041307]$ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-7d6xp 4/4 Running 1 99m ovnkube-master-9plqt 2/4 Running 0 28s ovnkube-master-j4dvq 4/4 Running 0 99m ovnkube-node-7q265 2/2 Running 0 81m ovnkube-node-9wlfj 2/2 Running 0 99m ovnkube-node-ghfq7 2/2 Running 0 99m ovnkube-node-q74sk 2/2 Running 0 81m ovnkube-node-wb8qt 2/2 Running 0 99m ovnkube-node-z9q44 2/2 Running 0 81m ovs-node-5crdp 1/1 Running 0 99m ovs-node-skjfh 1/1 Running 0 81m ovs-node-sm4b9 1/1 Running 0 81m ovs-node-ssfnd 1/1 Running 0 81m ovs-node-sz6sf 1/1 Running 0 99m ovs-node-vxhv7 1/1 Running 0 99m [akaris@linux 2041307]$ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-7d6xp 4/4 Running 1 100m ovnkube-master-9plqt 4/4 Running 0 56s ovnkube-master-j4dvq 4/4 Running 0 100m ovnkube-node-7q265 2/2 Running 0 82m ovnkube-node-9wlfj 2/2 Running 0 100m ovnkube-node-ghfq7 2/2 Running 0 100m ovnkube-node-q74sk 2/2 Running 0 82m ovnkube-node-wb8qt 2/2 Running 0 100m ovnkube-node-z9q44 2/2 Running 0 82m ovs-node-5crdp 1/1 Running 0 100m ovs-node-skjfh 1/1 Running 0 82m ovs-node-sm4b9 1/1 Running 0 82m ovs-node-ssfnd 1/1 Running 0 82m ovs-node-sz6sf 1/1 Running 0 100m ovs-node-vxhv7 1/1 Running 0 100m [akaris@linux 2041307]$ oc logs -c ovnkube-master -n openshift-ovn-kubernetes ovnkube-master-9plqt | tail I0125 17:05:00.997515 1 pods.go:231] [openshift-kube-apiserver/revision-pruner-5-ip-10-0-149-204.ec2.internal] addLogicalPort took 84.984982ms I0125 17:05:01.046245 1 pods.go:231] [openshift-machine-api/machine-api-operator-69fcc8cb7d-49mwj] addLogicalPort took 62.329628ms I0125 17:05:01.055968 1 pods.go:231] [openshift-service-ca/service-ca-69d9df78f9-htnb8] addLogicalPort took 72.216816ms I0125 17:05:01.074806 1 pods.go:231] [openshift-dns/dns-default-4zvb9] addLogicalPort took 86.241346ms I0125 17:05:01.075224 1 pods.go:231] [openshift-kube-controller-manager/revision-pruner-5-ip-10-0-172-70.ec2.internal] addLogicalPort took 83.423011ms I0125 17:05:01.078542 1 pods.go:231] [openshift-dns/dns-default-4wfrt] addLogicalPort took 90.12651ms I0125 17:05:01.091201 1 pods.go:231] [openshift-ingress/router-default-8695db5c69-fdvzz] addLogicalPort took 97.682271ms I0125 17:05:01.108856 1 pods.go:231] [openshift-kube-scheduler/installer-5-ip-10-0-172-70.ec2.internal] addLogicalPort took 52.834518ms I0125 17:05:01.117518 1 pods.go:231] [openshift-service-catalog-removed/openshift-service-catalog-apiserver-remover-9g478] addLogicalPort took 38.925562ms I0125 17:05:02.132551 1 loadbalancer.go:222] Existing Service Reject ACL found: 0700a1a1-93d9-4d63-beec-1b0885480571 for 5c3f1e7d-3170-4a29-9dfd-38890a2d8129-172.30.17.17:27017 [akaris@linux 2041307]$ oc debug node/ip-10-0-172-234.ec2.internal -- curl -m 60 172.30.17.17:27017 Starting pod/ip-10-0-172-234ec2internal-debug ... To use host binaries, run `chroot /host` % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:01:00 --:--:-- 0 curl: (28) Connection timed out after 60001 milliseconds Removing debug pod ... [akaris@linux 2041307]$ oc debug node/ip-10-0-172-234.ec2.internal -- curl -m 60 172.30.17.17:27017 Starting pod/ip-10-0-172-234ec2internal-debug ... To use host binaries, run `chroot /host` % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:01:00 --:--:-- 0 curl: (28) Connection timed out after 60001 milliseconds Removing debug pod ... [akaris@linux 2041307]$ oc get pods No resources found in default namespace. [akaris@linux 2041307]$ oc get svc oc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 112m nginx ClusterIP 172.30.17.17 <none> 27017/TCP 13m openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 97m [akaris@linux 2041307]$ oc get ep NAME ENDPOINTS AGE kubernetes 10.0.137.190:6443,10.0.149.204:6443,10.0.172.70:6443 113m nginx <none> 13m [akaris@linux 2041307]$ ~~~ Monitoing the sbdb while that's happening: ~~~ karis@linux ipi-us-east-1]$ oc rsh -n openshift-ovn-kubernetes -c sbdb ovnkube-master-j4dvq sh-4.2# ovn-sbctl list controller_event sh-4.2# for i in {0..300}; do echo "." ; ovn-sbctl list controller_event; sleep 1 ; done . . . . . . . . . . . . _uuid : 64cfba2c-2f06-4c4a-bc1e-2a2ff276ab93 chassis : bcd347cb-12cc-4b07-896a-62d85fdbee4a event_info : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"} event_type : empty_lb_backends seq_num : 7 . _uuid : 64cfba2c-2f06-4c4a-bc1e-2a2ff276ab93 chassis : bcd347cb-12cc-4b07-896a-62d85fdbee4a event_info : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"} event_type : empty_lb_backends seq_num : 7 . _uuid : 64cfba2c-2f06-4c4a-bc1e-2a2ff276ab93 chassis : bcd347cb-12cc-4b07-896a-62d85fdbee4a event_info : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"} event_type : empty_lb_backends seq_num : 7 . . . . . _uuid : 885cf534-156e-4d33-adb9-503c2565a144 chassis : bcd347cb-12cc-4b07-896a-62d85fdbee4a event_info : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"} event_type : empty_lb_backends seq_num : 8 . . . . . . . . _uuid : 62c2c817-8ab0-4178-90cd-d757a91a631e chassis : bcd347cb-12cc-4b07-896a-62d85fdbee4a event_info : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"} event_type : empty_lb_backends seq_num : 9 . _uuid : 62c2c817-8ab0-4178-90cd-d757a91a631e chassis : bcd347cb-12cc-4b07-896a-62d85fdbee4a event_info : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"} event_type : empty_lb_backends seq_num : 9 . _uuid : 62c2c817-8ab0-4178-90cd-d757a91a631e chassis : bcd347cb-12cc-4b07-896a-62d85fdbee4a event_info : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"} event_type : empty_lb_backends seq_num : 9 . . . . . . . . . . . . . . . _uuid : 71d6b249-1f75-43f3-af8a-30ffff8bd195 chassis : bcd347cb-12cc-4b07-896a-62d85fdbee4a event_info : {load_balancer="5c3f1e7d-3170-4a29-9dfd-38890a2d8129", protocol=tcp, vip="172.30.17.17:27017"} event_type : empty_lb_backends seq_num : 10 . . . . . . . . . . ^C sh-4.2# ~~~
Hello docs team, We need to document a short-coming of OVNKubernetes on OCP 4.6 and 4.7. The following issue has been present in all versions of OVN Kubernetes <= 4.7 and is fixed starting with version 4.8 and beyond. Issue description: Idled applications (https://docs.openshift.com/container-platform/4.7/applications/idling-applications.html) do not automatically wake up after ovnkube-master pods
Documentation team, I created knowledge base solution https://access.redhat.com/solutions/6671241. Please decide if it should be enough to document this in a KCS, or if you want to integrate this as part of our 4.6 and 4.7 documentation. Either way, the Knowledge Base Solution should contain all the details that you need in order to document this issue. Thanks, Andreas
Hey Andreas, thanks for the hard work on this. As I understand your request, you're stating that idled applications do not automatically wake up. As a result, you must manually scale the deployment or resources using oc scale. At the very bottom of the page you linked in comment 12, it states: ~~~ Applications can also be manually unidled by scaling up the resources. Procedure To scale up a DeploymentConfig, run: $ oc scale --replicas=1 dc <dc_name> ~~~ https://docs.openshift.com/container-platform/4.7/applications/idling-applications.html#idle-unidling-applications_idling-applications Is this not exactly what you're asking us to document?
Hi Andreas, Can you follow up on my above comment when you get a chance? Thanks
Per comment 11, I'd like to ask Feng Pan to review my feedback in comment 15. It seems we already have documented the suggested updates: https://docs.openshift.com/container-platform/4.7/applications/idling-applications.html#idle-unidling-applications_idling-applications Please correct me if my understanding of the docs request is incorrect.
Hi! Sorry I don't know why I missed this. What we need to document (or refer to the KCS) is what I wrote in here: https://access.redhat.com/solutions/6671241 So, we need to document somewhere that auto-unidling in 4.6 and 4.7 doesn't work (and in fact never worked). https://bugzilla.redhat.com/show_bug.cgi?id=2041307#c13 ~~~ Documentation team, I created knowledge base solution https://access.redhat.com/solutions/6671241. Please decide if it should be enough to document this in a KCS, or if you want to integrate this as part of our 4.6 and 4.7 documentation. Either way, the Knowledge Base Solution should contain all the details that you need in order to document this issue. ~~~ So, what I'm asking for is ideally another note at the bottom of below: https://docs.openshift.com/container-platform/4.7/applications/idling-applications.html#idle-unidling-applications_idling-applications NOTE: Automatic unidling by a router is currently only supported by the default HAProxy router. NOTE: Services do not support automatic unidling if you configure Kuryr-Kubernetes as an SDN. NOTE: In OpenShift 4.6 and 4.7, idled services do not become active again when they receive network traffic if they were idled while the leader ovnkube-master process was restarted or after the ovnkube-master leader changed. For further details, refer to https://access.redhat.com/solutions/6671241
Thanks for explaining that, Andreas! I've created a pull request for your review: https://github.com/openshift/openshift-docs/pull/46520
Setting ON_QA as well.
looks good to me :)
docs looks good to me. move this to verified.
These changes are now live. I will close this bug as current release. Thanks for your help, Andreas! https://docs.openshift.com/container-platform/4.7/applications/idling-applications.html#idle-unidling-applications_idling-applications
*** Bug 2107657 has been marked as a duplicate of this bug. ***