Description of problem: All routes can't be accessed after creating 2000 services in 4 namespaces, and scale routers to 0 and back to 2. 4.10 bug: https://bugzilla.redhat.com/show_bug.cgi?id=2035481. It was fixed. But regression happened. Related bug opened later on 4.10: https://bugzilla.redhat.com/show_bug.cgi?id=2070674 But on 4.11, I saw a difference compared to 4.10 bug: Use perf tool to analysis the highest cpu usage master, 'swapper' is the hightest cpu user, ovnkube is the second on 4.11. This behavior is different from 4.10 ://bugzilla.redhat.com/show_bug.cgi?id=2035481#c17, which ovnkube is the highest cpu user. Opened this one for 4.11. Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-04-25-171513 How reproducible: Easy to reproduce. Not reproduced on SDN. Steps to Reproduce: 1.Install a cluster with 4.11 nightly build, OVN, vm_type_masters: m5.4xlarge vm_type_workers: m5.2xlarge. SCale it up to 120 worker nodes. 2. Run router-perf test to create 2000 applications. Note: router perf test scales routers to 0 and back to 2 This line was added in this PR on Sep 8, 2021 during 4.9 https://github.com/cloud-bulldozer/e2e-benchmarking/pull/222# 3. Visit the the applications. Actual results: All routes can not be accessed. Cluster routes like canary, and some co routes can not be accessed too. Failed co due to route issue: authentication 4.11.0-0.nightly-2022-04-25-171513 False False True 9m4s OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.qili-411-aws-ovn.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) console 4.11.0-0.nightly-2022-04-25-171513 False False False 9m11s RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.qili-411-aws-ovn.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.qili-411-aws-ovn.qe.devcluster.openshift.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Jenkins job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/router-perf/418/console Expected results: All routes should be able to accessed, canary and co routes are not impacted. Additional info: ------------------------------ # all pods/services/routes are created successfully % for termination in http edge passthrough reencrypt; do echo pods in http-scale-${termination}; oc get pods -n http-scale-${termination}| grep Running| wc -l; echo services in http-scale-${termination}; oc get services --no-headers -n http-scale-${termination} | wc -l; echo endpoints in http-scale-${termination}; oc get endpoints --no-headers -n http-scale-${termination} | wc -l; done pods in http-scale-http 500 services in http-scale-http 500 endpoints in http-scale-http 500 pods in http-scale-edge 500 services in http-scale-edge 500 endpoints in http-scale-edge 500 pods in http-scale-passthrough 500 services in http-scale-passthrough 500 endpoints in http-scale-passthrough 500 pods in http-scale-reencrypt 500 services in http-scale-reencrypt 500 endpoints in http-scale-reencrypt 500 ------------------------------ # Monitor the cluster's canary route every 5 seconds. Right after the router-perf test scaled down routers to 0 and scaled them up to 2, canary route started to fail, time was around Tue Apr 26 14:02:08 CST 2022. And the canary route did not recover till Tue Apr 26 16:26:46 CST 2022, outage duration was about 2 hours 20 minutes. 04-26 14:01:57.613 + oc rollout status -n openshift-ingress deploy/router-default 04-26 14:01:57.613 Waiting for deployment "router-default" rollout to finish: 0 of 2 updated replicas are available... 04-26 14:02:12.457 Waiting for deployment "router-default" rollout to finish: 0 of 2 updated replicas are available... 04-26 14:02:44.517 Waiting for deployment "router-default" rollout to finish: 1 of 2 updated replicas are available... 04-26 14:03:10.979 Waiting for deployment "router-default" rollout to finish: 1 of 2 updated replicas are available... 04-26 14:03:43.013 deployment "router-default" successfully rolled out % while true; do echo -$(date)-; echo $(curl -Is canary-openshift-ingress-canary.apps.qili-411-aws-ovn.qe.devcluster.openshift.com -w "%{http_code}" -o /dev/null); sleep 5; done -Tue Apr 26 13:48:18 CST 2022- 302 .... -Tue Apr 26 14:02:03 CST 2022- 302 -Tue Apr 26 14:02:08 CST 2022- 000 .... .... -Tue Apr 26 16:25:39 CST 2022- 000 -Tue Apr 26 16:26:46 CST 2022- 302 ------------------------------ # canary route's response during the failure % curl canary-openshift-ingress-canary.apps.qili-411-aws-ovn.qe.devcluster.openshift.com curl: (52) Empty reply from server ------------------------------ # At the same time cluster co failures were seen due to routes failed. % oc get co --no-headers| grep -v 'True.*False.*False' authentication 4.11.0-0.nightly-2022-04-25-171513 False False True 9m4s OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.qili-411-aws-ovn.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) console 4.11.0-0.nightly-2022-04-25-171513 False False False 9m11s RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.qili-411-aws-ovn.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.qili-411-aws-ovn.qe.devcluster.openshift.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ------------------------------ # Curl the test routes, ALL of them are not avaiable. % for i in {1..100}; echo ${i}: $(curl -sI http-perf-${i}-http-scale-http.apps.qili-411-aws.qe.devcluster.openshift.com -w "%{http_code}" -o /dev/null) 1: 000 2: 000 3: 000 4: 000 5: 000 6: 000 7: 000 8: 000 9: 000 10: 000 .... 100: 000 ------------------------------ # Check masters' cpu usage % oc get nodes | grep master ip-10-0-133-85.us-east-2.compute.internal Ready master 5h2m v1.23.3+54654d2 ip-10-0-166-3.us-east-2.compute.internal Ready master 5h2m v1.23.3+54654d2 ip-10-0-204-126.us-east-2.compute.internal Ready master 5h2m v1.23.3+54654 ------------------------------ % oc adm top node --sort-by=cpu | head -5 NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-166-3.us-east-2.compute.internal 6233m 40% 18052Mi 28% ip-10-0-133-85.us-east-2.compute.internal 3718m 23% 18128Mi 29% ip-10-0-204-126.us-east-2.compute.internal 2219m 14% 16733Mi 26% ip-10-0-203-117.us-east-2.compute.internal 1536m 3% 14066Mi 7% % oc adm top node --sort-by=cpu | head -5 NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-166-3.us-east-2.compute.internal 5084m 32% 19496Mi 31% ip-10-0-133-85.us-east-2.compute.internal 2324m 14% 19307Mi 31% ip-10-0-204-126.us-east-2.compute.internal 1703m 10% 18198Mi 29% ip-10-0-175-156.us-east-2.compute.internal 989m 2% 15250Mi 8% ------------------------------ # Get top of a master, ovnkube used most cpu % oc debug node/ip-10-0-166-3.us-east-2.compute.internal sh-4.4# top -b top - 06:22:29 up 5:00, 0 users, load average: 6.15, 6.39, 6.17 Tasks: 455 total, 1 running, 454 sleeping, 0 stopped, 0 zombie %Cpu(s): 66.4 us, 2.5 sy, 0.0 ni, 29.3 id, 0.0 wa, 1.1 hi, 0.7 si, 0.0 st MiB Mem : 63605.6 total, 34346.3 free, 17396.9 used, 11862.3 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 45401.8 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7328 root 20 0 3770604 1.2g 36108 S 1024 2.0 119:22.19 ovnkube 2710 root 20 0 3125456 255908 72216 S 11.8 0.4 44:00.08 kubelet 45379 root 20 0 3332648 2.0g 87864 S 11.8 3.2 47:41.24 kube-apiserver 49743 root 20 0 10.6g 804420 300800 S 11.8 1.2 31:25.57 etcd 1 root 20 0 246052 18736 8724 S 5.9 0.0 2:07.65 systemd 9676 1000 20 0 740056 62376 25640 S 5.9 0.1 0:41.06 dns-operator 9805 nobody 20 0 735696 38340 20420 S 5.9 0.1 0:03.29 kube-rbac-proxy 11182 1000340+ 20 0 739604 40384 22244 S 5.9 0.1 0:54.49 machine-config- 30484 root 20 0 2546640 194968 48392 S 5.9 0.3 6:05.01 cluster-kube-ap 334970 root 20 0 54568 4812 3792 R 5.9 0.0 0:00.02 top 2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd top - 06:36:57 up 5:14, 0 users, load average: 6.37, 6.78, 6.58 Tasks: 454 total, 1 running, 453 sleeping, 0 stopped, 0 zombie %Cpu(s): 28.1 us, 1.5 sy, 0.0 ni, 69.1 id, 0.1 wa, 0.6 hi, 0.6 si, 0.0 st MiB Mem : 63605.6 total, 31663.0 free, 18557.8 used, 13384.8 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 44240.9 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7328 root 20 0 3908596 1.4g 36172 S 389.1 2.2 177:10.12 ovnkube 45379 root 20 0 3332648 2.0g 87928 S 15.6 3.2 49:47.21 kube-apiserver 2710 root 20 0 3125456 254404 72216 S 14.2 0.4 46:18.48 kubelet 49743 root 20 0 10.7g 890536 300800 S 10.6 1.4 33:03.01 etcd 3564 root 20 0 3092664 2.8g 7220 S 5.3 4.5 16:50.93 ovn-northd 21248 root 20 0 2406832 115296 49420 S 5.0 0.2 6:19.80 oauth-apiserver 1453 800 10 -10 1617384 190348 40936 S 4.3 0.3 16:20.57 ovs-vswitchd ------------------------------ # Use perf tool to analysis the highest cpu usage master, 'swapper' is the hightest cpu user, ovnkube is the second on 4.11. This behavior is different from 4.10 ://bugzilla.redhat.com/show_bug.cgi?id=2035481#c17, which ovnkube is the highest cpu user. sh-4.4# perf record -a -g sh-4.4# perf report Samples: 883K of event 'cpu-clock:pppH', Event count (approx.): 220842500000 Children Self Command Shared Object Symbol + 61.59% 0.00% swapper [kernel.kallsyms] [k] secondary_startup_64_no_verify + 61.59% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry + 61.59% 0.01% swapper [kernel.kallsyms] [k] do_idle + 61.39% 0.00% swapper [kernel.kallsyms] [k] cpuidle_enter + 61.38% 0.04% swapper [kernel.kallsyms] [k] cpuidle_enter_state + 61.35% 0.00% swapper [kernel.kallsyms] [k] acpi_idle_enter + 61.35% 0.00% swapper [kernel.kallsyms] [k] acpi_idle_do_entry + 61.34% 61.09% swapper [kernel.kallsyms] [k] native_safe_halt + 58.16% 0.00% swapper [kernel.kallsyms] [k] start_secondary + 17.16% 0.00% ovnkube ovnkube [.] runtime.goexit.abi0 + 13.35% 0.00% ovnkube ovnkube [.] k8s.io/apimachinery/pkg/util/wait.BackoffUntil + 13.35% 0.00% ovnkube ovnkube [.] k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1 + 13.35% 0.00% ovnkube ovnkube [.] k8s.io/apimachinery/pkg/util/wait.JitterUntil + 13.35% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).Run.dwrap.8 + 13.35% 0.00% ovnkube ovnkube [.] k8s.io/apimachinery/pkg/util/wait.Until + 13.35% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).worker-fm + 13.35% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).processNextW + 13.35% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).syncService + 13.34% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/loadbalancer.EnsureLBs + 13.31% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.(*modelClient).buildOps + 13.31% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.(*modelClient).createOrUpdateOps + 13.30% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.(*modelClient).lookup + 13.30% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.(*modelClient).whereCache + 13.30% 0.00% ovnkube ovnkube [.] github.com/ovn-org/libovsdb/client.(*api).List + 13.29% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.CreateOrUpdateLoadBalancersOps + 13.24% 0.31% ovnkube ovnkube [.] github.com/ovn-org/libovsdb/client.api.List + 10.21% 0.00% ovnkube [unknown] [k] 0x000000000046ef41 + 7.85% 0.05% ovnkube ovnkube [.] reflect.Value.Call + 7.81% 0.12% ovnkube ovnkube [.] github.com/ovn-org/libovsdb/client.(*predicateConditional).Matches + 7.42% 0.00% ovnkube [unknown] [.] 0x0000000001745dee Samples: 253K of event 'cpu-clock:pppH', Event count (approx.): 63359500000 Children Self Command Shared Object Symbol + 70.85% 0.00% swapper [kernel.kallsyms] [k] secondary_startup_64_no_verify + 70.85% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry + 70.84% 0.01% swapper [kernel.kallsyms] [k] do_idle + 70.64% 0.00% swapper [kernel.kallsyms] [k] cpuidle_enter + 70.64% 0.04% swapper [kernel.kallsyms] [k] cpuidle_enter_state + 70.60% 0.00% swapper [kernel.kallsyms] [k] acpi_idle_enter + 70.60% 0.00% swapper [kernel.kallsyms] [k] acpi_idle_do_entry + 70.60% 70.29% swapper [kernel.kallsyms] [k] native_safe_halt + 65.94% 0.00% swapper [kernel.kallsyms] [k] start_secondary + 16.97% 0.00% ovnkube ovnkube [.] runtime.goexit.abi0 + 14.13% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).Run.dwrap.8 + 14.13% 0.00% ovnkube ovnkube [.] k8s.io/apimachinery/pkg/util/wait.Until + 14.13% 0.00% ovnkube ovnkube [.] k8s.io/apimachinery/pkg/util/wait.JitterUntil + 14.13% 0.00% ovnkube ovnkube [.] k8s.io/apimachinery/pkg/util/wait.BackoffUntil + 14.13% 0.00% ovnkube ovnkube [.] k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1 + 14.13% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).worker-fm + 14.13% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).processNextW + 14.13% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).syncService + 14.12% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/loadbalancer.EnsureLBs + 14.06% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.(*modelClient).buildOps + 14.06% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.(*modelClient).createOrUpdateOps + 14.05% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.(*modelClient).lookup + 14.05% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.(*modelClient).whereCache + 14.05% 0.00% ovnkube ovnkube [.] github.com/ovn-org/libovsdb/client.(*api).List + 14.02% 0.00% ovnkube ovnkube [.] github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.CreateOrUpdateLoadBalancersOps + 13.99% 0.31% ovnkube ovnkube [.] github.com/ovn-org/libovsdb/client.api.List + 8.54% 0.16% ovnkube ovnkube [.] github.com/ovn-org/libovsdb/client.(*predicateConditional).Matches + 8.38% 0.06% ovnkube ovnkube [.] reflect.Value.Call + 7.90% 1.19% ovnkube ovnkube [.] reflect.Value.call + 6.41% 0.00% ovnkube [unknown] [.] 0x000000000046ef41 ------------------------------ # The rest of 2 masters has different highest cpu processes. ovsdb-server, ovs-vswitchd % oc debug node/ip-10-0-133-85.us-east-2.compute.internal sh-4.4# top -b top - 06:24:31 up 5:02, 0 users, load average: 3.69, 3.35, 3.60 Tasks: 412 total, 2 running, 410 sleeping, 0 stopped, 0 zombie %Cpu(s): 9.3 us, 1.4 sy, 0.0 ni, 87.9 id, 0.3 wa, 0.7 hi, 0.3 si, 0.0 st MiB Mem : 62933.6 total, 35864.7 free, 17041.6 used, 10027.2 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 45094.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4719 root 20 0 848164 657224 8484 R 88.9 1.0 23:48.41 ovsdb-server 50704 root 20 0 5339676 2.7g 88888 S 16.7 4.4 107:04.00 kube-apiserver 2719 root 20 0 3051020 221876 72868 S 11.1 0.3 32:01.62 kubelet 39591 1000530+ 20 0 2560156 265824 53932 S 11.1 0.4 1:40.27 openshift-contr 56494 root 20 0 10.9g 854024 300768 S 11.1 1.3 44:01.51 etcd 56666 root 20 0 2409076 76844 41680 S 11.1 0.1 16:12.69 cluster-etcd-op 525161 root 20 0 54568 4864 3844 R 11.1 0.0 0:00.02 top 1 root 20 0 244904 17756 8980 S 0.0 0.0 2:49.25 systemd ------------------------------ % oc debug node/ip-10-0-204-126.us-east-2.compute.internal sh-4.4# top -b top - 06:26:32 up 5:04, 0 users, load average: 0.62, 1.21, 2.04 Tasks: 395 total, 1 running, 394 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.5 us, 1.1 sy, 0.0 ni, 96.7 id, 0.0 wa, 0.4 hi, 0.4 si, 0.0 st MiB Mem : 63605.6 total, 38258.1 free, 16139.1 used, 9208.4 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 46664.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1455 800 10 -10 1614944 144336 40936 S 6.2 0.2 6:25.74 ovs-vswitchd 2687 root 20 0 3198868 222668 70024 S 6.2 0.3 30:44.20 kubelet 44141 root 20 0 2722496 420632 69808 S 6.2 0.6 6:27.47 openshift-apise 48844 root 20 0 3400748 2.1g 87952 S 6.2 3.3 46:03.42 kube-apiserver 51491 root 20 0 1458696 676576 73656 S 6.2 1.0 8:24.01 kube-controller 55096 root 20 0 10.7g 849880 300032 S 6.2 1.3 32:16.96 etcd 58991 root 20 0 2319396 74300 35468 S 6.2 0.1 1:00.46 oauth-server 1 root 20 0 244648 17408 8884 S 0.0 0.0 2:47.74 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd ------------------------------
*** This bug has been marked as a duplicate of bug 2070674 ***
Marking closed and putting TestBlocker on the bug this was DUP-ed against *** This bug has been marked as a duplicate of bug 2070674 ***