Created attachment 1894458 [details] Load balancer test that ended up with connectivity outage Must gather logs: 1. Issue: Testing upgrade by version annotation causing connectivity outage to External IP 2. WMCO & OpenShift Version "version":"4.0.1+f66f0980" 4.9.0-0.nightly-2022-06-24-070308 3. Platform - AWS/Azure/VSpehre/Platform=none Azure 4. If the platform is vSphere, what is the VMware tools version? 5. Is it a new test case or an old test case? New case if it is the old test case, is it regression or first-time tested? Old test OCP-35707 Is it platform-specific or consistent across all platforms? AWS as well 6. Steps to Reproduce Create and run a script that continuously testing the Windows External IP (LB) cat probeLB.sh ──(Mon,Jul04)─┘ #!/bin/bash set -e while true do date echo "curl 52.189.34.88" curl 52.189.34.88 echo "" sleep 2 done NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/linux-webserver LoadBalancer 172.30.3.186 52.189.33.119 8080:31053/TCP 113m service/win-webserver LoadBalancer 172.30.105.53 52.189.34.88 80:30648/TCP 115m a. Scale WMCO to 0 # oc scale deployment.apps/windows-machine-config-operator --replicas=0 -n openshift-windows-machine-config-operator b. Perform version annotation on Windows node # oc annotate node windows-zrtrt --overwrite windowsmachineconfig.openshift.io/version=invalidVersion c. scale up again WMCO #oc scale deployment.apps/windows-machine-config-operator --replicas=1 -n openshift-windows-machine-config-operator 7. Actual Result and Expected Result Connectivity lost when machine is in provisioning state, after the machine get an IP the connectivity return 8. A possible workaround has been tried? Is there a way to recover from the issue being tried out? no, connectivity is back after a minute 9. Logs Must-gather-windows-node-logs(https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_windows_node_logs#L24) oc get network.operator cluster -o yaml oc logs -f deployment/windows-machine-config-operator -n openshift-windows-machine-config-operator Windows MachineSet yaml or windows-instances ConfigMap oc get machineset <windows_machineSet_name> -n openshift-machine-api -o yaml oc get configmaps <windows_configmap_name> -n <namespace_name> -o yaml Optional logs: Anything that can be useful to debug the issue.
Confirmed this issue appears in 4.11 "version": "6.0.0-dd57309"
@rrasouli Looking at the logs you've attached I'm not clear on the problem. Is it that the final curl never reaches the LB? It would be helpful to have the curl stderr. What happens when curling again? Does this issue go away eventually or do all subsequent curls have the same behavior?
ssoto The problem at the end of the curl script test is that the LB isn't responding when the machine is in provisioning state, therefore the last curl happens when it failed to reach the LB address. That repeats in all scenarios as described in the bug description. I can remove the set -e to see that the LB is available again. After the machine state is switched to provisioned (has IP address), the LB responds again. What happens when curling again? Does this issue go away eventually or do all subsequent curls have the same behavior? This issue is going away until we start a new upgrade scenario.
@rrasouli Any update here after the session with the dev team?
Same results still there is a packet loss in 4.12 as well on AWS
@rrasouli is attachment 1913549 [details] related to this bug?
@rrasouli I followed the steps described in the test [1] and I could replicate this scenario where the service becomes unavailable for a short period of time. Yet, I don’t feel it is a bug. Looking at the win-webserver deployment [2] referenced in the test there is only one (1) replica and you need at least two pods for the service to be considered minimally available. The load balancer will distribute network traffic across all pods of the deployment. Recommendation: Set the number of replicas to at least two (2) to ensure that deleting a single pod will not cause downtime. For example, you are running a single instance of the win-webserver workload; if the one and only pod get deleted, evicted, or scheduled in another Windows node (WMCO upgrade scenario), you may find the service completely unavailable for a short period of time. In general, if you only have one application replica, any termination will result in downtime. | | 7. Actual Result and Expected Result | Connectivity lost when machine is in provisioning state, after the machine | get an IP the connectivity return WRT the actual result; the events are independent. In this case, the new machine getting an IP address in the Provisioning state is unrelated to the service becoming available. Connectivity is back after the scheduler assigns the win-webserver pod to another Windows node. [1] https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-35707 [2] https://raw.githubusercontent.com/sgaoshang/winc-test/master/data/WinWebServer.yaml
Hi @rrarrasou the number of Windows workloads is essential and the workloads should be distributed (scheduled) evenly across the existing Windows nodes to achieve better availability. As per our conversation in Slack [1] to avoid service outages in the load balancer ensure there are at least two (2) Windows nodes in Ready state, and the running Windows workloads are not scheduled in the same Windows node. Otherwise, the workloads pods with the Web-Server will be re-scheduled to another Windows node and the service will experience a short outage during this process as described above. [1] https://coreos.slack.com/archives/CM4ERHBJS/p1666203999912499?thread_ts=1666195645.101679&cid=CM4ERHBJS
This bug was reviewed in today's QE sync meeting. QE Team confirmed that after following the suggestion mentioned above[1] the bug is not reproducible in AWS. QE Team is actively working on the verification in Azure, so moving this bug to ON_QA. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2103631#c14
Afte some thorough testing, I could confirm that there is an existing issue with Azure (tested in 4.8 and 4.11) Load Balancer when performing the node reconciliation (even though there are 3 workers): #######ATTEMTP #4 Wed Nov 2 01:25:04 PM CET 2022 ###### NAME STATUS ROLES AGE VERSION windows-gbjbz Ready worker 7m35s v1.21.11-rc.0.1506+5cc9227e4695d1 windows-gmn4c Ready worker 34m v1.21.11-rc.0.1506+5cc9227e4695d1 windows-v88gm Ready worker 66m v1.21.11-rc.0.1506+5cc9227e4695d1 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES linux-webserver-7749c5ddff-8xm6g 1/1 Running 0 109m 10.131.0.26 jfrancoa-0211-rete-k28gm-worker-westus-vpbcw <none> <none> win-webserver-549cd7495d-5s8mq 1/1 Running 0 3m29s 10.132.8.8 windows-gmn4c <none> <none> win-webserver-549cd7495d-6l47p 1/1 Running 0 22m 10.132.6.9 windows-v88gm <none> <none> win-webserver-549cd7495d-d5z5f 1/1 Running 0 22m 10.132.8.6 windows-gmn4c <none> <none> win-webserver-549cd7495d-fs9c6 1/1 Running 0 3m29s 10.132.9.3 windows-gbjbz <none> <none> win-webserver-549cd7495d-gbbmd 1/1 Running 0 22m 10.132.8.7 windows-gmn4c <none> <none> win-webserver-549cd7495d-m6lhf 1/1 Running 0 22m 10.132.6.10 windows-v88gm <none> <none> win-webserver-549cd7495d-m7jdc 1/1 Running 0 3m29s 10.132.9.4 windows-gbjbz <none> <none> win-webserver-549cd7495d-qlqsd 1/1 Running 0 3m29s 10.132.9.2 windows-gbjbz <none> <none> win-webserver-549cd7495d-wdgkk 1/1 Running 0 22m 10.132.6.11 windows-v88gm <none> <none> % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed ^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M100 63 100 63 0 0 159 0 --:--:-- --:--:-- --:--:-- 159 <html><body><H1>Windows Container Web Server</H1></body></html> #######ATTEMTP #5 Wed Nov 2 01:26:06 PM CET 2022 ###### NAME STATUS ROLES AGE VERSION windows-gbjbz Ready worker 8m38s v1.21.11-rc.0.1506+5cc9227e4695d1 windows-gmn4c Ready,SchedulingDisabled worker 35m v1.21.11-rc.0.1506+5cc9227e4695d1 windows-v88gm Ready worker 68m v1.21.11-rc.0.1506+5cc9227e4695d1 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES linux-webserver-7749c5ddff-8xm6g 1/1 Running 0 110m 10.131.0.26 jfrancoa-0211-rete-k28gm-worker-westus-vpbcw <none> <none> win-webserver-549cd7495d-6l47p 1/1 Running 0 23m 10.132.6.9 windows-v88gm <none> <none> win-webserver-549cd7495d-b6hvg 0/1 ContainerCreating 0 9s <none> windows-gbjbz <none> <none> win-webserver-549cd7495d-cwlmn 0/1 ContainerCreating 0 9s <none> windows-v88gm <none> <none> win-webserver-549cd7495d-fs9c6 1/1 Running 0 4m32s 10.132.9.3 windows-gbjbz <none> <none> win-webserver-549cd7495d-m6lhf 1/1 Running 0 23m 10.132.6.10 windows-v88gm <none> <none> win-webserver-549cd7495d-m7jdc 1/1 Running 0 4m32s 10.132.9.4 windows-gbjbz <none> <none> win-webserver-549cd7495d-pcrwt 0/1 ContainerCreating 0 9s <none> windows-v88gm <none> <none> win-webserver-549cd7495d-qlqsd 1/1 Running 0 4m32s 10.132.9.2 windows-gbjbz <none> <none> win-webserver-549cd7495d-wdgkk 1/1 Running 0 23m 10.132.6.11 windows-v88gm <none> <none> % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed ^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:05 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:06 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:08 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:08 --:--:-- 0 curl: (7) Failed to connect to 20.237.202.229 port 80 after 8270 ms: Connection refused Another situation that happened was, in AWS, as everytime a node is reconciled another machine is being allocated, the newly created machine does not have pre-pulled Windows docker image so creating the container takes longer time and while that docker image is being pulled, the node in which most of the containers were running got to it's time of reconcilling too, having a disruption in the service provided by the load balancer: #######ATTEMTP #35 Wed Nov 2 01:56:22 PM CET 2022 ###### NAME STATUS ROLES AGE VERSION ip-10-0-69-232.us-east-2.compute.internal Ready worker 78m v1.21.11-rc.0.1506+5cc9227e4695d1 ip-10-0-71-242.us-east-2.compute.internal Ready worker 14m v1.21.11-rc.0.1506+5cc9227e4695d1 ip-10-0-72-103.us-east-2.compute.internal Ready,SchedulingDisabled worker 3m14s v1.21.11-rc.0.1506+5cc9227e4695d1 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES linux-webserver-7749c5ddff-8frhq 1/1 Running 0 4h45m 10.128.2.21 ip-10-0-72-51.us-east-2.compute.internal <none> <none> win-webserver-858656469f-224cr 0/1 ContainerCreating 0 9m23s <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-f89rb 1/1 Running 0 9m23s 10.132.8.14 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-hcg9j 1/1 Running 0 9m24s 10.132.8.12 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-jhx9d 1/1 Running 0 34m 10.132.8.10 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-l5xp5 1/1 Running 0 34m 10.132.8.11 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-mxvbg 1/1 Running 0 9m23s 10.132.8.13 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-s5vjh 1/1 Running 0 53m 10.132.8.8 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-vcwrj 0/1 ContainerCreating 0 9m24s <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-vspgf 1/1 Running 0 53m 10.132.8.9 ip-10-0-69-232.us-east-2.compute.internal <none> <none> % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed ^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M100 63 100 63 0 0 229 0 --:--:-- --:--:-- --:--:-- 229 <html><body><H1>Windows Container Web Server</H1></body></html> #######ATTEMTP #36 Wed Nov 2 01:57:24 PM CET 2022 ###### NAME STATUS ROLES AGE VERSION ip-10-0-69-232.us-east-2.compute.internal Ready,SchedulingDisabled worker 79m v1.21.11-rc.0.1506+5cc9227e4695d1 ip-10-0-71-242.us-east-2.compute.internal Ready worker 15m v1.21.11-rc.0.1506+5cc9227e4695d1 ip-10-0-72-103.us-east-2.compute.internal Ready worker 4m15s v1.21.11-rc.0.1506+5cc9227e4695d1 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES linux-webserver-7749c5ddff-8frhq 1/1 Running 0 4h46m 10.128.2.21 ip-10-0-72-51.us-east-2.compute.internal <none> <none> win-webserver-858656469f-224cr 0/1 ContainerCreating 0 10m <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-2j8zb 0/1 ContainerCreating 0 4s <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-4rnb5 0/1 ContainerCreating 0 4s <none> ip-10-0-72-103.us-east-2.compute.internal <none> <none> win-webserver-858656469f-blkfk 0/1 ContainerCreating 0 4s <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-f89rb 1/1 Terminating 0 10m 10.132.8.14 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-hcg9j 1/1 Terminating 0 10m 10.132.8.12 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-jhx9d 1/1 Terminating 0 35m 10.132.8.10 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-l5xp5 1/1 Terminating 0 35m 10.132.8.11 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-lvs4z 0/1 ContainerCreating 0 4s <none> ip-10-0-72-103.us-east-2.compute.internal <none> <none> win-webserver-858656469f-m8svn 0/1 ContainerCreating 0 4s <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-mxvbg 1/1 Terminating 0 10m 10.132.8.13 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-s5vjh 1/1 Terminating 0 54m 10.132.8.8 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-vcwrj 0/1 ContainerCreating 0 10m <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-vspgf 1/1 Terminating 0 54m 10.132.8.9 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-xdpqn 0/1 ContainerCreating 0 4s <none> ip-10-0-72-103.us-east-2.compute.internal <none> <none> win-webserver-858656469f-zv6sq 0/1 ContainerCreating 0 4s <none> ip-10-0-72-103.us-east-2.compute.internal <none> <none> % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed ^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:05 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:06 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:08 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:09 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:11 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:12 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:13 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:14 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:15 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:16 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:17 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:18 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:19 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:20 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:21 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:22 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:23 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:24 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:25 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:26 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:27 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:28 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:29 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:31 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:32 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:33 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:34 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:35 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:36 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:37 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:38 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:39 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:40 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:41 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:42 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:43 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:44 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:45 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:46 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:47 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:48 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:49 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:50 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:51 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:52 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:53 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:54 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:55 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:56 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:57 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:58 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:59 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:59 --:--:-- 0 curl: (52) Empty reply from server My doubt regarding the AWS case is, what's the policy in the case of an upgrade? Should we ensure availability of the service during the whole procedure? Because if that's the case, WMCO should wait until the image is pulled and the container is created before jumping into the next node to reconcile, in my humble opinion. I am uploading the scenario for the three cases (Azure 4.11-35707_Azure_411.log , Azure 4.8-35707_Azure_48.log and AWS 4.8-35707_AWS_48.log) and moving back the BZ into ASSIGNED. Sorry for the delay in answering, but I was getting failures in the Load Balancer's connectivity and I wasn't sure if it was because the logic of the test was wrong, after running in parallel: for i in {1..60}; do time=`date`; echo -e "\n#######ATTEMTP #${i} ${time} ######" &>> /tmp/35707_Azure_48.log;oc get nodes -l=node.openshift.io/os_id="Windows" &>> /tmp/35707_Azure_48.log;oc get pods -n winc-test -o wide &>> /tmp/35707_Azure_48.log;curl --connect-timeout 60 20.237.202.229 &>> /tmp/35707_Azure_48.log;sleep 60; done I could confirm it was not a glitch.
Regarding the AWS scenario in which the connectivity gets lost because the image isn't pulled yet, I can confirm it also occurs in IPI AWS version 4.11: Nov 3 09:24:28.221: INFO: Checked LB connectivity of a72eb2193d310475189a0d026446609b-701037125.us-east-2.elb.amazonaws.com Nov 3 09:24:48.610: INFO: Windows machine is not provisioned yet. Waiting 30 seconds more ... Nov 3 09:25:18.409: INFO: numberOfMachines value is: 3 Nov 3 09:25:28.534: INFO: Checked LB connectivity of a72eb2193d310475189a0d026446609b-701037125.us-east-2.elb.amazonaws.com Nov 3 09:25:54.471: INFO: numberOfMachines value is: 3 Nov 3 09:26:28.827: INFO: Connectivity check failed: error in curl command exit status 56 the IP of a72eb2193d310475189a0d026446609b-701037125.us-east-2.elb.amazonaws.com is not accesible I wanted to upload the same "scenario" logs, but when re-running the test case it looks like it went fine. Everything depends on the the time it will take to pull the Windows container image, if it's quick enough to download it before the next node gets reconciled then it works fine, but if not we will see a short service disruption.
After several attempts, I managed to reproduce the issue also in 4.11 for AWS (attached log for AWS 4.11 Scenario). The truth is that it's quite a corner case, but it is indeed impacting the underlaying service. I think we should leave clear the expectations on the availability of the workload's during the upgrade (do we allow some small service disruption?, or should it be 0 disruption?) and if we decide to allow some disruption, it should be documented.
Hi @jfrancoa Thanks for working on this, with regard to your comment about the policy in the case of an upgrade, yes there upgrade should pick one node at a time. > Should we ensure availability of the service during the whole procedure? Yes and this statement aligns more with the responsibility of the cluster administrator. WMCO cannot ensure that, as it does not controls the number Windows nodes or workloads. > Another situation that happened was, in AWS, as everytime a node is reconciled another machine is being allocated, the newly created machine does not have pre-pulled Windows docker image so creating the container takes longer time and while that docker image is being pulled, the node in which most of the containers were running got to it's time of reconcilling too, having a disruption in the service provided by the load balance This is a valid scenario as Windows container images are always pulled on Windows machines, if not present. In ATTEMTP #36 from comment #16 the reason for the load balancer outage is the lack of a Windows workload (win-webserver-*) in `Running` state. #######ATTEMTP #36 Wed Nov 2 01:57:24 PM CET 2022 ###### NAME STATUS ROLES AGE VERSION ip-10-0-69-232.us-east-2.compute.internal Ready,SchedulingDisabled worker 79m v1.21.11-rc.0.1506+5cc9227e4695d1 ip-10-0-71-242.us-east-2.compute.internal Ready worker 15m v1.21.11-rc.0.1506+5cc9227e4695d1 ip-10-0-72-103.us-east-2.compute.internal Ready worker 4m15s v1.21.11-rc.0.1506+5cc9227e4695d1 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES linux-webserver-7749c5ddff-8frhq 1/1 Running 0 4h46m 10.128.2.21 ip-10-0-72-51.us-east-2.compute.internal <none> <none> win-webserver-858656469f-224cr 0/1 ContainerCreating 0 10m <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-2j8zb 0/1 ContainerCreating 0 4s <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-4rnb5 0/1 ContainerCreating 0 4s <none> ip-10-0-72-103.us-east-2.compute.internal <none> <none> win-webserver-858656469f-blkfk 0/1 ContainerCreating 0 4s <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-f89rb 1/1 Terminating 0 10m 10.132.8.14 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-hcg9j 1/1 Terminating 0 10m 10.132.8.12 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-jhx9d 1/1 Terminating 0 35m 10.132.8.10 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-l5xp5 1/1 Terminating 0 35m 10.132.8.11 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-lvs4z 0/1 ContainerCreating 0 4s <none> ip-10-0-72-103.us-east-2.compute.internal <none> <none> win-webserver-858656469f-m8svn 0/1 ContainerCreating 0 4s <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-mxvbg 1/1 Terminating 0 10m 10.132.8.13 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-s5vjh 1/1 Terminating 0 54m 10.132.8.8 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-vcwrj 0/1 ContainerCreating 0 10m <none> ip-10-0-71-242.us-east-2.compute.internal <none> <none> win-webserver-858656469f-vspgf 1/1 Terminating 0 54m 10.132.8.9 ip-10-0-69-232.us-east-2.compute.internal <none> <none> win-webserver-858656469f-xdpqn 0/1 ContainerCreating 0 4s <none> ip-10-0-72-103.us-east-2.compute.internal <none> <none> win-webserver-858656469f-zv6sq 0/1 ContainerCreating 0 4s <none> ip-10-0-72-103.us-east-2.compute.internal <none> <none>
The issue with the pre-pulled images is clear now, we will try to use nanocore image (hopefully that will reduce the image container pulling times) as having a Windows VM template with pre-pulled images for Azure and AWS isn't an option for our automation. However, there is still the issue with Azure, every test I ran it showed connectivity issues, and in fact, that was the initial provider for which the BZ was opened initially. I think it requires further investigation, as there seems to be some issue with the LB when reconciliation occurs.
Created https://issues.redhat.com/browse/OCPBUGS-3506 to track the issue with the load balancer outage during the Windows nodes upgrade in Azure. @jfrancoa PTAL.
Ack, looking good. Thanks for that Jose