2103631 – [Windows containers] Windows node upgrade with version annotation causing LB (External IP) outage

Bug 2103631 - [Windows containers] Windows node upgrade with version annotation causing LB (External IP) outage

Summary: [Windows containers] Windows node upgrade with version annotation causing LB ...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Windows Containers
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Windows
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	jvaldes
QA Contact:	Ronnie Rasouli
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-04 09:41 UTC by Ronnie Rasouli
Modified:	2022-11-11 06:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-10 17:23:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Load balancer test that ended up with connectivity outage (13.05 KB, text/html) 2022-07-04 09:41 UTC, Ronnie Rasouli	no flags	Details
View All

Description Ronnie Rasouli 2022-07-04 09:41:15 UTC

Created attachment 1894458 [details]
Load balancer test that ended up with connectivity outage

Must gather logs:

1. Issue: Testing upgrade by version annotation causing connectivity outage to External IP
2. WMCO & OpenShift Version 
"version":"4.0.1+f66f0980"
4.9.0-0.nightly-2022-06-24-070308

3. Platform - AWS/Azure/VSpehre/Platform=none
Azure
4. If the platform is vSphere, what is the VMware tools version?
5. Is it a new test case or an old test case? New case
   if it is the old test case, is it regression or first-time tested? Old test 
OCP-35707
   Is it platform-specific or consistent across all platforms? AWS as well
6. Steps to Reproduce
Create and run a script that continuously testing the Windows External IP (LB)
cat probeLB.sh                                                                                                                                               ──(Mon,Jul04)─┘
#!/bin/bash
set -e

while true
do
	date
	echo "curl 52.189.34.88"
	curl 52.189.34.88
	echo ""
	sleep 2
done

NAME                      TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)          AGE
service/linux-webserver   LoadBalancer   172.30.3.186    52.189.33.119   8080:31053/TCP   113m
service/win-webserver     LoadBalancer   172.30.105.53   52.189.34.88    80:30648/TCP     115m

a. Scale WMCO to 0
# oc scale deployment.apps/windows-machine-config-operator --replicas=0 -n openshift-windows-machine-config-operator
b. Perform version annotation on Windows node
# oc annotate node windows-zrtrt --overwrite windowsmachineconfig.openshift.io/version=invalidVersion
c. scale up again WMCO
#oc scale deployment.apps/windows-machine-config-operator --replicas=1 -n openshift-windows-machine-config-operator
7. Actual Result and Expected Result
Connectivity lost when machine is in provisioning state, after the machine get an IP the connectivity return
8. A possible workaround has been tried? Is there a way to recover from the issue being tried out? no, connectivity is back after a minute
9. Logs
       Must-gather-windows-node-logs(https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_windows_node_logs#L24)
           oc get network.operator cluster -o yaml
           oc logs -f deployment/windows-machine-config-operator -n openshift-windows-machine-config-operator
       Windows MachineSet yaml or windows-instances ConfigMap
           oc get machineset <windows_machineSet_name> -n openshift-machine-api -o yaml
           oc get configmaps <windows_configmap_name> -n <namespace_name> -o yaml


 Optional logs:
    Anything that can be useful to debug the issue.

Comment 5 Ronnie Rasouli 2022-07-04 15:34:44 UTC

Confirmed this issue appears in 4.11 "version": "6.0.0-dd57309"

Comment 6 Sebastian Soto 2022-07-06 13:01:29 UTC

@rrasouli Looking at the logs you've attached I'm not clear on the problem.
Is it that the final curl never reaches the LB? It would be helpful to have the curl stderr.
What happens when curling again? Does this issue go away eventually or do all subsequent curls have the same behavior?

Comment 7 Ronnie Rasouli 2022-07-07 06:43:45 UTC

ssoto 
The problem at the end of the curl script test is that the LB isn't responding when the machine is in provisioning state, therefore the last curl happens when it failed to reach the LB address.
That repeats in all scenarios as described in the bug description.
I can remove the set -e to see that the LB is available again.
After the machine state is switched to provisioned (has IP address), the LB responds again.

What happens when curling again? Does this issue go away eventually or do all subsequent curls have the same behavior?
This issue is going away until we start a new upgrade scenario.

Comment 8 Mohammad Saif Shaikh 2022-09-08 20:14:07 UTC

@rrasouli Any update here after the session with the dev team?

Comment 9 Ronnie Rasouli 2022-09-22 14:32:32 UTC

Same results still there is a packet loss in 4.12 as well on AWS

Comment 11 jvaldes 2022-10-07 04:23:48 UTC

@rrasouli is attachment 1913549 [details] related to this bug?

Comment 12 jvaldes 2022-10-09 02:24:25 UTC

@rrasouli I followed the steps described in the test [1] and I could replicate this scenario where the service becomes unavailable for a short period of time.

Yet, I don’t feel it is a bug. Looking at the win-webserver deployment [2] referenced in the test there is only one (1) replica and you need at least two pods for the service to be considered minimally available. The load balancer will distribute network traffic across all pods of the deployment.

Recommendation: Set the number of replicas to at least two (2) to ensure that deleting a single pod will not cause downtime. For example, you are running a single instance of the win-webserver workload; if the one and only pod get deleted, evicted, or scheduled in another Windows node (WMCO upgrade scenario), you may find the service completely unavailable for a short period of time. In general, if you only have one application replica, any termination will result in downtime.

|
| 7. Actual Result and Expected Result
| Connectivity lost when machine is in provisioning state, after the machine
| get an IP the connectivity return

WRT the actual result; the events are independent. In this case, the new machine getting an IP address in the Provisioning state is unrelated to the service becoming available. Connectivity is back after the scheduler assigns the win-webserver pod to another Windows node.

[1] https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-35707

[2] https://raw.githubusercontent.com/sgaoshang/winc-test/master/data/WinWebServer.yaml

Comment 14 jvaldes 2022-10-23 23:42:11 UTC

Hi @rrarrasou the number of Windows workloads is essential and the workloads should be distributed (scheduled) evenly across the existing Windows nodes to achieve better availability.


As per our conversation in Slack [1] to avoid service outages in the load balancer ensure there are at least two (2) Windows nodes in Ready state, and the running Windows workloads are not scheduled in the same Windows node. Otherwise, the workloads pods with the Web-Server will be re-scheduled to another Windows node and the service will experience a short outage during this process as described above.



[1] https://coreos.slack.com/archives/CM4ERHBJS/p1666203999912499?thread_ts=1666195645.101679&cid=CM4ERHBJS

Comment 15 jvaldes 2022-10-27 17:21:28 UTC

This bug was reviewed in today's QE sync meeting. QE Team confirmed that after following the suggestion mentioned above[1] the bug is not reproducible in AWS. QE Team is actively working on the verification in Azure, so moving this bug to ON_QA.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=2103631#c14

Comment 16 Jose Luis Franco 2022-11-02 15:02:58 UTC

Afte some thorough testing, I could confirm that there is an existing issue with Azure (tested in 4.8 and 4.11) Load Balancer when performing the node reconciliation (even though there are 3 workers):

#######ATTEMTP #4 Wed Nov  2 01:25:04 PM CET 2022  ######
NAME            STATUS   ROLES    AGE     VERSION
windows-gbjbz   Ready    worker   7m35s   v1.21.11-rc.0.1506+5cc9227e4695d1
windows-gmn4c   Ready    worker   34m     v1.21.11-rc.0.1506+5cc9227e4695d1
windows-v88gm   Ready    worker   66m     v1.21.11-rc.0.1506+5cc9227e4695d1
NAME                               READY   STATUS    RESTARTS   AGE     IP            NODE                                           NOMINATED NODE   READINESS GATES
linux-webserver-7749c5ddff-8xm6g   1/1     Running   0          109m    10.131.0.26   jfrancoa-0211-rete-k28gm-worker-westus-vpbcw   <none>           <none>
win-webserver-549cd7495d-5s8mq     1/1     Running   0          3m29s   10.132.8.8    windows-gmn4c                                  <none>           <none>
win-webserver-549cd7495d-6l47p     1/1     Running   0          22m     10.132.6.9    windows-v88gm                                  <none>           <none>
win-webserver-549cd7495d-d5z5f     1/1     Running   0          22m     10.132.8.6    windows-gmn4c                                  <none>           <none>
win-webserver-549cd7495d-fs9c6     1/1     Running   0          3m29s   10.132.9.3    windows-gbjbz                                  <none>           <none>
win-webserver-549cd7495d-gbbmd     1/1     Running   0          22m     10.132.8.7    windows-gmn4c                                  <none>           <none>
win-webserver-549cd7495d-m6lhf     1/1     Running   0          22m     10.132.6.10   windows-v88gm                                  <none>           <none>
win-webserver-549cd7495d-m7jdc     1/1     Running   0          3m29s   10.132.9.4    windows-gbjbz                                  <none>           <none>
win-webserver-549cd7495d-qlqsd     1/1     Running   0          3m29s   10.132.9.2    windows-gbjbz                                  <none>           <none>
win-webserver-549cd7495d-wdgkk     1/1     Running   0          22m     10.132.6.11   windows-v88gm                                  <none>           <none>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100    63  100    63    0     0    159      0 --:--:-- --:--:-- --:--:--   159
<html><body><H1>Windows Container Web Server</H1></body></html>
#######ATTEMTP #5 Wed Nov  2 01:26:06 PM CET 2022  ######
NAME            STATUS                     ROLES    AGE     VERSION
windows-gbjbz   Ready                      worker   8m38s   v1.21.11-rc.0.1506+5cc9227e4695d1
windows-gmn4c   Ready,SchedulingDisabled   worker   35m     v1.21.11-rc.0.1506+5cc9227e4695d1
windows-v88gm   Ready                      worker   68m     v1.21.11-rc.0.1506+5cc9227e4695d1
NAME                               READY   STATUS              RESTARTS   AGE     IP            NODE                                           NOMINATED NODE   READINESS GATES
linux-webserver-7749c5ddff-8xm6g   1/1     Running             0          110m    10.131.0.26   jfrancoa-0211-rete-k28gm-worker-westus-vpbcw   <none>           <none>
win-webserver-549cd7495d-6l47p     1/1     Running             0          23m     10.132.6.9    windows-v88gm                                  <none>           <none>
win-webserver-549cd7495d-b6hvg     0/1     ContainerCreating   0          9s      <none>        windows-gbjbz                                  <none>           <none>
win-webserver-549cd7495d-cwlmn     0/1     ContainerCreating   0          9s      <none>        windows-v88gm                                  <none>           <none>
win-webserver-549cd7495d-fs9c6     1/1     Running             0          4m32s   10.132.9.3    windows-gbjbz                                  <none>           <none>
win-webserver-549cd7495d-m6lhf     1/1     Running             0          23m     10.132.6.10   windows-v88gm                                  <none>           <none>
win-webserver-549cd7495d-m7jdc     1/1     Running             0          4m32s   10.132.9.4    windows-gbjbz                                  <none>           <none>
win-webserver-549cd7495d-pcrwt     0/1     ContainerCreating   0          9s      <none>        windows-v88gm                                  <none>           <none>
win-webserver-549cd7495d-qlqsd     1/1     Running             0          4m32s   10.132.9.2    windows-gbjbz                                  <none>           <none>
win-webserver-549cd7495d-wdgkk     1/1     Running             0          23m     10.132.6.11   windows-v88gm                                  <none>           <none>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:08 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:08 --:--:--     0
curl: (7) Failed to connect to 20.237.202.229 port 80 after 8270 ms: Connection refused

Another situation that happened was, in AWS, as everytime a node is reconciled another machine is being allocated, the newly created machine does not have pre-pulled Windows docker image so creating the container takes longer time and while that docker image is being pulled, the node in which most of the containers were running got to it's time of reconcilling too, having a disruption in the service provided by the load balancer:

#######ATTEMTP #35 Wed Nov  2 01:56:22 PM CET 2022  ######
NAME                                        STATUS                     ROLES    AGE     VERSION
ip-10-0-69-232.us-east-2.compute.internal   Ready                      worker   78m     v1.21.11-rc.0.1506+5cc9227e4695d1
ip-10-0-71-242.us-east-2.compute.internal   Ready                      worker   14m     v1.21.11-rc.0.1506+5cc9227e4695d1
ip-10-0-72-103.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   3m14s   v1.21.11-rc.0.1506+5cc9227e4695d1
NAME                               READY   STATUS              RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
linux-webserver-7749c5ddff-8frhq   1/1     Running             0          4h45m   10.128.2.21   ip-10-0-72-51.us-east-2.compute.internal    <none>           <none>
win-webserver-858656469f-224cr     0/1     ContainerCreating   0          9m23s   <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-f89rb     1/1     Running             0          9m23s   10.132.8.14   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-hcg9j     1/1     Running             0          9m24s   10.132.8.12   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-jhx9d     1/1     Running             0          34m     10.132.8.10   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-l5xp5     1/1     Running             0          34m     10.132.8.11   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-mxvbg     1/1     Running             0          9m23s   10.132.8.13   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-s5vjh     1/1     Running             0          53m     10.132.8.8    ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-vcwrj     0/1     ContainerCreating   0          9m24s   <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-vspgf     1/1     Running             0          53m     10.132.8.9    ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100    63  100    63    0     0    229      0 --:--:-- --:--:-- --:--:--   229
<html><body><H1>Windows Container Web Server</H1></body></html>
#######ATTEMTP #36 Wed Nov  2 01:57:24 PM CET 2022  ######    
NAME                                        STATUS                     ROLES    AGE     VERSION
ip-10-0-69-232.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   79m     v1.21.11-rc.0.1506+5cc9227e4695d1
ip-10-0-71-242.us-east-2.compute.internal   Ready                      worker   15m     v1.21.11-rc.0.1506+5cc9227e4695d1
ip-10-0-72-103.us-east-2.compute.internal   Ready                      worker   4m15s   v1.21.11-rc.0.1506+5cc9227e4695d1
NAME                               READY   STATUS              RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
linux-webserver-7749c5ddff-8frhq   1/1     Running             0          4h46m   10.128.2.21   ip-10-0-72-51.us-east-2.compute.internal    <none>           <none>
win-webserver-858656469f-224cr     0/1     ContainerCreating   0          10m     <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-2j8zb     0/1     ContainerCreating   0          4s      <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-4rnb5     0/1     ContainerCreating   0          4s      <none>        ip-10-0-72-103.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-blkfk     0/1     ContainerCreating   0          4s      <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-f89rb     1/1     Terminating         0          10m     10.132.8.14   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-hcg9j     1/1     Terminating         0          10m     10.132.8.12   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-jhx9d     1/1     Terminating         0          35m     10.132.8.10   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-l5xp5     1/1     Terminating         0          35m     10.132.8.11   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-lvs4z     0/1     ContainerCreating   0          4s      <none>        ip-10-0-72-103.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-m8svn     0/1     ContainerCreating   0          4s      <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-mxvbg     1/1     Terminating         0          10m     10.132.8.13   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-s5vjh     1/1     Terminating         0          54m     10.132.8.8    ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-vcwrj     0/1     ContainerCreating   0          10m     <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-vspgf     1/1     Terminating         0          54m     10.132.8.9    ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-xdpqn     0/1     ContainerCreating   0          4s      <none>        ip-10-0-72-103.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-zv6sq     0/1     ContainerCreating   0          4s      <none>        ip-10-0-72-103.us-east-2.compute.internal   <none>           <none>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:08 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:09 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:11 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:12 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:13 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:14 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:15 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:16 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:17 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:18 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:19 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:20 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:21 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:22 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:23 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:24 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:25 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:26 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:27 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:28 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:29 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:30 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:31 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:32 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:33 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:34 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:35 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:36 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:37 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:38 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:39 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:40 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:41 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:42 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:43 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:44 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:45 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:46 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:47 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:48 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:49 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:50 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:51 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:52 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:53 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:54 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:55 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:56 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:57 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:58 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:59 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:59 --:--:--     0
curl: (52) Empty reply from server

My doubt regarding the AWS case is, what's the policy in the case of an upgrade? Should we ensure availability of the service during the whole procedure? Because if that's the case, WMCO should wait until the image is pulled and the container is created before jumping into the next node to reconcile, in my humble opinion.

I am uploading the scenario for the three cases (Azure 4.11-35707_Azure_411.log , Azure 4.8-35707_Azure_48.log and AWS 4.8-35707_AWS_48.log) and moving back the BZ into ASSIGNED.

Sorry for the delay in answering, but I was getting failures in the Load Balancer's connectivity and I wasn't sure if it was because the logic of the test was wrong, after running in parallel: 

for i in {1..60}; do time=`date`; echo -e "\n#######ATTEMTP #${i} ${time}  ######" &>> /tmp/35707_Azure_48.log;oc get nodes -l=node.openshift.io/os_id="Windows" &>> /tmp/35707_Azure_48.log;oc get pods -n winc-test -o wide &>> /tmp/35707_Azure_48.log;curl --connect-timeout 60 20.237.202.229 &>> /tmp/35707_Azure_48.log;sleep 60; done

I could confirm it was not a glitch.

Comment 20 Jose Luis Franco 2022-11-03 09:48:24 UTC

Regarding the AWS scenario in which the connectivity gets lost because the image isn't pulled yet, I can confirm it also occurs in IPI AWS version 4.11:

Nov  3 09:24:28.221: INFO: Checked LB connectivity of a72eb2193d310475189a0d026446609b-701037125.us-east-2.elb.amazonaws.com
Nov  3 09:24:48.610: INFO: Windows machine is not provisioned yet. Waiting 30 seconds more ...
Nov  3 09:25:18.409: INFO: numberOfMachines value is: 3
Nov  3 09:25:28.534: INFO: Checked LB connectivity of a72eb2193d310475189a0d026446609b-701037125.us-east-2.elb.amazonaws.com
Nov  3 09:25:54.471: INFO: numberOfMachines value is: 3
Nov  3 09:26:28.827: INFO: Connectivity check failed: error in curl command exit status 56 the IP of a72eb2193d310475189a0d026446609b-701037125.us-east-2.elb.amazonaws.com is not accesible 

I wanted to upload the same "scenario" logs, but when re-running the test case it looks like it went fine. Everything depends on the the time it will take to pull the Windows container image, if it's quick enough to download it before the next node gets reconciled then it works fine, but if not we will see a short service disruption.

Comment 22 Jose Luis Franco 2022-11-03 16:39:25 UTC

After several attempts, I managed to reproduce the issue also in 4.11 for AWS (attached log for AWS 4.11 Scenario). The truth is that it's quite a corner case, but it is indeed impacting the underlaying service. I think we should leave clear the expectations on the availability of the workload's during the upgrade (do we allow some small service disruption?, or should it be 0 disruption?) and if we decide to allow some disruption, it should be documented.

Comment 23 jvaldes 2022-11-07 14:34:58 UTC

Hi @jfrancoa 

Thanks for working on this, with regard to your comment about the policy in the case of an upgrade, yes there upgrade should pick one node at a time.


> Should we ensure availability of the service during the whole procedure?

Yes and this statement aligns more with the responsibility of the cluster administrator. WMCO cannot ensure that, as it does not controls the number Windows nodes or workloads.



> Another situation that happened was, in AWS, as everytime a node is reconciled another machine is being allocated, the newly created machine does not have pre-pulled Windows docker image so creating the container takes longer time and while that docker image is being pulled, the node in which most of the containers were running got to it's time of reconcilling too, having a disruption in the service provided by the load balance


This is a valid scenario as Windows container images are always pulled on Windows machines, if not present.




In ATTEMTP #36 from comment #16 the reason for the load balancer outage is the lack of a Windows workload (win-webserver-*)
in  `Running` state. 



#######ATTEMTP #36 Wed Nov  2 01:57:24 PM CET 2022  ######    
NAME                                        STATUS                     ROLES    AGE     VERSION
ip-10-0-69-232.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   79m     v1.21.11-rc.0.1506+5cc9227e4695d1
ip-10-0-71-242.us-east-2.compute.internal   Ready                      worker   15m     v1.21.11-rc.0.1506+5cc9227e4695d1
ip-10-0-72-103.us-east-2.compute.internal   Ready                      worker   4m15s   v1.21.11-rc.0.1506+5cc9227e4695d1
NAME                               READY   STATUS              RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
linux-webserver-7749c5ddff-8frhq   1/1     Running             0          4h46m   10.128.2.21   ip-10-0-72-51.us-east-2.compute.internal    <none>           <none>
win-webserver-858656469f-224cr     0/1     ContainerCreating   0          10m     <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-2j8zb     0/1     ContainerCreating   0          4s      <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-4rnb5     0/1     ContainerCreating   0          4s      <none>        ip-10-0-72-103.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-blkfk     0/1     ContainerCreating   0          4s      <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-f89rb     1/1     Terminating         0          10m     10.132.8.14   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-hcg9j     1/1     Terminating         0          10m     10.132.8.12   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-jhx9d     1/1     Terminating         0          35m     10.132.8.10   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-l5xp5     1/1     Terminating         0          35m     10.132.8.11   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-lvs4z     0/1     ContainerCreating   0          4s      <none>        ip-10-0-72-103.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-m8svn     0/1     ContainerCreating   0          4s      <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-mxvbg     1/1     Terminating         0          10m     10.132.8.13   ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-s5vjh     1/1     Terminating         0          54m     10.132.8.8    ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-vcwrj     0/1     ContainerCreating   0          10m     <none>        ip-10-0-71-242.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-vspgf     1/1     Terminating         0          54m     10.132.8.9    ip-10-0-69-232.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-xdpqn     0/1     ContainerCreating   0          4s      <none>        ip-10-0-72-103.us-east-2.compute.internal   <none>           <none>
win-webserver-858656469f-zv6sq     0/1     ContainerCreating   0          4s      <none>        ip-10-0-72-103.us-east-2.compute.internal   <none>           <none>

Comment 25 Jose Luis Franco 2022-11-10 13:12:34 UTC

The issue with the pre-pulled images is clear now, we will try to use nanocore image (hopefully that will reduce the image container pulling times) as having a Windows VM template with pre-pulled images for Azure and AWS isn't an option for our automation. However, there is still the issue with Azure, every test I ran it showed connectivity issues, and in fact, that was the initial provider for which the BZ was opened initially. I think it requires further investigation, as there seems to be some issue with the LB when reconciliation occurs.

Comment 26 jvaldes 2022-11-10 17:22:38 UTC

Created https://issues.redhat.com/browse/OCPBUGS-3506 to track the issue with the load balancer outage during the Windows nodes upgrade in Azure. @jfrancoa  PTAL.

Comment 27 Jose Luis Franco 2022-11-11 06:11:33 UTC

Ack, looking good. Thanks for that Jose

Note You need to log in before you can comment on or make changes to this bug.