Bug 1905950

Summary: LB service unstable with multiple Windows nodes and pods
Product: OpenShift Container Platform Reporter: gaoshang <sgao>
Component: Windows ContainersAssignee: Sebastian Soto <ssoto>
Status: CLOSED ERRATA QA Contact: gaoshang <sgao>
Severity: high Docs Contact:
Priority: high    
Version: 4.6.zCC: anusaxen, aos-bugs, aravindh, dcbw, gmarkley, pmahajan, rgudimet, sdodson, ssoto, zzhao
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Issue in kube-proxy Consequence: Nodes would become unreachable under certain situations when using a load balancer Fix: Upstream fix was merged downstream Result: Normal behavior
Story Points: ---
Clone Of:
: 1942628 (view as bug list) Environment:
Last Closed: 2021-08-03 20:29:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1942628    
Attachments:
Description Flags
machineset
none
WinWebServer none

Description gaoshang 2020-12-09 11:18:36 UTC
Description of problem:
LB service works well with 1 Windows node + n Windows pods, if scale up another Windows node, LB become unstable
e.g.
# oc get nodes -l kubernetes.io/os=windows
NAME              STATUS   ROLES    AGE     VERSION
winworker-5hjbf   Ready    worker   4h13m   v1.19.2-1005+e8f355a526fe47
winworker-wcvxf   Ready    worker   4h17m   v1.19.2-1005+e8f355a526fe47

# oc get pod -owide
NAME                             READY   STATUS    RESTARTS   AGE    IP           NODE              NOMINATED NODE   READINESS GATES
win-webserver-549cd7495d-86fx7   1/1     Running   0          85m    10.132.0.6   winworker-wcvxf   <none>           <none>
win-webserver-549cd7495d-8x4v9   1/1     Running   0          2m7s   10.132.1.7   winworker-5hjbf   <none>           <none>

# oc get service
NAME            TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)        AGE
win-webserver   LoadBalancer   172.30.136.239   52.182.219.200   80:31231/TCP   148m

# curl 52.182.219.200
curl: (7) Failed connect to 52.182.219.200:80; Connection timed out


Version-Release number of selected component (if applicable):
WMCO 1.0.0
OCP version 4.6.0-0.nightly-2020-12-08-021151

How reproducible:
always

Steps to Reproduce:
1, Install OCP cluster with ovn-kubernetes network
2, Scale up 2 Windows node and 2 Windows pod
3, Check LB service
# curl 52.182.219.200

Actual results:
LB service unstable

Expected results:
LB service should work well

Additional info:

Comment 1 Aravindh Puthiyaparambil 2020-12-09 14:34:25 UTC
Procedure to reproduce on a new AWS 4.6 cluster with WMCO running:
1. Bring up a MachineSet with 2 replica
2. Bring up a Webserver Deployment with 2 replicas and Service . The pods should have landed on different nodes.
3. curl the website and it will not work

Comment 2 gaoshang 2020-12-09 14:37:14 UTC
I tried this: keep 2 Windows node, scale down Webserver Deployment with 1 replicas, LB works again, scale up Webserver with 2 replicas and pods landed on different nodes, LB will unstable

Comment 4 Anurag saxena 2020-12-09 22:14:26 UTC
cc'in @dcbw

Comment 5 Aravindh Puthiyaparambil 2020-12-10 00:12:45 UTC
I tried the following experiments:

Brought up a 4.6 cluster on AWS and ran WMCO@https://github.com/openshift/windows-machine-config-operator/commit/c28e7de1f6f48780a44fcc0e3a19bae7ae07ee07

1. Created a MachineSet with 2 replicas using the Windows Server 1809 (10.0.17763.1457) Datacenter image (us-east-2/ami-002ad2301d1a7322d) and 
2. Brought up a webserver deployment using mcr.microsoft.com/powershell:lts-nanoserver-1809 image.
3. curl <service external IP/DNS>

I was *not able* to reproduce the problem. Also tried with mcr.microsoft.com/windows/servercore:1809 and could not reproduce the problem.

I also tried:

1. Created a MachineSet with 2 replicas using the Windows Server 1909 Datacenter image (us-east-2/ami-08c4963e382fe473f)
2. Brought up a webserver deployment using mcr.microsoft.com/windows/servercore:1909 image. 
3. curl <service externap IP/DNS>

I was *able* to reproduce the problem i.e intermittent "curl: (52) Empty reply from server".  Also tried with mcr.microsoft.com/powershell:nanoserver-1909 and could not reproduce the problem.

To summarize:
.............................................................................................................
:                Windows version                 :                 Deployment Image                 : Issue :
:................................................:..................................................:.......:
: 1809 v10.0.17763.1457 (ami-002ad2301d1a7322d)  : mcr.microsoft.com/windows/servercore:1809        : No    :
: 1809 v10.0.17763.1457 (ami-002ad2301d1a7322d)  : mcr.microsoft.com/powershell:lts-nanoserver-1809 : No    :
: 1909 v10.0.18363.1198 (ami-08c4963e382fe473f)  : mcr.microsoft.com/windows/servercore:1909        : Yes   :
: 1909 v10.0.18363.1198 (ami-08c4963e382fe473f)  : mcr.microsoft.com/powershell:nanoserver-1909     : Yes   :
:................................................:..................................................:.......:

Comment 6 gaoshang 2020-12-10 16:32:34 UTC
I tested with Windows Server 2019(10.0.17763.1579) + image mcr.microsoft.com/powershell:lts-nanoserver-1809, this issue exist.

Env:
OCP version 4.6.8
Cloud provider Azure
MachineSet and WinWebServer(deployment and service included) attached
Must-gather attached

Found some error in kube-proxy, not sure if it's related

# oc adm node-logs -l kubernetes.io/os=windows --path=kube-proxy/kube-proxy.exe.ERROR
winworker-9zcmf Log file created at: 2020/12/10 14:44:59
winworker-9zcmf Running on machine: winworker-9zcmf
winworker-9zcmf Binary: Built with gc go1.15.2 for windows/amd64
winworker-9zcmf Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
winworker-9zcmf E1210 14:44:59.788510    2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy
winworker-9zcmf E1210 14:44:59.835528    2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy
winworker-9zcmf E1210 14:45:14.054623    2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy
winworker-9zcmf E1210 14:45:18.780544    2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy
winworker-9zcmf E1210 14:45:18.831558    2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy
winworker-9zcmf E1210 14:45:48.897218    2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy
winworker-9zcmf E1210 14:47:24.298405    2788 proxier.go:1116] Endpoint information not available for service winc-test/linux-webserver. Not applying any policy
winworker-9zcmf E1210 14:47:24.343417    2788 proxier.go:1116] Endpoint information not available for service winc-test/linux-webserver. Not applying any policy
winworker-9zcmf E1210 14:54:04.440172    2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-9zcmf E1210 14:54:04.484189    2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-9zcmf E1210 14:54:09.011283    2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-9zcmf E1210 14:54:39.080454    2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-9zcmf E1210 15:48:58.684729    2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-9zcmf E1210 15:48:58.927741    2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-9zcmf E1210 15:53:19.122508    2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-9zcmf E1210 15:53:19.175513    2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-gx7ch Log file created at: 2020/12/10 14:45:01
winworker-gx7ch Running on machine: winworker-gx7ch
winworker-gx7ch Binary: Built with gc go1.15.2 for windows/amd64
winworker-gx7ch Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
winworker-gx7ch E1210 14:45:01.908906    4064 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy
winworker-gx7ch E1210 14:45:14.008221    4064 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy
winworker-gx7ch E1210 14:45:18.797934    4064 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy
winworker-gx7ch E1210 14:45:18.859936    4064 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy
winworker-gx7ch E1210 14:45:48.987742    4064 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy
winworker-gx7ch E1210 14:47:24.312969    4064 proxier.go:1116] Endpoint information not available for service winc-test/linux-webserver. Not applying any policy
winworker-gx7ch E1210 14:47:24.381970    4064 proxier.go:1116] Endpoint information not available for service winc-test/linux-webserver. Not applying any policy
winworker-gx7ch E1210 14:54:04.479336    4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-gx7ch E1210 14:54:04.583347    4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-gx7ch E1210 14:54:09.276368    4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-gx7ch E1210 14:54:39.357825    4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-gx7ch E1210 15:48:58.812977    4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-gx7ch E1210 15:48:58.995992    4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-gx7ch E1210 15:53:21.489424    4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy
winworker-gx7ch E1210 15:53:21.560448    4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy

Comment 7 gaoshang 2020-12-10 16:33:10 UTC
Created attachment 1738262 [details]
machineset

Comment 8 gaoshang 2020-12-10 16:34:22 UTC
Created attachment 1738263 [details]
WinWebServer

Comment 10 Aravindh Puthiyaparambil 2020-12-10 23:54:19 UTC
In https://bugzilla.redhat.com/show_bug.cgi?id=1905950#c5, "1809" needs to be replaced with "Windows Server 2019". Here is an updated table with findings:

......................................
: Windows :     Version      : Issue :
:.........:..................:.......:
:    2019 : v10.0.17763.1457 : No    :
:    2019 : v10.0.17763.1577 : Yes   :
:    1909 : v10.0.18363.1198 : Yes   :
:.........:..................:.......:

It is possible that there could be a version of 1809 and 1909 that works but that would need more bisection.

AWS
---
The way to find the AMI ID for a working Windows image is to use the output of `aws ec2 describe-images --filters Name=name,Values=Windows_Server-2019-English-Full-ContainersLatest-2020.09.09 --region <regios> --query 'Images[*].[ImageId]' --output=json | jq .[0][0]` ,
replacing <region> with the one your cluster is using.

Azure
-----
The way to find the Azure info is to run "az vm image list --all --location centralus --publisher MicrosoftWindowsServer --offer WindowsServer --sku 2019-Datacenter-with-Containers --query "[?contains(version, '17763.1457.2009030514')]" " and then pick an offer similar to the example. Remember to replace <location> with cluster location . 

Example:
{
    "offer": "WindowsServer",
    "publisher": "MicrosoftWindowsServer",
    "sku": "2019-Datacenter-with-Containers",
    "urn": "MicrosoftWindowsServer:WindowsServer:2019-Datacenter-with-Containers:17763.1457.2009030514",
    "version": "17763.1457.2009030514"
  }
Use that info in the MachineSet. The main thing to note is that the `sku` needs to be `2019-Datacenter-with-Containers` and the version `17763.1457.2009030514`

Comment 11 Sebastian Soto 2020-12-11 22:06:45 UTC
Opened a Github issue to collaborate with Microsoft.
https://github.com/microsoft/Windows-Containers/issues/78

Comment 12 Aravindh Puthiyaparambil 2020-12-14 15:53:29 UTC
*** Bug 1905949 has been marked as a duplicate of this bug. ***

Comment 13 ravig 2021-01-18 17:39:44 UTC
Sebastian is talking to Microsoft folks on this. Waiting for response from Microsoft as of now.

Comment 15 Sebastian Soto 2021-03-24 16:10:25 UTC
This seems to be due to release-4.6 of openshift/kubernetes missing the commit https://github.com/kubernetes/kubernetes/pull/96499/files

Comment 16 Sebastian Soto 2021-03-24 16:15:04 UTC
This should not be a problem on 4.8 clusters

Comment 17 gaoshang 2021-03-25 07:53:40 UTC
This bug has been verified and passed on OCP 4.8, thanks.

Version:
OCP: 4.8.0-0.nightly-2021-03-24-200346
WMCO commit: 31138e36d04f17830de941afad9da89778d514a4

Steps:
1, When Windows pods land on different Windows node, check LB service works well.

# oc get nodes -owide -l kubernetes.io/os=windows
NAME            STATUS   ROLES    AGE     VERSION                       INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION    CONTAINER-RUNTIME
windows-l52f8   Ready    worker   3h29m   v1.20.0-1079+aa519d98764112   10.0.32.7     <none>        Windows Server 2019 Datacenter   10.0.17763.1817   docker://19.3.14
windows-m94g2   Ready    worker   3h24m   v1.20.0-1079+aa519d98764112   10.0.32.8     <none>        Windows Server 2019 Datacenter   10.0.17763.1817   docker://19.3.14

# oc get pod -owide
NAME                               READY   STATUS    RESTARTS   AGE    IP            NODE                                    NOMINATED NODE   READINESS GATES
win-webserver-549cd7495d-bgrvr     1/1     Running   0          170m   10.132.1.3    windows-m94g2                           <none>           <none>
win-webserver-549cd7495d-dcjls     1/1     Running   1          170m   10.132.0.6    windows-l52f8                           <none>           <none>
win-webserver-549cd7495d-sfh7n     1/1     Running   1          170m   10.132.0.5    windows-l52f8                           <none>           <none>
win-webserver-549cd7495d-tr6vz     1/1     Running   0          170m   10.132.1.2    windows-m94g2                           <none>           <none>
win-webserver-549cd7495d-wn7gt     1/1     Running   1          170m   10.132.0.7    windows-l52f8                           <none>           <none>

# oc get service 
NAME              TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)        AGE
linux-webserver   ClusterIP      172.30.104.62    <none>           8080/TCP       151m
win-webserver     LoadBalancer   172.30.222.188   52.154.240.144   80:31741/TCP   152m

# curl 52.154.240.144
<html><body><H1>Windows Container Web Server</H1></body></html>

# ./curlloop.sh 52.154.240.144
Attempt 1 03:32:18- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 2 03:32:18- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 3 03:32:18- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 4 03:32:18- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 5 03:32:18- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 6 03:32:18- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 7 03:32:18- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 8 03:32:18- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 9 03:32:18- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 10 03:32:19- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 11 03:32:19- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 12 03:32:19- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 13 03:32:19- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 14 03:32:19- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 15 03:32:19- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 16 03:32:19- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 17 03:32:19- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 18 03:32:19- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 19 03:32:20- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 20 03:32:20- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 21 03:32:20- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 22 03:32:20- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 23 03:32:20- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 24 03:32:20- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 25 03:32:20- infinite loops [ hit CTRL+C to stop]
<html><body><H1>Windows Container Web Server</H1></body></html>Attempt 26 03:32:20- infinite loops [ hit CTRL+C to stop]

Comment 20 errata-xmlrpc 2021-08-03 20:29:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Platform for Windows Containers 3.0.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3001