Bug 1905950
Summary: | LB service unstable with multiple Windows nodes and pods | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | gaoshang <sgao> | ||||||
Component: | Windows Containers | Assignee: | Sebastian Soto <ssoto> | ||||||
Status: | CLOSED ERRATA | QA Contact: | gaoshang <sgao> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.6.z | CC: | anusaxen, aos-bugs, aravindh, dcbw, gmarkley, pmahajan, rgudimet, sdodson, ssoto, zzhao | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.8.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Cause: Issue in kube-proxy
Consequence: Nodes would become unreachable under certain situations when using a load balancer
Fix: Upstream fix was merged downstream
Result: Normal behavior
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1942628 (view as bug list) | Environment: | |||||||
Last Closed: | 2021-08-03 20:29:16 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1942628 | ||||||||
Attachments: |
|
Description
gaoshang
2020-12-09 11:18:36 UTC
Procedure to reproduce on a new AWS 4.6 cluster with WMCO running: 1. Bring up a MachineSet with 2 replica 2. Bring up a Webserver Deployment with 2 replicas and Service . The pods should have landed on different nodes. 3. curl the website and it will not work I tried this: keep 2 Windows node, scale down Webserver Deployment with 1 replicas, LB works again, scale up Webserver with 2 replicas and pods landed on different nodes, LB will unstable cc'in @dcbw I tried the following experiments: Brought up a 4.6 cluster on AWS and ran WMCO@https://github.com/openshift/windows-machine-config-operator/commit/c28e7de1f6f48780a44fcc0e3a19bae7ae07ee07 1. Created a MachineSet with 2 replicas using the Windows Server 1809 (10.0.17763.1457) Datacenter image (us-east-2/ami-002ad2301d1a7322d) and 2. Brought up a webserver deployment using mcr.microsoft.com/powershell:lts-nanoserver-1809 image. 3. curl <service external IP/DNS> I was *not able* to reproduce the problem. Also tried with mcr.microsoft.com/windows/servercore:1809 and could not reproduce the problem. I also tried: 1. Created a MachineSet with 2 replicas using the Windows Server 1909 Datacenter image (us-east-2/ami-08c4963e382fe473f) 2. Brought up a webserver deployment using mcr.microsoft.com/windows/servercore:1909 image. 3. curl <service externap IP/DNS> I was *able* to reproduce the problem i.e intermittent "curl: (52) Empty reply from server". Also tried with mcr.microsoft.com/powershell:nanoserver-1909 and could not reproduce the problem. To summarize: ............................................................................................................. : Windows version : Deployment Image : Issue : :................................................:..................................................:.......: : 1809 v10.0.17763.1457 (ami-002ad2301d1a7322d) : mcr.microsoft.com/windows/servercore:1809 : No : : 1809 v10.0.17763.1457 (ami-002ad2301d1a7322d) : mcr.microsoft.com/powershell:lts-nanoserver-1809 : No : : 1909 v10.0.18363.1198 (ami-08c4963e382fe473f) : mcr.microsoft.com/windows/servercore:1909 : Yes : : 1909 v10.0.18363.1198 (ami-08c4963e382fe473f) : mcr.microsoft.com/powershell:nanoserver-1909 : Yes : :................................................:..................................................:.......: I tested with Windows Server 2019(10.0.17763.1579) + image mcr.microsoft.com/powershell:lts-nanoserver-1809, this issue exist. Env: OCP version 4.6.8 Cloud provider Azure MachineSet and WinWebServer(deployment and service included) attached Must-gather attached Found some error in kube-proxy, not sure if it's related # oc adm node-logs -l kubernetes.io/os=windows --path=kube-proxy/kube-proxy.exe.ERROR winworker-9zcmf Log file created at: 2020/12/10 14:44:59 winworker-9zcmf Running on machine: winworker-9zcmf winworker-9zcmf Binary: Built with gc go1.15.2 for windows/amd64 winworker-9zcmf Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg winworker-9zcmf E1210 14:44:59.788510 2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy winworker-9zcmf E1210 14:44:59.835528 2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy winworker-9zcmf E1210 14:45:14.054623 2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy winworker-9zcmf E1210 14:45:18.780544 2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy winworker-9zcmf E1210 14:45:18.831558 2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy winworker-9zcmf E1210 14:45:48.897218 2788 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy winworker-9zcmf E1210 14:47:24.298405 2788 proxier.go:1116] Endpoint information not available for service winc-test/linux-webserver. Not applying any policy winworker-9zcmf E1210 14:47:24.343417 2788 proxier.go:1116] Endpoint information not available for service winc-test/linux-webserver. Not applying any policy winworker-9zcmf E1210 14:54:04.440172 2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-9zcmf E1210 14:54:04.484189 2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-9zcmf E1210 14:54:09.011283 2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-9zcmf E1210 14:54:39.080454 2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-9zcmf E1210 15:48:58.684729 2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-9zcmf E1210 15:48:58.927741 2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-9zcmf E1210 15:53:19.122508 2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-9zcmf E1210 15:53:19.175513 2788 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-gx7ch Log file created at: 2020/12/10 14:45:01 winworker-gx7ch Running on machine: winworker-gx7ch winworker-gx7ch Binary: Built with gc go1.15.2 for windows/amd64 winworker-gx7ch Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg winworker-gx7ch E1210 14:45:01.908906 4064 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy winworker-gx7ch E1210 14:45:14.008221 4064 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy winworker-gx7ch E1210 14:45:18.797934 4064 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy winworker-gx7ch E1210 14:45:18.859936 4064 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy winworker-gx7ch E1210 14:45:48.987742 4064 proxier.go:1116] Endpoint information not available for service openshift-windows-machine-config-operator/linux-webserver. Not applying any policy winworker-gx7ch E1210 14:47:24.312969 4064 proxier.go:1116] Endpoint information not available for service winc-test/linux-webserver. Not applying any policy winworker-gx7ch E1210 14:47:24.381970 4064 proxier.go:1116] Endpoint information not available for service winc-test/linux-webserver. Not applying any policy winworker-gx7ch E1210 14:54:04.479336 4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-gx7ch E1210 14:54:04.583347 4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-gx7ch E1210 14:54:09.276368 4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-gx7ch E1210 14:54:39.357825 4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-gx7ch E1210 15:48:58.812977 4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-gx7ch E1210 15:48:58.995992 4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-gx7ch E1210 15:53:21.489424 4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy winworker-gx7ch E1210 15:53:21.560448 4064 proxier.go:1116] Endpoint information not available for service winc-test/win-webserver. Not applying any policy Created attachment 1738262 [details]
machineset
Created attachment 1738263 [details]
WinWebServer
In https://bugzilla.redhat.com/show_bug.cgi?id=1905950#c5, "1809" needs to be replaced with "Windows Server 2019". Here is an updated table with findings: ...................................... : Windows : Version : Issue : :.........:..................:.......: : 2019 : v10.0.17763.1457 : No : : 2019 : v10.0.17763.1577 : Yes : : 1909 : v10.0.18363.1198 : Yes : :.........:..................:.......: It is possible that there could be a version of 1809 and 1909 that works but that would need more bisection. AWS --- The way to find the AMI ID for a working Windows image is to use the output of `aws ec2 describe-images --filters Name=name,Values=Windows_Server-2019-English-Full-ContainersLatest-2020.09.09 --region <regios> --query 'Images[*].[ImageId]' --output=json | jq .[0][0]` , replacing <region> with the one your cluster is using. Azure ----- The way to find the Azure info is to run "az vm image list --all --location centralus --publisher MicrosoftWindowsServer --offer WindowsServer --sku 2019-Datacenter-with-Containers --query "[?contains(version, '17763.1457.2009030514')]" " and then pick an offer similar to the example. Remember to replace <location> with cluster location . Example: { "offer": "WindowsServer", "publisher": "MicrosoftWindowsServer", "sku": "2019-Datacenter-with-Containers", "urn": "MicrosoftWindowsServer:WindowsServer:2019-Datacenter-with-Containers:17763.1457.2009030514", "version": "17763.1457.2009030514" } Use that info in the MachineSet. The main thing to note is that the `sku` needs to be `2019-Datacenter-with-Containers` and the version `17763.1457.2009030514` Opened a Github issue to collaborate with Microsoft. https://github.com/microsoft/Windows-Containers/issues/78 *** Bug 1905949 has been marked as a duplicate of this bug. *** Sebastian is talking to Microsoft folks on this. Waiting for response from Microsoft as of now. This seems to be due to release-4.6 of openshift/kubernetes missing the commit https://github.com/kubernetes/kubernetes/pull/96499/files This should not be a problem on 4.8 clusters This bug has been verified and passed on OCP 4.8, thanks. Version: OCP: 4.8.0-0.nightly-2021-03-24-200346 WMCO commit: 31138e36d04f17830de941afad9da89778d514a4 Steps: 1, When Windows pods land on different Windows node, check LB service works well. # oc get nodes -owide -l kubernetes.io/os=windows NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME windows-l52f8 Ready worker 3h29m v1.20.0-1079+aa519d98764112 10.0.32.7 <none> Windows Server 2019 Datacenter 10.0.17763.1817 docker://19.3.14 windows-m94g2 Ready worker 3h24m v1.20.0-1079+aa519d98764112 10.0.32.8 <none> Windows Server 2019 Datacenter 10.0.17763.1817 docker://19.3.14 # oc get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES win-webserver-549cd7495d-bgrvr 1/1 Running 0 170m 10.132.1.3 windows-m94g2 <none> <none> win-webserver-549cd7495d-dcjls 1/1 Running 1 170m 10.132.0.6 windows-l52f8 <none> <none> win-webserver-549cd7495d-sfh7n 1/1 Running 1 170m 10.132.0.5 windows-l52f8 <none> <none> win-webserver-549cd7495d-tr6vz 1/1 Running 0 170m 10.132.1.2 windows-m94g2 <none> <none> win-webserver-549cd7495d-wn7gt 1/1 Running 1 170m 10.132.0.7 windows-l52f8 <none> <none> # oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE linux-webserver ClusterIP 172.30.104.62 <none> 8080/TCP 151m win-webserver LoadBalancer 172.30.222.188 52.154.240.144 80:31741/TCP 152m # curl 52.154.240.144 <html><body><H1>Windows Container Web Server</H1></body></html> # ./curlloop.sh 52.154.240.144 Attempt 1 03:32:18- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 2 03:32:18- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 3 03:32:18- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 4 03:32:18- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 5 03:32:18- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 6 03:32:18- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 7 03:32:18- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 8 03:32:18- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 9 03:32:18- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 10 03:32:19- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 11 03:32:19- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 12 03:32:19- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 13 03:32:19- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 14 03:32:19- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 15 03:32:19- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 16 03:32:19- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 17 03:32:19- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 18 03:32:19- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 19 03:32:20- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 20 03:32:20- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 21 03:32:20- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 22 03:32:20- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 23 03:32:20- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 24 03:32:20- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 25 03:32:20- infinite loops [ hit CTRL+C to stop] <html><body><H1>Windows Container Web Server</H1></body></html>Attempt 26 03:32:20- infinite loops [ hit CTRL+C to stop] Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Platform for Windows Containers 3.0.0 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3001 |