Description of problem: When Windows pods land on different Windows node, re-create LoadBalancer service, found kube-proxy terminated unexpectedly on Windows nodes Version-Release number of selected component (if applicable): WMCO version: 2.0.0+9cfb5c1 OCP version: 4.7.0-0.nightly-2021-03-14-223051 Windows version: Windows Server 2019 10.0.17763.1817 How reproducible: Always Steps to Reproduce: 1. Install WMCO on OCP cluster and scale up 2 Windows nodes. # oc get nodes -l kubernetes.io/os=windows -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-133-68.us-east-2.compute.internal Ready worker 123m v1.20.0-1030+cac2421340a449 10.0.133.68 <none> Windows Server 2019 Datacenter 10.0.17763.1817 docker://19.3.14 ip-10-0-140-118.us-east-2.compute.internal Ready worker 130m v1.20.0-1030+cac2421340a449 10.0.140.118 <none> Windows Server 2019 Datacenter 10.0.17763.1817 docker://19.3.14 2. Create 2 Windows pods e.g win-webserver and make sure they are land on different Windows node # oc get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES win-webserver-549cd7495d-vtbsd 1/1 Running 0 65s 10.132.0.2 ip-10-0-140-118.us-east-2.compute.internal <none> <none> win-webserver-549cd7495d-xklxc 1/1 Running 0 7m31s 10.132.1.2 ip-10-0-133-68.us-east-2.compute.internal <none> <none> # cat WinWebServer.yaml apiVersion: apps/v1 kind: Deployment metadata: labels: app: win-webserver name: win-webserver spec: selector: matchLabels: app: win-webserver replicas: 1 template: metadata: labels: app: win-webserver name: win-webserver spec: tolerations: - key: "os" value: "Windows" Effect: "NoSchedule" containers: - name: windowswebserver image: mcr.microsoft.com/windows/servercore:ltsc2019 imagePullPolicy: IfNotPresent command: - powershell.exe - -command - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/'); $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening) { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content); $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer, 0, $buffer.Length); $response.Close(); }; nodeSelector: beta.kubernetes.io/os: windows 3. Create LB service, check it's available # oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE win-webserver LoadBalancer 172.30.200.235 a40403ddbf3da439289d9fc3b0875f5c-743076964.us-east-2.elb.amazonaws.com 80:30043/TCP 4h1m # curl a40403ddbf3da439289d9fc3b0875f5c-743076964.us-east-2.elb.amazonaws.com <html><body><H1>Windows Container Web Server</H1></body></html> # cat WinLoadBalancer.yaml apiVersion: v1 kind: Service metadata: name: win-webserver labels: app: win-webserver spec: ports: # the port that this service should serve on - port: 80 targetPort: 80 selector: app: win-webserver type: LoadBalancer 4. Delete LB service and re-create it again, fould LB is no longer available, ssh to Windows node, found kube-proxy service terminated unexpectedly. # oc delete service win-webserver service "win-webserver" deleted # oc create -f WinLoadBalancer.yaml service/win-webserver created # oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE win-webserver LoadBalancer 172.30.206.11 a2a24b5ff62c54b07a988200b1fecbca-998036870.us-east-2.elb.amazonaws.com 80:30320/TCP 55s # curl a2a24b5ff62c54b07a988200b1fecbca-998036870.us-east-2.elb.amazonaws.com curl: (52) Empty reply from server PS C:\Users\Administrator> get-service kube-proxy Status Name DisplayName ------ ---- ----------- Stopped kube-proxy kube-proxy PS C:\> Get-EventLog -LogName System -Newest 15 Index Time EntryType Source InstanceID Message ----- ---- --------- ------ ---------- ------- 73594 Mar 17 09:26 Information Service Control M... 1073748860 The Network Setup Service service entered the stopped state. 73593 Mar 17 09:26 Information Service Control M... 1073748860 The Network Setup Service service entered the running state. 73592 Mar 17 08:26 Information Service Control M... 1073748860 The Network Setup Service service entered the stopped state. 73591 Mar 17 08:26 Information Service Control M... 1073748860 The Network Setup Service service entered the running state. 73590 Mar 17 07:44 Information Service Control M... 1073748860 The WinHTTP Web Proxy Auto-Discovery Service service entered the stopped state. 73589 Mar 17 07:26 Information Service Control M... 1073748860 The Network Setup Service service entered the stopped state. 73588 Mar 17 07:26 Information Service Control M... 1073748860 The Network Setup Service service entered the running state. 73587 Mar 17 07:12 Information Service Control M... 1073748860 The Windows Modules Installer service entered the stopped state. 73586 Mar 17 07:11 Information Service Control M... 1073748860 The Network Setup Service service entered the stopped state. 73585 Mar 17 07:10 Information Service Control M... 1073748860 The Windows Modules Installer service entered the running state. 73584 Mar 17 07:10 Information Service Control M... 1073748860 The Network Setup Service service entered the running state. 73583 Mar 17 07:08 Information Service Control M... 1073748860 The WinHTTP Web Proxy Auto-Discovery Service service entered the running state. 73582 Mar 17 06:54 Information Service Control M... 1073748860 The Windows Modules Installer service entered the stopped state. 73581 Mar 17 06:52 Information Service Control M... 1073748860 The Windows Modules Installer service entered the running state. 73580 Mar 17 06:31 Error Service Control M... 3221232506 The kube-proxy service terminated unexpectedly. It has done this 1 time(s). 5. If I start kube-proxy service manually on all Windows nodes, LB works again. PS C:\> Start-Service kube-proxy PS C:\> # curl a2a24b5ff62c54b07a988200b1fecbca-998036870.us-east-2.elb.amazonaws.com <html><body><H1>Windows Container Web Server</H1></body></html> Actual results: kube-proxy is terminated unexpectedly after recreated LB service Expected results: kube-proxy should not be terminated after recreated LB service Additional info:
@sgao, WinWebServer.yaml has replicas = 1. Is that a typo given you say that there are two pods, each running on a different Windows node?
@sgao, how did you ssh into the Windows node?
I was able to reproduce the problem except when I tried the kube-proxy service stopped after creating the k8s LB service the first time. I then ran kube-proxy from the command line instead of as a service and observed the following crash when the k8s LB service was created: panic: runtime error: invalid memory address or nil pointer dereference [signal 0xc0000005 code=0x0 addr=0x0 pc=0x253c0dc] goroutine 91 [running]: k8s.io/kubernetes/pkg/proxy/winkernel.(*Proxier).syncProxyRules(0xc000085200) /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/proxy/winkernel/proxier.go:1124 +0x9dc k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).tryRun(0xc0005ee3c0) /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:292 +0xbe k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).Loop(0xc0005ee3c0, 0xc00007c0c0) /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:203 +0x21b k8s.io/kubernetes/pkg/proxy/winkernel.(*Proxier).SyncLoop(0xc000085200) /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/proxy/winkernel/proxier.go:752 +0x72 created by k8s.io/kubernetes/cmd/kube-proxy/app.(*ProxyServer).Run /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:778 +0x855
(In reply to Aravindh Puthiyaparambil from comment #3) > @sgao, WinWebServer.yaml has replicas = 1. Is that a typo given > you say that there are two pods, each running on a different Windows node? @aravindh Yes, replicas should be 2 here, actually I scaled it by # oc scale deployment.apps/win-webserver --replicas=2.
(In reply to Aravindh Puthiyaparambil from comment #4) > @sgao, how did you ssh into the Windows node? I'm ssh into Windows node according to https://docs.openshift.com/container-platform/4.7/support/troubleshooting/troubleshooting-windows-container-workload-issues.html#accessing-windows-node-using-ssh_troubleshooting-windows-container-workload-issues
Another data point: if I don't wait for the deployment pods to go to running and create the LB service immediately, kube-proxy does not crash on the nodes.
Waiting on https://github.com/kubernetes/kubernetes/issues/100384 to be fixed
Waiting on a patch from Microsoft to test out.
This bug has been verified and passed on OCP 4.8.0-0.nightly-2021-05-25-223219, thanks. Version: OCP 4.8.0-0.nightly-2021-05-25-223219 WMCO built from https://github.com/openshift/windows-machine-config-operator/commit/6996efa111654ba29ec1b1d9cd7ec76567bd0987 Steps: Repeat steps in Bug, after LB service re-created, kube-proxy service did not crash anymore. # oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE win-webserver LoadBalancer 172.30.190.136 a50f53fdba80d4b0289acde82b8c335d-288421468.us-east-2.elb.amazonaws.com 80:30797/TCP 36m # curl a50f53fdba80d4b0289acde82b8c335d-288421468.us-east-2.elb.amazonaws.com <html><body><H1>Windows Container Web Server</H1></body></html> # oc delete service win-webserver service "win-webserver" deleted # oc create -f LB.yaml # oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE win-webserver LoadBalancer 172.30.245.246 a1eb57db05562481ead3a5f9ba4ed492-2096297118.us-east-2.elb.amazonaws.com 80:32488/TCP 2m33s # curl a1eb57db05562481ead3a5f9ba4ed492-2096297118.us-east-2.elb.amazonaws.com <html><body><H1>Windows Container Web Server</H1></body></html>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Platform for Windows Containers 3.0.0 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3001