Bug 1939968

Summary: kube-proxy service terminated unexpectedly after recreated LB service
Product: OpenShift Container Platform Reporter: gaoshang <sgao>
Component: Windows ContainersAssignee: Aravindh Puthiyaparambil <aravindh>
Status: CLOSED ERRATA QA Contact: gaoshang <sgao>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: aos-bugs, jvaldes, ssoto
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: kube-proxy was incorrectly performing reference counting Consequence: kube-proxy was crashing when a LB service was created Fix: Apply reference counting to only remote endpoints Result: kube-proxy no longer crashes when a LB service is created
Story Points: ---
Clone Of:
: 1963263 (view as bug list) Environment:
Last Closed: 2021-08-03 20:29:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1963263    

Description gaoshang 2021-03-17 11:09:05 UTC
Description of problem:
When Windows pods land on different Windows node, re-create LoadBalancer service, found kube-proxy terminated unexpectedly on Windows nodes

Version-Release number of selected component (if applicable):
WMCO version: 2.0.0+9cfb5c1
OCP version: 4.7.0-0.nightly-2021-03-14-223051
Windows version: Windows Server 2019 10.0.17763.1817

How reproducible:
Always

Steps to Reproduce:
1. Install WMCO on OCP cluster and scale up 2 Windows nodes.
# oc get nodes -l kubernetes.io/os=windows -owide
NAME                                         STATUS   ROLES    AGE    VERSION                       INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION    CONTAINER-RUNTIME
ip-10-0-133-68.us-east-2.compute.internal    Ready    worker   123m   v1.20.0-1030+cac2421340a449   10.0.133.68    <none>        Windows Server 2019 Datacenter   10.0.17763.1817   docker://19.3.14
ip-10-0-140-118.us-east-2.compute.internal   Ready    worker   130m   v1.20.0-1030+cac2421340a449   10.0.140.118   <none>        Windows Server 2019 Datacenter   10.0.17763.1817   docker://19.3.14

2. Create 2 Windows pods e.g win-webserver and make sure they are land on different Windows node
# oc get pod -owide
NAME                             READY   STATUS    RESTARTS   AGE     IP           NODE                                         NOMINATED NODE   READINESS GATES
win-webserver-549cd7495d-vtbsd   1/1     Running   0          65s     10.132.0.2   ip-10-0-140-118.us-east-2.compute.internal   <none>           <none>
win-webserver-549cd7495d-xklxc   1/1     Running   0          7m31s   10.132.1.2   ip-10-0-133-68.us-east-2.compute.internal    <none>           <none>

# cat WinWebServer.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
spec:
  selector:
    matchLabels:
      app: win-webserver
  replicas: 1
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      tolerations:
      - key: "os"
        value: "Windows"
        Effect: "NoSchedule"
      containers:
      - name: windowswebserver
        image: mcr.microsoft.com/windows/servercore:ltsc2019
        imagePullPolicy: IfNotPresent
        command:
        - powershell.exe
        - -command
        - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/'); $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening) { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content); $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer, 0, $buffer.Length); $response.Close(); };
      nodeSelector:
        beta.kubernetes.io/os: windows

3. Create LB service, check it's available
# oc get service
NAME            TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)        AGE
win-webserver   LoadBalancer   172.30.200.235   a40403ddbf3da439289d9fc3b0875f5c-743076964.us-east-2.elb.amazonaws.com   80:30043/TCP   4h1m

# curl a40403ddbf3da439289d9fc3b0875f5c-743076964.us-east-2.elb.amazonaws.com
<html><body><H1>Windows Container Web Server</H1></body></html>

# cat WinLoadBalancer.yaml
apiVersion: v1
kind: Service
metadata:
  name: win-webserver
  labels:
    app: win-webserver
spec:
  ports:
  # the port that this service should serve on
  - port: 80
    targetPort: 80
  selector:
    app: win-webserver
  type: LoadBalancer

4. Delete LB service and re-create it again, fould LB is no longer available, ssh to Windows node, found kube-proxy service terminated unexpectedly.

# oc delete service win-webserver
service "win-webserver" deleted
# oc create -f WinLoadBalancer.yaml
service/win-webserver created

# oc get service
NAME            TYPE           CLUSTER-IP      EXTERNAL-IP                                                              PORT(S)        AGE
win-webserver   LoadBalancer   172.30.206.11   a2a24b5ff62c54b07a988200b1fecbca-998036870.us-east-2.elb.amazonaws.com   80:30320/TCP   55s

# curl a2a24b5ff62c54b07a988200b1fecbca-998036870.us-east-2.elb.amazonaws.com
curl: (52) Empty reply from server

PS C:\Users\Administrator> get-service kube-proxy 

Status   Name               DisplayName
------   ----               -----------
Stopped  kube-proxy         kube-proxy

PS C:\> Get-EventLog -LogName System -Newest 15
   Index Time          EntryType   Source                 InstanceID Message
   ----- ----          ---------   ------                 ---------- -------
   73594 Mar 17 09:26  Information Service Control M...   1073748860 The Network Setup Service service entered the stopped state.
   73593 Mar 17 09:26  Information Service Control M...   1073748860 The Network Setup Service service entered the running state.
   73592 Mar 17 08:26  Information Service Control M...   1073748860 The Network Setup Service service entered the stopped state.
   73591 Mar 17 08:26  Information Service Control M...   1073748860 The Network Setup Service service entered the running state.
   73590 Mar 17 07:44  Information Service Control M...   1073748860 The WinHTTP Web Proxy Auto-Discovery Service service entered the stopped state.
   73589 Mar 17 07:26  Information Service Control M...   1073748860 The Network Setup Service service entered the stopped state.
   73588 Mar 17 07:26  Information Service Control M...   1073748860 The Network Setup Service service entered the running state.
   73587 Mar 17 07:12  Information Service Control M...   1073748860 The Windows Modules Installer service entered the stopped state.
   73586 Mar 17 07:11  Information Service Control M...   1073748860 The Network Setup Service service entered the stopped state.
   73585 Mar 17 07:10  Information Service Control M...   1073748860 The Windows Modules Installer service entered the running state.
   73584 Mar 17 07:10  Information Service Control M...   1073748860 The Network Setup Service service entered the running state.
   73583 Mar 17 07:08  Information Service Control M...   1073748860 The WinHTTP Web Proxy Auto-Discovery Service service entered the running state.
   73582 Mar 17 06:54  Information Service Control M...   1073748860 The Windows Modules Installer service entered the stopped state.
   73581 Mar 17 06:52  Information Service Control M...   1073748860 The Windows Modules Installer service entered the running state.
   73580 Mar 17 06:31  Error       Service Control M...   3221232506 The kube-proxy service terminated unexpectedly.  It has done this 1 time(s).

5. If I start kube-proxy service manually on all Windows nodes, LB works again.
PS C:\> Start-Service kube-proxy
PS C:\> 

# curl a2a24b5ff62c54b07a988200b1fecbca-998036870.us-east-2.elb.amazonaws.com
<html><body><H1>Windows Container Web Server</H1></body></html>

Actual results:
kube-proxy is terminated unexpectedly after recreated LB service

Expected results:
kube-proxy should not be terminated after recreated LB service

Additional info:

Comment 3 Aravindh Puthiyaparambil 2021-03-17 20:07:35 UTC
@sgao, WinWebServer.yaml has replicas = 1. Is that a typo given you say that there are two pods, each running on a different Windows node?

Comment 4 Aravindh Puthiyaparambil 2021-03-17 20:11:02 UTC
@sgao, how did you ssh into the Windows node?

Comment 5 Aravindh Puthiyaparambil 2021-03-17 23:00:11 UTC
I was able to reproduce the problem except when I tried the kube-proxy service stopped after creating the k8s LB service the first time. I then ran kube-proxy from the command line instead of as a service and observed the following crash when the k8s LB service was created:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x253c0dc]
goroutine 91 [running]:
k8s.io/kubernetes/pkg/proxy/winkernel.(*Proxier).syncProxyRules(0xc000085200)
        /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/proxy/winkernel/proxier.go:1124 +0x9dc
k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).tryRun(0xc0005ee3c0)
        /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:292 +0xbe
k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).Loop(0xc0005ee3c0, 0xc00007c0c0)
        /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:203 +0x21b
k8s.io/kubernetes/pkg/proxy/winkernel.(*Proxier).SyncLoop(0xc000085200)
        /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/proxy/winkernel/proxier.go:752 +0x72
created by k8s.io/kubernetes/cmd/kube-proxy/app.(*ProxyServer).Run
        /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:778 +0x855

Comment 6 gaoshang 2021-03-18 01:50:42 UTC
(In reply to Aravindh Puthiyaparambil from comment #3)
> @sgao, WinWebServer.yaml has replicas = 1. Is that a typo given
> you say that there are two pods, each running on a different Windows node?

@aravindh Yes, replicas should be 2 here, actually I scaled it by # oc scale deployment.apps/win-webserver --replicas=2.

Comment 7 gaoshang 2021-03-18 03:28:37 UTC
(In reply to Aravindh Puthiyaparambil from comment #4)
> @sgao, how did you ssh into the Windows node?

I'm ssh into Windows node according to https://docs.openshift.com/container-platform/4.7/support/troubleshooting/troubleshooting-windows-container-workload-issues.html#accessing-windows-node-using-ssh_troubleshooting-windows-container-workload-issues

Comment 8 Aravindh Puthiyaparambil 2021-03-18 17:55:11 UTC
Another data point: if I don't wait for the deployment pods to go to running and create the LB service immediately, kube-proxy does not crash on the nodes.

Comment 9 Aravindh Puthiyaparambil 2021-03-29 18:28:49 UTC
Waiting on https://github.com/kubernetes/kubernetes/issues/100384 to be fixed

Comment 10 Aravindh Puthiyaparambil 2021-04-19 16:44:46 UTC
Waiting on a patch from Microsoft to test out.

Comment 11 gaoshang 2021-05-26 09:41:33 UTC
This bug has been verified and passed on OCP 4.8.0-0.nightly-2021-05-25-223219, thanks.

Version:
OCP 4.8.0-0.nightly-2021-05-25-223219
WMCO built from https://github.com/openshift/windows-machine-config-operator/commit/6996efa111654ba29ec1b1d9cd7ec76567bd0987

Steps:

Repeat steps in Bug, after LB service re-created, kube-proxy service did not crash anymore.

# oc get service
NAME            TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)        AGE
win-webserver   LoadBalancer   172.30.190.136   a50f53fdba80d4b0289acde82b8c335d-288421468.us-east-2.elb.amazonaws.com   80:30797/TCP   36m

# curl a50f53fdba80d4b0289acde82b8c335d-288421468.us-east-2.elb.amazonaws.com
<html><body><H1>Windows Container Web Server</H1></body></html>

# oc delete service win-webserver
service "win-webserver" deleted

# oc create -f LB.yaml

# oc get service
NAME            TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)        AGE
win-webserver   LoadBalancer   172.30.245.246   a1eb57db05562481ead3a5f9ba4ed492-2096297118.us-east-2.elb.amazonaws.com   80:32488/TCP   2m33s

# curl a1eb57db05562481ead3a5f9ba4ed492-2096297118.us-east-2.elb.amazonaws.com
<html><body><H1>Windows Container Web Server</H1></body></html>

Comment 14 errata-xmlrpc 2021-08-03 20:29:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Platform for Windows Containers 3.0.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3001