Bug 1939968
Summary: | kube-proxy service terminated unexpectedly after recreated LB service | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | gaoshang <sgao> | |
Component: | Windows Containers | Assignee: | Aravindh Puthiyaparambil <aravindh> | |
Status: | CLOSED ERRATA | QA Contact: | gaoshang <sgao> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.7 | CC: | aos-bugs, jvaldes, ssoto | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: kube-proxy was incorrectly performing reference counting
Consequence: kube-proxy was crashing when a LB service was created
Fix: Apply reference counting to only remote endpoints
Result: kube-proxy no longer crashes when a LB service is created
|
Story Points: | --- | |
Clone Of: | ||||
: | 1963263 (view as bug list) | Environment: | ||
Last Closed: | 2021-08-03 20:29:16 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1963263 |
Description
gaoshang
2021-03-17 11:09:05 UTC
@sgao, WinWebServer.yaml has replicas = 1. Is that a typo given you say that there are two pods, each running on a different Windows node? @sgao, how did you ssh into the Windows node? I was able to reproduce the problem except when I tried the kube-proxy service stopped after creating the k8s LB service the first time. I then ran kube-proxy from the command line instead of as a service and observed the following crash when the k8s LB service was created: panic: runtime error: invalid memory address or nil pointer dereference [signal 0xc0000005 code=0x0 addr=0x0 pc=0x253c0dc] goroutine 91 [running]: k8s.io/kubernetes/pkg/proxy/winkernel.(*Proxier).syncProxyRules(0xc000085200) /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/proxy/winkernel/proxier.go:1124 +0x9dc k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).tryRun(0xc0005ee3c0) /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:292 +0xbe k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).Loop(0xc0005ee3c0, 0xc00007c0c0) /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:203 +0x21b k8s.io/kubernetes/pkg/proxy/winkernel.(*Proxier).SyncLoop(0xc000085200) /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/proxy/winkernel/proxier.go:752 +0x72 created by k8s.io/kubernetes/cmd/kube-proxy/app.(*ProxyServer).Run /remote-source/build/windows-machine-config-operator/kubernetes/_output/local/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:778 +0x855 (In reply to Aravindh Puthiyaparambil from comment #3) > @sgao, WinWebServer.yaml has replicas = 1. Is that a typo given > you say that there are two pods, each running on a different Windows node? @aravindh Yes, replicas should be 2 here, actually I scaled it by # oc scale deployment.apps/win-webserver --replicas=2. (In reply to Aravindh Puthiyaparambil from comment #4) > @sgao, how did you ssh into the Windows node? I'm ssh into Windows node according to https://docs.openshift.com/container-platform/4.7/support/troubleshooting/troubleshooting-windows-container-workload-issues.html#accessing-windows-node-using-ssh_troubleshooting-windows-container-workload-issues Another data point: if I don't wait for the deployment pods to go to running and create the LB service immediately, kube-proxy does not crash on the nodes. Waiting on https://github.com/kubernetes/kubernetes/issues/100384 to be fixed Waiting on a patch from Microsoft to test out. This bug has been verified and passed on OCP 4.8.0-0.nightly-2021-05-25-223219, thanks. Version: OCP 4.8.0-0.nightly-2021-05-25-223219 WMCO built from https://github.com/openshift/windows-machine-config-operator/commit/6996efa111654ba29ec1b1d9cd7ec76567bd0987 Steps: Repeat steps in Bug, after LB service re-created, kube-proxy service did not crash anymore. # oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE win-webserver LoadBalancer 172.30.190.136 a50f53fdba80d4b0289acde82b8c335d-288421468.us-east-2.elb.amazonaws.com 80:30797/TCP 36m # curl a50f53fdba80d4b0289acde82b8c335d-288421468.us-east-2.elb.amazonaws.com <html><body><H1>Windows Container Web Server</H1></body></html> # oc delete service win-webserver service "win-webserver" deleted # oc create -f LB.yaml # oc get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE win-webserver LoadBalancer 172.30.245.246 a1eb57db05562481ead3a5f9ba4ed492-2096297118.us-east-2.elb.amazonaws.com 80:32488/TCP 2m33s # curl a1eb57db05562481ead3a5f9ba4ed492-2096297118.us-east-2.elb.amazonaws.com <html><body><H1>Windows Container Web Server</H1></body></html> Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Platform for Windows Containers 3.0.0 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3001 |