Description of problem: The router pod fails with PANIC error when the stats port is already in use: I1221 23:03:13.480857 1 template.go:403] router "msg"="starting router" "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 89e441002d0dfaefd5587bd9a380f254e864dba2\nversionFromGit: 4.0.0-219-g89e4410\ngitTreeState: clean\nbuildDate: 2020-12-05T14:15:58Z\n" panic: listen tcp 0.0.0.0:1936: bind: address already in use goroutine 1 [running]: github.com/openshift/router/pkg/router/metrics.Listener.Listen(0xc0002905a0, 0xc, 0xc000582f00, 0xc00000a04f, 0xc, 0xc00000a0ef, 0xc, 0x1bf7840, 0xc0002b43a0, 0x1bf7900, ...) /go/src/github.com/openshift/router/pkg/router/metrics/metrics.go:131 +0x57f github.com/openshift/router/pkg/cmd/infra/router.(*TemplateRouterOptions).Run(0xc0004fe000, 0xc0004de2a0, 0x0, 0x2d) /go/src/github.com/openshift/router/pkg/cmd/infra/router/template.go:536 +0x23fc github.com/openshift/router/pkg/cmd/infra/router.NewCommandTemplateRouter.func1(0xc0004d1340, 0xc0003315c0, 0x0, 0x1, 0x0, 0x0) /go/src/github.com/openshift/router/pkg/cmd/infra/router/template.go:223 +0x1dc github.com/spf13/cobra.(*Command).execute(0xc0004d1340, 0xc00000e090, 0x1, 0x1, 0xc0004d1340, 0xc00000e090) /go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:842 +0x47c github.com/spf13/cobra.(*Command).ExecuteC(0xc0004d1340, 0xc0002aa1e0, 0xc0004d1340, 0x10) /go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:950 +0x375 github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:887 main.main() /go/src/github.com/openshift/router/cmd/openshift-router/main.go:35 +0x305 Version-Release number of selected component (if applicable): OpenShift Container Platform 4.6 How reproducible: #n/a - happened during operator's action to redeploy the router unfortunately not reproducible as the environment is stabilized now Actual results: panic error Expected results: at least the router should not panic, but fatal or error Additional info: I would say that the issue is strange as the router's binary should not end with panic, rather fatal
I'm also seeing this oc logs router-default-58d8b6c48d-jhqzm I1226 22:16:39.842873 1 template.go:403] router "msg"="starting router" "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 89e441002d0dfaefd5587bd9a380f254e864dba2\nversionFromGit: 4.0.0-219-g89e4410\ngitTreeState: clean\nbuildDate: 2020-12-16T17:13:49Z\n" panic: listen tcp 0.0.0.0:1936: bind: address already in use goroutine 1 [running]: github.com/openshift/router/pkg/router/metrics.Listener.Listen(0xc000200700, 0xc, 0xc000183200, 0xc00000a0af, 0xc, 0xc00000a0ef, 0xc, 0x1bf7840, 0xc0003b3f80, 0x1bf7900, ...) /go/src/github.com/openshift/router/pkg/router/metrics/metrics.go:131 +0x57f github.com/openshift/router/pkg/cmd/infra/router.(*TemplateRouterOptions).Run(0xc000360000, 0xc0002bb7a0, 0x0, 0x2d) /go/src/github.com/openshift/router/pkg/cmd/infra/router/template.go:536 +0x23fc github.com/openshift/router/pkg/cmd/infra/router.NewCommandTemplateRouter.func1(0xc0005138c0, 0xc00006f6f0, 0x0, 0x1, 0x0, 0x0) /go/src/github.com/openshift/router/pkg/cmd/infra/router/template.go:223 +0x1dc github.com/spf13/cobra.(*Command).execute(0xc0005138c0, 0xc00000e0b0, 0x1, 0x1, 0xc0005138c0, 0xc00000e0b0) /go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:842 +0x47c github.com/spf13/cobra.(*Command).ExecuteC(0xc0005138c0, 0xc00007afc0, 0xc0005138c0, 0x10) /go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:950 +0x375 github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:887 main.main() /go/src/github.com/openshift/router/cmd/openshift-router/main.go:35 +0x305 Started happening on upgrade to 4.6.9 oc version Client Version: 4.6.8 Server Version: 4.6.9 Kubernetes Version: v1.19.0+7070803
The port conflict is unexpected because if the cluster is on a cloud platform, then the router uses its own private net namespace by default; and if the cluster is on bare metal, then the router uses the host network by default, but the scheduler should prevent the router pod from being scheduled on a node where another pod has claimed port 1936, so in either case, there should never be a port conflict. Is the cluster running on a cloud platform or on bare metal? Can you get the default ingresscontroller (`oc -n openshift-ingress-operator get ingresscontrollers/default -o yaml`)? Can you determine which process is using port 1936 (`ss -tlnp 'sport = 1936'`)?
Hi, the problem is not really an issue with the router, rather than logging it as panic, where it should not be a "panic" error, fatal would be better. I think it just needs change in the code (as commented above the affected line in the router code). The point of the issue is really that the router can't bind to 1936, because the previous container was not yet stopped or the port was not released by previous run of the router. The router was running normally. After change in ingresscontroller crd, the operator started new deployment of the router. The newly created pod was not able to start, due the previous was still bound to the port 1936 (in Terminating state). After 2 restarts, the router was able to bind to the port correctly. The message is clear in what is happening, just it shouldn't log it as "error within the code", where the real issue is in the network configuration. Just imagine that admin sees that error and thinks the issue is with the code, where it really isn't. The issue is in the infrastructure - and as you mentioned, admin needs to investigate why the port is not available. It needs fix in the code, from panic to fatal.
hi @miciah was bare metal cluster running OVNKubernetes sdn for me killing/restarting the ingress pod (multiple times) did not work i traced the process listening on 1936 it was ovnkube so could be the old pod was still terminating and still wired into sdn Cheers
(In reply to Vladislav Walek from comment #4) > Hi, > > the problem is not really an issue with the router, rather than logging it > as panic, where it should not be a "panic" error, fatal would be better. Fair enough, we can change the panic to a fatal error. (In reply to Mike Hepburn from comment #5) > hi @miciah > > was bare metal cluster running OVNKubernetes sdn > for me killing/restarting the ingress pod (multiple times) did not work > i traced the process listening on 1936 it was ovnkube > so could be the old pod was still terminating and still wired into sdn > > Cheers If it was the old pod, it should show up as the openshift-router process, not ovnkube. Do you have a custom service for the router? What does `oc -n openshift-ingress get services` list? Can you provide the output of `oc -n openshift-ingress get services -o yaml`? By default, on bare metal, there should be only a single, ClusterIP-type service named "router-internal-default".
Actually, would it be possible to get all the services in the cluster (`oc get services -A -o yaml`) and the ovnkube logs as well?
hi .. sorry i had deleted that cluster and installed a fresh 4.9. i can try to reproduce it of course? was wondering of the other 4.8->4.9 blockers were causing any of this behavior https://bugzilla.redhat.com/show_bug.cgi?id=1914284 https://bugzilla.redhat.com/show_bug.cgi?id=1914393 one thing of note - i had the default routers scheduled on the master nodes.
Updating target-release as it appears this will not make the 4.7 release schedule.
I'll try to fix the panic next sprint.
Sorry, I need to make a follow-up PR to address <https://github.com/openshift/router/pull/266#issuecomment-792051542>.
Verified in "4.10.0-0.nightly-2021-12-23-153012" release version.
Hi, if there is anything that customers should know about this bug or if there are any important workarounds that should be outlined in the bug fixes section OpenShift Container Platform 4.10 release notes, please update the Doc Type and Doc Text fields. If not, can you please mark it as "no doc update"? Thanks!
This is a minor issue, so we don't really need the doc update.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056