1909906 – The router fails with PANIC error when stats port already in use

Bug 1909906 - The router fails with PANIC error when stats port already in use

Summary: The router fails with PANIC error when stats port already in use

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-22 01:23 UTC by Vladislav Walek
Modified:	2024-03-25 17:38 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:02:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift router pull 266	None	Closed	[Insights/Rules/RFE] Add check for file before doing (expensive) calls to usernames	2022-04-15 07:00:21 UTC
Github	openshift router pull 366	None	Merged	Bug 1909906: Exit immediately if stats port is taken	2022-02-10 22:49:00 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:02:50 UTC

Description Vladislav Walek 2020-12-22 01:23:07 UTC

Description of problem:

The router pod fails with PANIC error when the stats port is already in use:

I1221 23:03:13.480857       1 template.go:403] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 89e441002d0dfaefd5587bd9a380f254e864dba2\nversionFromGit: 4.0.0-219-g89e4410\ngitTreeState: clean\nbuildDate: 2020-12-05T14:15:58Z\n"
panic: listen tcp 0.0.0.0:1936: bind: address already in use

goroutine 1 [running]:
github.com/openshift/router/pkg/router/metrics.Listener.Listen(0xc0002905a0, 0xc, 0xc000582f00, 0xc00000a04f, 0xc, 0xc00000a0ef, 0xc, 0x1bf7840, 0xc0002b43a0, 0x1bf7900, ...)
        /go/src/github.com/openshift/router/pkg/router/metrics/metrics.go:131 +0x57f
github.com/openshift/router/pkg/cmd/infra/router.(*TemplateRouterOptions).Run(0xc0004fe000, 0xc0004de2a0, 0x0, 0x2d)
        /go/src/github.com/openshift/router/pkg/cmd/infra/router/template.go:536 +0x23fc
github.com/openshift/router/pkg/cmd/infra/router.NewCommandTemplateRouter.func1(0xc0004d1340, 0xc0003315c0, 0x0, 0x1, 0x0, 0x0)
        /go/src/github.com/openshift/router/pkg/cmd/infra/router/template.go:223 +0x1dc
github.com/spf13/cobra.(*Command).execute(0xc0004d1340, 0xc00000e090, 0x1, 0x1, 0xc0004d1340, 0xc00000e090)
        /go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:842 +0x47c
github.com/spf13/cobra.(*Command).ExecuteC(0xc0004d1340, 0xc0002aa1e0, 0xc0004d1340, 0x10)
        /go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:950 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:887
main.main()
        /go/src/github.com/openshift/router/cmd/openshift-router/main.go:35 +0x305



Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.6

How reproducible:
#n/a - happened during operator's action to redeploy the router
unfortunately not reproducible as the environment is stabilized now

Actual results:
panic error


Expected results:
at least the router should not panic, but fatal or error


Additional info:
I would say that the issue is strange as the router's binary should not end with panic, rather fatal

Comment 2 Mike Hepburn 2020-12-26 22:19:53 UTC

I'm also seeing this

oc logs router-default-58d8b6c48d-jhqzm
I1226 22:16:39.842873       1 template.go:403] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 89e441002d0dfaefd5587bd9a380f254e864dba2\nversionFromGit: 4.0.0-219-g89e4410\ngitTreeState: clean\nbuildDate: 2020-12-16T17:13:49Z\n"
panic: listen tcp 0.0.0.0:1936: bind: address already in use

goroutine 1 [running]:
github.com/openshift/router/pkg/router/metrics.Listener.Listen(0xc000200700, 0xc, 0xc000183200, 0xc00000a0af, 0xc, 0xc00000a0ef, 0xc, 0x1bf7840, 0xc0003b3f80, 0x1bf7900, ...)
	/go/src/github.com/openshift/router/pkg/router/metrics/metrics.go:131 +0x57f
github.com/openshift/router/pkg/cmd/infra/router.(*TemplateRouterOptions).Run(0xc000360000, 0xc0002bb7a0, 0x0, 0x2d)
	/go/src/github.com/openshift/router/pkg/cmd/infra/router/template.go:536 +0x23fc
github.com/openshift/router/pkg/cmd/infra/router.NewCommandTemplateRouter.func1(0xc0005138c0, 0xc00006f6f0, 0x0, 0x1, 0x0, 0x0)
	/go/src/github.com/openshift/router/pkg/cmd/infra/router/template.go:223 +0x1dc
github.com/spf13/cobra.(*Command).execute(0xc0005138c0, 0xc00000e0b0, 0x1, 0x1, 0xc0005138c0, 0xc00000e0b0)
	/go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:842 +0x47c
github.com/spf13/cobra.(*Command).ExecuteC(0xc0005138c0, 0xc00007afc0, 0xc0005138c0, 0x10)
	/go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:950 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/router/vendor/github.com/spf13/cobra/command.go:887
main.main()
	/go/src/github.com/openshift/router/cmd/openshift-router/main.go:35 +0x305


Started happening on upgrade to 4.6.9

oc version
Client Version: 4.6.8
Server Version: 4.6.9
Kubernetes Version: v1.19.0+7070803

Comment 3 Miciah Dashiel Butler Masters 2020-12-31 04:45:13 UTC

The port conflict is unexpected because if the cluster is on a cloud platform, then the router uses its own private net namespace by default; and if the cluster is on bare metal, then the router uses the host network by default, but the scheduler should prevent the router pod from being scheduled on a node where another pod has claimed port 1936, so in either case, there should never be a port conflict.  

Is the cluster running on a cloud platform or on bare metal?  

Can you get the default ingresscontroller (`oc -n openshift-ingress-operator get ingresscontrollers/default -o yaml`)?  

Can you determine which process is using port 1936 (`ss -tlnp 'sport = 1936'`)?

Comment 4 Vladislav Walek 2020-12-31 05:04:12 UTC

Hi,

the problem is not really an issue with the router, rather than logging it as panic, where it should not be a "panic" error, fatal would be better. 
I think it just needs change in the code (as commented above the affected line in the router code).

The point of the issue is really that the router can't bind to 1936, because the previous container was not yet stopped or the port was not released by previous run of the router.
The router was running normally. After change in ingresscontroller crd, the operator started new deployment of the router. 
The newly created pod was not able to start, due the previous was still bound to the port 1936 (in Terminating state). After 2 restarts, the router was able to bind to the port correctly.

The message is clear in what is happening, just it shouldn't log it as "error within the code", where the real issue is in the network configuration.
Just imagine that admin sees that error and thinks the issue is with the code, where it really isn't. The issue is in the infrastructure - and as you mentioned, admin needs to investigate why the port is not available.

It needs fix in the code, from panic to fatal.

Comment 5 Mike Hepburn 2021-01-02 00:19:16 UTC

hi @miciah

was bare metal cluster running OVNKubernetes sdn
for me killing/restarting the ingress pod (multiple times) did not work
i traced the process listening on 1936 it was ovnkube
so could be the old pod was still terminating and still wired into sdn

Cheers

Comment 6 Miciah Dashiel Butler Masters 2021-01-05 19:04:19 UTC

(In reply to Vladislav Walek from comment #4)
> Hi,
> 
> the problem is not really an issue with the router, rather than logging it
> as panic, where it should not be a "panic" error, fatal would be better. 

Fair enough, we can change the panic to a fatal error.


(In reply to Mike Hepburn from comment #5)
> hi @miciah
> 
> was bare metal cluster running OVNKubernetes sdn
> for me killing/restarting the ingress pod (multiple times) did not work
> i traced the process listening on 1936 it was ovnkube
> so could be the old pod was still terminating and still wired into sdn
> 
> Cheers

If it was the old pod, it should show up as the openshift-router process, not ovnkube.  

Do you have a custom service for the router?  What does `oc -n openshift-ingress get services` list?  Can you provide the output of `oc -n openshift-ingress get services -o yaml`?  By default, on bare metal, there should be only a single, ClusterIP-type service named "router-internal-default".

Comment 7 Miciah Dashiel Butler Masters 2021-01-05 22:24:44 UTC

Actually, would it be possible to get all the services in the cluster (`oc get services -A -o yaml`) and the ovnkube logs as well?

Comment 8 Mike Hepburn 2021-01-13 03:43:58 UTC

hi .. sorry i had deleted that cluster and installed a fresh 4.9. 

i can try to reproduce it of course? was wondering of the other 4.8->4.9 blockers were causing any of this behavior

https://bugzilla.redhat.com/show_bug.cgi?id=1914284
https://bugzilla.redhat.com/show_bug.cgi?id=1914393

one thing of note - i had the default routers scheduled on the master nodes.

Comment 9 mfisher 2021-01-20 20:04:53 UTC

Updating target-release as it appears this will not make the 4.7 release schedule.

Comment 10 Miciah Dashiel Butler Masters 2021-02-06 02:11:46 UTC

I'll try to fix the panic next sprint.

Comment 12 Miciah Dashiel Butler Masters 2021-03-08 04:53:57 UTC

Sorry, I need to make a follow-up PR to address <https://github.com/openshift/router/pull/266#issuecomment-792051542>.

Comment 18 Arvind iyengar 2021-12-28 08:29:01 UTC

Verified in "4.10.0-0.nightly-2021-12-23-153012" release version.

Comment 22 Brandi Munilla 2022-02-10 20:09:04 UTC

Hi, if there is anything that customers should know about this bug or if there are any important workarounds that should be outlined in the bug fixes section OpenShift Container Platform 4.10 release notes, please update the Doc Type and Doc Text fields. If not, can you please mark it as "no doc update"? Thanks!

Comment 23 Miciah Dashiel Butler Masters 2022-02-10 23:16:23 UTC

This is a minor issue, so we don't really need the doc update.

Comment 25 errata-xmlrpc 2022-03-10 16:02:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.