Bug 1979822 - mdns-publisher pods are crashing and restarting often.
Summary: mdns-publisher pods are crashing and restarting often.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.9.0
Assignee: Ben Nemec
QA Contact: Victor Voronkov
URL:
Whiteboard:
Depends On:
Blocks: 1988145
TreeView+ depends on / blocked
 
Reported: 2021-07-07 07:53 UTC by tmicheli
Modified: 2022-04-25 05:46 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2003563 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:38:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift mdns-publisher pull 33 0 None open Bug 1979822: Update zeroconf vendoring 2021-07-07 15:41:55 UTC
Red Hat Knowledge Base (Solution) 6171012 0 None None None 2021-09-08 14:13:20 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:38:16 UTC

Description tmicheli 2021-07-07 07:53:29 UTC
Description of problem:

The mdns-publisher worker pods are crashing and restarting due to panics in the go code.

~~~
# oc get  pod mdns-publisher-name-xn2w4-worker-0-tkpmc -o jsonpath='{.status}'  |jq .
[...]
  "containerStatuses": [
    {
      "containerID": "cri-o://d58725c656d0cf9f8c83dfb54eb03b8b06513461ccc6c73a0392d64375897b25",
      "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:582087fd639adc5ed9064c32ff3891babad6d339f8046fd217a69e2f5caef9dc",
      "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:582087fd639adc5ed9064c32ff3891babad6d339f8046fd217a69e2f5caef9dc",
      "lastState": {
        "terminated": {
          "containerID": "cri-o://edf27522aea137b84a3671401f51457ea8d222034d80b0603925a1408b3c86b0",
          "exitCode": 2,
          "finishedAt": "2021-07-05T00:06:41Z",
          "message": " 0x10, 0x2, 0x5a2, 0xc000024a1c, 0x4, 0xc0001b8610, 0x6, 0x1470, ...)\n\t/go/src/github.com/openshift/mdns-publisher/pkg/publisher/publisher.go:95 +0x45\ngithub.com/openshift/mdns-publisher/pkg/publisher.
IfaceCheck(0xc0000249c0, 0x10, 0x10, 0x2, 0x5a2, 0xc000024a1c, 0x4, 0xc0001b8610, 0x6, 0x1470, ...)\n\t/go/src/github.com/openshift/mdns-publisher/pkg/publisher/publisher.go:101 +0xa5\ncreated by github.com/openshift/mdns-publisher/cmd.glob..func1\n\t/go/src/github.com/openshift/mdns-publisher/cmd/publish.go:93 +0xc6a\n\ngoroutine 22 [IO wait]:\ninternal/poll.runtime_pollWait(0x7efe9e7c4e90, 0x72, 0xc0004803c0)\n\t/usr/lib/golang/src/runtime/netpoll.go:222 +0x55\ninternal/poll.(*pollDesc).wait(0xc000198818, 0x72, 0x0, 0x0, 0x0)\n\t/usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45\ninternal/poll.(*pollDesc).waitRead(...)\n\t/usr/lib/golang/src/internal/poll/f
d_poll_runtime.go:92\ninternal/poll.(*FD).RawRead(0xc000198800, 0xc00017b4a0, 0x0, 0x0)\n\t/usr/lib/golang/src/internal/poll/fd_unix.go:533 +0xfc\nnet.(*rawConn).Read(0xc000012150, 0xc00017b4a0, 0x1, 0x1)\n\t/usr/lib/golang/src/net/rawconn.go:43 +0x68\ngolang.org/x/net/internal/socket.(*Conn).recvMsg(0xc000010e20, 0xc000208ee0, 0x0, 0x28, 0xc000208f38)\n\t/go/src/github.com/openshift/mdns-publisher/vendor/golang.org/x/net/internal/socket/rawconn_ms
g.go:31 +0x20e\ngolang.org/x/net/internal/socket.(*Conn).RecvMsg(...)\n\t/go/src/github.com/openshift/mdns-publisher/vendor/golang.org/x/net/internal/socket/socket.go:255\ngolang.org/x/net/ipv6.(*payloadHandler).ReadFrom(0xc0000a2e70, 0xc00015a000, 0x10000, 0x10000, 0x2, 0x94f740, 0xc00017b440, 0x0, 0x0, 0x0)\n\t/go/src/github.com/openshift/mdns-publisher/vendor/golang.org/x/net/ipv6/payload_cmsg.go:31 +0x1df\ngithub.com/celebdor/zeroconf.(*Server).recv6(0xc000099260, 0xc0000a2e60)\n\t/go/src/github.com/openshift/mdns-publisher/vendor/github.com/celebdor/zeroconf/server.go:302 +0x109\ncreated by github.com/celebdor/zeroconf.(*Server).mainloop\n\t/go/src/github.com/opens
hift/mdns-publisher/vendor/github.com/celebdor/zeroconf/server.go:211 +0x65\n",
          "reason": "Error",
          "startedAt": "2021-07-04T17:28:19Z"
        }
[...]


# oc logs    mdns-publisher-name-xn2w4-worker-0-tkpmc -p
time="2021-07-04T17:28:19Z" level=info msg="Publishing with settings" collision_avoidance=hostname ip=x.x.x.x
time="2021-07-04T17:28:19Z" level=info msg="Binding interface" name=ens3
time="2021-07-04T17:28:19Z" level=debug msg="Changing service name" new="name Workstation-name-xn2w4-worker-0-tkpmc" original="name Workstation"
time="2021-07-04T17:28:19Z" level=info msg="Publishing service" domain=local. hostname=name-xn2w4-worker-0-tkpmc.local. name="name Workstation-name-xn2w4-worker-0-tkpmc" port=42424 ttl=3200 type=_workstation._tcp
time="2021-07-04T17:28:19Z" level=info msg="Zeroconf registering service" name="name Workstation-name-xn2w4-worker-0-tkpmc"
time="2021-07-04T17:28:19Z" level=info msg="Zeroconf setting service ttl" name="name Workstation-name-xn2w4-worker-0-tkpmc" ttl=3200
fatal error: concurrent map read and map write

goroutine 21 [running]:
runtime.throw(0x8d6800, 0x21)
        /usr/lib/golang/src/runtime/panic.go:1116 +0x72 fp=0xc000202d18 sp=0xc000202ce8 pc=0x43b952
runtime.mapaccess2_fast64(0x85d0a0, 0xc0001c2660, 0x2, 0xc00001000c, 0xc000202de0)
        /usr/lib/golang/src/runtime/map_fast64.go:61 +0x1ac fp=0xc000202d40 sp=0xc000202d18 pc=0x41832c
github.com/celebdor/zeroconf.(*Server).handleQuery(0xc000099260, 0xc000202eb0, 0x2, 0x94f740, 0xc0004269c0, 0x0, 0x0)
        /go/src/github.com/openshift/mdns-publisher/vendor/github.com/celebdor/zeroconf/server.go:366 +0x2f3 fp=0xc000202e78 sp=0xc000202d40 pc=0x693cd3
github.com/celebdor/zeroconf.(*Server).parsePacket(0xc000099260, 0xc00028e000, 0x29, 0x10000, 0x2, 0x94f740, 0xc0004269c0, 0xc0004269c0, 0x0)
        /go/src/github.com/openshift/mdns-publisher/vendor/github.com/celebdor/zeroconf/server.go:323 +0xf9 fp=0xc000202f48 sp=0xc000202e78 pc=0x693999
github.com/celebdor/zeroconf.(*Server).recv4(0xc000099260, 0xc0000a2e10)
        /go/src/github.com/openshift/mdns-publisher/vendor/github.com/celebdor/zeroconf/server.go:281 +0x16f fp=0xc000202fd0 sp=0xc000202f48 pc=0x69364f
runtime.goexit()
        /usr/lib/golang/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000202fd8 sp=0xc000202fd0 pc=0x470401
created by github.com/celebdor/zeroconf.(*Server).mainloop
        /go/src/github.com/openshift/mdns-publisher/vendor/github.com/celebdor/zeroconf/server.go:208 +0x89

goroutine 1 [select, 396 minutes]:
github.com/openshift/mdns-publisher/cmd.glob..func1(0xbbeaa0, 0xc00007b280, 0x0, 0x1)
        /go/src/github.com/openshift/mdns-publisher/cmd/publish.go:95 +0xcfa
github.com/spf13/cobra.(*Command).execute(0xbbeaa0, 0xc000010090, 0x1, 0x1, 0xbbeaa0, 0xc000010090)
        /go/src/github.com/openshift/mdns-publisher/vendor/github.com/spf13/cobra/command.go:830 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xbbeaa0, 0x44b24a, 0xb84160, 0xc000000180)
        /go/src/github.com/openshift/mdns-publisher/vendor/github.com/spf13/cobra/command.go:914 +0x30b
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/mdns-publisher/vendor/github.com/spf13/cobra/command.go:864
github.com/openshift/mdns-publisher/cmd.Execute()
        /go/src/github.com/openshift/mdns-publisher/cmd/publish.go:139 +0x31
main.main()
        /go/src/github.com/openshift/mdns-publisher/main.go:8 +0x25

goroutine 18 [syscall, 396 minutes]:
os/signal.signal_recv(0x0)
        /usr/lib/golang/src/runtime/sigqueue.go:147 +0x9d
os/signal.loop()
        /usr/lib/golang/src/os/signal/signal_unix.go:23 +0x25
created by os/signal.Notify.func1.1
        /usr/lib/golang/src/os/signal/signal.go:150 +0x45

goroutine 19 [chan receive, 396 minutes]:
github.com/openshift/mdns-publisher/pkg/publisher.Publish(0xc0000249c0, 0x10, 0x10, 0x2, 0x5a2, 0xc000024a1c, 0x4, 0xc0001b8610, 0x6, 0x1470, ...)
        /go/src/github.com/openshift/mdns-publisher/pkg/publisher/publisher.go:41 +0x6ac
created by github.com/openshift/mdns-publisher/cmd.glob..func1
        /go/src/github.com/openshift/mdns-publisher/cmd/publish.go:90 +0xba5

goroutine 20 [sleep]:
time.Sleep(0x12a05f200)
        /usr/lib/golang/src/runtime/time.go:188 +0xbf
github.com/openshift/mdns-publisher/pkg/publisher.ifaceCheck(0xc0000249c0, 0x10, 0x10, 0x2, 0x5a2, 0xc000024a1c, 0x4, 0xc0001b8610, 0x6, 0x1470, ...)
        /go/src/github.com/openshift/mdns-publisher/pkg/publisher/publisher.go:95 +0x45
github.com/openshift/mdns-publisher/pkg/publisher.IfaceCheck(0xc0000249c0, 0x10, 0x10, 0x2, 0x5a2, 0xc000024a1c, 0x4, 0xc0001b8610, 0x6, 0x1470, ...)
        /go/src/github.com/openshift/mdns-publisher/pkg/publisher/publisher.go:101 +0xa5
created by github.com/openshift/mdns-publisher/cmd.glob..func1
        /go/src/github.com/openshift/mdns-publisher/cmd/publish.go:93 +0xc6a

goroutine 22 [IO wait]:
internal/poll.runtime_pollWait(0x7efe9e7c4e90, 0x72, 0xc0004803c0)
        /usr/lib/golang/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc000198818, 0x72, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).RawRead(0xc000198800, 0xc00017b4a0, 0x0, 0x0)
        /usr/lib/golang/src/internal/poll/fd_unix.go:533 +0xfc
net.(*rawConn).Read(0xc000012150, 0xc00017b4a0, 0x1, 0x1)
        /usr/lib/golang/src/net/rawconn.go:43 +0x68
golang.org/x/net/internal/socket.(*Conn).recvMsg(0xc000010e20, 0xc000208ee0, 0x0, 0x28, 0xc000208f38)
        /go/src/github.com/openshift/mdns-publisher/vendor/golang.org/x/net/internal/socket/rawconn_msg.go:31 +0x20e
golang.org/x/net/internal/socket.(*Conn).RecvMsg(...)
        /go/src/github.com/openshift/mdns-publisher/vendor/golang.org/x/net/internal/socket/socket.go:255
golang.org/x/net/ipv6.(*payloadHandler).ReadFrom(0xc0000a2e70, 0xc00015a000, 0x10000, 0x10000, 0x2, 0x94f740, 0xc00017b440, 0x0, 0x0, 0x0)
        /go/src/github.com/openshift/mdns-publisher/vendor/golang.org/x/net/ipv6/payload_cmsg.go:31 +0x1df
github.com/celebdor/zeroconf.(*Server).recv6(0xc000099260, 0xc0000a2e60)
        /go/src/github.com/openshift/mdns-publisher/vendor/github.com/celebdor/zeroconf/server.go:302 +0x109
created by github.com/celebdor/zeroconf.(*Server).mainloop
        /go/src/github.com/openshift/mdns-publisher/vendor/github.com/celebdor/zeroconf/server.go:211 +0x65
~~~
There is no Dualstack in use:
~~~
# oc describe network.config/cluster
Name:         cluster
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         Network
Metadata:
  Creation Timestamp:  2021-06-21T11:23:30Z
  Generation:          2
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:clusterNetwork:
        f:externalIP:
          .:
          f:policy:
        f:networkType:
        f:serviceNetwork:
      f:status:
    Manager:      cluster-bootstrap
    Operation:    Update
    Time:         2021-06-21T11:23:30Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:clusterNetwork:
        f:networkType:
        f:serviceNetwork:
    Manager:         cluster-network-operator
    Operation:       Update
    Time:            2021-06-21T11:32:03Z
  Resource Version:  3631
  Self Link:         /apis/config.openshift.io/v1/networks/cluster
  UID:               eb3a9750-6040-4be4-8ac4-ef9b4e325eea
Spec:
  Cluster Network:
    Cidr:         10.128.0.0/14
    Host Prefix:  23
  External IP:
    Policy:
  Network Type:  Kuryr
  Service Network:
    172.30.0.0/16
Status:
  Cluster Network:
    Cidr:         10.128.0.0/14
    Host Prefix:  23
  Network Type:   Kuryr
  Service Network:
    172.30.0.0/16
Events:  <none>
~~~

Version-Release number of selected component (if applicable):


How reproducible:
This was observed in SoS and kuryr enabled but also with IPI installer in non RHOSP environments.

Steps to Reproduce:


Actual results:
mdns-publisher worker pods gets restarted often.

Expected results:
- mdns-publisher worker pods are not restarting often.

Additional info:

Comment 2 Ben Nemec 2021-07-07 15:33:46 UTC
It turns out this was already fixed by https://github.com/dcbw/zeroconf/commit/cf83d55efa2450344cb81a395c7ba439a001f6ca but we need to re-vendor zeroconf in mdns-publisher to make it effective.

Comment 11 Victor Voronkov 2021-08-03 10:00:18 UTC
can't really verify on this build, since mDNS pods were removed from 4.8, will do real verification on 4.7 backport

Comment 15 errata-xmlrpc 2021-10-18 17:38:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 18 Victor Voronkov 2022-04-25 05:46:02 UTC
automation not relevant for this version, only 4.7 and 4.6


Note You need to log in before you can comment on or make changes to this bug.