Bug 2055865
Summary: | northd doesn't wake up for poll intervals when DB IP cannot be reached | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | bzhai | ||||
Component: | OVN | Assignee: | Ilya Maximets <i.maximets> | ||||
Status: | CLOSED ERRATA | QA Contact: | Ehsan Elahi <eelahi> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | FDP 21.H | CC: | achernet, ctrautma, dcbw, i.maximets, jiji, mmichels | ||||
Target Milestone: | --- | Keywords: | CustomerScenariosInitiative | ||||
Target Release: | FDP 22.B | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | ovn2.13-20.12.0-195 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-03-30 16:28:12 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
bzhai
2022-02-17 20:18:23 UTC
First we thought the northd probe interval was set to 0, but it was 5000: sh-4.4# ovn-nbctl get NB_GLOBAL . options:northd_probe_interval "5000" nbctl daemon logs (from ovnkube-master pod): 2022-02-17T18:22:10Z|00002|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:22:11Z|00005|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:22:13Z|00009|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:22:17Z|00013|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable <snip> 2022-02-17T18:28:17.387Z|00062|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:28:25.395Z|00063|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:28:33.404Z|00064|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:28:41.412Z|00065|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:28:49.420Z|00066|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:28:57.431Z|00067|reconnect|INFO|ssl:192.168.58.108:9641: connected <eg, connectivity is restored at 18:28:57 at least> ovn-northd logs: 2022-02-17T18:20:05Z|17665|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20529|reconnect|DBG|ssl:192.168.58.108:9642: idle 29997 ms, sending inactivity probe 2022-02-17T18:20:05Z|20530|reconnect|DBG|ssl:192.168.58.108:9642: entering IDLE 2022-02-17T18:20:05Z|20531|jsonrpc|DBG|ssl:192.168.58.108:9642: send request, method="echo", params=[], id="echo" 2022-02-17T18:20:05Z|20532|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T18:20:05Z|20533|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T18:20:05Z|17666|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17667|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20534|poll_loop|DBG|wakeup due to [POLLIN] on fd 14 (192.168.58.108:59440<->192.168.58.108:9642) at lib/stream-ssl.c:832 (0% CPU usage) 2022-02-17T18:20:05Z|17668|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20535|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T18:20:05Z|20536|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T18:20:05Z|20537|jsonrpc|DBG|ssl:192.168.58.108:9642: received reply, result=[], id="echo" 2022-02-17T18:20:05Z|20538|reconnect|DBG|ssl:192.168.58.108:9642: entering ACTIVE 2022-02-17T18:20:05Z|17669|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20539|poll_loop|DBG|wakeup due to 0-ms timeout at lib/stream-ssl.c:834 (0% CPU usage) 2022-02-17T18:20:05Z|17670|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17671|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20540|poll_loop|DBG|wakeup due to [POLLIN] on fd 14 (192.168.58.108:59440<->192.168.58.108:9642) at lib/stream-ssl.c:832 (0% CPU usage) 2022-02-17T18:20:05Z|20541|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T18:20:05Z|20542|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T18:20:05Z|17672|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20543|jsonrpc|DBG|ssl:192.168.58.108:9642: received notification, method="update2", params=[["monid","_Server"],{"Database":{"91a38ed7-d7ac-45c2-8123-659db7650ff6":{"modify":{"index":48732}} }}] 2022-02-17T18:20:05Z|20544|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T18:20:05Z|20545|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T18:20:05Z|20546|jsonrpc|DBG|ssl:192.168.58.108:9642: received notification, method="update3", params=[["monid","OVN_Southbound"],"0b74fdca-f575-42d3-ba2a-533e98ab2df2",{"SB_Global":{"72de3344-ac6f-4f2 e-b26f-0b2087b05614":{"modify":{"options":["map",[["e2e_timestamp","1645122005"]]]}}}}] 2022-02-17T18:20:05Z|20547|poll_loop|DBG|wakeup due to [POLLIN] on fd 14 (192.168.58.108:59440<->192.168.58.108:9642) at lib/stream-ssl.c:832 (0% CPU usage) 2022-02-17T18:20:05Z|17673|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20548|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T18:20:05Z|20549|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T18:20:05Z|20550|jsonrpc|DBG|ssl:192.168.58.108:9642: received reply, result=[{"count":1},{"count":1},{},{}], id=4200 2022-02-17T18:20:05Z|17674|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17675|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17676|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17677|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17678|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17679|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17680|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17681|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17682|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17683|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) < connectivity is interrupted by allowing the DHCP lease to expire, and the IP to be removed from the node's interface > < northd goes to sleep and doesn't wake up for a long time > < then allow DHCP to re-acquire the lease for the same IP and assign it back to the interface around 18:28:57, but northd doesn't wake up for an hour > 2022-02-17T19:28:20Z|20551|poll_loop|DBG|wakeup due to [POLLIN] on fd 6 (/var/run/ovn/ovn-northd.1.ctl<->) at lib/stream-fd.c:274 (0% CPU usage) < attempt to wake northd up by running "ovn-appctl status", it resurrects itself > 2022-02-17T19:28:20Z|17684|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20552|hmap|DBG|Dropped 9 log messages in last 4095 seconds (most recently, 4095 seconds ago) due to excessive rate 2022-02-17T19:28:20Z|20553|hmap|DBG|northd/northd.c:1571: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (32 nodes total across 32 buckets) 2022-02-17T19:28:20Z|20554|hmap|DBG|northd/northd.c:1571: 2 buckets with 6+ nodes, including 1 bucket with 7 nodes (64 nodes total across 64 buckets) 2022-02-17T19:28:20Z|20555|hmap|DBG|lib/ovn-util.c:613: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (32 nodes total across 32 buckets) 2022-02-17T19:28:20Z|17685|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17686|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17687|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17688|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20556|hmap|DBG|lib/hmapx.c:41: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (64 nodes total across 64 buckets) 2022-02-17T19:28:20Z|20557|hmap|DBG|lib/hmapx.c:41: 1 bucket with 6+ nodes, including 1 bucket with 7 nodes (128 nodes total across 128 buckets) 2022-02-17T19:28:20Z|20558|hmap|DBG|lib/hmapx.c:41: 3 buckets with 6+ nodes, including 1 bucket with 8 nodes (256 nodes total across 256 buckets) 2022-02-17T19:28:20Z|20559|hmap|DBG|lib/hmapx.c:41: 4 buckets with 6+ nodes, including 3 buckets with 7 nodes (512 nodes total across 512 buckets) 2022-02-17T19:28:20Z|20560|hmap|DBG|lib/hmapx.c:41: 10 buckets with 6+ nodes, including 3 buckets with 7 nodes (1024 nodes total across 1024 buckets) 2022-02-17T19:28:20Z|17689|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17690|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20561|jsonrpc|DBG|unix#1: received request, method="status", params=[], id=0 2022-02-17T19:28:20Z|20562|unixctl|DBG|received request status[], id=0 2022-02-17T19:28:20Z|20563|unixctl|DBG|replying with success, id=0: "Status: active " 2022-02-17T19:28:20Z|20564|jsonrpc|DBG|unix#1: send reply, result="Status: active\n", id=0 2022-02-17T19:28:20Z|20565|poll_loop|DBG|wakeup due to 0-ms timeout at lib/reconnect.c:664 (0% CPU usage) 2022-02-17T19:28:20Z|17691|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20566|reconnect|DBG|ssl:192.168.58.108:9641: idle 4094953 ms, sending inactivity probe 2022-02-17T19:28:20Z|20567|reconnect|DBG|ssl:192.168.58.108:9641: entering IDLE 2022-02-17T19:28:20Z|20568|jsonrpc|DBG|ssl:192.168.58.108:9641: send request, method="echo", params=[], id="echo" 2022-02-17T19:28:20Z|20569|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:20Z|20570|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 257 (1 bytes) 2022-02-17T19:28:20Z|20571|stream_ssl|WARN|SSL_read: system error (Connection reset by peer) 2022-02-17T19:28:20Z|20572|jsonrpc|WARN|ssl:192.168.58.108:9641: receive error: Connection reset by peer 2022-02-17T19:28:20Z|20573|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:20Z|20574|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 257 (1 bytes) 2022-02-17T19:28:20Z|20575|reconnect|DBG|ssl:192.168.58.108:9642: idle 4094944 ms, sending inactivity probe 2022-02-17T19:28:20Z|20576|reconnect|DBG|ssl:192.168.58.108:9642: entering IDLE 2022-02-17T19:28:20Z|20577|jsonrpc|DBG|ssl:192.168.58.108:9642: send request, method="echo", params=[], id="echo" 2022-02-17T19:28:20Z|20578|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T19:28:20Z|20579|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T19:28:20Z|20580|stream_ssl|WARN|SSL_read: system error (Connection reset by peer) 2022-02-17T19:28:20Z|20581|jsonrpc|WARN|ssl:192.168.58.108:9642: receive error: Connection reset by peer 2022-02-17T19:28:20Z|20582|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T19:28:20Z|20583|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T19:28:20Z|17692|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20584|hmap|DBG|northd/northd.c:1571: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (32 nodes total across 32 buckets) 2022-02-17T19:28:20Z|20585|hmap|DBG|northd/northd.c:1571: 2 buckets with 6+ nodes, including 1 bucket with 7 nodes (64 nodes total across 64 buckets) 2022-02-17T19:28:20Z|17693|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17694|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17695|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17696|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17697|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17698|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17699|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17700|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20586|poll_loop|DBG|wakeup due to 0-ms timeout at ssl:192.168.58.108:9641 (0% CPU usage) 2022-02-17T19:28:20Z|17701|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17702|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20587|reconnect|WARN|ssl:192.168.58.108:9641: connection dropped (Connection reset by peer) 2022-02-17T19:28:20Z|20588|reconnect|DBG|ssl:192.168.58.108:9641: entering BACKOFF 2022-02-17T19:28:20Z|20589|ovsdb_cs|DBG|ssl:192.168.58.108:9641: MONITORING -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419 2022-02-17T19:28:20Z|20590|reconnect|WARN|ssl:192.168.58.108:9642: connection dropped (Connection reset by peer) 2022-02-17T19:28:20Z|20591|reconnect|DBG|ssl:192.168.58.108:9642: entering BACKOFF 2022-02-17T19:28:20Z|20592|ovsdb_cs|DBG|ssl:192.168.58.108:9642: MONITORING -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419 2022-02-17T19:28:20Z|20593|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2022-02-17T19:28:20Z|17703|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:21Z|20594|poll_loop|DBG|wakeup due to 1000-ms timeout at lib/reconnect.c:664 (0% CPU usage) 2022-02-17T19:28:21Z|20595|reconnect|INFO|ssl:192.168.58.108:9641: connecting... 2022-02-17T19:28:21Z|20596|reconnect|DBG|ssl:192.168.58.108:9641: entering CONNECTING 2022-02-17T19:28:21Z|17704|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:21Z|20597|reconnect|INFO|ssl:192.168.58.108:9642: connecting... 2022-02-17T19:28:21Z|20598|reconnect|DBG|ssl:192.168.58.108:9642: entering CONNECTING 2022-02-17T19:28:21Z|20599|stream_ssl|DBG|client7-->ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:21Z|20600|stream_ssl|DBG|client7-->ssl:192.168.58.108:9641 handshake: client_hello (340 bytes) 2022-02-17T19:28:21Z|20601|stream_ssl|DBG|client8-->ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T19:28:21Z|20602|stream_ssl|DBG|client8-->ssl:192.168.58.108:9642 handshake: client_hello (340 bytes) 2022-02-17T19:28:21Z|17705|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:21Z|20603|poll_loop|DBG|wakeup due to [POLLIN] on fd 13 (192.168.58.108:60464<->192.168.58.108:9641) at lib/stream-ssl.c:821 (0% CPU usage) 2022-02-17T19:28:21Z|17706|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:21Z|20604|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:21Z|20605|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 handshake: server_hello (122 bytes) 2022-02-17T19:28:21Z|20606|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:21Z|20607|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:21Z|20608|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 257 (1 bytes) 2022-02-17T19:28:21Z|20609|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 handshake: <unknown> (6 bytes) 2022-02-17T19:28:21Z|20610|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:21Z|20611|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 257 (1 bytes) 2022-02-17T19:28:21Z|20612|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 handshake: certificate_request (112 bytes) My 3-nodes cluster got recovered with the same command: kni@jumphost ~/rhacm $ oc get pods NAME READY STATUS RESTARTS AGE nginx2 0/1 ContainerCreating 0 4h57m kni@jumphost ~/rhacm $ kni@jumphost ~/rhacm $ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-dqhtm 6/6 Running 2 (22h ago) 22h ovnkube-master-dwrn4 6/6 Running 1 (18h ago) 22h ovnkube-master-pmc2j 6/6 Running 3 (3h27m ago) 22h ovnkube-node-8lxdj 4/4 Running 0 23h ovnkube-node-q8t4d 4/4 Running 0 23h ovnkube-node-zt549 4/4 Running 0 23h kni@jumphost ~/rhacm $ kni@jumphost ~/rhacm $ kni@jumphost ~/rhacm $ oc exec -it ovnkube-master-dqhtm -n openshift-ovn-kubernetes -- ovn-appctl -t /var/run/ovn/ovn-northd.1.ctl vlog/set dbg Defaulted container "northd" out of: northd, nbdb, kube-rbac-proxy, sbdb, ovnkube-master, ovn-dbchecker kni@jumphost ~/rhacm $ oc get pods NAME READY STATUS RESTARTS AGE nginx2 1/1 Running 0 4h58m kni@jumphost ~/rhacm $ oc describe pod nginx2 Name: nginx2 Namespace: default Priority: 0 Node: openshift-master-0/192.168.58.21 Start Time: Thu, 17 Feb 2022 11:50:00 -0500 ...... node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 3m49s (x238 over 4h50m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx2_default_97b7c90b-238e-41da-b54b-5bedb1cfc5b7_0(b5b325344f7949c3e330896e96a529658b4695a3883c2ac6ec85b4cec82b41ac): error adding pod default_nginx2 to CNI network "multus-cni-network": [default/nginx2/97b7c90b-238e-41da-b54b-5bedb1cfc5b7:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[default/nginx2 b5b325344f7949c3e330896e96a529658b4695a3883c2ac6ec85b4cec82b41ac] [default/nginx2 b5b325344f7949c3e330896e96a529658b4695a3883c2ac6ec85b4cec82b41ac] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:80:00:3c [10.128.0.60/24] ' Normal AddedInterface 2m40s multus Add eth0 [10.128.0.60/24] from ovn-kubernetes I think, I found the root cause and I posted the fix for review: https://patchwork.ozlabs.org/project/openvswitch/patch/20220221141622.767208-1-i.maximets@ovn.org/ The problem caused by the following OVS patch: 4241d652e465 ("jsonrpc: Avoid disconnecting prematurely due to long poll intervals.") And it basically breaks the inactivity probe functionality in case the application doesn't have any other sources of wake-up events (like northd). @ ### Reproduced on: [root@bz_2055865 ~]# rpm -qa |grep -E 'ovn|openvswitch' ovn2.13-central-20.12.0-193.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch openvswitch2.13-2.13.0-162.el8fdp.x86_64 ovn2.13-20.12.0-193.el8fdp.x86_64 ovn2.13-host-20.12.0-193.el8fdp.x86_64 ### Created a cluster and start ovn-northd [root@bz_2055865 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=172.16.1.1 --db-nb-create-insecure-remote=yes --db-sb-addr=172.16.1.1 --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr=172.16.1.1 --db-nb-cluster-local-addr=172.16.1.1 --ovn-northd-nb-db=tcp:172.16.1.1:6641 --ovn-northd-sb-db=tcp:172.16.1.1:6642 start_northd Creating cluster database /etc/ovn/ovnnb_db.db [ OK ] Starting ovsdb-nb [ OK ] Waiting for OVN_Northbound to come up 2022-03-08T21:31:24Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-03-08T21:31:24Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected [ OK ] Creating cluster database /etc/ovn/ovnsb_db.db [ OK ] Starting ovsdb-sb [ OK ] Waiting for OVN_Southbound to come up 2022-03-08T21:31:24Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2022-03-08T21:31:24Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected [ OK ] Starting ovn-northd [ OK ] ### Check ovn-northd.log [root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log 2022-03-08T21:30:12.435Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:30:12.435Z|00004|daemon_unix(monitor)|INFO|pid 67309 died, exit status 0, exiting 2022-03-08T21:31:24.448Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:31:24.449Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1] 2022-03-08T21:31:24.449Z|00003|reconnect|INFO|tcp:172.16.1.2:6641: connecting... 2022-03-08T21:31:24.449Z|00004|reconnect|INFO|tcp:172.16.1.2:6642: connecting... 2022-03-08T21:31:24.449Z|00005|reconnect|INFO|tcp:172.16.1.2:6641: connected 2022-03-08T21:31:24.449Z|00006|reconnect|INFO|tcp:172.16.1.2:6642: connected 2022-03-08T21:31:24.449Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. ### Look the process IDs for ovnsb_db and ovn_nb_db. The objective is to block the communication between ovnsb_db and ovn_northd in such a way that ovn-northd does not know about this blockage. ### In this way we can see if the inactivity probes are disabled or enabled. [root@bz_2055865 ~]# cat /var/run/ovn/ovnsb_db.pid 67571 [root@bz_2055865 ~]# cat /var/run/ovn/ovnnb_db.pid 67550 [root@bz_2055865 ~]# kill -19 67571 [root@bz_2055865 ~]# kill -19 67550 [root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log 2022-03-08T21:30:12.435Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:30:12.435Z|00004|daemon_unix(monitor)|INFO|pid 67309 died, exit status 0, exiting 2022-03-08T21:31:24.448Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:31:24.449Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1] 2022-03-08T21:31:24.449Z|00003|reconnect|INFO|tcp:172.16.1.2:6641: connecting... 2022-03-08T21:31:24.449Z|00004|reconnect|INFO|tcp:172.16.1.2:6642: connecting... 2022-03-08T21:31:24.449Z|00005|reconnect|INFO|tcp:172.16.1.2:6641: connected 2022-03-08T21:31:24.449Z|00006|reconnect|INFO|tcp:172.16.1.2:6642: connected 2022-03-08T21:31:24.449Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2022-03-08T21:31:34.457Z|00008|memory|INFO|3800 kB peak resident set size after 10.0 seconds <============ Inactivity probes are disabled. ovn-northd did not detect the disconnection. So didn't attempted to reconnect. ### Verified on: [root@bz_2055865 ~]# rpm -qa |grep -E 'ovn|openvswitch' openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch ovn2.13-20.12.0-195.el8fdp.x86_64 ovn2.13-central-20.12.0-195.el8fdp.x86_64 ovn2.13-host-20.12.0-195.el8fdp.x86_64 openvswitch2.13-2.13.0-162.el8fdp.x86_64 ### Cluster created [root@bz_2055865 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=172.16.1.1 --db-nb-create-insecure-remote=yes --db-sb-addr=172.16.1.1 --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr=172.16.1.1 --db-nb-cluster-local-addr=172.16.1.1 --ovn-northd-nb-db=tcp:172.16.1.1:6641 --ovn-northd-sb-db=tcp:172.16.1.1:6642 start_northd Creating cluster database /etc/ovn/ovnnb_db.db [ OK ] Starting ovsdb-nb [ OK ] Waiting for OVN_Northbound to come up 2022-03-08T21:35:22Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-03-08T21:35:22Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected [ OK ] Creating cluster database /etc/ovn/ovnsb_db.db [ OK ] Starting ovsdb-sb [ OK ] Waiting for OVN_Southbound to come up 2022-03-08T21:35:22Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2022-03-08T21:35:22Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected [ OK ] Starting ovn-northd [ OK ] [root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00004|daemon_unix(monitor)|INFO|pid 1793028 died, exit status 0, exiting 2022-03-08T21:35:22.139Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:35:22.140Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1] 2022-03-08T21:35:22.140Z|00003|reconnect|INFO|tcp:172.16.1.1:6641: connecting... 2022-03-08T21:35:22.140Z|00004|reconnect|INFO|tcp:172.16.1.1:6642: connecting... 2022-03-08T21:35:22.140Z|00005|reconnect|INFO|tcp:172.16.1.1:6641: connected 2022-03-08T21:35:22.140Z|00006|reconnect|INFO|tcp:172.16.1.1:6642: connected 2022-03-08T21:35:22.140Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2022-03-08T21:35:32.145Z|00008|memory|INFO|3936 kB peak resident set size after 10.0 seconds [root@bz_2055865 ~]# cat /var/run/ovn/ovnsb_db.pid 1793479 [root@bz_2055865 ~]# cat /var/run/ovn/ovnnb_db.pid 1793458 [root@bz_2055865 ~]# kill -19 1793479 [root@bz_2055865 ~]# kill -19 1793458 [root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00004|daemon_unix(monitor)|INFO|pid 1793028 died, exit status 0, exiting 2022-03-08T21:35:22.139Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:35:22.140Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1] 2022-03-08T21:35:22.140Z|00003|reconnect|INFO|tcp:172.16.1.1:6641: connecting... 2022-03-08T21:35:22.140Z|00004|reconnect|INFO|tcp:172.16.1.1:6642: connecting... 2022-03-08T21:35:22.140Z|00005|reconnect|INFO|tcp:172.16.1.1:6641: connected 2022-03-08T21:35:22.140Z|00006|reconnect|INFO|tcp:172.16.1.1:6642: connected 2022-03-08T21:35:22.140Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2022-03-08T21:35:32.145Z|00008|memory|INFO|3936 kB peak resident set size after 10.0 seconds 2022-03-08T21:36:22.160Z|00009|reconnect|ERR|tcp:172.16.1.1:6642: no response to inactivity probe after 5 seconds, disconnecting 2022-03-08T21:36:22.160Z|00010|reconnect|INFO|tcp:172.16.1.1:6642: connection dropped 2022-03-08T21:36:22.160Z|00011|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2022-03-08T21:36:23.161Z|00012|reconnect|INFO|tcp:172.16.1.1:6642: connecting... 2022-03-08T21:36:23.161Z|00013|reconnect|INFO|tcp:172.16.1.1:6642: connected <============= Inactivity probes are enabled. ovn_northd is trying to reconnect with the db servers. ### Resume the northboudDB and southboundDB processes [root@bz_2055865 ~]# kill -18 1793479 [root@bz_2055865 ~]# kill -18 1793458 [root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00004|daemon_unix(monitor)|INFO|pid 1793028 died, exit status 0, exiting 2022-03-08T21:35:22.139Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:35:22.140Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1] 2022-03-08T21:35:22.140Z|00003|reconnect|INFO|tcp:172.16.1.1:6641: connecting... 2022-03-08T21:35:22.140Z|00004|reconnect|INFO|tcp:172.16.1.1:6642: connecting... 2022-03-08T21:35:22.140Z|00005|reconnect|INFO|tcp:172.16.1.1:6641: connected 2022-03-08T21:35:22.140Z|00006|reconnect|INFO|tcp:172.16.1.1:6642: connected 2022-03-08T21:35:22.140Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2022-03-08T21:35:32.145Z|00008|memory|INFO|3936 kB peak resident set size after 10.0 seconds 2022-03-08T21:36:22.160Z|00009|reconnect|ERR|tcp:172.16.1.1:6642: no response to inactivity probe after 5 seconds, disconnecting 2022-03-08T21:36:22.160Z|00010|reconnect|INFO|tcp:172.16.1.1:6642: connection dropped 2022-03-08T21:36:22.160Z|00011|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2022-03-08T21:36:23.161Z|00012|reconnect|INFO|tcp:172.16.1.1:6642: connecting... 2022-03-08T21:36:23.161Z|00013|reconnect|INFO|tcp:172.16.1.1:6642: connected 2022-03-08T21:36:27.164Z|00014|reconnect|ERR|tcp:172.16.1.1:6641: no response to inactivity probe after 5 seconds, disconnecting 2022-03-08T21:36:27.164Z|00015|reconnect|INFO|tcp:172.16.1.1:6641: connection dropped 2022-03-08T21:36:28.165Z|00016|reconnect|INFO|tcp:172.16.1.1:6641: connecting... 2022-03-08T21:36:28.165Z|00017|reconnect|INFO|tcp:172.16.1.1:6641: connected ........... 2022-03-08T21:38:16.247Z|00066|reconnect|ERR|tcp:172.16.1.1:6641: no response to inactivity probe after 5 seconds, disconnecting 2022-03-08T21:38:16.247Z|00067|reconnect|INFO|tcp:172.16.1.1:6641: connection dropped 2022-03-08T21:38:19.245Z|00068|reconnect|INFO|tcp:172.16.1.1:6642: connected 2022-03-08T21:38:24.251Z|00069|reconnect|INFO|tcp:172.16.1.1:6641: connected 2022-03-08T21:38:24.626Z|00070|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. <======== ovn-northd reconnected again with the north and south bound DB servers. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1143 |