Bug 1944160 - CNO: nbctl daemon should log reconnection info
Summary: CNO: nbctl daemon should log reconnection info
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Dan Williams
QA Contact: Ross Brattain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-29 12:40 UTC by Ilya Maximets
Modified: 2021-07-27 22:56 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:56:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1040 0 None open Bug 1944160: ovn: log more info about nbctl daemon database reconnections 2021-03-29 13:48:23 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:56:50 UTC

Internal Links: 1943800

Description Ilya Maximets 2021-03-29 12:40:35 UTC
Unlike other daemons, ovn-nbctl by default sets "reconnect:warn" logging level.
This means that no usual reconnection events logged.  Missing logs makes it
hard to debug reconnection related issues, i.e. raft leader changes and
transaction retries.

To fix that, ovn-nbctl daemon should be started with '-vreconnect:file:info'.

Comment 1 Dan Williams 2021-03-29 13:49:16 UTC
(In reply to Ilya Maximets from comment #0)
> Unlike other daemons, ovn-nbctl by default sets "reconnect:warn" logging
> level.
> This means that no usual reconnection events logged.  Missing logs makes it
> hard to debug reconnection related issues, i.e. raft leader changes and
> transaction retries.
> 
> To fix that, ovn-nbctl daemon should be started with '-vreconnect:file:info'.

PR posted and linked above.

Comment 2 Dan Williams 2021-03-29 17:34:17 UTC
Ilya, is this what you'd expect to see:

2021-03-29T14:12:31.752Z|00017|reconnect|INFO|ssl:10.0.0.5:9641: waiting 4 seconds before reconnect
2021-03-29T14:12:35.756Z|00018|reconnect|INFO|ssl:10.0.0.7:9641: connecting...
2021-03-29T14:12:35.762Z|00019|reconnect|INFO|ssl:10.0.0.7:9641: connected
2021-03-29T14:12:35.768Z|00020|ovsdb_idl|INFO|ssl:10.0.0.7:9641: clustered database server is disconnected from cluster; trying another server
2021-03-29T14:12:35.768Z|00021|reconnect|INFO|ssl:10.0.0.7:9641: connection attempt timed out
2021-03-29T14:12:35.768Z|00022|reconnect|INFO|ssl:10.0.0.7:9641: continuing to reconnect in the background but suppressing further logging
2021-03-29T14:12:43.781Z|00023|reconnect|INFO|ssl:10.0.0.6:9641: connected
2021-03-29T14:12:58.501Z|00001|vlog|INFO|opened log file /run/ovn/ovn-nbctl.log
2021-03-29T14:12:58.502Z|00002|reconnect|INFO|ssl:10.0.0.6:9641: connecting...
2021-03-29T14:12:58.506Z|00003|reconnect|INFO|ssl:10.0.0.6:9641: connected

Comment 3 Ilya Maximets 2021-03-29 17:47:02 UTC
(In reply to Dan Williams from comment #2)
> Ilya, is this what you'd expect to see:
> 
> 2021-03-29T14:12:31.752Z|00017|reconnect|INFO|ssl:10.0.0.5:9641: waiting 4
> seconds before reconnect
> 2021-03-29T14:12:35.756Z|00018|reconnect|INFO|ssl:10.0.0.7:9641:
> connecting...
> 2021-03-29T14:12:35.762Z|00019|reconnect|INFO|ssl:10.0.0.7:9641: connected
> 2021-03-29T14:12:35.768Z|00020|ovsdb_idl|INFO|ssl:10.0.0.7:9641: clustered
> database server is disconnected from cluster; trying another server
> 2021-03-29T14:12:35.768Z|00021|reconnect|INFO|ssl:10.0.0.7:9641: connection
> attempt timed out
> 2021-03-29T14:12:35.768Z|00022|reconnect|INFO|ssl:10.0.0.7:9641: continuing
> to reconnect in the background but suppressing further logging
> 2021-03-29T14:12:43.781Z|00023|reconnect|INFO|ssl:10.0.0.6:9641: connected
> 2021-03-29T14:12:58.501Z|00001|vlog|INFO|opened log file
> /run/ovn/ovn-nbctl.log
> 2021-03-29T14:12:58.502Z|00002|reconnect|INFO|ssl:10.0.0.6:9641:
> connecting...
> 2021-03-29T14:12:58.506Z|00003|reconnect|INFO|ssl:10.0.0.6:9641: connected

Yes.  It logs actual 'connected' states and 'backoff' intervals that should
be enough to compare with logs from other services.

Comment 5 Ross Brattain 2021-04-13 02:14:21 UTC
Verified on 4.8.0-0.nightly-2021-04-09-222447


8:2021-04-13T02:04:32.574979811Z ++ ovn-nbctl --pidfile=/var/run/ovn/ovn-nbctl.pid --detach -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:172.31.248.47:9641,ssl:172.31.248.66:9641,ssl:172.31.248.67:9641 --log-file=/run/ovn/ovn-nbctl.log -vreconnect:file:info

29:2021-04-13T02:04:18.227946947Z 2021-04-13T02:04:18Z|00024|reconnect|INFO|ssl:172.31.248.47:9642: connection attempt failed (Connection refused)
30:2021-04-13T02:04:18.228017656Z 2021-04-13T02:04:18Z|00025|reconnect|INFO|ssl:172.31.248.47:9642: waiting 4 seconds before reconnect
31:2021-04-13T02:04:22.231229002Z 2021-04-13T02:04:22Z|00026|reconnect|INFO|ssl:172.31.248.66:9641: connecting...
32:2021-04-13T02:04:22.231324985Z 2021-04-13T02:04:22Z|00027|reconnect|INFO|ssl:172.31.248.66:9642: connecting...
33:2021-04-13T02:04:22.231803237Z 2021-04-13T02:04:22Z|00028|reconnect|INFO|ssl:172.31.248.66:9642: connection attempt failed (Connection refused)
34:2021-04-13T02:04:22.231803237Z 2021-04-13T02:04:22Z|00029|reconnect|INFO|ssl:172.31.248.66:9642: continuing to reconnect in the background but suppressing further logging
35:2021-04-13T02:04:22.235475930Z 2021-04-13T02:04:22Z|00030|reconnect|INFO|ssl:172.31.248.66:9641: connected
36:2021-04-13T02:04:22.237612078Z 2021-04-13T02:04:22Z|00031|ovsdb_idl|INFO|ssl:172.31.248.66:9641: clustered database server is disconnected from cluster; trying another server
37:2021-04-13T02:04:22.237652441Z 2021-04-13T02:04:22Z|00032|reconnect|INFO|ssl:172.31.248.66:9641: connection attempt timed out
38:2021-04-13T02:04:22.237676690Z 2021-04-13T02:04:22Z|00033|reconnect|INFO|ssl:172.31.248.66:9641: continuing to reconnect in the background but suppressing further logging
39:2021-04-13T02:04:30.242999407Z 2021-04-13T02:04:30Z|00034|reconnect|INFO|ssl:172.31.248.67:9641: connected
40:2021-04-13T02:04:30.244964985Z 2021-04-13T02:04:30Z|00035|ovsdb_idl|INFO|ssl:172.31.248.67:9641: clustered database server is disconnected from cluster; trying another server
41:2021-04-13T02:04:38.247968966Z 2021-04-13T02:04:38Z|00036|reconnect|INFO|ssl:172.31.248.47:9642: connected
42:2021-04-13T02:04:38.249320774Z 2021-04-13T02:04:38Z|00037|reconnect|INFO|ssl:172.31.248.47:9641: connected
44:2021-04-13T02:04:38.250341564Z 2021-04-13T02:04:38Z|00039|ovsdb_idl|INFO|ssl:172.31.248.47:9642: clustered database server is disconnected from cluster; trying another server
46:2021-04-13T02:04:46.261486129Z 2021-04-13T02:04:46Z|00041|reconnect|INFO|ssl:172.31.248.66:9641: connected
47:2021-04-13T02:04:46.263476362Z 2021-04-13T02:04:46Z|00042|reconnect|INFO|ssl:172.31.248.66:9642: connected
51:2021-04-13T02:04:46.265192771Z 2021-04-13T02:04:46Z|00046|ovsdb_idl|INFO|ssl:172.31.248.66:9642: clustered database server is disconnected from cluster; trying another server
52:2021-04-13T02:04:54.274893865Z 2021-04-13T02:04:54Z|00047|reconnect|INFO|ssl:172.31.248.67:9641: connected
53:2021-04-13T02:04:54.276203252Z 2021-04-13T02:04:54Z|00048|reconnect|INFO|ssl:172.31.248.67:9642: connected

Comment 8 errata-xmlrpc 2021-07-27 22:56:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.