Bug 1944160

Summary: CNO: nbctl daemon should log reconnection info
Product: OpenShift Container Platform Reporter: Ilya Maximets <i.maximets>
Component: NetworkingAssignee: Dan Williams <dcbw>
Networking sub component: ovn-kubernetes QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: dcbw, rbrattai, trozet
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:56:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ilya Maximets 2021-03-29 12:40:35 UTC
Unlike other daemons, ovn-nbctl by default sets "reconnect:warn" logging level.
This means that no usual reconnection events logged.  Missing logs makes it
hard to debug reconnection related issues, i.e. raft leader changes and
transaction retries.

To fix that, ovn-nbctl daemon should be started with '-vreconnect:file:info'.

Comment 1 Dan Williams 2021-03-29 13:49:16 UTC
(In reply to Ilya Maximets from comment #0)
> Unlike other daemons, ovn-nbctl by default sets "reconnect:warn" logging
> level.
> This means that no usual reconnection events logged.  Missing logs makes it
> hard to debug reconnection related issues, i.e. raft leader changes and
> transaction retries.
> 
> To fix that, ovn-nbctl daemon should be started with '-vreconnect:file:info'.

PR posted and linked above.

Comment 2 Dan Williams 2021-03-29 17:34:17 UTC
Ilya, is this what you'd expect to see:

2021-03-29T14:12:31.752Z|00017|reconnect|INFO|ssl:10.0.0.5:9641: waiting 4 seconds before reconnect
2021-03-29T14:12:35.756Z|00018|reconnect|INFO|ssl:10.0.0.7:9641: connecting...
2021-03-29T14:12:35.762Z|00019|reconnect|INFO|ssl:10.0.0.7:9641: connected
2021-03-29T14:12:35.768Z|00020|ovsdb_idl|INFO|ssl:10.0.0.7:9641: clustered database server is disconnected from cluster; trying another server
2021-03-29T14:12:35.768Z|00021|reconnect|INFO|ssl:10.0.0.7:9641: connection attempt timed out
2021-03-29T14:12:35.768Z|00022|reconnect|INFO|ssl:10.0.0.7:9641: continuing to reconnect in the background but suppressing further logging
2021-03-29T14:12:43.781Z|00023|reconnect|INFO|ssl:10.0.0.6:9641: connected
2021-03-29T14:12:58.501Z|00001|vlog|INFO|opened log file /run/ovn/ovn-nbctl.log
2021-03-29T14:12:58.502Z|00002|reconnect|INFO|ssl:10.0.0.6:9641: connecting...
2021-03-29T14:12:58.506Z|00003|reconnect|INFO|ssl:10.0.0.6:9641: connected

Comment 3 Ilya Maximets 2021-03-29 17:47:02 UTC
(In reply to Dan Williams from comment #2)
> Ilya, is this what you'd expect to see:
> 
> 2021-03-29T14:12:31.752Z|00017|reconnect|INFO|ssl:10.0.0.5:9641: waiting 4
> seconds before reconnect
> 2021-03-29T14:12:35.756Z|00018|reconnect|INFO|ssl:10.0.0.7:9641:
> connecting...
> 2021-03-29T14:12:35.762Z|00019|reconnect|INFO|ssl:10.0.0.7:9641: connected
> 2021-03-29T14:12:35.768Z|00020|ovsdb_idl|INFO|ssl:10.0.0.7:9641: clustered
> database server is disconnected from cluster; trying another server
> 2021-03-29T14:12:35.768Z|00021|reconnect|INFO|ssl:10.0.0.7:9641: connection
> attempt timed out
> 2021-03-29T14:12:35.768Z|00022|reconnect|INFO|ssl:10.0.0.7:9641: continuing
> to reconnect in the background but suppressing further logging
> 2021-03-29T14:12:43.781Z|00023|reconnect|INFO|ssl:10.0.0.6:9641: connected
> 2021-03-29T14:12:58.501Z|00001|vlog|INFO|opened log file
> /run/ovn/ovn-nbctl.log
> 2021-03-29T14:12:58.502Z|00002|reconnect|INFO|ssl:10.0.0.6:9641:
> connecting...
> 2021-03-29T14:12:58.506Z|00003|reconnect|INFO|ssl:10.0.0.6:9641: connected

Yes.  It logs actual 'connected' states and 'backoff' intervals that should
be enough to compare with logs from other services.

Comment 5 Ross Brattain 2021-04-13 02:14:21 UTC
Verified on 4.8.0-0.nightly-2021-04-09-222447


8:2021-04-13T02:04:32.574979811Z ++ ovn-nbctl --pidfile=/var/run/ovn/ovn-nbctl.pid --detach -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:172.31.248.47:9641,ssl:172.31.248.66:9641,ssl:172.31.248.67:9641 --log-file=/run/ovn/ovn-nbctl.log -vreconnect:file:info

29:2021-04-13T02:04:18.227946947Z 2021-04-13T02:04:18Z|00024|reconnect|INFO|ssl:172.31.248.47:9642: connection attempt failed (Connection refused)
30:2021-04-13T02:04:18.228017656Z 2021-04-13T02:04:18Z|00025|reconnect|INFO|ssl:172.31.248.47:9642: waiting 4 seconds before reconnect
31:2021-04-13T02:04:22.231229002Z 2021-04-13T02:04:22Z|00026|reconnect|INFO|ssl:172.31.248.66:9641: connecting...
32:2021-04-13T02:04:22.231324985Z 2021-04-13T02:04:22Z|00027|reconnect|INFO|ssl:172.31.248.66:9642: connecting...
33:2021-04-13T02:04:22.231803237Z 2021-04-13T02:04:22Z|00028|reconnect|INFO|ssl:172.31.248.66:9642: connection attempt failed (Connection refused)
34:2021-04-13T02:04:22.231803237Z 2021-04-13T02:04:22Z|00029|reconnect|INFO|ssl:172.31.248.66:9642: continuing to reconnect in the background but suppressing further logging
35:2021-04-13T02:04:22.235475930Z 2021-04-13T02:04:22Z|00030|reconnect|INFO|ssl:172.31.248.66:9641: connected
36:2021-04-13T02:04:22.237612078Z 2021-04-13T02:04:22Z|00031|ovsdb_idl|INFO|ssl:172.31.248.66:9641: clustered database server is disconnected from cluster; trying another server
37:2021-04-13T02:04:22.237652441Z 2021-04-13T02:04:22Z|00032|reconnect|INFO|ssl:172.31.248.66:9641: connection attempt timed out
38:2021-04-13T02:04:22.237676690Z 2021-04-13T02:04:22Z|00033|reconnect|INFO|ssl:172.31.248.66:9641: continuing to reconnect in the background but suppressing further logging
39:2021-04-13T02:04:30.242999407Z 2021-04-13T02:04:30Z|00034|reconnect|INFO|ssl:172.31.248.67:9641: connected
40:2021-04-13T02:04:30.244964985Z 2021-04-13T02:04:30Z|00035|ovsdb_idl|INFO|ssl:172.31.248.67:9641: clustered database server is disconnected from cluster; trying another server
41:2021-04-13T02:04:38.247968966Z 2021-04-13T02:04:38Z|00036|reconnect|INFO|ssl:172.31.248.47:9642: connected
42:2021-04-13T02:04:38.249320774Z 2021-04-13T02:04:38Z|00037|reconnect|INFO|ssl:172.31.248.47:9641: connected
44:2021-04-13T02:04:38.250341564Z 2021-04-13T02:04:38Z|00039|ovsdb_idl|INFO|ssl:172.31.248.47:9642: clustered database server is disconnected from cluster; trying another server
46:2021-04-13T02:04:46.261486129Z 2021-04-13T02:04:46Z|00041|reconnect|INFO|ssl:172.31.248.66:9641: connected
47:2021-04-13T02:04:46.263476362Z 2021-04-13T02:04:46Z|00042|reconnect|INFO|ssl:172.31.248.66:9642: connected
51:2021-04-13T02:04:46.265192771Z 2021-04-13T02:04:46Z|00046|ovsdb_idl|INFO|ssl:172.31.248.66:9642: clustered database server is disconnected from cluster; trying another server
52:2021-04-13T02:04:54.274893865Z 2021-04-13T02:04:54Z|00047|reconnect|INFO|ssl:172.31.248.67:9641: connected
53:2021-04-13T02:04:54.276203252Z 2021-04-13T02:04:54Z|00048|reconnect|INFO|ssl:172.31.248.67:9642: connected

Comment 8 errata-xmlrpc 2021-07-27 22:56:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438