Bug 2058368 - Openshift OVN-K got restarted mutilple times with the error " ovsdb-server/memory-trim-on-compaction on'' failed: exit status 1 and " ovndbchecker.go:118] unable to turn on memory trimming for SB DB, stderr " , cluster unavailable
Summary: Openshift OVN-K got restarted mutilple times with the error " ovsdb-server/m...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: x86_64
OS: Other
unspecified
high
Target Milestone: ---
: 4.11.0
Assignee: Mohamed Mahmoud
QA Contact: Ross Brattain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-24 18:26 UTC by Immanuvel
Modified: 2022-08-10 10:51 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:51:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1320 0 None open Bug 2058368: Move memory-trimming-on-compaction out of dbchecker to nbdb/sbdb 2022-02-28 16:56:59 UTC
Github openshift cluster-network-operator pull 1365 0 None open Bug 2058368: Add wait to make sure db control file is created 2022-04-01 18:31:48 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:51:34 UTC

Description Immanuvel 2022-02-24 18:26:10 UTC
Description of problem:

After Upgrade from 4.6.44 to 4.8.28

 Openshift OVN-Kubernetes  ovnkube-master pods  getting  restarted multiple times with the error  " ovsdb-server/memory-trim-on-compaction on'' failed: exit status 1   and " ovndbchecker.go:118] unable to turn on memory       trimming for SB DB, stderr "  , cluster  unavailable

Particularly ovn sbdb  and also we have some errors on etcd which makes the ingress unstable

Version-Release number of selected component (if applicable):

Openshift  4.8.28  

AZURE IPI 

How reproducible:

Not Sure what has happened so not sure if it can be reproduced 

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

After migrated to 4.8.28 this version customer started to observe this cluster instability  issues like etcd leader election , ingress  degraded and console down ,  authentication and OVN-kubernetes etc..

Comment 4 Riccardo Ravaioli 2022-03-07 14:51:31 UTC
Moving back the status to ON_QA, unless there's something that Mohamed's PR is not capturing. I suppose the status change happened by accident.

Comment 6 Ross Brattain 2022-03-17 17:17:53 UTC
There are no logs for lifecycle postStart hooks on success, so I can't see the postStart script run.

I don't see "|ovsdb_server|INFO|memory trimming after compaction enabled." in the [ns]bdb container logs, so I can't tell if compaction is enabled.

I haven't seen any compaction Database compaction took %"PRIu64"ms" messages yet.

There is no retry loop in the [[ -f $OVNNBDB_CTL ]], so this is probably failing silently.

Comment 7 Ross Brattain 2022-03-17 17:23:37 UTC
It would be nice to have a command to get the value of "ovsdb-server/memory-trim-on-compaction".  There shouldn't be a "set" without a corresponding "get" to verify.

Comment 8 Mohamed Mahmoud 2022-03-21 12:31:16 UTC
(In reply to Ross Brattain from comment #7)
> It would be nice to have a command to get the value of
> "ovsdb-server/memory-trim-on-compaction".  There shouldn't be a "set"
> without a corresponding "get" to verify.

if turning on memory trimming failed u will see errors in the events since this is done in PostStart hook the assumption here is ovnnbdb_ctl should have already been there hence no retry, did u check the nbdb and sbdb logs on the elected leader ?

Comment 9 Ross Brattain 2022-03-21 16:20:12 UTC
I checked all the logs.

I can't get a timestamp of when the postStart hook occurred so I can't compare when it occurred with the timestamps in the log.

Maybe there is a logging race?  Hence the comment about a "get" command, logs are not reliable for checking current state.

Comment 17 Ross Brattain 2022-04-01 19:14:06 UTC
testing notes:

editing CNO bindata in running container.

oc debug node/$(oc -n openshift-network-operator get pod -l name=network-operator -o jsonpath='{.items[0].spec.nodeName}')
chroot /host
# edit bindata inside running CNO container
vi /proc/$(pgrep -f cluster-network-operator)/root/bindata/network/ovn-kubernetes/self-hosted/ovnkube-master.yaml
exit

# delete Daemonset to have CNO re-render
oc delete -n openshift-ovn-kubernetes  daemonset.apps/ovnkube-master

# delete CNO to revert
oc delete -n openshift-network-operator pod -l name=network-operator

Comment 20 Ross Brattain 2022-04-04 18:27:42 UTC
Another issue, log spam is going to break things.


We need a set-once type of thing.

2022-04-04T17:01:44.917Z|00115|timeval|WARN|disk: 0 reads, 72 writes
2022-04-04T17:01:44.917Z|00116|timeval|WARN|context switches: 17 voluntary, 9 involuntary
2022-04-04T17:01:44.917Z|00117|coverage|INFO|Skipping details of duplicate event coverage for hash=ddd04a25
2022-04-04T17:01:46.378Z|00118|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:01:56.344Z|00119|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:02:06.345Z|00120|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:02:16.349Z|00121|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:02:26.331Z|00010|timeval(log_fsync3)|WARN|Unreasonably long 1139ms poll interval (0ms user, 0ms system)
2022-04-04T17:02:26.331Z|00011|timeval(log_fsync3)|WARN|context switches: 5 voluntary, 0 involuntary
2022-04-04T17:02:26.331Z|00012|coverage(log_fsync3)|INFO|Skipping details of duplicate event coverage for hash=ddd04a25
2022-04-04T17:02:26.368Z|00122|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:02:36.345Z|00123|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:02:46.370Z|00124|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:02:56.339Z|00125|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:03:02.730Z|00013|timeval(log_fsync3)|WARN|Unreasonably long 3904ms poll interval (0ms user, 0ms system)
2022-04-04T17:03:02.730Z|00014|timeval(log_fsync3)|WARN|context switches: 6 voluntary, 0 involuntary
2022-04-04T17:03:02.730Z|00015|coverage(log_fsync3)|INFO|Skipping details of duplicate event coverage for hash=ddd04a25
2022-04-04T17:03:06.346Z|00126|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:03:16.345Z|00127|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:03:26.346Z|00128|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:03:36.338Z|00129|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:03:46.359Z|00130|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:03:56.350Z|00131|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:04:06.339Z|00132|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:04:16.343Z|00133|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:04:26.346Z|00134|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:04:36.343Z|00135|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:04:46.344Z|00136|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:04:56.342Z|00137|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:05:06.341Z|00138|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:05:16.350Z|00139|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:05:26.341Z|00140|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:05:36.350Z|00141|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:05:46.346Z|00142|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:05:56.344Z|00143|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:06:06.345Z|00144|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:06:16.348Z|00145|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:06:26.338Z|00146|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:06:36.340Z|00147|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:06:46.340Z|00148|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:06:56.343Z|00149|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:07:06.393Z|00150|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:07:16.358Z|00151|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:07:26.343Z|00152|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:07:36.344Z|00153|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:07:46.343Z|00154|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:07:56.343Z|00155|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:08:06.358Z|00156|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:08:16.352Z|00157|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:08:26.368Z|00158|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:08:36.399Z|00159|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:08:46.349Z|00160|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:08:56.352Z|00161|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:09:06.363Z|00162|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:09:16.351Z|00163|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:09:26.344Z|00164|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:09:36.346Z|00165|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:09:46.351Z|00166|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:09:56.344Z|00167|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:10:06.341Z|00168|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:10:16.340Z|00169|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:10:26.354Z|00170|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:10:36.344Z|00171|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:10:46.344Z|00172|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:10:56.357Z|00173|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:11:06.349Z|00174|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:11:16.349Z|00175|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:11:26.353Z|00176|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:11:36.353Z|00177|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:11:46.348Z|00178|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:11:56.353Z|00179|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:12:06.355Z|00180|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:12:16.351Z|00181|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:12:26.349Z|00182|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:12:36.343Z|00183|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:12:46.357Z|00184|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:12:56.342Z|00185|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:13:06.341Z|00186|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:13:16.348Z|00187|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:13:26.341Z|00188|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:13:36.352Z|00189|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:13:46.346Z|00190|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:13:56.354Z|00191|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:14:06.350Z|00192|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:14:16.365Z|00193|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:14:26.347Z|00194|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:14:36.350Z|00195|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:14:46.425Z|00196|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:14:56.342Z|00197|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:15:06.343Z|00198|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:15:16.344Z|00199|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:15:26.341Z|00200|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:15:36.350Z|00201|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:15:46.348Z|00202|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:15:56.345Z|00203|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:16:06.350Z|00204|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:16:16.353Z|00205|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:16:26.353Z|00206|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:16:36.351Z|00207|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:16:46.351Z|00208|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:16:56.354Z|00209|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:17:06.343Z|00210|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:17:16.355Z|00211|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:17:26.343Z|00212|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:17:36.348Z|00213|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:17:46.344Z|00214|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:17:56.403Z|00215|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:18:06.343Z|00216|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:18:16.346Z|00217|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:18:26.346Z|00218|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:18:36.359Z|00219|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:18:46.347Z|00220|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:18:56.349Z|00221|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:19:06.350Z|00222|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:19:16.366Z|00223|ovsdb_server|INFO|memory trimming after compaction enabled.
2022-04-04T17:19:26.361Z|00224|ovsdb_server|INFO|memory trimming after compaction enabled.

Comment 21 Ross Brattain 2022-04-04 18:30:22 UTC
That was on 4.11.0-0.ci-2022-04-04-131844

Comment 25 Ross Brattain 2022-04-05 15:24:06 UTC
Okay, so I don't know if the log spam will have any functional impact.  ~90 bytes every 10 seconds is ~32k per hour, ~759k per day, so it's not like it will impact container log storage.  At least it will annoying clear that memory trimming is in fact enabled.

I assume the memory trim itself is saving use lots of MB.

Moving to Verified.

Comment 27 errata-xmlrpc 2022-08-10 10:51:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.