Bug 1887585

Summary:	ovn-masters stuck in crashloop after scale test
Product:	OpenShift Container Platform	Reporter:	Joe Talerico <jtaleric>
Component:	Networking	Assignee:	Aniket Bhat <anbhat>
Networking sub component:	ovn-kubernetes	QA Contact:	Mike Fiedler <mifiedle>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	anbhat, bbennett, dblack, fpan, mifiedle, sburke, smalleni, syangsao, trozet
Version:	4.6	Keywords:	Regression, TestBlocker
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	aos-scalability-46
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1888721 (view as bug list)		Environment:
Last Closed:	2021-02-24 15:25:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1888721

Description Joe Talerico 2020-10-12 20:47:41 UTC

Description of problem:
Ran object density test with 500 projects, with 100 worker nodes. Previous to OCP4.6.rc0 this test ran fine, however, rc0 has introduced something new, which causes the masters to get in a crashloop.

root@ip-172-31-68-73: ~/benchmark-operator # oc get pods -n openshift-ovn-kubernetes | grep master
ovnkube-master-9g6bz   6/6     Running            22         4d8h
ovnkube-master-jn4sr   5/6     CrashLoopBackOff   21         4d8h
ovnkube-master-lrgmk   5/6     CrashLoopBackOff   17         4d8h

  Warning  Unhealthy  115m  kubelet  Readiness probe failed: + /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound
+ grep -v Address -q
+ grep 10.0.172.219
  Warning  Unhealthy  110m  kubelet  Readiness probe failed: + /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound
+ grep 10.0.172.219
+ grep -v Address -q
  Warning  Unhealthy  99m (x3 over 163m)  kubelet  Readiness probe failed: + /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound
+ grep 10.0.172.219
+ grep -v Address -q
  Normal   Created  93m (x5 over 4d8h)     kubelet  Created container ovn-dbchecker
  Normal   Started  93m (x5 over 4d8h)     kubelet  Started container ovn-dbchecker

4:39
      Started:   Mon, 12 Oct 2020 20:36:01 +0000
    Last State:  Terminated
      Reason:    Error
      Message:   vn-k8s-cni-overlay} OVNKubernetesFeature:{EnableEgressIP:true} Kubernetes:{Kubeconfig: CACert: APIServer:https://api-int.rookovn.perf-testing.devcluster.openshift.com:6443 Token: CompatServiceCIDR: RawServiceCIDRs:172.30.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes MetricsBindAd
dress: OVNMetricsBindAddress: MetricsEnablePprof:false OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes: NoHostSubnetNodes:nil} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:fals
e externalID: exec:<nil>} Gateway:{Mode:local Interface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false} MasterHA:{ElectionLeaseDuration:60 ElectionRenewDeadline:30 ElectionRetryPeriod:20} HybridOverlay:{Enabled:false RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789}}
I1012 20:29:31.060303       1 ovndbmanager.go:22] Starting DB Checker to ensure cluster membership and DB consistency
I1012 20:29:31.060345       1 ovndbmanager.go:43] Starting ensure routine for Raft db: /etc/ovn/ovnsb_db.db
I1012 20:29:31.060366       1 ovndbmanager.go:43] Starting ensure routine for Raft db: /etc/ovn/ovnnb_db.db
I1012 20:30:31.648730       1 ovndbmanager.go:229] check-cluster returned out: "", stderr: ""
W1012 20:30:45.297491       1 ovndbmanager.go:89] Unable to get db server ID for: /etc/ovn/ovnsb_db.db, stderr: ovsdb-tool: syntax error: /etc/ovn/ovnsb_db.db: 625335735 bytes starting at offset 65 have SHA-1 hash ff4ef44ee1817d3482803f9cec049584f1db7a32 but should have hash 2b7967802ce8c93f46d5cca5ea0564f28c07ee46
, err: exit status 1
F1012 20:30:59.280415       1 ovndbmanager.go:200] Error occured during checking of clustered db db: /etc/ovn/ovnsb_db.db,stdout: "", stderr: "ovsdb-tool: syntax error: /etc/ovn/ovnsb_db.db: 625335735 bytes starting at offset 65 have SHA-1 hash ff4ef44ee1817d3482803f9cec049584f1db7a32 but should have hash 2b7967802ce
8c93f46d5cca5ea0564f28c07ee46\n", err: exit status 1
      Exit Code:    255
      Started:      Mon, 12 Oct 2020 20:29:31 +0000
      Finished:     Mon, 12 Oct 2020 20:30:59 +0000
    Ready:          True

Version-Release number of selected component (if applicable):
4.6.0-rc.0:

How reproducible:
N/A

Comment 2 Mike Fiedler 2020-10-13 14:37:19 UTC

I did not see this while verifying https://bugzilla.redhat.com/show_bug.cgi?id=1859883 on 4.6.0-0.nightly-2020-09-28-061045

I do see the same issue as reported here on 4.6.0.rc2.   100 node cluster is stable when idle.   When 1000 projects with 4000 pods are created, the apiserver goes unavailable and never seems to recover.   If any logs would help I can create a bastion and try to access nodes directly.

Comment 3 Joe Talerico 2020-10-14 16:55:15 UTC

I have confirmed this is still present in rc2.

I was able to get a must-gather from this run: 
http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.6-ovn/must-gather/kaboom-must-gather.local.6910035819870841487.tar.gz


Recreate issue :
  oc apply -f resources/namespace.yaml 
  oc apply -f resources/kube-burner-role.yml 
  oc apply -f deploy/
  oc create -f resources/crds/ripsaw_v1alpha1_ripsaw_crd.yaml
  oc apply -f resources/operator.yaml
  

Ripsaw workload:

apiVersion: ripsaw.cloudbulldozer.io/v1alpha1
kind: Benchmark
metadata:
  name: kube-burner-cluster-density
  namespace: my-ripsaw
spec:
  elasticsearch:
    server: search-cloud-perf-lqrf3jjtaqo7727m7ynd2xyt4y.us-west-2.es.amazonaws.com 
    port: 80 
  prometheus:
    server: https://search-cloud-perf-lqrf3jjtaqo7727m7ynd2xyt4y.us-west-2.es.amazonaws.com:443
    prom_token: <token>
    prom_url: <url>
  workload:
    name: kube-burner
    args:
      workload: cluster-density
      job_iterations: 500 
      wait_for: ["Build"]
      wait_when_finished: true
      image: quay.io/cloud-bulldozer/kube-burner:latest 
      qps: 25
      burst: 25

Comment 4 Aniket Bhat 2020-10-14 19:59:40 UTC

The plan is to figure out what ovn/ovs build between September 19 and October 1st introduced the issue for db corruption.

OVS needs to dial back to el7-52 version of the rpms - Aniket is working on creating a PR and testing this on a jump host.
OVN needs to be dialed back to 20.06.2-3.el8fdp rpms - Anil will test this with a PR provided by Tim.

Comment 5 Feng Pan 2020-10-14 21:44:53 UTC

Re-targeting 4.6.0, this causes ovn to be unrecoverable.

Comment 8 Tim Rozet 2020-10-15 20:57:31 UTC

Investigating the must gather and a reproduced setup revealed multiple problems and findings:
1. The original problem stated in this bug is around a container crashing. This is due to an issue in ovn-dbchecker where we accidentally exit out of the container if we hit an error reading the db file for OVN. There is a fix for not exiting the script here:
https://github.com/openshift/ovn-kubernetes/pull/308

The original error message here indicated to us that there was corruption in the southbound database:
/etc/ovn/ovnsb_db.db: 625335735 bytes starting at offset 65 have SHA-1 hash ff4ef44ee1817d3482803f9cec049584f1db7a32 but should have hash 2b7967802ce8c93f46d5cca5ea0564f28c07ee46

However, after looking into this it is caused because ovn sbdb is writing to the file while we are trying to read it. Therefore we will need to come up with a proper fix for this, in this bug.

2. More concerning than the dbchecker crashing is that OVN NB is completely hosed. We can see that every OVN NB DB instance is over 17GB of RAM, and all northds are pegged at 100% CPU. This causes commands from ovn-kubernetes to fail. Tracking this bug here:
https://bugzilla.redhat.com/show_bug.cgi?id=1888829

We do not know what is causing this to happen in OVN yet. We want to rule out the service reject ACL scale issues (https://bugzilla.redhat.com/show_bug.cgi?id=1855408) by only running the test with pod density.

3. The failed calls in #2 exposed a segfault in ovn-kubernetes master. In this scenario, we cannot create the address set for a namespace in OVN and end up crashing when a pod is later added for the namespace. When ovnkube-master restarts, things seem to reconcile. Tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1888827

4. During the object density scale test, many containers are in "Init:Error". This is because the first thing they do is try to issue a dns request to github.com and git clone. The DNS fails because we are missing flows in OVS. This is only temporary and subsequent DNS lookups work. Before we return that CNI has succeeded in ovn-k8s we need to check for the presence of flows in port security out table. Tracked by: https://bugzilla.redhat.com/show_bug.cgi?id=1885761

Comment 9 Joe Talerico 2020-10-16 00:12:31 UTC

As a data point, we created 5000 naked pods across 75 nodes, and had no issues. All pods where in Running state, and no restarts/error/crashloops from OVN.

Comment 13 Anurag saxena 2020-12-01 15:47:59 UTC

@mifiedle Could you re-assign appropriately to right QA contact? Thanks

Comment 14 Mike Fiedler 2020-12-01 21:28:57 UTC

This problem still occurs on 4.7.0-0.nightly-2020-11-30-172451

In a 100 node OVN cluster, the API becomes completely unresponsive after 800 namespaces and 3200 pods are created.   Let me know what documentation/logs are needed.   oc adm must-gather is not an option.

Comment 15 Sai Sindhur Malleni 2020-12-04 14:07:10 UTC

Yes, I'm hitting the same issues on baremetal even with the said fix.

Comment 16 Sai Sindhur Malleni 2020-12-04 14:07:56 UTC

Yes, I'm hitting the same issues on baremetal even with the said fix.

Comment 19 Mike Fiedler 2020-12-08 20:43:03 UTC

I believe the issue I hit in comment 14 is different.   While verifying https://bugzilla.redhat.com/show_bug.cgi?id=1855408 (basically the same as this bz with 2000 pods instead of 4000 as here), we hit https://bugzilla.redhat.com/show_bug.cgi?id=1905680.   The problem was not master-related but nodes OOMing and going NotReady due to ovn-controller memory usage.  

With big enough nodes, we could run that workload successfully.   I will re-try this workload with huge nodes.

Comment 20 Mike Fiedler 2020-12-08 21:20:57 UTC

Verified on 4.7.0-0.nightly-2020-12-04-013308

With m5.2xlarge nodes I could get this workload running with no master crash issue.   With m5.xlarge nodes (which work fine for openshift-sdn), I hit https://bugzilla.redhat.com/show_bug.cgi?id=1905680

Comment 24 errata-xmlrpc 2021-02-24 15:25:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633