Description of problem: Ran object density test with 500 projects, with 100 worker nodes. Previous to OCP4.6.rc0 this test ran fine, however, rc0 has introduced something new, which causes the masters to get in a crashloop. root@ip-172-31-68-73: ~/benchmark-operator # oc get pods -n openshift-ovn-kubernetes | grep master ovnkube-master-9g6bz 6/6 Running 22 4d8h ovnkube-master-jn4sr 5/6 CrashLoopBackOff 21 4d8h ovnkube-master-lrgmk 5/6 CrashLoopBackOff 17 4d8h Warning Unhealthy 115m kubelet Readiness probe failed: + /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound + grep -v Address -q + grep 10.0.172.219 Warning Unhealthy 110m kubelet Readiness probe failed: + /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound + grep 10.0.172.219 + grep -v Address -q Warning Unhealthy 99m (x3 over 163m) kubelet Readiness probe failed: + /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound + grep 10.0.172.219 + grep -v Address -q Normal Created 93m (x5 over 4d8h) kubelet Created container ovn-dbchecker Normal Started 93m (x5 over 4d8h) kubelet Started container ovn-dbchecker 4:39 Started: Mon, 12 Oct 2020 20:36:01 +0000 Last State: Terminated Reason: Error Message: vn-k8s-cni-overlay} OVNKubernetesFeature:{EnableEgressIP:true} Kubernetes:{Kubeconfig: CACert: APIServer:https://api-int.rookovn.perf-testing.devcluster.openshift.com:6443 Token: CompatServiceCIDR: RawServiceCIDRs:172.30.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes MetricsBindAd dress: OVNMetricsBindAddress: MetricsEnablePprof:false OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes: NoHostSubnetNodes:nil} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:fals e externalID: exec:<nil>} Gateway:{Mode:local Interface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false} MasterHA:{ElectionLeaseDuration:60 ElectionRenewDeadline:30 ElectionRetryPeriod:20} HybridOverlay:{Enabled:false RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789}} I1012 20:29:31.060303 1 ovndbmanager.go:22] Starting DB Checker to ensure cluster membership and DB consistency I1012 20:29:31.060345 1 ovndbmanager.go:43] Starting ensure routine for Raft db: /etc/ovn/ovnsb_db.db I1012 20:29:31.060366 1 ovndbmanager.go:43] Starting ensure routine for Raft db: /etc/ovn/ovnnb_db.db I1012 20:30:31.648730 1 ovndbmanager.go:229] check-cluster returned out: "", stderr: "" W1012 20:30:45.297491 1 ovndbmanager.go:89] Unable to get db server ID for: /etc/ovn/ovnsb_db.db, stderr: ovsdb-tool: syntax error: /etc/ovn/ovnsb_db.db: 625335735 bytes starting at offset 65 have SHA-1 hash ff4ef44ee1817d3482803f9cec049584f1db7a32 but should have hash 2b7967802ce8c93f46d5cca5ea0564f28c07ee46 , err: exit status 1 F1012 20:30:59.280415 1 ovndbmanager.go:200] Error occured during checking of clustered db db: /etc/ovn/ovnsb_db.db,stdout: "", stderr: "ovsdb-tool: syntax error: /etc/ovn/ovnsb_db.db: 625335735 bytes starting at offset 65 have SHA-1 hash ff4ef44ee1817d3482803f9cec049584f1db7a32 but should have hash 2b7967802ce 8c93f46d5cca5ea0564f28c07ee46\n", err: exit status 1 Exit Code: 255 Started: Mon, 12 Oct 2020 20:29:31 +0000 Finished: Mon, 12 Oct 2020 20:30:59 +0000 Ready: True Version-Release number of selected component (if applicable): 4.6.0-rc.0: How reproducible: N/A
I did not see this while verifying https://bugzilla.redhat.com/show_bug.cgi?id=1859883 on 4.6.0-0.nightly-2020-09-28-061045 I do see the same issue as reported here on 4.6.0.rc2. 100 node cluster is stable when idle. When 1000 projects with 4000 pods are created, the apiserver goes unavailable and never seems to recover. If any logs would help I can create a bastion and try to access nodes directly.
I have confirmed this is still present in rc2. I was able to get a must-gather from this run: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.6-ovn/must-gather/kaboom-must-gather.local.6910035819870841487.tar.gz Recreate issue : oc apply -f resources/namespace.yaml oc apply -f resources/kube-burner-role.yml oc apply -f deploy/ oc create -f resources/crds/ripsaw_v1alpha1_ripsaw_crd.yaml oc apply -f resources/operator.yaml Ripsaw workload: apiVersion: ripsaw.cloudbulldozer.io/v1alpha1 kind: Benchmark metadata: name: kube-burner-cluster-density namespace: my-ripsaw spec: elasticsearch: server: search-cloud-perf-lqrf3jjtaqo7727m7ynd2xyt4y.us-west-2.es.amazonaws.com port: 80 prometheus: server: https://search-cloud-perf-lqrf3jjtaqo7727m7ynd2xyt4y.us-west-2.es.amazonaws.com:443 prom_token: <token> prom_url: <url> workload: name: kube-burner args: workload: cluster-density job_iterations: 500 wait_for: ["Build"] wait_when_finished: true image: quay.io/cloud-bulldozer/kube-burner:latest qps: 25 burst: 25
The plan is to figure out what ovn/ovs build between September 19 and October 1st introduced the issue for db corruption. OVS needs to dial back to el7-52 version of the rpms - Aniket is working on creating a PR and testing this on a jump host. OVN needs to be dialed back to 20.06.2-3.el8fdp rpms - Anil will test this with a PR provided by Tim.
Re-targeting 4.6.0, this causes ovn to be unrecoverable.
Investigating the must gather and a reproduced setup revealed multiple problems and findings: 1. The original problem stated in this bug is around a container crashing. This is due to an issue in ovn-dbchecker where we accidentally exit out of the container if we hit an error reading the db file for OVN. There is a fix for not exiting the script here: https://github.com/openshift/ovn-kubernetes/pull/308 The original error message here indicated to us that there was corruption in the southbound database: /etc/ovn/ovnsb_db.db: 625335735 bytes starting at offset 65 have SHA-1 hash ff4ef44ee1817d3482803f9cec049584f1db7a32 but should have hash 2b7967802ce8c93f46d5cca5ea0564f28c07ee46 However, after looking into this it is caused because ovn sbdb is writing to the file while we are trying to read it. Therefore we will need to come up with a proper fix for this, in this bug. 2. More concerning than the dbchecker crashing is that OVN NB is completely hosed. We can see that every OVN NB DB instance is over 17GB of RAM, and all northds are pegged at 100% CPU. This causes commands from ovn-kubernetes to fail. Tracking this bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1888829 We do not know what is causing this to happen in OVN yet. We want to rule out the service reject ACL scale issues (https://bugzilla.redhat.com/show_bug.cgi?id=1855408) by only running the test with pod density. 3. The failed calls in #2 exposed a segfault in ovn-kubernetes master. In this scenario, we cannot create the address set for a namespace in OVN and end up crashing when a pod is later added for the namespace. When ovnkube-master restarts, things seem to reconcile. Tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1888827 4. During the object density scale test, many containers are in "Init:Error". This is because the first thing they do is try to issue a dns request to github.com and git clone. The DNS fails because we are missing flows in OVS. This is only temporary and subsequent DNS lookups work. Before we return that CNI has succeeded in ovn-k8s we need to check for the presence of flows in port security out table. Tracked by: https://bugzilla.redhat.com/show_bug.cgi?id=1885761
As a data point, we created 5000 naked pods across 75 nodes, and had no issues. All pods where in Running state, and no restarts/error/crashloops from OVN.
@mifiedle Could you re-assign appropriately to right QA contact? Thanks
This problem still occurs on 4.7.0-0.nightly-2020-11-30-172451 In a 100 node OVN cluster, the API becomes completely unresponsive after 800 namespaces and 3200 pods are created. Let me know what documentation/logs are needed. oc adm must-gather is not an option.
Yes, I'm hitting the same issues on baremetal even with the said fix.
I believe the issue I hit in comment 14 is different. While verifying https://bugzilla.redhat.com/show_bug.cgi?id=1855408 (basically the same as this bz with 2000 pods instead of 4000 as here), we hit https://bugzilla.redhat.com/show_bug.cgi?id=1905680. The problem was not master-related but nodes OOMing and going NotReady due to ovn-controller memory usage. With big enough nodes, we could run that workload successfully. I will re-try this workload with huge nodes.
Verified on 4.7.0-0.nightly-2020-12-04-013308 With m5.2xlarge nodes I could get this workload running with no master crash issue. With m5.xlarge nodes (which work fine for openshift-sdn), I hit https://bugzilla.redhat.com/show_bug.cgi?id=1905680
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633