Bug 1887585
Summary: | ovn-masters stuck in crashloop after scale test | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Joe Talerico <jtaleric> | |
Component: | Networking | Assignee: | Aniket Bhat <anbhat> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Mike Fiedler <mifiedle> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | anbhat, bbennett, dblack, fpan, mifiedle, sburke, smalleni, syangsao, trozet | |
Version: | 4.6 | Keywords: | Regression, TestBlocker | |
Target Milestone: | --- | |||
Target Release: | 4.7.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | aos-scalability-46 | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1888721 (view as bug list) | Environment: | ||
Last Closed: | 2021-02-24 15:25:25 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1888721 |
Description
Joe Talerico
2020-10-12 20:47:41 UTC
I did not see this while verifying https://bugzilla.redhat.com/show_bug.cgi?id=1859883 on 4.6.0-0.nightly-2020-09-28-061045 I do see the same issue as reported here on 4.6.0.rc2. 100 node cluster is stable when idle. When 1000 projects with 4000 pods are created, the apiserver goes unavailable and never seems to recover. If any logs would help I can create a bastion and try to access nodes directly. I have confirmed this is still present in rc2. I was able to get a must-gather from this run: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.6-ovn/must-gather/kaboom-must-gather.local.6910035819870841487.tar.gz Recreate issue : oc apply -f resources/namespace.yaml oc apply -f resources/kube-burner-role.yml oc apply -f deploy/ oc create -f resources/crds/ripsaw_v1alpha1_ripsaw_crd.yaml oc apply -f resources/operator.yaml Ripsaw workload: apiVersion: ripsaw.cloudbulldozer.io/v1alpha1 kind: Benchmark metadata: name: kube-burner-cluster-density namespace: my-ripsaw spec: elasticsearch: server: search-cloud-perf-lqrf3jjtaqo7727m7ynd2xyt4y.us-west-2.es.amazonaws.com port: 80 prometheus: server: https://search-cloud-perf-lqrf3jjtaqo7727m7ynd2xyt4y.us-west-2.es.amazonaws.com:443 prom_token: <token> prom_url: <url> workload: name: kube-burner args: workload: cluster-density job_iterations: 500 wait_for: ["Build"] wait_when_finished: true image: quay.io/cloud-bulldozer/kube-burner:latest qps: 25 burst: 25 The plan is to figure out what ovn/ovs build between September 19 and October 1st introduced the issue for db corruption. OVS needs to dial back to el7-52 version of the rpms - Aniket is working on creating a PR and testing this on a jump host. OVN needs to be dialed back to 20.06.2-3.el8fdp rpms - Anil will test this with a PR provided by Tim. Re-targeting 4.6.0, this causes ovn to be unrecoverable. Investigating the must gather and a reproduced setup revealed multiple problems and findings: 1. The original problem stated in this bug is around a container crashing. This is due to an issue in ovn-dbchecker where we accidentally exit out of the container if we hit an error reading the db file for OVN. There is a fix for not exiting the script here: https://github.com/openshift/ovn-kubernetes/pull/308 The original error message here indicated to us that there was corruption in the southbound database: /etc/ovn/ovnsb_db.db: 625335735 bytes starting at offset 65 have SHA-1 hash ff4ef44ee1817d3482803f9cec049584f1db7a32 but should have hash 2b7967802ce8c93f46d5cca5ea0564f28c07ee46 However, after looking into this it is caused because ovn sbdb is writing to the file while we are trying to read it. Therefore we will need to come up with a proper fix for this, in this bug. 2. More concerning than the dbchecker crashing is that OVN NB is completely hosed. We can see that every OVN NB DB instance is over 17GB of RAM, and all northds are pegged at 100% CPU. This causes commands from ovn-kubernetes to fail. Tracking this bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1888829 We do not know what is causing this to happen in OVN yet. We want to rule out the service reject ACL scale issues (https://bugzilla.redhat.com/show_bug.cgi?id=1855408) by only running the test with pod density. 3. The failed calls in #2 exposed a segfault in ovn-kubernetes master. In this scenario, we cannot create the address set for a namespace in OVN and end up crashing when a pod is later added for the namespace. When ovnkube-master restarts, things seem to reconcile. Tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1888827 4. During the object density scale test, many containers are in "Init:Error". This is because the first thing they do is try to issue a dns request to github.com and git clone. The DNS fails because we are missing flows in OVS. This is only temporary and subsequent DNS lookups work. Before we return that CNI has succeeded in ovn-k8s we need to check for the presence of flows in port security out table. Tracked by: https://bugzilla.redhat.com/show_bug.cgi?id=1885761 As a data point, we created 5000 naked pods across 75 nodes, and had no issues. All pods where in Running state, and no restarts/error/crashloops from OVN. @mifiedle Could you re-assign appropriately to right QA contact? Thanks This problem still occurs on 4.7.0-0.nightly-2020-11-30-172451 In a 100 node OVN cluster, the API becomes completely unresponsive after 800 namespaces and 3200 pods are created. Let me know what documentation/logs are needed. oc adm must-gather is not an option. Yes, I'm hitting the same issues on baremetal even with the said fix. Yes, I'm hitting the same issues on baremetal even with the said fix. I believe the issue I hit in comment 14 is different. While verifying https://bugzilla.redhat.com/show_bug.cgi?id=1855408 (basically the same as this bz with 2000 pods instead of 4000 as here), we hit https://bugzilla.redhat.com/show_bug.cgi?id=1905680. The problem was not master-related but nodes OOMing and going NotReady due to ovn-controller memory usage. With big enough nodes, we could run that workload successfully. I will re-try this workload with huge nodes. Verified on 4.7.0-0.nightly-2020-12-04-013308 With m5.2xlarge nodes I could get this workload running with no master crash issue. With m5.xlarge nodes (which work fine for openshift-sdn), I hit https://bugzilla.redhat.com/show_bug.cgi?id=1905680 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |