Bug 1916029
| Summary: | Upgrade from 4.6.9 to 4.7.0-fc.2 is failing, network begins to drop packets during upgrade, stable prior to upgrade | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Naga Ravi Chaitanya Elluri <nelluri> |
| Component: | Networking | Assignee: | Naga Ravi Chaitanya Elluri <nelluri> |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED EOL | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | bbennett, dcbw, nelluri, oarribas, rravaiol, surya, wking |
| Version: | 4.6 | Flags: | surya:
needinfo-
|
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | aos-scalability-46 | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-10-28 10:18:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Naga Ravi Chaitanya Elluri
2021-01-14 00:44:50 UTC
We did a second round of upgrade but this time the builds were excluded during the cluster load phase which seems to reduce the load on the cluster especially Etcd since the DB size is seen to be half when compared to the cluster with builds. The upgrade passed without any issues in this case, so the suspect is high load causing network congestion causing the control plane to go unstable and eventually go unresponsive when machine-config cluster operator upgrade is in progess where masters are rebooted. During situations where resource contention/packet drops are suspected, can you jump into an OVS container on the node and run "ovs-dpctl show" to get OVS data plane stats on packet drops/upcalls? Can I have access to the environment for the stress test or could you provide us with the OVS and SDN container logs from the nodes when you saw the stress on ens5 ? Also out of curiosity where can I find more info on the builds? I am not an expert on OCP stress tests/upgrades, so would like to know what kind of build was included per project since you mentioned in the second comment that removing the builds seemed to get the cluster back to normal. oh nvm, about the container logs, my eyes didn't notice that you've attached the master node's logs. Okay so skimmed through the logs, and basically at this point every component is screaming that API is unavailable like stated in comment #1: 2021-01-13T22:40:14.824620481+00:00 stderr F F0113 22:40:14.824578 34446 cmd.go:106] Failed to initialize sdn: failed to initialize SDN: could not get ClusterNetwork resource: Get "https://api-int.upgradecluster.perfscale.devcluster.openshift.com:6443/apis/network.openshift.io/v1/clusternetworks/default": dial tcp 10.0.xxx.xxx:6443: i/o timeout So we'd need the ovs flows from the nodes like Dan asked ^ and I'll try and get help from the OVS team to find the bottleneck for the load. There is an issue with network flows in 4.6.9: https://bugzilla.redhat.com/show_bug.cgi?id=1914284, we are re-running the test on a cluster built with 4.6.12 payload which has the fix and try to upgrade to 4.7.0.fc.3. We will grab the ovs flows information this time if/when we see the packet drops. Thanks. |