Bug 1613200
| Summary: | [Scale][HA] Unable to Spawn all the 500 VMs due to PortStatus not getting updated to ACTIVE for certain VMs | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Sridhar Gaddam <sgaddam> | ||||||||||||||
| Component: | opendaylight | Assignee: | Josh Hershberg <jhershbe> | ||||||||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Noam Manos <nmanos> | ||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||
| Priority: | high | ||||||||||||||||
| Version: | 13.0 (Queens) | CC: | aadam, jhershbe, mkolesni, mpeterso, nyechiel, smalleni | ||||||||||||||
| Target Milestone: | z4 | Keywords: | Triaged, ZStream | ||||||||||||||
| Target Release: | 13.0 (Queens) | ||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||
| OS: | Unspecified | ||||||||||||||||
| Whiteboard: | HA | ||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||
| Last Closed: | 2018-10-07 06:59:01 UTC | Type: | Bug | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Attachments: |
|
||||||||||||||||
|
Description
Sridhar Gaddam
2018-08-07 08:38:36 UTC
Some additional notes: During the test-run JAVA_HEAP was also tweaked to 8GB (to avoid OOM), so there was no OOM during the entire test-run. Created attachment 1473924 [details]
neutron-controller-0-iter1
Created attachment 1473925 [details]
opendaylight-controller-0-iter1
Created attachment 1473926 [details]
neutron-controller-1-iter1
Created attachment 1473927 [details]
opendaylight-controller-1-iter1
Created attachment 1473929 [details]
neutron-controller-2-iter1
Created attachment 1473930 [details]
overcloud-controller-2-iter1
I spent some time analyzing this and I can confidently say that there is certainly no reason to assume there is a problem with the port-status update mechanism. Here's the breakdown of what I found. * The neutron logs indicate there are 1913 ports in total * Of those only 401 ports never transition to active * Of those only 260 are VM ports * Of those 248 ports have no log line indicating that genius received the openflow port-status event and created the InterfaceState. This usually means that the VM port was not attached to the switch. * A remaining 6 ports have a span > 5 minutes between when the neutron port shows up in the karaf to when then smac and dmac flows are programmed (port status is set to ACTIVE immediately following that). Five minutes is the amount of time nova waits for a VM's port to go active, so nova gives up waiting and sets the VM to error state. * A remaining 6 ports never get smac and dmac flows configured...why not requires some research. That covers all the failed ports, 248+6+6 = 260 vm ports. Obviously, this requires more research to determine why all these failures are happening. In all honesty, it seems that the system was quite simply very, very overloaded, all queues were backed up, and just generally hosed. I am not sure this is a bug, it could just be too far over our performance capabilities. If OVS can support it then ODL should as well. Do we need to increase the memory or CPU cores? After consultation with Sridhar we agreed to close this one. Basically, it was opened initially because there was a suspicion that it somehow was related to port status. At this point it is clear that this is not a specific bug but rather cluster overload, i.e., ask timeout etc. These issues are already being worked on in various forums. |