Bug 2015141
| Summary: | nova-api shows Apache/mod_wsgi request data read error | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Eric Nothen <enothen> |
| Component: | openstack-ceilometer | Assignee: | Matthias Runge <mrunge> |
| Status: | CLOSED NOTABUG | QA Contact: | Leonid Natapov <lnatapov> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.1 (Train) | CC: | alifshit, apevec, bdobreli, dasmith, dhill, dparkes, eglynn, fgarciad, gkadam, jhakimra, kchamart, kurathod, morazi, sbauza, sgordon, smooney, vromanso |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-11-09 14:57:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Eric Nothen
2021-10-18 13:52:04 UTC
Hello DFG:CloudOps! At this point we're looking for an RCA to the apparent behaviour change described in comment #14. Thanks in advance! Moving this back to Compute has a placeholder until we get confirmation that we can close this. Thanks Sean for the insight! G(In reply to Artom Lifshitz from comment #18) > Moving this back to Compute has a placeholder until we get confirmation that > we can close this. Thanks Sean for the insight! Bleah, looks like we still have some questions around support level for Ceilometer in general, and the NodesDiscoveryTripleO plugin (and where it runs, overcloud or undercloud) in particular. Customer has been able to replicate the issue they had in production on RHOSP 16.1 by running batches of calls using the openstack cli, like the following: $ cat parallel.sh #!/bin/bash source ~/overcloudrc counter=1 total=$1 while [ $total -ge $counter ]; do openstack server list --all-projects & echo $counter let counter=$counter+1 done $ while [ TRUE ] ;do ./parallel.sh 50 ;sleep 30 ;done After some loops, the same error reported above starts showing up. What would be the tuning recommendations? [api]max_limit, anything else worth looking at? (In reply to Eric Nothen from comment #20) > Customer has been able to replicate the issue they had in production on > RHOSP 16.1 by running batches of calls using the openstack cli, like the > following: > > $ cat parallel.sh > #!/bin/bash > > source ~/overcloudrc > > counter=1 > total=$1 > > while [ $total -ge $counter ]; do > openstack server list --all-projects & > echo $counter > let counter=$counter+1 > done > > $ while [ TRUE ] ;do ./parallel.sh 50 ;sleep 30 ;done > > After some loops, the same error reported above starts showing up. > > What would be the tuning recommendations? [api]max_limit, anything else > worth looking at? That's the best place to start, as it addresses the problem at the source by limiting the amount of results returned. Another potential avenue is the Timeout value in /var/lib/config-data/nova/etc/httpd/conf/httpd.conf. They currently have it set to 60, the OSP Director default is 90. The risk in increasing that value is that the problem is just being shifted somewhere else - in other words, if Apache stops timing out, HAProxy will time out instead, for example. My understanding is that the customer will re-attempt the FFU in the near future. Please open a new BZ if any issues crop up, I think we can safely close this one now that we understand the root cause (NodesDiscoveryTripleO) and habe a workaround ([api]max_limit). The workaround of reducing nova [api]max_limit will not be applied by the customer. Aside from the identified API calls coming from the ceilometer compute agent, they also have a valid use case in which they need to periodically extract the full list of servers, which is roughly five times the TripleO default (1000). Therefore, while the number of calls will be greatly reduced after disabling the ceilometer compute agent, the possibility of blocking Nova API by issuing multiple parallel calls (as mentioned on comment #20) is still there. (In reply to Eric Nothen from comment #24) > The workaround of reducing nova [api]max_limit will not be applied by the > customer. Aside from the identified API calls coming from the ceilometer > compute agent, they also have a valid use case in which they need to > periodically extract the full list of servers, which is roughly five times > the TripleO default (1000). And that's still possible with [api]max_limit. That config option just forces pagination in the API results, and sets a number of results per page. So instead of making 1 request, waiting 30 seconds, and getting 10000 results back, you would make 10 requests, wait 3 seconds for each, and get 1000 results back in every one (the numbers are completely made up and arbitrary). Whatever client they're using needs to be set up to handle pagination, but the API would results a list of results, and a marker to send with the next request to indicate "start listing results from this one". > Therefore, while the number of calls will be greatly reduced after disabling > the ceilometer compute agent, the possibility of blocking Nova API by > issuing multiple parallel calls (as mentioned on comment #20) is still there. (In reply to Artom Lifshitz from comment #25) > And that's still possible with [api]max_limit. That config option just > forces pagination in the API results, and sets a number of results per page. > So instead of making 1 request, waiting 30 seconds, and getting 10000 > results back, you would make 10 requests, wait 3 seconds for each, and get > 1000 results back in every one (the numbers are completely made up and > arbitrary). Whatever client they're using needs to be set up to handle > pagination, but the API would results a list of results, and a marker to > send with the next request to indicate "start listing results from this one". Thank you for the tip Artom. I've checked the Nova API reference, and found that they can even test the paginated results without even changing max_limit on the service to begin with, by using "limit" and "marker" [1] on their periodic query for all servers. I have passed this information to the customer. [1] https://docs.openstack.org/api-ref/compute/?expanded=list-servers-detailed-detail#list-servers-detailed |