Description of problem: During an upgrade from RHOSP13 to RHOSP16.1, nova-api started to fail with the following error: 2021-10-16 10:03:30.615 27 INFO nova.api.openstack.requestlog [req-9b797aea-83ab-4a31-a83d-d4662775eb92 852a6ca63b2c4e689c8fe8b79d4380e0 f32f6c2cdbe44a009d2dfcceeac0cc7d - default default] 10.48.84.241 "GET /" status: 500 len: 0 microversion: - time: 0.000147 2021-10-16 10:03:30.615 27 ERROR nova.api.openstack [req-9b797aea-83ab-4a31-a83d-d4662775eb92 852a6ca63b2c4e689c8fe8b79d4380e0 f32f6c2cdbe44a009d2dfcceeac0cc7d - default default] Caught error: Apache/mod_wsgi request data read error: Partial results are valid but processing is incomplete.: OSError: Apache/mod_wsgi request data read error: Partial results are valid but processing is incomplete. At the same time, errors like this start showing up on rabbitmq: 2021-10-16 03:00:14.766 [error] <0.6383.4> closing AMQP connection <0.6383.4> (10.48.84.243:39276 -> 10.48.10.28:5672 - mod_wsgi:27:79f28be8-3e29-4677-b514-c38b32659c0a): missed heartbeats from client, timeout: 60s Additionally, all mesagges managed by rabbitmq shown unack. Version-Release number of selected component (if applicable): Errors visible when using both following groups of images (images were downgraded as part of troubleshooting but the error persisted): REPOSITORY TAG IMAGE ID CREATED SIZE registry.redhat.io/rhosp-rhel8/openstack-nova-api 16.1.6-8 48b9f3c0a2ef 4 months ago 1.13 GB registry.redhat.io/rhosp-rhel8/openstack-nova-scheduler 16.1.6-8 d896812ca795 4 months ago 1.26 GB registry.redhat.io/rhosp-rhel8/openstack-nova-novncproxy 16.1.6-7 07bf36b8d9da 4 months ago 1.1 GB registry.redhat.io/rhosp-rhel8/openstack-nova-conductor 16.1.6-7 c181ebb649d6 4 months ago 1.04 GB REPOSITORY TAG IMAGE ID CREATED SIZE registry.redhat.io/rhosp-rhel8/openstack-nova-scheduler 16.1.6-8.1627296370 0c42ef7af5f1 2 months ago 1.26 GB registry.redhat.io/rhosp-rhel8/openstack-nova-api 16.1.6-8.1627296549 8b4e89dfab49 2 months ago 1.13 GB registry.redhat.io/rhosp-rhel8/openstack-nova-conductor 16.1.6-7.1627296380 3d00d0d9fcee 2 months ago 1.04 GB registry.redhat.io/rhosp-rhel8/openstack-nova-novncproxy 16.1.6-7.1627296377 9706ed17a0af 2 months ago 1.1 GB How reproducible: Not sure at the time, but the Red Hat consultants assigned to the customer will attempt to setup and environment to replicate the environment. Steps to Reproduce: 1. 2. 3. Actual results: Errors on Nova API and rabbitmq, unacknowledged messages. Expected results: No errors on Nova API and rabbitmq, no unacknowledged messages. Additional info: Nova API traces, sosreports and other debugging information (with customer data)_on the case attached.
Hello DFG:CloudOps! At this point we're looking for an RCA to the apparent behaviour change described in comment #14. Thanks in advance!
Moving this back to Compute has a placeholder until we get confirmation that we can close this. Thanks Sean for the insight!
G(In reply to Artom Lifshitz from comment #18) > Moving this back to Compute has a placeholder until we get confirmation that > we can close this. Thanks Sean for the insight! Bleah, looks like we still have some questions around support level for Ceilometer in general, and the NodesDiscoveryTripleO plugin (and where it runs, overcloud or undercloud) in particular.
Customer has been able to replicate the issue they had in production on RHOSP 16.1 by running batches of calls using the openstack cli, like the following: $ cat parallel.sh #!/bin/bash source ~/overcloudrc counter=1 total=$1 while [ $total -ge $counter ]; do openstack server list --all-projects & echo $counter let counter=$counter+1 done $ while [ TRUE ] ;do ./parallel.sh 50 ;sleep 30 ;done After some loops, the same error reported above starts showing up. What would be the tuning recommendations? [api]max_limit, anything else worth looking at?
(In reply to Eric Nothen from comment #20) > Customer has been able to replicate the issue they had in production on > RHOSP 16.1 by running batches of calls using the openstack cli, like the > following: > > $ cat parallel.sh > #!/bin/bash > > source ~/overcloudrc > > counter=1 > total=$1 > > while [ $total -ge $counter ]; do > openstack server list --all-projects & > echo $counter > let counter=$counter+1 > done > > $ while [ TRUE ] ;do ./parallel.sh 50 ;sleep 30 ;done > > After some loops, the same error reported above starts showing up. > > What would be the tuning recommendations? [api]max_limit, anything else > worth looking at? That's the best place to start, as it addresses the problem at the source by limiting the amount of results returned. Another potential avenue is the Timeout value in /var/lib/config-data/nova/etc/httpd/conf/httpd.conf. They currently have it set to 60, the OSP Director default is 90. The risk in increasing that value is that the problem is just being shifted somewhere else - in other words, if Apache stops timing out, HAProxy will time out instead, for example.
My understanding is that the customer will re-attempt the FFU in the near future. Please open a new BZ if any issues crop up, I think we can safely close this one now that we understand the root cause (NodesDiscoveryTripleO) and habe a workaround ([api]max_limit).
The workaround of reducing nova [api]max_limit will not be applied by the customer. Aside from the identified API calls coming from the ceilometer compute agent, they also have a valid use case in which they need to periodically extract the full list of servers, which is roughly five times the TripleO default (1000). Therefore, while the number of calls will be greatly reduced after disabling the ceilometer compute agent, the possibility of blocking Nova API by issuing multiple parallel calls (as mentioned on comment #20) is still there.
(In reply to Eric Nothen from comment #24) > The workaround of reducing nova [api]max_limit will not be applied by the > customer. Aside from the identified API calls coming from the ceilometer > compute agent, they also have a valid use case in which they need to > periodically extract the full list of servers, which is roughly five times > the TripleO default (1000). And that's still possible with [api]max_limit. That config option just forces pagination in the API results, and sets a number of results per page. So instead of making 1 request, waiting 30 seconds, and getting 10000 results back, you would make 10 requests, wait 3 seconds for each, and get 1000 results back in every one (the numbers are completely made up and arbitrary). Whatever client they're using needs to be set up to handle pagination, but the API would results a list of results, and a marker to send with the next request to indicate "start listing results from this one". > Therefore, while the number of calls will be greatly reduced after disabling > the ceilometer compute agent, the possibility of blocking Nova API by > issuing multiple parallel calls (as mentioned on comment #20) is still there.
(In reply to Artom Lifshitz from comment #25) > And that's still possible with [api]max_limit. That config option just > forces pagination in the API results, and sets a number of results per page. > So instead of making 1 request, waiting 30 seconds, and getting 10000 > results back, you would make 10 requests, wait 3 seconds for each, and get > 1000 results back in every one (the numbers are completely made up and > arbitrary). Whatever client they're using needs to be set up to handle > pagination, but the API would results a list of results, and a marker to > send with the next request to indicate "start listing results from this one". Thank you for the tip Artom. I've checked the Nova API reference, and found that they can even test the paginated results without even changing max_limit on the service to begin with, by using "limit" and "marker" [1] on their periodic query for all servers. I have passed this information to the customer. [1] https://docs.openstack.org/api-ref/compute/?expanded=list-servers-detailed-detail#list-servers-detailed