Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2015141

Summary:	nova-api shows Apache/mod_wsgi request data read error
Product:	Red Hat OpenStack	Reporter:	Eric Nothen <enothen>
Component:	openstack-ceilometer	Assignee:	Matthias Runge <mrunge>
Status:	CLOSED NOTABUG	QA Contact:	Leonid Natapov <lnatapov>
Severity:	high	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	alifshit, apevec, bdobreli, dasmith, dhill, dparkes, eglynn, fgarciad, gkadam, jhakimra, kchamart, kurathod, morazi, sbauza, sgordon, smooney, vromanso
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-11-09 14:57:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eric Nothen 2021-10-18 13:52:04 UTC

Description of problem:
During an upgrade from RHOSP13 to RHOSP16.1, nova-api started to fail with the following error:

2021-10-16 10:03:30.615 27 INFO nova.api.openstack.requestlog [req-9b797aea-83ab-4a31-a83d-d4662775eb92 852a6ca63b2c4e689c8fe8b79d4380e0 f32f6c2cdbe44a009d2dfcceeac0cc7d - default default] 10.48.84.241 "GET /" status: 500 len: 0 microversion: - time: 0.000147
2021-10-16 10:03:30.615 27 ERROR nova.api.openstack [req-9b797aea-83ab-4a31-a83d-d4662775eb92 852a6ca63b2c4e689c8fe8b79d4380e0 f32f6c2cdbe44a009d2dfcceeac0cc7d - default default] Caught error: Apache/mod_wsgi request data read error: Partial results are valid but processing is incomplete.: OSError: Apache/mod_wsgi request data read error: Partial results are valid but processing is incomplete.


At the same time, errors like this start showing up on rabbitmq:

2021-10-16 03:00:14.766 [error] <0.6383.4> closing AMQP connection <0.6383.4> (10.48.84.243:39276 -> 10.48.10.28:5672 - mod_wsgi:27:79f28be8-3e29-4677-b514-c38b32659c0a):
missed heartbeats from client, timeout: 60s

Additionally, all mesagges managed by rabbitmq shown unack.


Version-Release number of selected component (if applicable):

Errors visible when using both following groups of images (images were downgraded as part of troubleshooting but the error persisted):

REPOSITORY                                                   TAG                       IMAGE ID       CREATED        SIZE
registry.redhat.io/rhosp-rhel8/openstack-nova-api            16.1.6-8                  48b9f3c0a2ef   4 months ago   1.13 GB
registry.redhat.io/rhosp-rhel8/openstack-nova-scheduler      16.1.6-8                  d896812ca795   4 months ago   1.26 GB
registry.redhat.io/rhosp-rhel8/openstack-nova-novncproxy     16.1.6-7                  07bf36b8d9da   4 months ago   1.1 GB
registry.redhat.io/rhosp-rhel8/openstack-nova-conductor      16.1.6-7                  c181ebb649d6   4 months ago   1.04 GB

REPOSITORY                                                   TAG                       IMAGE ID       CREATED        SIZE
registry.redhat.io/rhosp-rhel8/openstack-nova-scheduler      16.1.6-8.1627296370       0c42ef7af5f1   2 months ago   1.26 GB
registry.redhat.io/rhosp-rhel8/openstack-nova-api            16.1.6-8.1627296549       8b4e89dfab49   2 months ago   1.13 GB
registry.redhat.io/rhosp-rhel8/openstack-nova-conductor      16.1.6-7.1627296380       3d00d0d9fcee   2 months ago   1.04 GB
registry.redhat.io/rhosp-rhel8/openstack-nova-novncproxy     16.1.6-7.1627296377       9706ed17a0af   2 months ago   1.1 GB


How reproducible:
Not sure at the time, but the Red Hat consultants assigned to the customer will attempt to setup and environment to replicate the environment.

Steps to Reproduce:
1.
2.
3.

Actual results:
Errors on Nova API and rabbitmq, unacknowledged messages.


Expected results:
No errors on Nova API and rabbitmq, no unacknowledged messages.

Additional info:
Nova API traces, sosreports and other debugging information (with customer data)_on the case attached.

Comment 16 Artom Lifshitz 2021-10-26 13:40:18 UTC

Hello DFG:CloudOps!

At this point we're looking for an RCA to the apparent behaviour change described in comment #14.

Thanks in advance!

Comment 18 Artom Lifshitz 2021-10-26 14:16:24 UTC

Moving this back to Compute has a placeholder until we get confirmation that we can close this. Thanks Sean for the insight!

Comment 19 Artom Lifshitz 2021-10-26 14:21:25 UTC

G(In reply to Artom Lifshitz from comment #18)
> Moving this back to Compute has a placeholder until we get confirmation that
> we can close this. Thanks Sean for the insight!

Bleah, looks like we still have some questions around support level for Ceilometer in general, and the NodesDiscoveryTripleO plugin (and where it runs, overcloud or undercloud) in particular.

Comment 20 Eric Nothen 2021-10-26 19:32:26 UTC

Customer has been able to replicate the issue they had in production on RHOSP 16.1 by running batches of calls using the openstack cli, like the following:

$ cat parallel.sh 
#!/bin/bash

source ~/overcloudrc

counter=1
total=$1

while [ $total -ge $counter ]; do
  openstack server list --all-projects &
  echo $counter
  let counter=$counter+1
done

$ while [ TRUE ] ;do ./parallel.sh 50 ;sleep 30 ;done

After some loops, the same error reported above starts showing up.

What would be the tuning recommendations? [api]max_limit, anything else worth looking at?

Comment 21 Artom Lifshitz 2021-10-26 19:54:16 UTC

(In reply to Eric Nothen from comment #20)
> Customer has been able to replicate the issue they had in production on
> RHOSP 16.1 by running batches of calls using the openstack cli, like the
> following:
> 
> $ cat parallel.sh 
> #!/bin/bash
> 
> source ~/overcloudrc
> 
> counter=1
> total=$1
> 
> while [ $total -ge $counter ]; do
>   openstack server list --all-projects &
>   echo $counter
>   let counter=$counter+1
> done
> 
> $ while [ TRUE ] ;do ./parallel.sh 50 ;sleep 30 ;done
> 
> After some loops, the same error reported above starts showing up.
> 
> What would be the tuning recommendations? [api]max_limit, anything else
> worth looking at?

That's the best place to start, as it addresses the problem at the source by limiting the amount of results returned.

Another potential avenue is the Timeout value in /var/lib/config-data/nova/etc/httpd/conf/httpd.conf. They currently have it set to 60, the OSP Director default is 90. The risk in increasing that value is that the problem is just being shifted somewhere else - in other words, if Apache stops timing out, HAProxy will time out instead, for example.

Comment 23 Artom Lifshitz 2021-11-09 14:57:50 UTC

My understanding is that the customer will re-attempt the FFU in the near future. Please open a new BZ if any issues crop up, I think we can safely close this one now that we understand the root cause (NodesDiscoveryTripleO) and habe a workaround ([api]max_limit).

Comment 24 Eric Nothen 2021-11-10 11:39:27 UTC

The workaround of reducing nova [api]max_limit will not be applied by the customer. Aside from the identified API calls coming from the ceilometer compute agent, they also have a valid use case in which they need to periodically extract the full list of servers, which is roughly five times the TripleO default (1000).

Therefore, while the number of calls will be greatly reduced after disabling the ceilometer compute agent, the possibility of blocking Nova API by issuing multiple parallel calls (as mentioned on comment #20) is still there.

Comment 25 Artom Lifshitz 2021-11-10 17:55:54 UTC

(In reply to Eric Nothen from comment #24)
> The workaround of reducing nova [api]max_limit will not be applied by the
> customer. Aside from the identified API calls coming from the ceilometer
> compute agent, they also have a valid use case in which they need to
> periodically extract the full list of servers, which is roughly five times
> the TripleO default (1000).

And that's still possible with [api]max_limit. That config option just forces pagination in the API results, and sets a number of results per page. So instead of making 1 request, waiting 30 seconds, and getting 10000 results back, you would make 10 requests, wait 3 seconds for each, and get 1000 results back in every one (the numbers are completely made up and arbitrary). Whatever client they're using needs to be set up to handle pagination, but the API would results a list of results, and a marker to send with the next request to indicate "start listing results from this one".

> Therefore, while the number of calls will be greatly reduced after disabling
> the ceilometer compute agent, the possibility of blocking Nova API by
> issuing multiple parallel calls (as mentioned on comment #20) is still there.

Comment 26 Eric Nothen 2021-11-15 09:43:00 UTC

(In reply to Artom Lifshitz from comment #25)
> And that's still possible with [api]max_limit. That config option just
> forces pagination in the API results, and sets a number of results per page.
> So instead of making 1 request, waiting 30 seconds, and getting 10000
> results back, you would make 10 requests, wait 3 seconds for each, and get
> 1000 results back in every one (the numbers are completely made up and
> arbitrary). Whatever client they're using needs to be set up to handle
> pagination, but the API would results a list of results, and a marker to
> send with the next request to indicate "start listing results from this one".

Thank you for the tip Artom. I've checked the Nova API reference, and found that they can even test the paginated results without even changing max_limit on the service to begin with, by using "limit" and "marker" [1] on their periodic query for all servers. I have passed this information to the customer.

[1] https://docs.openstack.org/api-ref/compute/?expanded=list-servers-detailed-detail#list-servers-detailed