Bug 1613995 - Ironic api service returns http 500 errors
Summary: Ironic api service returns http 500 errors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-ironic
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Linux
medium
urgent
Target Milestone: z3
: 13.0 (Queens)
Assignee: Julia Kreger
QA Contact: mlammon
URL:
Whiteboard:
Depends On:
Blocks: epmosp13bugs 1588541
TreeView+ depends on / blocked
 
Reported: 2018-08-08 18:48 UTC by Chris Dearborn
Modified: 2018-11-13 22:29 UTC (History)
21 users (show)

Fixed In Version: puppet-ironic-12.4.0-3.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-13 22:28:18 UTC
Target Upstream Version:
Embargoed:
lmarsh: needinfo-
lmarsh: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 591493 0 'None' MERGED Correct default number of threads 2020-12-15 00:19:21 UTC
OpenStack gerrit 596776 0 'None' MERGED Correct default number of threads 2020-12-15 00:18:52 UTC
Red Hat Product Errata RHBA-2018:3587 0 None None None 2018-11-13 22:29:36 UTC
Storyboard 2003464 0 None None None 2018-08-16 17:49:50 UTC

Internal Links: 1648091

Description Chris Dearborn 2018-08-08 18:48:29 UTC
Description of problem:
The ironic api service intermittently returns http 500 errors during RAID configuration using the iDRAC driver.

Version-Release number of selected component (if applicable):
OSP13

How reproducible:
Attempt to deploy the JetStream solution on 14G Dell hardware.  Note the consistent failures when RAID is configured during the assign_role script execution.

Steps to Reproduce:
1. See above.

Actual results:
The ironic api service intermittently returns http 500 errors.

Expected results:
The ironic api service should not return http 500 errors.

Additional info:
When attempting to set the target RAID configuration on a node an http 500 error is sometimes returned from the ironic api service:
Traceback (most recent call last):
  File "./assign_role.py", line 1509, in main
    drac_client)
  File "./assign_role.py", line 782, in configure_raid
    ironic_client.node.set_target_raid_config(node_uuid, target_raid_config)
  File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 474, in set_target_raid_config
    return self.update(path, target_raid_config, http_method='PUT')
  File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 336, in update
    method=http_method)
  File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 200, in _update
    resp, body = self.api.json_request(method, url, **kwargs)
  File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 645, in json_request
    resp = self._http_request(url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 247, in wrapper
    return func(self, url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 627, in _http_request
    error_json.get('debuginfo'), method, url)
InternalServerError: Internal Server Error (HTTP 500)


At close to the same time, the ironic api service returns another http 500 error to a different client thread.  This occurs when calling list_unfinished_jobs through vendor passthru:
Traceback (most recent call last):
  File "./assign_role.py", line 1509, in main
    drac_client)
  File "./assign_role.py", line 757, in configure_raid
    node_uuid, ironic_client, drac_client, 'JBOD')
  File "./assign_role.py", line 1295, in change_physical_disk_state_wait
    JobHelper.wait_for_job_completions(ironic_client, node_uuid)
  File "/home/osp_admin/pilot/job_helper.py", line 32, in wait_for_job_completions
    http_method='GET').unfinished_jobs:
  File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 362, in vendor_passthru
    return self.get(path)
  File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 313, in get
    return self._get(resource_id=node_id, fields=fields)
  File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 88, in _get
    return self._list(self._path(resource_id))[0]
  File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 182, in _list
    data = self.__list(url, response_key=response_key, body=body)
  File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 174, in __list
    resp, body = self.api.json_request('GET', url)
  File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 645, in json_request
    resp = self._http_request(url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 247, in wrapper
    return func(self, url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 627, in _http_request
    error_json.get('debuginfo'), method, url)
InternalServerError: Internal Server Error (HTTP 500)


/var/log/httpd/ironic_wsgi_error.log shows the following error:
[Tue Aug 07 21:56:20.014913 2018] [:error] [pid 23514] [remote 192.168.120.10:184] mod_wsgi (pid=23514): Exception occurred processing WSGI script '/var/www/cgi-bin/ironic/app'.
[Tue Aug 07 21:56:20.014953 2018] [:error] [pid 23514] [remote 192.168.120.10:184] Traceback (most recent call last):
[Tue Aug 07 21:56:20.014967 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/ironic/api/app.py", line 126, in __call__
[Tue Aug 07 21:56:20.015059 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     return self.v1(environ, start_response)
[Tue Aug 07 21:56:20.015067 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/ironic/api/middleware/json_ext.py", line 43, in __call__
[Tue Aug 07 21:56:20.015105 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     return self.app(env, start_response)
[Tue Aug 07 21:56:20.015111 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/dec.py", line 131, in __call__
[Tue Aug 07 21:56:20.015200 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     resp = self.call_func(req, *args, **self.kwargs)
[Tue Aug 07 21:56:20.015207 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/dec.py", line 196, in call_func
[Tue Aug 07 21:56:20.015217 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     return self.func(req, *args, **kwargs)
[Tue Aug 07 21:56:20.015222 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/oslo_middleware/base.py", line 131, in __call__
[Tue Aug 07 21:56:20.015291 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     response = req.get_response(self.application)
[Tue Aug 07 21:56:20.015308 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 1316, in send
[Tue Aug 07 21:56:20.015619 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     application, catch_exc_info=False)
[Tue Aug 07 21:56:20.015628 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 1280, in call_application
[Tue Aug 07 21:56:20.015639 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     app_iter = application(self.environ, start_response)
[Tue Aug 07 21:56:20.015645 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/ironic/api/middleware/auth_token.py", line 59, in __call__
[Tue Aug 07 21:56:20.015681 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     return super(AuthTokenMiddleware, self).__call__(env, start_response)
[Tue Aug 07 21:56:20.015687 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/dec.py", line 131, in __call__
[Tue Aug 07 21:56:20.015696 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     resp = self.call_func(req, *args, **self.kwargs)
[Tue Aug 07 21:56:20.015717 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/dec.py", line 196, in call_func
[Tue Aug 07 21:56:20.015731 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     return self.func(req, *args, **kwargs)
[Tue Aug 07 21:56:20.015735 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py", line 337, in __call__
[Tue Aug 07 21:56:20.015911 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     response = req.get_response(self._app)
[Tue Aug 07 21:56:20.015918 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 1316, in send
[Tue Aug 07 21:56:20.015928 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     application, catch_exc_info=False)
[Tue Aug 07 21:56:20.015932 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 1280, in call_application
[Tue Aug 07 21:56:20.015940 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     app_iter = application(self.environ, start_response)
[Tue Aug 07 21:56:20.015944 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/pecan/middleware/recursive.py", line 56, in __call__
[Tue Aug 07 21:56:20.016016 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     return self.application(environ, start_response)
[Tue Aug 07 21:56:20.016022 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/ironic/api/middleware/parsable_error.py", line 67, in __call__
[Tue Aug 07 21:56:20.016061 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     app_iter = self.app(environ, replacement_start_response)
[Tue Aug 07 21:56:20.016066 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/pecan/core.py", line 835, in __call__
[Tue Aug 07 21:56:20.016222 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     return super(Pecan, self).__call__(environ, start_response)
[Tue Aug 07 21:56:20.016230 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/pecan/core.py", line 677, in __call__
[Tue Aug 07 21:56:20.016240 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     controller, args, kwargs = self.find_controller(state)
[Tue Aug 07 21:56:20.016244 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/pecan/core.py", line 853, in find_controller
[Tue Aug 07 21:56:20.016252 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     controller, args, kw = super(Pecan, self).find_controller(_state)
[Tue Aug 07 21:56:20.016256 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/pecan/core.py", line 526, in find_controller
[Tue Aug 07 21:56:20.016263 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     if not isinstance(req.json, dict):
[Tue Aug 07 21:56:20.016267 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/pecan/core.py", line 52, in __getattribute__
[Tue Aug 07 21:56:20.016274 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     return WebObRequest.__getattribute__(self, name)
[Tue Aug 07 21:56:20.016279 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 724, in _json_body__get
[Tue Aug 07 21:56:20.016286 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     return json.loads(self.body.decode(self.charset))
[Tue Aug 07 21:56:20.016290 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/pecan/core.py", line 52, in __getattribute__
[Tue Aug 07 21:56:20.016297 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     return WebObRequest.__getattribute__(self, name)
[Tue Aug 07 21:56:20.016301 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 702, in body
[Tue Aug 07 21:56:20.016312 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     self.make_body_seekable() # we need this to have content_length
[Tue Aug 07 21:56:20.016317 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 930, in make_body_seekable
[Tue Aug 07 21:56:20.016323 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     self.copy_body()
[Tue Aug 07 21:56:20.016328 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 954, in copy_body
[Tue Aug 07 21:56:20.016334 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     data = input.read(min(todo, 65535))
[Tue Aug 07 21:56:20.016341 2018] [:error] [pid 23514] [remote 192.168.120.10:184]   File "/usr/lib/python2.7/site-packages/webob/request.py", line 1540, in readinto
[Tue Aug 07 21:56:20.016348 2018] [:error] [pid 23514] [remote 192.168.120.10:184]     data = self.file.read(sz0)
[Tue Aug 07 21:56:20.016362 2018] [:error] [pid 23514] [remote 192.168.120.10:184] IOError: request data read error


If I run the assign_role script to configure RAID following overcloud deployment failure against the nodes that errored out, it runs successfully with no issues.  The assign_role script works against OSP10, and no http 500 errors were returned from the ironic api service.

No errors appear in ironic-conductor.log.

Comment 1 Bob Fournier 2018-08-09 19:53:17 UTC
This was discussed upstream, along with options for turning off wsgi. The upstream irc log is here -
http://eavesdrop.openstack.org/irclogs/%23openstack-ironic/%23openstack-ironic.2018-08-07.log.html#t2018-08-07T14:05:3

To summarize from chat log:
"For turning off wsgi - you'd have to manually stop the httpd running the ironic service, and should then be able to directly invoke the ironic-api in debug mode... although I don't remember how that is assembled so it may require something like proxying through httpd"

To test:
"uwsgi has an option for dealing with taking to long to read the data on the service side, its "post-buffering = 1" in the uwsgi config, I ran into similar issues on a different project <but...> looks like its apache itself thats running the wsgi daemon not uwsgi so not sure what the equivolent config option is."

Julia thinks what he is encountering is threading issue where somehow the code is running to the point where it is touching the same basic object but that object is getting overridden and upon read the socket is thought to be closed.

Comment 2 Julia Kreger 2018-08-10 20:02:56 UTC
Working with Chris while comparing Bob's environment as well. We found that the apache wsgi config was set for 4 processes and 1 thread on Chris's environment, where as Bob's was 8 processes and 1 thread.

In review of the provided log data and exceptions, it looks as if the requests may be stacking, and upon understanding what the is actually occurring with the client, that is likely the case (at least without reviewing full access logs). We changed Chris's environment to run 8 processes with 1 thread each, thus allowing for 8 concurrent requests to be passed through apache instead of four, and an initial test run passed the point it was previously failing.

Chris is going to retry some more, and look at trying some variations and will follow back up with us. In the mean time, in development we don't set processes, which I believe results in a default of 1 per CPU, and we set threads to 10... So it seems like maybe the template we're deploying results in an overly restrictive configuration that is not able to keep up with their script.

Comment 3 arkady kanevsky 2018-08-10 20:11:58 UTC
Thank you Julia.
What is the default configuration in OSPd 13 for Apache WSGI?

Any guidance on how this setting should be done based on the number of nodes under management?

Comment 4 Chris Dearborn 2018-08-10 21:16:09 UTC
All of the WSGI services are configured the same way on the director node in OSP13: 4 processes, 1 thread.

Comment 5 Bob Fournier 2018-08-13 13:29:53 UTC
Chris,

Can you check what api_workers is set to in  /etc/ironic/ironic.conf?

On my system its 8, this is what is setting the number of processes for wsgi I believe:

# Number of workers for OpenStack Ironic API service. The
# default is equal to the number of CPUs available if that can
# be determined, else a default worker count of 1 is returned.
# (integer value)
#api_workers = <None>
api_workers=8

(undercloud) [stack@host01 ~]$ cat /etc/httpd/conf.d/10-ironic_wsgi.conf | grep proc
  WSGIDaemonProcess ironic display-name=ironic_wsgi group=ironic processes=8 threads=1 user=ironic

We should be able to override this.

Comment 6 Chris Dearborn 2018-08-13 13:39:26 UTC
Hey Bob,

I just checked, and api_workers is set to 4 in /etc/ironic/ironic.conf:

# Number of workers for OpenStack Ironic API service. The
# default is equal to the number of CPUs available if that can
# be determined, else a default worker count of 1 is returned.
# (integer value)
#api_workers = <None>
api_workers=4

Comment 7 Julia Kreger 2018-08-13 14:32:39 UTC
Hey Chris, in your Apache httpd config that launches the WSGI workers. Could you remove the part of the line that has "threads=1" or set it to something like "threads=10". And then reset "processes=4" which is the default that your install setup initially. Once done, restart httpd, and then re-test? We're hoping that you do not encounter the issue with threads being used since they would naturally be more memory friendly and be able to support a much higher concurrent request load.

-Julia

Comment 8 Chris Dearborn 2018-08-13 15:24:06 UTC
Hey Julia,

sure.  I've just started the test run with processes=4 and threads=10.

I did try processes=8, threads=1 and processes=1, threads=10 over the weekend, and both of these worked with no issues.

Comment 9 Chris Dearborn 2018-08-13 21:59:08 UTC
Just completed a test with processes=4 and threads=10, and it worked perfectly.

Comment 10 Chris Dearborn 2018-08-16 17:51:47 UTC
Link to upstream bug:
https://storyboard.openstack.org/#!/story/2003464

Comment 16 Chris Dearborn 2018-11-07 20:44:22 UTC
We have tested this on several 9 node clusters and have verified that it resolves the problem in nearly all cases.

Comment 20 errata-xmlrpc 2018-11-13 22:28:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3587


Note You need to log in before you can comment on or make changes to this bug.