Description of problem: The ironic api service intermittently returns http 500 errors during RAID configuration using the iDRAC driver. Version-Release number of selected component (if applicable): OSP13 How reproducible: Attempt to deploy the JetStream solution on 14G Dell hardware. Note the consistent failures when RAID is configured during the assign_role script execution. Steps to Reproduce: 1. See above. Actual results: The ironic api service intermittently returns http 500 errors. Expected results: The ironic api service should not return http 500 errors. Additional info: When attempting to set the target RAID configuration on a node an http 500 error is sometimes returned from the ironic api service: Traceback (most recent call last): File "./assign_role.py", line 1509, in main drac_client) File "./assign_role.py", line 782, in configure_raid ironic_client.node.set_target_raid_config(node_uuid, target_raid_config) File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 474, in set_target_raid_config return self.update(path, target_raid_config, http_method='PUT') File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 336, in update method=http_method) File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 200, in _update resp, body = self.api.json_request(method, url, **kwargs) File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 645, in json_request resp = self._http_request(url, method, **kwargs) File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 247, in wrapper return func(self, url, method, **kwargs) File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 627, in _http_request error_json.get('debuginfo'), method, url) InternalServerError: Internal Server Error (HTTP 500) At close to the same time, the ironic api service returns another http 500 error to a different client thread. This occurs when calling list_unfinished_jobs through vendor passthru: Traceback (most recent call last): File "./assign_role.py", line 1509, in main drac_client) File "./assign_role.py", line 757, in configure_raid node_uuid, ironic_client, drac_client, 'JBOD') File "./assign_role.py", line 1295, in change_physical_disk_state_wait JobHelper.wait_for_job_completions(ironic_client, node_uuid) File "/home/osp_admin/pilot/job_helper.py", line 32, in wait_for_job_completions http_method='GET').unfinished_jobs: File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 362, in vendor_passthru return self.get(path) File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 313, in get return self._get(resource_id=node_id, fields=fields) File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 88, in _get return self._list(self._path(resource_id))[0] File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 182, in _list data = self.__list(url, response_key=response_key, body=body) File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 174, in __list resp, body = self.api.json_request('GET', url) File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 645, in json_request resp = self._http_request(url, method, **kwargs) File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 247, in wrapper return func(self, url, method, **kwargs) File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 627, in _http_request error_json.get('debuginfo'), method, url) InternalServerError: Internal Server Error (HTTP 500) /var/log/httpd/ironic_wsgi_error.log shows the following error: [Tue Aug 07 21:56:20.014913 2018] [:error] [pid 23514] [remote 192.168.120.10:184] mod_wsgi (pid=23514): Exception occurred processing WSGI script '/var/www/cgi-bin/ironic/app'. [Tue Aug 07 21:56:20.014953 2018] [:error] [pid 23514] [remote 192.168.120.10:184] Traceback (most recent call last): [Tue Aug 07 21:56:20.014967 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/ironic/api/app.py", line 126, in __call__ [Tue Aug 07 21:56:20.015059 2018] [:error] [pid 23514] [remote 192.168.120.10:184] return self.v1(environ, start_response) [Tue Aug 07 21:56:20.015067 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/ironic/api/middleware/json_ext.py", line 43, in __call__ [Tue Aug 07 21:56:20.015105 2018] [:error] [pid 23514] [remote 192.168.120.10:184] return self.app(env, start_response) [Tue Aug 07 21:56:20.015111 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/dec.py", line 131, in __call__ [Tue Aug 07 21:56:20.015200 2018] [:error] [pid 23514] [remote 192.168.120.10:184] resp = self.call_func(req, *args, **self.kwargs) [Tue Aug 07 21:56:20.015207 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/dec.py", line 196, in call_func [Tue Aug 07 21:56:20.015217 2018] [:error] [pid 23514] [remote 192.168.120.10:184] return self.func(req, *args, **kwargs) [Tue Aug 07 21:56:20.015222 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/oslo_middleware/base.py", line 131, in __call__ [Tue Aug 07 21:56:20.015291 2018] [:error] [pid 23514] [remote 192.168.120.10:184] response = req.get_response(self.application) [Tue Aug 07 21:56:20.015308 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/request.py", line 1316, in send [Tue Aug 07 21:56:20.015619 2018] [:error] [pid 23514] [remote 192.168.120.10:184] application, catch_exc_info=False) [Tue Aug 07 21:56:20.015628 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/request.py", line 1280, in call_application [Tue Aug 07 21:56:20.015639 2018] [:error] [pid 23514] [remote 192.168.120.10:184] app_iter = application(self.environ, start_response) [Tue Aug 07 21:56:20.015645 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/ironic/api/middleware/auth_token.py", line 59, in __call__ [Tue Aug 07 21:56:20.015681 2018] [:error] [pid 23514] [remote 192.168.120.10:184] return super(AuthTokenMiddleware, self).__call__(env, start_response) [Tue Aug 07 21:56:20.015687 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/dec.py", line 131, in __call__ [Tue Aug 07 21:56:20.015696 2018] [:error] [pid 23514] [remote 192.168.120.10:184] resp = self.call_func(req, *args, **self.kwargs) [Tue Aug 07 21:56:20.015717 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/dec.py", line 196, in call_func [Tue Aug 07 21:56:20.015731 2018] [:error] [pid 23514] [remote 192.168.120.10:184] return self.func(req, *args, **kwargs) [Tue Aug 07 21:56:20.015735 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py", line 337, in __call__ [Tue Aug 07 21:56:20.015911 2018] [:error] [pid 23514] [remote 192.168.120.10:184] response = req.get_response(self._app) [Tue Aug 07 21:56:20.015918 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/request.py", line 1316, in send [Tue Aug 07 21:56:20.015928 2018] [:error] [pid 23514] [remote 192.168.120.10:184] application, catch_exc_info=False) [Tue Aug 07 21:56:20.015932 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/request.py", line 1280, in call_application [Tue Aug 07 21:56:20.015940 2018] [:error] [pid 23514] [remote 192.168.120.10:184] app_iter = application(self.environ, start_response) [Tue Aug 07 21:56:20.015944 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/pecan/middleware/recursive.py", line 56, in __call__ [Tue Aug 07 21:56:20.016016 2018] [:error] [pid 23514] [remote 192.168.120.10:184] return self.application(environ, start_response) [Tue Aug 07 21:56:20.016022 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/ironic/api/middleware/parsable_error.py", line 67, in __call__ [Tue Aug 07 21:56:20.016061 2018] [:error] [pid 23514] [remote 192.168.120.10:184] app_iter = self.app(environ, replacement_start_response) [Tue Aug 07 21:56:20.016066 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/pecan/core.py", line 835, in __call__ [Tue Aug 07 21:56:20.016222 2018] [:error] [pid 23514] [remote 192.168.120.10:184] return super(Pecan, self).__call__(environ, start_response) [Tue Aug 07 21:56:20.016230 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/pecan/core.py", line 677, in __call__ [Tue Aug 07 21:56:20.016240 2018] [:error] [pid 23514] [remote 192.168.120.10:184] controller, args, kwargs = self.find_controller(state) [Tue Aug 07 21:56:20.016244 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/pecan/core.py", line 853, in find_controller [Tue Aug 07 21:56:20.016252 2018] [:error] [pid 23514] [remote 192.168.120.10:184] controller, args, kw = super(Pecan, self).find_controller(_state) [Tue Aug 07 21:56:20.016256 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/pecan/core.py", line 526, in find_controller [Tue Aug 07 21:56:20.016263 2018] [:error] [pid 23514] [remote 192.168.120.10:184] if not isinstance(req.json, dict): [Tue Aug 07 21:56:20.016267 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/pecan/core.py", line 52, in __getattribute__ [Tue Aug 07 21:56:20.016274 2018] [:error] [pid 23514] [remote 192.168.120.10:184] return WebObRequest.__getattribute__(self, name) [Tue Aug 07 21:56:20.016279 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/request.py", line 724, in _json_body__get [Tue Aug 07 21:56:20.016286 2018] [:error] [pid 23514] [remote 192.168.120.10:184] return json.loads(self.body.decode(self.charset)) [Tue Aug 07 21:56:20.016290 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/pecan/core.py", line 52, in __getattribute__ [Tue Aug 07 21:56:20.016297 2018] [:error] [pid 23514] [remote 192.168.120.10:184] return WebObRequest.__getattribute__(self, name) [Tue Aug 07 21:56:20.016301 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/request.py", line 702, in body [Tue Aug 07 21:56:20.016312 2018] [:error] [pid 23514] [remote 192.168.120.10:184] self.make_body_seekable() # we need this to have content_length [Tue Aug 07 21:56:20.016317 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/request.py", line 930, in make_body_seekable [Tue Aug 07 21:56:20.016323 2018] [:error] [pid 23514] [remote 192.168.120.10:184] self.copy_body() [Tue Aug 07 21:56:20.016328 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/request.py", line 954, in copy_body [Tue Aug 07 21:56:20.016334 2018] [:error] [pid 23514] [remote 192.168.120.10:184] data = input.read(min(todo, 65535)) [Tue Aug 07 21:56:20.016341 2018] [:error] [pid 23514] [remote 192.168.120.10:184] File "/usr/lib/python2.7/site-packages/webob/request.py", line 1540, in readinto [Tue Aug 07 21:56:20.016348 2018] [:error] [pid 23514] [remote 192.168.120.10:184] data = self.file.read(sz0) [Tue Aug 07 21:56:20.016362 2018] [:error] [pid 23514] [remote 192.168.120.10:184] IOError: request data read error If I run the assign_role script to configure RAID following overcloud deployment failure against the nodes that errored out, it runs successfully with no issues. The assign_role script works against OSP10, and no http 500 errors were returned from the ironic api service. No errors appear in ironic-conductor.log.
This was discussed upstream, along with options for turning off wsgi. The upstream irc log is here - http://eavesdrop.openstack.org/irclogs/%23openstack-ironic/%23openstack-ironic.2018-08-07.log.html#t2018-08-07T14:05:3 To summarize from chat log: "For turning off wsgi - you'd have to manually stop the httpd running the ironic service, and should then be able to directly invoke the ironic-api in debug mode... although I don't remember how that is assembled so it may require something like proxying through httpd" To test: "uwsgi has an option for dealing with taking to long to read the data on the service side, its "post-buffering = 1" in the uwsgi config, I ran into similar issues on a different project <but...> looks like its apache itself thats running the wsgi daemon not uwsgi so not sure what the equivolent config option is." Julia thinks what he is encountering is threading issue where somehow the code is running to the point where it is touching the same basic object but that object is getting overridden and upon read the socket is thought to be closed.
Working with Chris while comparing Bob's environment as well. We found that the apache wsgi config was set for 4 processes and 1 thread on Chris's environment, where as Bob's was 8 processes and 1 thread. In review of the provided log data and exceptions, it looks as if the requests may be stacking, and upon understanding what the is actually occurring with the client, that is likely the case (at least without reviewing full access logs). We changed Chris's environment to run 8 processes with 1 thread each, thus allowing for 8 concurrent requests to be passed through apache instead of four, and an initial test run passed the point it was previously failing. Chris is going to retry some more, and look at trying some variations and will follow back up with us. In the mean time, in development we don't set processes, which I believe results in a default of 1 per CPU, and we set threads to 10... So it seems like maybe the template we're deploying results in an overly restrictive configuration that is not able to keep up with their script.
Thank you Julia. What is the default configuration in OSPd 13 for Apache WSGI? Any guidance on how this setting should be done based on the number of nodes under management?
All of the WSGI services are configured the same way on the director node in OSP13: 4 processes, 1 thread.
Chris, Can you check what api_workers is set to in /etc/ironic/ironic.conf? On my system its 8, this is what is setting the number of processes for wsgi I believe: # Number of workers for OpenStack Ironic API service. The # default is equal to the number of CPUs available if that can # be determined, else a default worker count of 1 is returned. # (integer value) #api_workers = <None> api_workers=8 (undercloud) [stack@host01 ~]$ cat /etc/httpd/conf.d/10-ironic_wsgi.conf | grep proc WSGIDaemonProcess ironic display-name=ironic_wsgi group=ironic processes=8 threads=1 user=ironic We should be able to override this.
Hey Bob, I just checked, and api_workers is set to 4 in /etc/ironic/ironic.conf: # Number of workers for OpenStack Ironic API service. The # default is equal to the number of CPUs available if that can # be determined, else a default worker count of 1 is returned. # (integer value) #api_workers = <None> api_workers=4
Hey Chris, in your Apache httpd config that launches the WSGI workers. Could you remove the part of the line that has "threads=1" or set it to something like "threads=10". And then reset "processes=4" which is the default that your install setup initially. Once done, restart httpd, and then re-test? We're hoping that you do not encounter the issue with threads being used since they would naturally be more memory friendly and be able to support a much higher concurrent request load. -Julia
Hey Julia, sure. I've just started the test run with processes=4 and threads=10. I did try processes=8, threads=1 and processes=1, threads=10 over the weekend, and both of these worked with no issues.
Just completed a test with processes=4 and threads=10, and it worked perfectly.
Link to upstream bug: https://storyboard.openstack.org/#!/story/2003464
We have tested this on several 9 node clusters and have verified that it resolves the problem in nearly all cases.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3587