Red Hat Bugzilla – Bug 1462306
Failed create Instance under load: Remote error: NoSuchColumnError
Last modified: 2017-06-30 10:15:16 EDT
Description of problem:
Failed create several instances under REST API load test.
"Remote error: NoSuchColumnError "Could not locate column in row for column 'compute_nodes.id'" [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming\n res = se..."
"File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 866, in schedule_and_build_instances request_specs.to_legacy_filter_properties_dict()) File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 597, in _schedule_instances hosts = self.scheduler_client.select_destinations(context, spec_obj) File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 371, in wrapped return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 51, in select_destinations return self.queryclient.select_destinations(context, spec_obj) File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method return getattr(self.instance, __name)(*args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 32, in select_destinations return self.scheduler_rpcapi.select_destinations(context, spec_obj) File "/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 129, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call retry=self.retry) File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 97, in _send timeout=timeout, retry=retry) File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 505, in send retry=retry) File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 496, in _send raise result:
nova-conductor.log has many errors (see attached file)
Look for instance:
Name - perf-16-1-vm
ID - d21aa630-71f9-43ff-a578-9f7b8140e82f
Status - Error
Created - June 15, 2017, 9:11 p.m.
Version-Release number of selected component (if applicable):
rhos-release 11 -p 2017-05-30.1
[root@overcloud-controller-0 log]# rpm -qa | grep nova
It could be reproduced by run the REST API performance test
which simulates load on Openstack with 20 concurrent threads.
Each thread create instance and perform actions:
Steps to Reproduce:
Several instances in state Error
No instances in state Error
Not happened in RHOS 9 and 10
Created attachment 1288416 [details]
Created attachment 1288417 [details]
Your conductor log indicates two things:
1. Occasionally nova times out waiting for neutron
2. A _lot_ of database traffic is taking a very long time to complete
Both of these could come from purely overwhelming the system (maybe the database?) with too much traffic.
The NoSuchColumnError sounds like incomplete setup to me, as that should never be possible unless the schema doesn't match the code. Since you didn't provide that whole log, it's hard to draw much of a conclusion from what you have provided.
I would double check your deployment and make sure that you have sync'd your schema levels to match the code on all databases.
For diagnosing the issues in the conductor log, I would start by checking the database load to see if it's beyond a reasonable level for your deployment. Next, I would figure out why neutron is timing out and try to resolve that. I'm not sure what your deployment (hardware) looks like, but 20 parallel threads of the load you described is a LOT of traffic.
I resend to you mail which describe tests configuration and reports.
Unfortunately, I cannot attach it to bug.
But, I'll retest all starting from June 25 and update you with result
Regarding traffic, I performed the same test on the same hardware on RHOS9 and 10 without failures. Bug reproduce only in RHOS11.
And I don't think, this a LOT traffic.
ONLY 20 threads for 3 controllers and 6 computes
where each server is Dell Inc. PowerEdge R620/0KCKR5,
Red Hat Enterprise Linux Server release 7.3 (Maipo)
24 x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Thanks for the email context.
It's definitely a lot of traffic. Knowing that the same deployment was able to handle it in previous releases is a good data point.
The errors in the conductor log are almost definitely related to stress on the database (regardless of where it's coming from). It sounds like the people who have replied on your mail thread have some ideas to resolve that.
The NoSuchColumn error is structural. It should either always happen, or never happen. It indicates that the schema of some database we're talking to doesn't match what we expect it to be. I can't really think of any reason that would be dependent on load.
Because the other errors indicate an overstressed database, I would say you should resolve those things and then see if we're still hitting the NoSuchColumn error.
(In reply to Dan Smith from comment #5)
> Thanks for the email context.
> It's definitely a lot of traffic. Knowing that the same deployment was able
> to handle it in previous releases is a good data point.
> The errors in the conductor log are almost definitely related to stress on
> the database (regardless of where it's coming from). It sounds like the
> people who have replied on your mail thread have some ideas to resolve that.
I'm going to monitor database and update all with results
> The NoSuchColumn error is structural. It should either always happen, or
> never happen. It indicates that the schema of some database we're talking to
> doesn't match what we expect it to be. I can't really think of any reason
> that would be dependent on load.
I'll ping you to check the database structure when deploy rhos11 on next week.
Is it OK?
> Because the other errors indicate an overstressed database, I would say you
> should resolve those things and then see if we're still hitting the
> NoSuchColumn error.
If you reproduce this, please re-open and needinfo me. We're going to close this to get it off our dashboard.