Bug 1462306

Summary: Failed create Instance under load: Remote error: NoSuchColumnError
Product: Red Hat OpenStack Reporter: Yuri Obshansky <yobshans>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED NOTABUG QA Contact: Joe H. Rahme <jhakimra>
Severity: high Docs Contact:
Priority: unspecified    
Version: 11.0 (Ocata)CC: berrange, dasmith, eglynn, kchamart, sbauza, sferdjao, sgordon, srevivo, vromanso
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-30 14:15:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
controller nova-conductor.log
none
controller mysqld.log none

Description Yuri Obshansky 2017-06-16 16:56:20 UTC
Description of problem:
Failed create several instances under REST API load test. 
Message: 
"Remote error: NoSuchColumnError "Could not locate column in row for column 'compute_nodes.id'" [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming\n res = se..."
Details:
"File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 866, in schedule_and_build_instances request_specs[0].to_legacy_filter_properties_dict()) File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 597, in _schedule_instances hosts = self.scheduler_client.select_destinations(context, spec_obj) File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 371, in wrapped return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 51, in select_destinations return self.queryclient.select_destinations(context, spec_obj) File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method return getattr(self.instance, __name)(*args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 32, in select_destinations return self.scheduler_rpcapi.select_destinations(context, spec_obj) File "/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 129, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call retry=self.retry) File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 97, in _send timeout=timeout, retry=retry) File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 505, in send retry=retry) File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 496, in _send raise result: 
nova-conductor.log has many errors (see attached file) 
Look for instance:
Name - perf-16-1-vm
ID - d21aa630-71f9-43ff-a578-9f7b8140e82f
Status - Error
Created - June 15, 2017, 9:11 p.m.

Version-Release number of selected component (if applicable):
rhos-release 11   -p 2017-05-30.1
[root@overcloud-controller-0 log]# rpm -qa | grep nova
openstack-nova-novncproxy-15.0.3-3.el7ost.noarch
python-nova-15.0.3-3.el7ost.noarch
openstack-nova-cert-15.0.3-3.el7ost.noarch
openstack-nova-migration-15.0.3-3.el7ost.noarch
openstack-nova-api-15.0.3-3.el7ost.noarch
python-novaclient-7.1.0-1.el7ost.noarch
puppet-nova-10.4.0-5.el7ost.noarch
openstack-nova-compute-15.0.3-3.el7ost.noarch
openstack-nova-scheduler-15.0.3-3.el7ost.noarch
openstack-nova-console-15.0.3-3.el7ost.noarch
openstack-nova-common-15.0.3-3.el7ost.noarch
openstack-nova-conductor-15.0.3-3.el7ost.noarch
openstack-nova-placement-api-15.0.3-3.el7ost.noarch

How reproducible:
It could be reproduced by run the REST API performance test
which simulates load on Openstack with 20 concurrent threads.
Each thread create instance and perform actions:
01.NOVA.GET.Images
02.NOVA.GET.Flavors
03.NEUTRON.POST.Create.Network
04.NEUTRON.POST.Create.Subnet
05.NOVA.POST.Boot.Server
00.NOVA.GET.Server.Details
06.NOVA.POST.Pause.Server
07.NOVA.POST.Unpause.Server
08.NOVA.POST.Suspend.Server
09.NOVA.POST.Resume.Server
10.NOVA.POST.Soft.Reboot.Server
11.NOVA.POST.Hard.Reboot.Server
12.NOVA.POST.Stop.Server
13.NOVA.POST.Start.Server
14.NOVA.POST.Create.Image
15.NOVA.GET.Image.Id
16.NOVA.DELETE.Image
17.NOVA.DELETE.Server
18.NEUTRON.DELETE.Network

Steps to Reproduce:
1.
2.
3.

Actual results:
Several instances in state Error

Expected results:
No instances in state Error

Additional info:
Not happened in RHOS 9 and 10

Comment 1 Yuri Obshansky 2017-06-16 16:57:39 UTC
Created attachment 1288416 [details]
controller nova-conductor.log

Comment 2 Yuri Obshansky 2017-06-16 16:58:09 UTC
Created attachment 1288417 [details]
controller mysqld.log

Comment 3 Dan Smith 2017-06-22 19:54:57 UTC
Your conductor log indicates two things:

1. Occasionally nova times out waiting for neutron
2. A _lot_ of database traffic is taking a very long time to complete

Both of these could come from purely overwhelming the system (maybe the database?) with too much traffic.

The NoSuchColumnError sounds like incomplete setup to me, as that should never be possible unless the schema doesn't match the code. Since you didn't provide that whole log, it's hard to draw much of a conclusion from what you have provided.

I would double check your deployment and make sure that you have sync'd your schema levels to match the code on all databases.

For diagnosing the issues in the conductor log, I would start by checking the database load to see if it's beyond a reasonable level for your deployment. Next, I would figure out why neutron is timing out and try to resolve that. I'm not sure what your deployment (hardware) looks like, but 20 parallel threads of the load you described is a LOT of traffic.

Comment 4 Yuri Obshansky 2017-06-22 20:15:36 UTC
Hi Dan,

I resend to you mail which describe tests configuration and reports.
Unfortunately, I cannot attach it to bug.

But, I'll retest all starting from June 25 and update you with result

Regarding traffic, I performed the same test on  the same hardware on RHOS9 and 10 without failures. Bug reproduce only in RHOS11.
And I don't think, this a LOT traffic.
ONLY 20 threads for 3 controllers and 6 computes
where each server is Dell Inc. PowerEdge R620/0KCKR5,	
Red Hat Enterprise Linux Server release 7.3 (Maipo)	
24 x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz	
65758316 kB

Yuri

Comment 5 Dan Smith 2017-06-22 20:25:53 UTC
Thanks for the email context.

It's definitely a lot of traffic. Knowing that the same deployment was able to handle it in previous releases is a good data point.

The errors in the conductor log are almost definitely related to stress on the database (regardless of where it's coming from). It sounds like the people who have replied on your mail thread have some ideas to resolve that.

The NoSuchColumn error is structural. It should either always happen, or never happen. It indicates that the schema of some database we're talking to doesn't match what we expect it to be. I can't really think of any reason that would be dependent on load.

Because the other errors indicate an overstressed database, I would say you should resolve those things and then see if we're still hitting the NoSuchColumn error.

Comment 6 Yuri Obshansky 2017-06-22 20:32:18 UTC
(In reply to Dan Smith from comment #5)
> Thanks for the email context.
> 
> It's definitely a lot of traffic. Knowing that the same deployment was able
> to handle it in previous releases is a good data point.
> 
> The errors in the conductor log are almost definitely related to stress on
> the database (regardless of where it's coming from). It sounds like the
> people who have replied on your mail thread have some ideas to resolve that.

I'm going to monitor database and update all with results
> 
> The NoSuchColumn error is structural. It should either always happen, or
> never happen. It indicates that the schema of some database we're talking to
> doesn't match what we expect it to be. I can't really think of any reason
> that would be dependent on load.
> 

I'll ping you to check the database structure when deploy rhos11 on next week.
Is it OK? 

> Because the other errors indicate an overstressed database, I would say you
> should resolve those things and then see if we're still hitting the
> NoSuchColumn error.

Comment 7 Dan Smith 2017-06-30 14:15:16 UTC
If you reproduce this, please re-open and needinfo me. We're going to close this to get it off our dashboard.