Bug 1033188 - cinder: when attach volume fails on more than 1 compute setup no error is issued on command
Summary: cinder: when attach volume fails on more than 1 compute setup no error is iss...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-novaclient
Version: 4.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 5.0 (RHEL 7)
Assignee: David Ripton
QA Contact: Dafna Ron
URL:
Whiteboard: storage
Depends On:
Blocks: 1034255
TreeView+ depends on / blocked
 
Reported: 2013-11-21 16:32 UTC by Dafna Ron
Modified: 2019-09-09 15:20 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1034255 (view as bug list)
Environment:
Last Closed: 2014-05-21 16:48:07 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1253695 0 None None None Never

Description Dafna Ron 2013-11-21 16:32:39 UTC
Description of problem:

I have a two compute setup and was trying to attach a volume to an instance running on compute #2 from compute #1 (the command was run from compute#1).
There was a failure on the action in libvirt on compute #2 because of selinux and all I could see in compute #1 was that the volume was attached.

I think that if a volume action is run on a different compute and fails we should have some sort of message in the compute we are running the command from.
Or, we should log that the request was sent to a X compute to allow easy debugging.

Version-Release number of selected component (if applicable):

openstack-cinder-2013.2-2.el6ost.noarch

How reproducible:

100%

Steps to Reproduce:
1. in a two compute setup with glusterfs set fuse to 0 on the second compute (setsebool -P virt_use_fusefs=0). 
2. boot instance on compute#2
3. create a volume 
4. run the following command from compute#1 on the instance running on compute#2: 
nova volume-attach <instance> <volume> /dev/vdc

Actual results:

we would fail to attach the volume because of selinux but there is no error reported on compute#1 or indication of which of the computes the request was sent to. 

Expected results:

we should either log the failure in compute#1 or indicate to which compute the request was sent. 

Additional info:

This is from compute#2 - ERROR is clearly reported:

2013-11-21 14:59:00.372+0000: 28993: error : qemuMonitorJSONCheckError:357 : internal error unable to execute QEMU command '__com.redhat_drive_add': Device 'drive-virtio-disk1' could not be initialized
2013-11-21 15:00:04.009+0000: 28993: error : qemuMonitorJSONCheckError:357 : internal error unable to execute QEMU command '__com.redhat_drive_add': Device 'drive-virtio-disk1' could not be initialized
2013-11-21 15:02:47.382+0000: 28996: error : qemuMonitorJSONCheckError:357 : internal error unable to execute QEMU command '__com.redhat_drive_add': Device 'drive-virtio-disk2' could not be initialized
2013-11-21 15:04:27.390+0000: 28997: error : qemuMonitorJSONCheckError:357 : internal error unable to execute QEMU command '__com.redhat_drive_add': Device 'drive-virtio-disk2' could not be initialized
2013-11-21 15:08:04.638+0000: 28993: error : qemuMonitorJSONCheckError:357 : internal error unable to execute QEMU command '__com.redhat_drive_add': Device 'drive-virtio-disk2' could not be initialized

2013-11-21 17:08:05.350 29477 ERROR nova.openstack.common.rpc.amqp [req-90daa905-9b91-4260-9d4b-4ed1cf0ee0b8 24b77982be8049ee9cd5ad7bed913565 7eb59aa89e8944d098554ff6f5a4cf88] Exception during message handling
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp Traceback (most recent call last):
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 461, in _process_data
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp **args)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/dispatcher.py", line 172, in dispatch
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/exception.py", line 90, in wrapped
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp payload)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/exception.py", line 73, in wrapped
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp return f(self, context, *args, **kw)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 243, in decorated_function
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp pass
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 229, in decorated_function
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp return function(self, context, *args, **kwargs)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 271, in decorated_function
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp e, sys.exc_info())
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 258, in decorated_function
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp return function(self, context, *args, **kwargs)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 3638, in attach_volume
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp context, instance, mountpoint)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 3633, in attach_volume
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp mountpoint, instance)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 3680, in _attach_volume
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp connector)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 3670, in _attach_volume
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp encryption=encryption)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py", line 1105, in attach_volume
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp disk_dev)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py", line 1092, in attach_volume
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp virt_dom.attachDeviceFlags(conf.to_xml(), flags)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 187, in doit
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp result = proxy_call(self._autowrap, f, *args, **kwargs)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 147, in proxy_call
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp rv = execute(f,*args,**kwargs)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 76, in tworker
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp rv = meth(*args,**kwargs)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp File "/usr/lib64/python2.6/site-packages/libvirt.py", line 419, in attachDeviceFlags
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp if ret == -1: raise libvirtError ('virDomainAttachDeviceFlags() failed', dom=self)
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp libvirtError: internal error unable to execute QEMU command '__com.redhat_drive_add': Device 'drive-virtio-disk2' could not be initialized
2013-11-21 17:08:05.350 29477 TRACE nova.openstack.common.rpc.amqp
2013-11-21 17:08:05.352 29477 DEBUG qpid.messaging.io.raw [-] SENT[5966680]: '\x0f\x00\x00;\x00\x00\x00\x00\x00\x00\x00\x00\x02\x01\x01\x00\x00)01801d71-64fe-44c8-99f5-f99f88b9e700:1115\x0f\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x02
\x07\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' writeable /usr/lib/python2.6/site-packages/qpid/messaging/driver.py:480
2013-11-21 17:08:05.353 29477 DEBUG qpid.messaging.io.raw [-] READ[5966680]: '\x0f\x00\x00;\x00\x00\x00\x00\x00\x00\x00\x00\x02\x02\x01\x00\x00)01801d71-64fe-44c8-99f5-f99f8

This is on compute#1 - no error, just the request:

2013-11-21 16:54:25.835 19854 DEBUG nova.openstack.common.rpc.amqp [req-b773a8d2-175a-4b8c-aead-7c5dca33dadf 24b77982be8049ee9cd5ad7bed913565 7eb59aa89e8944d098554ff6f5a4cf88] Making synchronous call on conductor ... multicall /usr/lib/p
ython2.6/site-packages/nova/openstack/common/rpc/amqp.py:553
2013-11-21 16:54:25.836 19854 DEBUG nova.openstack.common.rpc.amqp [req-b773a8d2-175a-4b8c-aead-7c5dca33dadf 24b77982be8049ee9cd5ad7bed913565 7eb59aa89e8944d098554ff6f5a4cf88] MSG_ID is b2921a9060e148ec9186e478717672e8 multicall /usr/lib
/python2.6/site-packages/nova/openstack/common/rpc/amqp.py:556
2013-11-21 16:54:25.836 19854 DEBUG nova.openstack.common.rpc.amqp [req-b773a8d2-175a-4b8c-aead-7c5dca33dadf 24b77982be8049ee9cd5ad7bed913565 7eb59aa89e8944d098554ff6f5a4cf88] UNIQUE_ID is 677b4b67f7a64feebb87378496a87337. _add_unique_id
 /usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py:341
2013-11-21 16:54:25.839 19854 DEBUG qpid.messaging.io.ops [-] SENT[3afc290]: MessageTransfer(destination='amq.topic', id=serial(0), sync=True, headers=(DeliveryProperties(routing_key='topic/nova/conductor'), MessageProperties(content_typ
e='amqp/map', application_headers={'qpid.subject': 'topic/nova/conductor'})), payload='\x00\x00\x06+\x00\x00\x00\x02\x0coslo.message\x95\x06\x04{"_context_roles": ["admin"], "_msg_id": "b2921a9060e148ec9186e478717672e8", "_context_quota_
class": null, "_context_request_id": "req-b773a8d2-175a-4b8c-aead-7c5dca33dadf", "_context_service_catalog": [{"endpoints_links": [], "endpoints": [{"adminURL": "http://10.35.160.133:8776/v1/7eb59aa89e8944d098554ff6f5a4cf88", "region": "
RegionOne", "publicURL": "http://10.35.160.133:8776/v1/7eb59aa89e8944d098554ff6f5a4cf88", "id": "46f73e2ea31540a5aed0daa8ebc82857", "internalURL": "http://10.35.160.133:8776/v1/7eb59aa89e8944d098554ff6f5a4cf88"}], "type": "volume", "name
": "cinder"}], "_context_tenant": "7eb59aa89e8944d098554ff6f5a4cf88", "args": {"values": {"instance_uuid": "d1fa2f73-b81d-47f3-b30e-68ed069a849e", "finish_time": "2013-11-21T14:54:25.835276", "request_id": "req-b773a8d2-175a-4b8c-aead-7c
5dca33dadf", "result": "Success", "event": "compute_terminate_instance"}}, "_unique_id": "677b4b67f7a64feebb87378496a87337", "_context_timestamp": "2013-11-21T14:54:18.767872", "_context_user_id": "24b77982be8049ee9cd5ad7bed913565", "_co
ntext_project_name": "admin", "_context_read_deleted": "no", "_reply_q": "reply_82fcbb14702642ffbf4a0d5e9c745598", "_context_auth_token": "ea5a64ace812ebaa1e84934d65c1eac8", "namespace": null, "_context_instance_lock_checked": false, "_c
ontext_is_admin": true, "version": "1.25", "_context_project_id": "7eb59aa89e8944d098554ff6f5a4cf88", "_context_user": "24b77982be8049ee9cd5ad7bed913565", "_context_user_name": "admin", "method": "action_event_finish", "_context_remote_a
ddress": "10.35.160.133"}\x0coslo.version\x95\x00\x032.0') write_op /usr/lib/python2.6/site-packages/qpid/messaging/driver.py:686
2013-11-21 16:54:25.839 19854 DEBUG qpid.messaging [-] SENT[3d9e638]: Message(properties={'qpid.subject': 'topic/nova/conductor'}, content={'oslo.message': '{"_context_roles": ["admin"], "_msg_id": "b2921a9060e148ec9186e478717672e8", "_c
ontext_quota_class": null, "_context_request_id": "req-b773a8d2-175a-4b8c-aead-7c5dca33dadf", "_context_service_catalog": [{"endpoints_links": [], "endpoints": [{"adminURL": "http://10.35.160.133:8776/v1/7eb59aa89e8944d098554ff6f5a4cf88"
, "region": "RegionOne", "publicURL": "http://10.35.160.133:8776/v1/7eb59aa89e8944d098554ff6f5a4cf88", "id": "46f73e2ea31540a5aed0daa8ebc82857", "internalURL": "http://10.35.160.133:8776/v1/7eb59aa89e8944d098554ff6f5a4cf88"}], "type": "v
olume", "name": "cinder"}], "_context_tenant": "7eb59aa89e8944d098554ff6f5a4cf88", "args": {"values": {"instance_uuid": "d1fa2f73-b81d-47f3-b30e-68ed069a849e", "finish_time": "2013-11-21T14:54:25.835276", "request_id": "req-b773a8d2-175a
-4b8c-aead-7c5dca33dadf", "result": "Success", "event": "compute_terminate_instance"}}, "_unique_id": "677b4b67f7a64feebb87378496a87337", "_context_timestamp": "2013-11-21T14:54:18.767872", "_context_user_id": "24b77982be8049ee9cd5ad7bed
913565", "_context_project_name": "admin", "_context_read_deleted": "no", "_reply_q": "reply_82fcbb14702642ffbf4a0d5e9c745598", "_context_auth_token": "ea5a64ace812ebaa1e84934d65c1eac8", "namespace": null, "_context_instance_lock_checked
": false, "_context_is_admin": true, "version": "1.25", "_context_project_id": "7eb59aa89e8944d098554ff6f5a4cf88", "_context_user": "24b77982be8049ee9cd5ad7bed913565", "_context_user_name": "admin", "method": "action_event_finish", "_con
text_remote_address": "10.35.160.133"}', 'oslo.version': '2.0'}) send /usr/lib/python2.6/site-packages/qpid/messaging/driver.py:1280
2013-11-21 16:54:25.840 19854 DEBUG qpid.messaging.io.raw [-] SENT[3afc290]: '\x0f\x00\x00:\x00\x00\x00\x00\x00\x00\x00\x00\x02\x01\x01\x00\x00(27cb1792-1439-4b48-b1ea-7910b7dd1d1f:850\x0f\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x02\
x07\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0b\x01\x00\x1c\x00\x01\x00\x00\x00\x00\x00\x00\x04\x01\x01\x01\x01\x00\tamq.topic\x03\x02\x00f\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x19\x04\x01\x00\x02\x14topic/nova/con
ductor\x00\x00\x009\x04\x03\x10\x01\x08amqp/map\x00\x00\x00(\x00\x00\x00\x01\x0cqpid.subject\x95\x00\x14topic/nova/conductor\x07\x03\x06;\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x06+\x00\x00\x00\x02\x0coslo.message\x95\x06\x04{"_context_
roles": ["admin"], "_msg_id": "b2921a9060e148ec9186e478717672e8", "_context_quota_class": null, "_context_request_id": "req-b773a8d2-175a-4b8c-aead-7c5dca33dadf", "_context_service_catalog": [{"endpoints_links": [], "endpoints": [{"admin
URL": "http://10.35.160.133:8776/v1/7eb59aa89e8944d098554ff6f5a4cf88", "region": "RegionOne", "publicURL": "http://10.35.160.133:8776/v1/7eb59aa89e8944d098554ff6f5a4cf88", "id": "46f73e2ea31540a5aed0daa8ebc82857", "internalURL": "http://
10.35.160.133:8776/v1/7eb59aa89e8944d098554ff6f5a4cf88"}], "type": "volume", "name": "cinder"}], "_context_tenant": "7eb59aa89e8944d098554ff6f5a4cf88", "args": {"values": {"instance_uuid": "d1fa2f73-b81d-47f3-b30e-68ed069a849e", "finish_
time": "2013-11-21T14:54:25.835276", "request_id": "req-b773a8d2-175a-4b8c-aead-7c5dca33dadf", "result": "Success", "event": "compute_terminate_instance"}}, "_unique_id": "677b4b67f7a64feebb87378496a87337", "_context_timestamp": "2013-11
-21T14:54:18.767872", "_context_user_id": "24b77982be8049ee9cd5ad7bed913565", "_context_project_name": "admin", "_context_read_deleted": "no", "_reply_q": "reply_82fcbb14702642ffbf4a0d5e9c745598", "_context_auth_token": "ea5a64ace812ebaa
1e84934d65c1eac8", "namespace": null, "_context_instance_lock_checked": false, "_context_is_admin": true, "version": "1.25", "_context_project_id": "7eb59aa89e8944d098554ff6f5a4cf88", "_context_user": "24b77982be8049ee9cd5ad7bed913565",
"_context_user_name": "admin", "method": "action_event_finish", "_context_remote_address": "10.35.160.133"}\x0coslo.version\x95\x00\x032.0' writeable /usr/lib/python2.6/site-packages/qpid/messaging/driver.py:480
2013-11-21 16:54:25.841 19854 DEBUG qpid.messaging.io.raw [-] READ[3afc290]: '\x0f\x00\x00:\x00\x00\x00\x00\x00\x00\x00\x00\x02\x02\x01\x00\x00(27cb1792-1439-4b48-b1ea-7910b7dd1d1f:850\x0f\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x02\
x07\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\x00\x00\x12\x00\x00\x00\x00\x00\x00\x00\x00\x02\n\x03\x00\x00\x00\x0f\x00\x00\x12\x00\x00\x00\x00\x00\x00\x00\x00\x02\n\x03\x00\x00\x00\x0f\x00\x00\x1a\x00\x00\x00\x00\x00\x
00\x00\x00\x02\n\x01\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00' readable /usr/lib/python2.6/site-packages/qpid/messaging/driver.py:416

Comment 3 Xavier Queralt 2013-11-25 10:30:24 UTC
I think your assumption is wrong and that nova is already logging in the right place. The service receiving the attachment request from the client is nova-api, which will forward the request to the compute node hosting the specified instance. There (compute2), is where the request is being executed and it makes sense to log the error in that host. The api service only takes care of validating the request (parse the json/xml, ensure that all the fields are valid, ensure that the volume and the instance exist and reserve the volume so it cannot be attached a second time). Besides, we don't want to block the API service waiting for the result and/or the error as it's not a very RESTful approach. Remember that you can always get the host where the instance is running from the information of the command "nova show <instance_id>".

Take for example a cluster with multiple nova-api services spread through different hosts. When you make a request using the nova client or the API you don't know on which host the request has been initially processed until you go through all the hosts and check the api logs. Doesn't it, in this case, make more sense to log the error in the compute hosts which you can figure out easily?

Another question would be to make the nova client report an error if the attachment was unsuccessful. This could be done by implementing a "--poll" option for this command that checks the attachment status periodically until it either moves to the "attached" state  (successful attachment) or to the "available" state (unsuccessful attachment). If that is the case I suggest to move the bug to python-novaclient as a feature request.

Comment 4 Dafna Ron 2013-11-25 11:01:43 UTC
well, first, nova volume-attach has no --poll :) 

usage: nova volume-attach <server> <volume> <device>

Attach a volume to a server.

Positional arguments:
  <server>  Name or ID of server.
  <volume>  ID of the volume to attach.
  <device>  Name of the device e.g. /dev/vdb. Use "auto" for autoassign (if
            supported)


second, you are talking about technical issues in fixing this bug and I am talking about user experience in debugging + I don't think this should be corrected for one single command but should be global change in how we monitor and or report issues). 
I understand that this may not be simple or easy fix but I think that the user experience in debugging an issue (especially when you have a large environment) is very time consuming and complicated. 

Currently If I try to perform an action like attach volume and instance is running on a different compute, if there is a problem completing the action, both Horizon and the log will not report an ERROR at all. 
so user will attach a volume from horizon -> it reports success -> volume is not actually attached -> user goes to the log on his main server (since we can have 1000 computes but would use one main one) -> no error is reported. 
This is not a correct debugging flow... 
I would say that the least we can do if we direct the command to a different compute is to log that that command was redirected to a different compute in a very clear way.

Personally, I would like us to come up with a nice solution for HA and large scale environments (like a centralized log server or perhaps a log entry in a load server) since I think this would become a bigger issue once large scale environments are configured.

Comment 5 Xavier Queralt 2013-11-25 12:40:05 UTC
(In reply to Dafna Ron from comment #4)
> well, first, nova volume-attach has no --poll :) 
> 
> usage: nova volume-attach <server> <volume> <device>
> 
> Attach a volume to a server.
> 
> Positional arguments:
>   <server>  Name or ID of server.
>   <volume>  ID of the volume to attach.
>   <device>  Name of the device e.g. /dev/vdb. Use "auto" for autoassign (if
>             supported)
> 

Sorry, I should have been more specific. I was trying to say that it should be implemented in the nova client and, although not mentioned, in the dashboard.

> 
> second, you are talking about technical issues in fixing this bug and I am
> talking about user experience in debugging + I don't think this should be
> corrected for one single command but should be global change in how we
> monitor and or report issues). 
> I understand that this may not be simple or easy fix but I think that the
> user experience in debugging an issue (especially when you have a large
> environment) is very time consuming and complicated. 

It's easy to fix by making the API call to block until it gets an answer from the compute node. I'm just saying that I think it is not the right way to do it and that it should be done by extending the clients so they can check and report the status of the action.

> 
> Currently If I try to perform an action like attach volume and instance is
> running on a different compute, if there is a problem completing the action,
> both Horizon and the log will not report an ERROR at all. 
> so user will attach a volume from horizon -> it reports success -> volume is
> not actually attached -> user goes to the log on his main server (since we
> can have 1000 computes but would use one main one) -> no error is reported. 
> This is not a correct debugging flow... 

The main idea about having multiple and redundant services talking through the messaging queue is to not have a "main" server. So your assumption wouldn't be right in a real deployment and nothing prevents you from having 1000 api nodes too.

Don't take me wrong, I completely agree with you on that we should improve the error reporting for cases like the one you reported. But considering that we're talking to a REST API, I think it's the clients duty to query the status of an action before reporting its status. The error should be reported from the nova client or the dashboard, and if the problem needs further debugging the sysadmin should know how to get the host where the instance was running (either from the database or using the "nova show" command) and get the logs for that compute node.

> I would say that the least we can do if we direct the command to a different
> compute is to log that that command was redirected to a different compute in
> a very clear way.
> 
> Personally, I would like us to come up with a nice solution for HA and large
> scale environments (like a centralized log server or perhaps a log entry in
> a load server) since I think this would become a bigger issue once large
> scale environments are configured.

Sure, I've used services for aggregating/parsing the logs in the past and it helps a lot in big deployments.

Comment 6 Dafna Ron 2013-11-25 12:51:00 UTC
I think we are pretty much in agreement ;) 

1. this is a bug with user experience impact
2. the solution should not be to block API with sync requests even though its a simple solution 
3. we need to give some thought to possible solutions.

are we working on an async task system? perhaps it could be configured to query tasks in a scale environment?

Comment 7 Dafna Ron 2013-11-25 13:51:52 UTC
following the chat with Xavier I am moving this bug to cli and cloning an RFE to improve the monitoring for HA/large scale environments

Comment 8 Dave Allan 2014-05-21 16:48:07 UTC
Since this is being tracked upstream as a bug in launchpad (1253695), I'm closing this as UPSTREAM.


Note You need to log in before you can comment on or make changes to this bug.