Bug 1374764

Summary: CLI host evacuation and instanceHA failing
Product: Red Hat OpenStack Reporter: John Williams <j_t_williams>
Component: python-novaclientAssignee: melanie witt <mwitt>
Status: CLOSED ERRATA QA Contact: awaugama
Severity: urgent Docs Contact:
Priority: high    
Version: 9.0 (Mitaka)CC: abeekhof, arkady_kanevsky, audra_cooper, berrange, cdevine, christopher_dearborn, dasmith, david_paterson, dcain, eglynn, jjoyce, John_walsh, jruzicka, j_t_williams, kasmith, kchamart, kurt_hey, mburns, morazi, mwitt, nbarcet, randy_perryman, sbauza, sferdjao, sgordon, sreichar, srevivo, vromanso
Target Milestone: asyncKeywords: Rebase, ZStream
Target Release: 9.0 (Mitaka)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: python-novaclient-3.3.2-1.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-05 19:14:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1305654    

Description John Williams 2016-09-09 14:38:38 UTC
Description of problem:
The CLI host evacuation command is failing as shown below: 

[osp_admin@r7-director deployment-validation]$ nova list --field name,status,host
| ID                                   | Name         | Status | Host
| b5ad8947-1cb8-49cf-aa28-338d9bf04d8c | cirros_test  | ACTIVE | r7-13g-compute-2.rcbd.lab |
| fb4dcfc8-869e-43b0-ba4a-22815048d616 | cirros_test1 | ACTIVE | r7-13g-compute-2.rcbd.lab |

[osp_admin@r7-director deployment-validation]$ nova host-evacuate --target_host r7-13g-compute-0.rcbd.lab --on-shared-storage r7-13g-compute-2.rcbd.lab

| Server UUID                          | Evacuate Accepted | Error Message
| b5ad8947-1cb8-49cf-aa28-338d9bf04d8c | False             | Error while evacuating instance: evacuate() got an unexpected keyword argument 'on_shared_storage' |
| fb4dcfc8-869e-43b0-ba4a-22815048d616 | False             | Error while
evacuating instance: evacuate() got an unexpected keyword argument 'on_shared_storage' |


How reproducible:
Very reproducible with OSP9 

Steps to Reproduce:
1. Install OSP 9 and create a cluster
2. Create an instance and attempt to evacuate the host the instance lives on using the CLI command shown in the description of this issue.


Actual results:


Expected results:


Additional info:

Some addition version information: 

fence-agents-common-4.0.11-27.el7_2.7.x86_64

[heat-admin@r7-13g-compute-0 ~]$ rpm -qa | grep nova
openstack-nova-compute-13.1.1-2.el7ost.noarch
openstack-nova-console-13.1.1-2.el7ost.noarch
python-novaclient-3.3.1-1.el7ost.noarch
python-nova-13.1.1-2.el7ost.noarch
openstack-nova-novncproxy-13.1.1-2.el7ost.noarch
openstack-nova-common-13.1.1-2.el7ost.noarch
openstack-nova-api-13.1.1-2.el7ost.noarch
openstack-nova-conductor-13.1.1-2.el7ost.noarch
openstack-nova-cert-13.1.1-2.el7ost.noarch
openstack-nova-scheduler-13.1.1-2.el7ost.noarch

[heat-admin@r7-13g-compute-0 ~]$ nova --version
3.3.1

Comment 2 Stephen Gordon 2016-09-11 02:19:39 UTC
Added external trackers that appear to be related to the same issue and flagging for RHOSP 9 z-stream.

    https://bugs.launchpad.net/python-novaclient/+bug/1581336
    https://review.openstack.org/#/c/332038/

In addition we need to confirm if this needs to be applied to the versions of the client shipping with RHOSP 8, RHOSP 9, and the RHEL Workstation OpenStack Client Tools channels.

Comment 4 arkady kanevsky 2016-09-12 20:29:02 UTC
Mr. Burns,
when do we expect that to land in OSP9 on CDN?
We cannot start validation cycle with out this fix in OSP9.

Comment 5 Mike Burns 2016-09-12 20:38:48 UTC
(In reply to arkady kanevsky from comment #4)
> Mr. Burns,
> when do we expect that to land in OSP9 on CDN?
> We cannot start validation cycle with out this fix in OSP9.

At this time, it's not clear.  This needs some engineering review first from the compute team.  Once they reply, I'll be able to give a better answer.

Comment 6 Mike Burns 2016-09-12 22:07:54 UTC
Eoghan, can we get dev to review and comment?  Is the patch that Steve referenced enough to resolve this?  If so, can we get it setup for the the async update?

Comment 7 John Williams 2016-09-13 15:54:24 UTC
Is there a previous version of python-novaclient rpm that we can downgrade to and version lock as a work-around?   

Our current version is: python-novaclient-3.3.1-1.el7ost.noarch

Comment 8 Mike Burns 2016-09-13 18:17:27 UTC
(In reply to John Williams from comment #7)
> Is there a previous version of python-novaclient rpm that we can downgrade
> to and version lock as a work-around?   
> 
> Our current version is: python-novaclient-3.3.1-1.el7ost.noarch

As of now, in OSP 9, that is the only version of python-novaclient we've shipped.

Comment 9 melanie witt 2016-09-13 23:04:41 UTC
As mentioned in the upstream bug [1] this can be worked around by doing two commands instead of one 'host-evacuate':

 1. 'nova hypervisor-servers <host>' to get a list of servers on a host
 2. 'nova evacuate <server>' per server returned in step 1.

This works because 'nova host-evacuate' is a batch command that does 1 and 2 behind the scenes. The bug is that when microversion 2.14 was introduced, only 'nova evacuate' was modified to work with it and 'nova host-evacuate' was missed.

Another workaround that might work is:

 'nova --os-compute-api-version 2.13 host-evacuate'

since the problem is that novaclient 3.3.1 doesn't support 2.14 for 'host-evacuate'.

The fix for the bug was merged to the stable/mitaka branch on July 10 but there hasn't been an upstream novaclient release from stable/mitaka since. I have requested an upstream release of version 3.3.2 [2] today.

[1] https://bugs.launchpad.net/python-novaclient/+bug/1581336/comments/2
[2] https://review.openstack.org/369723

Comment 10 Andrew Beekhof 2016-09-14 00:35:26 UTC
Unfortunately we can't make use of these work-arounds as it is an automated evacuation coming from a shipping fence agent.

Bumping priority accordingly.

Comment 11 Andrew Beekhof 2016-09-14 00:40:50 UTC
(In reply to melanie witt from comment #9)
> As mentioned in the upstream bug [1] this can be worked around by doing two
> commands instead of one 'host-evacuate':
> 
>  1. 'nova hypervisor-servers <host>' to get a list of servers on a host
>  2. 'nova evacuate <server>' per server returned in step 1.
> 
> This works because 'nova host-evacuate' is a batch command that does 1 and 2
> behind the scenes. The bug is that when microversion 2.14 was introduced,
> only 'nova evacuate' was modified to work with it and 'nova host-evacuate'
> was missed.

Something doesn't sound right here.
As far as I can tell, the agent ( https://github.com/ClusterLabs/fence-agents/blob/master/fence/agents/compute/fence_compute.py ) already functions this way:

servers = nova.servers.list(search_opts={'host': options["--plug"], 'all_tenants': 1 })

for server in...

   (response, dictionary) = nova.servers.evacuate(server=server, on_shared_storage=on_shared_storage)

Comment 12 melanie witt 2016-09-14 00:59:10 UTC
> Something doesn't sound right here.
> As far as I can tell, the agent (
> https://github.com/ClusterLabs/fence-agents/blob/master/fence/agents/compute/
> fence_compute.py ) already functions this way:
> 
> servers = nova.servers.list(search_opts={'host': options["--plug"],
> 'all_tenants': 1 })
> 
> for server in...
> 
>    (response, dictionary) = nova.servers.evacuate(server=server,
> on_shared_storage=on_shared_storage)

You're right. So, the problem here is that from Nova API microversion 2.14 onward, the 'on_shared_storage' parameter has been removed [1]. With the novaclient 3.3.1 Python API, you should have seen the error "Setting 'on_shared_storage' argument is prohibited after microversion 2.14" [2]. The fix for this is to stop passing the 'on_shared_storage' parameter if you're requesting microversion >= 2.14.

The aforementioned upstream bug is about the Nova CLI 'host-evacuate' batch command (which is what was described in the first comment of this bug report). The fix for this would be either a backport or a rebase to 3.3.2 once it's released.

[1] http://docs.openstack.org/developer/nova/api_microversion_history.html#id12
[2] https://github.com/openstack/python-novaclient/blob/3.3.1/novaclient/v2/servers.py#L496-L498

Comment 13 melanie witt 2016-09-14 01:07:25 UTC
Also note it's not valid to do 'nova host-evacuate --on-shared-storage' from Nova API microversion >= 2.14 but there was a bug where even when --on-shared-storage was not passed, it wouldn't work.

Comment 14 Andrew Beekhof 2016-09-14 01:15:48 UTC
Looking at the agent, I see:

	nova = nova_client.Client('2',
		options["--username"],
		options["--password"],
		options["--tenant-name"],
		options["--auth-url"],
		insecure=options["--insecure"],
		region_name=options["--region-name"],
		endpoint_type=options["--endpoint-type"])


Does '2' mean "the latest 2.x microversion" ?

I'm unsure of the path forward here... what happened to the functionality behind that feature?  Is it now auto-detected? 

It's unclear to me what would be worse, not passing that option to older versions or explicitly requesting microversion 2.13 for the foreseeable future.  Which OSPs support 2.13?

Comment 15 melanie witt 2016-09-14 01:42:15 UTC
(In reply to Andrew Beekhof from comment #14)
> Looking at the agent, I see:
> 
> 	nova = nova_client.Client('2',
> 		options["--username"],
> 		options["--password"],
> 		options["--tenant-name"],
> 		options["--auth-url"],
> 		insecure=options["--insecure"],
> 		region_name=options["--region-name"],
> 		endpoint_type=options["--endpoint-type"])
> 
> 
> Does '2' mean "the latest 2.x microversion" ?

No, the novaclient Python API does not automatically request the latest microversion. '2' means microversion '2.0' which is the base version 2.1 from Nova API. So I would not expect you to see an error in the agent. Are you seeing an error in the agent?

The Nova CLI defaults to the latest microversion, so it will auto-detect the latest. This bug was opened about the 'nova host-evacuate' CLI command which will automatically call the latest Nova API version.

The Python API does not do version discovery because it's assumed users will be writing automation with it that shouldn't be disturbed by changing API microversions. The CLI does version discovery because it's assumed users are interactively running it on the command line and would expect to receive new API versions as they arrive.

Can you explain the problem you're encountering? Are you experiencing problems with the agent you described earlier? Or do you need a fix for the Nova CLI 'host-evacuate' command?

Comment 16 Andrew Beekhof 2016-09-14 03:00:07 UTC
(In reply to melanie witt from comment #15)
> (In reply to Andrew Beekhof from comment #14)
> > Does '2' mean "the latest 2.x microversion" ?
> 
> No, the novaclient Python API does not automatically request the latest
> microversion. '2' means microversion '2.0' which is the base version 2.1
> from Nova API. So I would not expect you to see an error in the agent. Are
> you seeing an error in the agent?

I believe that was the original context for this report.
JT: can you confirm please?

Comment 18 John Williams 2016-09-14 20:52:40 UTC
We are seeing the same error when we try to run the "nova host-evacuate ..." command shown above from the command line of one of the controller nodes with and without the --on-shared-storage argument.  

I tried to patch the host_evacuate.py file using the diffs as shown in changeset 332038 -- https://review.openstack.org/#/c/332038/ and it had no affect in the nova host-evacuate command-line behavior.  Thus I suspect there is another file that must also be modified for the changeset to work properly.   

Since we are using the RedHat yum packages with our installation, it is difficult to associate the openstack novaclient api version to the RedHat rpm versions.  Do you have a suggestion on version identification of files you which to compare?

Comment 19 melanie witt 2016-09-14 22:20:31 UTC
(In reply to John Williams from comment #18)
> We are seeing the same error when we try to run the "nova host-evacuate ..."
> command shown above from the command line of one of the controller nodes
> with and without the --on-shared-storage argument.  
> 
> I tried to patch the host_evacuate.py file using the diffs as shown in
> changeset 332038 -- https://review.openstack.org/#/c/332038/ and it had no
> affect in the nova host-evacuate command-line behavior.  Thus I suspect
> there is another file that must also be modified for the changeset to work
> properly.   

That's unexpected that the patch wouldn't help, but I'm building a new package that contains that change for you to try out.

> Since we are using the RedHat yum packages with our installation, it is
> difficult to associate the openstack novaclient api version to the RedHat
> rpm versions.  Do you have a suggestion on version identification of files
> you which to compare?

I'm not sure what you mean by version identification of files to compare, but you can see what version of novaclient you're running by:

 nova --version

and you can see what versions of Nova API are supported by your novaclient and by the server with:

 nova version-list

Comment 20 melanie witt 2016-09-14 22:30:07 UTC
The new package python-novaclient-3.3.2-1.el7ost is ready for you to try.

Comment 22 Andrew Beekhof 2016-09-15 02:36:01 UTC
(In reply to John Williams from comment #18)
> We are seeing the same error when we try to run the "nova host-evacuate ..."
> command shown above from the command line of one of the controller nodes
> with and without the --on-shared-storage argument.  

Ah, my bad, I just assumed the fence agent was involved.

Comment 23 Stephen Gordon 2016-09-15 12:38:00 UTC
(In reply to Andrew Beekhof from comment #22)
> (In reply to John Williams from comment #18)
> > We are seeing the same error when we try to run the "nova host-evacuate ..."
> > command shown above from the command line of one of the controller nodes
> > with and without the --on-shared-storage argument.  
> 
> Ah, my bad, I just assumed the fence agent was involved.

John can you perhaps clarify here, the bug title indicates you see this issue with both CLI initiated host evacuation *and* instance HA (fence agent) but the description and comments focus on the CLI.

Are you seeing the issue from both avenues or just when using the CLI directly?

Comment 24 John Williams 2016-09-15 14:01:36 UTC
(In reply to Stephen Gordon from comment #23)
> (In reply to Andrew Beekhof from comment #22)
> > (In reply to John Williams from comment #18)
> > > We are seeing the same error when we try to run the "nova host-evacuate ..."
> > > command shown above from the command line of one of the controller nodes
> > > with and without the --on-shared-storage argument.  
> > 
> > Ah, my bad, I just assumed the fence agent was involved.
> 
> John can you perhaps clarify here, the bug title indicates you see this
> issue with both CLI initiated host evacuation *and* instance HA (fence
> agent) but the description and comments focus on the CLI.
> 
> Are you seeing the issue from both avenues or just when using the CLI
> directly?

Yes, we are seeing issues with both avenues.  I have not been able to check log files with the instance HA (fence_agents) as the QA person has since rebuilt their stamp and is working on other validation tasks.  I have mainly need working with another QA person who reported the CLI issue, thus the focus on CLI. I'm hoping once the CLI issues is resolved the instance HA will be a non-issue.

Comment 25 John Williams 2016-09-15 14:29:55 UTC
(In reply to melanie witt from comment #20)
> The new package python-novaclient-3.3.2-1.el7ost is ready for you to try.

I looked at in our subscribed repos and couldn't find it.  Where else can I look, to find the new package python-novaclient-3.3.2-1.el7ost?

Comment 26 Stephen Gordon 2016-09-15 15:02:19 UTC
Mike can you clarify how this was handed off?

Comment 27 John Williams 2016-09-15 15:04:51 UTC
I received the file from Mike Burns.

Comment 28 John Williams 2016-09-15 15:56:37 UTC
With the new rpm package installed and without the argument --on-shared-storage, the CLI works.  I still need to test instanceHA.   

[root@overcloud-controller-0 ~]# nova host-evacuate --target_host  overcloud-compute-1.amberauto.org overcloud-compute-2.amberauto.org
+--------------------------------------+-------------------+---------------+
| Server UUID                          | Evacuate Accepted | Error Message |
+--------------------------------------+-------------------+---------------+
| d8e4e94a-6927-4a2e-93d4-53deec5d3fba | True              |               |
| 727b2c2c-ce87-404c-9c29-b925c9d1fbb5 | True              |               |
| 995fa66a-ab75-4751-bdb9-3954cca38e94 | True              |               |
| f8644da6-caa4-4022-92b6-1ef70ed50715 | True              |               |
| b3930673-f62c-4399-b3b5-777c4c53b292 | True              |               |
| 1d1cf10a-da7b-468f-a522-594b8b184b64 | True              |               |
| f26f4017-57d7-42b8-bf20-f1a0f3949b88 | True              |               |
| 9ae5f178-9b94-43ff-843c-1526d6dbe82a | True              |               |
| c64790bd-4532-400a-8f96-bd92ecbc5bfc | True              |               |
| cbfec874-2db1-4ab3-abf4-fb6c3b648320 | True              |               |
| ac58d9cd-56a8-4409-88c1-5e7fdb620505 | True              |               |
+--------------------------------------+-------------------+---------------+

Comment 30 John Williams 2016-09-16 16:17:07 UTC
I'm still having issues with InstanceHA, Andrew and I did a quick review of my cib.xml configuration and he located a few entries I should change on the pcs stonith fence-nova configuration.  

auth_url => auth-url 
password => password 
username => login 
tenant_name tenant-name 

root@r11b-controller-0 ~]# pcs stonith show fence-nova

 Resource: fence-nova (class=stonith type=fence_compute)
  Attributes: auth-url=http://192.168.190.50:5000/v2.0 login=admin passwd=NsDJdtWeDzezvvvTYnGKgArB3 tenant-name=admin record-only=1 action=off

I made those changes this morning and it cleared up the Parse errors in the log, but still I don't see the entries in the log where it tries to evacuate the instance.

Comment 31 John Williams 2016-09-16 20:01:35 UTC
Further testing of the CLI is indicating there may be another issue (different bug).  

Evacuation of an instance or instances from a node (A) to a new node (B) appears to be working correctly.  However, if you then attempt to evacuate the same instance/s from node B to another node (C, D, ...) or back to node (A) you receive the following error.  

ERROR (BadRequest): Compute service of r11b-compute-0.localdomain is still in use. (HTTP 400) (Request-ID: req-f279117f-b962-4f57-8a92-feac8fa26396).   

This was seen on 2 stamps, just waiting for QA to test and they will file a new bug on this if they see it as well.

Comment 32 Andrew Beekhof 2016-09-19 02:03:13 UTC
(In reply to John Williams from comment #30)
> I'm still having issues with InstanceHA, Andrew and I did a quick review of
> my cib.xml configuration and he located a few entries I should change on the
> pcs stonith fence-nova configuration.  
> 
> auth_url => auth-url 
> password => password 
> username => login 
> tenant_name tenant-name 
> 
> root@r11b-controller-0 ~]# pcs stonith show fence-nova
> 
>  Resource: fence-nova (class=stonith type=fence_compute)
>   Attributes: auth-url=http://192.168.190.50:5000/v2.0 login=admin
> passwd=NsDJdtWeDzezvvvTYnGKgArB3 tenant-name=admin record-only=1 action=off
> 
> I made those changes this morning and it cleared up the Parse errors in the
> log, but still I don't see the entries in the log where it tries to evacuate
> the instance.

Can we get updated logs please?

Comment 33 Andrew Beekhof 2016-09-19 02:04:16 UTC
(In reply to John Williams from comment #31)
> Further testing of the CLI is indicating there may be another issue
> (different bug).  
> 
> Evacuation of an instance or instances from a node (A) to a new node (B)
> appears to be working correctly.  However, if you then attempt to evacuate
> the same instance/s from node B to another node (C, D, ...) or back to node
> (A) you receive the following error.  
> 
> ERROR (BadRequest): Compute service of r11b-compute-0.localdomain is still
> in use. (HTTP 400) (Request-ID: req-f279117f-b962-4f57-8a92-feac8fa26396).   
> 
> This was seen on 2 stamps, just waiting for QA to test and they will file a
> new bug on this if they see it as well.

I've seen this in the past, I thought the nova folks had squashed it though.

Comment 34 John Williams 2016-09-19 17:45:26 UTC
Working on re-creating the issue since the QA stamp I was borrowing was wiped out over the weekend.

Comment 35 John Williams 2016-09-20 15:54:25 UTC
I found the last issue.  It was on my side where the stonith resources were being created with the wrong login information, thus fencing the compute nodes was failing.  

To summarize:  

I had the original issue with the python-novaclient needing to be patch taht was fixed wit the new rpm package python-novaclient-3.3.2-1.el7ost.noarch. 

I had to adjust the names of the parameters for the pcs stonith create fence-nova command. 

I had to fix a bug on my side for the the creation of the stonith compute resources.

- Thanks for the help.

Comment 37 Audra Cooper 2016-09-21 19:05:19 UTC
(In reply to Andrew Beekhof from comment #33)
> (In reply to John Williams from comment #31)
> > Further testing of the CLI is indicating there may be another issue
> > (different bug).  
> > 
> > Evacuation of an instance or instances from a node (A) to a new node (B)
> > appears to be working correctly.  However, if you then attempt to evacuate
> > the same instance/s from node B to another node (C, D, ...) or back to node
> > (A) you receive the following error.  
> > 
> > ERROR (BadRequest): Compute service of r11b-compute-0.localdomain is still
> > in use. (HTTP 400) (Request-ID: req-f279117f-b962-4f57-8a92-feac8fa26396).   
> > 
> > This was seen on 2 stamps, just waiting for QA to test and they will file a
> > new bug on this if they see it as well.
> 
> I've seen this in the past, I thought the nova folks had squashed it though.

I did see this same issue.  When I removed instances in Error state, nova host-evacuate from compute to next compute worked.  It seems to be intermittent and it looks like there is an older bug on this that is still in the assigned state.  https://bugzilla.redhat.com/show_bug.cgi?id=1168676

Comment 39 errata-xmlrpc 2016-10-05 19:14:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2032.html