| Summary: | CLI host evacuation and instanceHA failing | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | John Williams <j_t_williams> |
| Component: | python-novaclient | Assignee: | melanie witt <mwitt> |
| Status: | CLOSED ERRATA | QA Contact: | awaugama |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 9.0 (Mitaka) | CC: | abeekhof, arkady_kanevsky, audra_cooper, berrange, cdevine, christopher_dearborn, dasmith, david_paterson, dcain, eglynn, jjoyce, John_walsh, jruzicka, j_t_williams, kasmith, kchamart, kurt_hey, mburns, morazi, mwitt, nbarcet, randy_perryman, sbauza, sferdjao, sgordon, sreichar, srevivo, vromanso |
| Target Milestone: | async | Keywords: | Rebase, ZStream |
| Target Release: | 9.0 (Mitaka) | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | python-novaclient-3.3.2-1.el7ost | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-10-05 19:14:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 1305654 | ||
|
Description
John Williams
2016-09-09 14:38:38 UTC
Added external trackers that appear to be related to the same issue and flagging for RHOSP 9 z-stream.
https://bugs.launchpad.net/python-novaclient/+bug/1581336
https://review.openstack.org/#/c/332038/
In addition we need to confirm if this needs to be applied to the versions of the client shipping with RHOSP 8, RHOSP 9, and the RHEL Workstation OpenStack Client Tools channels.
Mr. Burns, when do we expect that to land in OSP9 on CDN? We cannot start validation cycle with out this fix in OSP9. (In reply to arkady kanevsky from comment #4) > Mr. Burns, > when do we expect that to land in OSP9 on CDN? > We cannot start validation cycle with out this fix in OSP9. At this time, it's not clear. This needs some engineering review first from the compute team. Once they reply, I'll be able to give a better answer. Eoghan, can we get dev to review and comment? Is the patch that Steve referenced enough to resolve this? If so, can we get it setup for the the async update? Is there a previous version of python-novaclient rpm that we can downgrade to and version lock as a work-around? Our current version is: python-novaclient-3.3.1-1.el7ost.noarch (In reply to John Williams from comment #7) > Is there a previous version of python-novaclient rpm that we can downgrade > to and version lock as a work-around? > > Our current version is: python-novaclient-3.3.1-1.el7ost.noarch As of now, in OSP 9, that is the only version of python-novaclient we've shipped. As mentioned in the upstream bug [1] this can be worked around by doing two commands instead of one 'host-evacuate': 1. 'nova hypervisor-servers <host>' to get a list of servers on a host 2. 'nova evacuate <server>' per server returned in step 1. This works because 'nova host-evacuate' is a batch command that does 1 and 2 behind the scenes. The bug is that when microversion 2.14 was introduced, only 'nova evacuate' was modified to work with it and 'nova host-evacuate' was missed. Another workaround that might work is: 'nova --os-compute-api-version 2.13 host-evacuate' since the problem is that novaclient 3.3.1 doesn't support 2.14 for 'host-evacuate'. The fix for the bug was merged to the stable/mitaka branch on July 10 but there hasn't been an upstream novaclient release from stable/mitaka since. I have requested an upstream release of version 3.3.2 [2] today. [1] https://bugs.launchpad.net/python-novaclient/+bug/1581336/comments/2 [2] https://review.openstack.org/369723 Unfortunately we can't make use of these work-arounds as it is an automated evacuation coming from a shipping fence agent. Bumping priority accordingly. (In reply to melanie witt from comment #9) > As mentioned in the upstream bug [1] this can be worked around by doing two > commands instead of one 'host-evacuate': > > 1. 'nova hypervisor-servers <host>' to get a list of servers on a host > 2. 'nova evacuate <server>' per server returned in step 1. > > This works because 'nova host-evacuate' is a batch command that does 1 and 2 > behind the scenes. The bug is that when microversion 2.14 was introduced, > only 'nova evacuate' was modified to work with it and 'nova host-evacuate' > was missed. Something doesn't sound right here. As far as I can tell, the agent ( https://github.com/ClusterLabs/fence-agents/blob/master/fence/agents/compute/fence_compute.py ) already functions this way: servers = nova.servers.list(search_opts={'host': options["--plug"], 'all_tenants': 1 }) for server in... (response, dictionary) = nova.servers.evacuate(server=server, on_shared_storage=on_shared_storage) > Something doesn't sound right here. > As far as I can tell, the agent ( > https://github.com/ClusterLabs/fence-agents/blob/master/fence/agents/compute/ > fence_compute.py ) already functions this way: > > servers = nova.servers.list(search_opts={'host': options["--plug"], > 'all_tenants': 1 }) > > for server in... > > (response, dictionary) = nova.servers.evacuate(server=server, > on_shared_storage=on_shared_storage) You're right. So, the problem here is that from Nova API microversion 2.14 onward, the 'on_shared_storage' parameter has been removed [1]. With the novaclient 3.3.1 Python API, you should have seen the error "Setting 'on_shared_storage' argument is prohibited after microversion 2.14" [2]. The fix for this is to stop passing the 'on_shared_storage' parameter if you're requesting microversion >= 2.14. The aforementioned upstream bug is about the Nova CLI 'host-evacuate' batch command (which is what was described in the first comment of this bug report). The fix for this would be either a backport or a rebase to 3.3.2 once it's released. [1] http://docs.openstack.org/developer/nova/api_microversion_history.html#id12 [2] https://github.com/openstack/python-novaclient/blob/3.3.1/novaclient/v2/servers.py#L496-L498 Also note it's not valid to do 'nova host-evacuate --on-shared-storage' from Nova API microversion >= 2.14 but there was a bug where even when --on-shared-storage was not passed, it wouldn't work. Looking at the agent, I see:
nova = nova_client.Client('2',
options["--username"],
options["--password"],
options["--tenant-name"],
options["--auth-url"],
insecure=options["--insecure"],
region_name=options["--region-name"],
endpoint_type=options["--endpoint-type"])
Does '2' mean "the latest 2.x microversion" ?
I'm unsure of the path forward here... what happened to the functionality behind that feature? Is it now auto-detected?
It's unclear to me what would be worse, not passing that option to older versions or explicitly requesting microversion 2.13 for the foreseeable future. Which OSPs support 2.13?
(In reply to Andrew Beekhof from comment #14) > Looking at the agent, I see: > > nova = nova_client.Client('2', > options["--username"], > options["--password"], > options["--tenant-name"], > options["--auth-url"], > insecure=options["--insecure"], > region_name=options["--region-name"], > endpoint_type=options["--endpoint-type"]) > > > Does '2' mean "the latest 2.x microversion" ? No, the novaclient Python API does not automatically request the latest microversion. '2' means microversion '2.0' which is the base version 2.1 from Nova API. So I would not expect you to see an error in the agent. Are you seeing an error in the agent? The Nova CLI defaults to the latest microversion, so it will auto-detect the latest. This bug was opened about the 'nova host-evacuate' CLI command which will automatically call the latest Nova API version. The Python API does not do version discovery because it's assumed users will be writing automation with it that shouldn't be disturbed by changing API microversions. The CLI does version discovery because it's assumed users are interactively running it on the command line and would expect to receive new API versions as they arrive. Can you explain the problem you're encountering? Are you experiencing problems with the agent you described earlier? Or do you need a fix for the Nova CLI 'host-evacuate' command? (In reply to melanie witt from comment #15) > (In reply to Andrew Beekhof from comment #14) > > Does '2' mean "the latest 2.x microversion" ? > > No, the novaclient Python API does not automatically request the latest > microversion. '2' means microversion '2.0' which is the base version 2.1 > from Nova API. So I would not expect you to see an error in the agent. Are > you seeing an error in the agent? I believe that was the original context for this report. JT: can you confirm please? We are seeing the same error when we try to run the "nova host-evacuate ..." command shown above from the command line of one of the controller nodes with and without the --on-shared-storage argument. I tried to patch the host_evacuate.py file using the diffs as shown in changeset 332038 -- https://review.openstack.org/#/c/332038/ and it had no affect in the nova host-evacuate command-line behavior. Thus I suspect there is another file that must also be modified for the changeset to work properly. Since we are using the RedHat yum packages with our installation, it is difficult to associate the openstack novaclient api version to the RedHat rpm versions. Do you have a suggestion on version identification of files you which to compare? (In reply to John Williams from comment #18) > We are seeing the same error when we try to run the "nova host-evacuate ..." > command shown above from the command line of one of the controller nodes > with and without the --on-shared-storage argument. > > I tried to patch the host_evacuate.py file using the diffs as shown in > changeset 332038 -- https://review.openstack.org/#/c/332038/ and it had no > affect in the nova host-evacuate command-line behavior. Thus I suspect > there is another file that must also be modified for the changeset to work > properly. That's unexpected that the patch wouldn't help, but I'm building a new package that contains that change for you to try out. > Since we are using the RedHat yum packages with our installation, it is > difficult to associate the openstack novaclient api version to the RedHat > rpm versions. Do you have a suggestion on version identification of files > you which to compare? I'm not sure what you mean by version identification of files to compare, but you can see what version of novaclient you're running by: nova --version and you can see what versions of Nova API are supported by your novaclient and by the server with: nova version-list The new package python-novaclient-3.3.2-1.el7ost is ready for you to try. (In reply to John Williams from comment #18) > We are seeing the same error when we try to run the "nova host-evacuate ..." > command shown above from the command line of one of the controller nodes > with and without the --on-shared-storage argument. Ah, my bad, I just assumed the fence agent was involved. (In reply to Andrew Beekhof from comment #22) > (In reply to John Williams from comment #18) > > We are seeing the same error when we try to run the "nova host-evacuate ..." > > command shown above from the command line of one of the controller nodes > > with and without the --on-shared-storage argument. > > Ah, my bad, I just assumed the fence agent was involved. John can you perhaps clarify here, the bug title indicates you see this issue with both CLI initiated host evacuation *and* instance HA (fence agent) but the description and comments focus on the CLI. Are you seeing the issue from both avenues or just when using the CLI directly? (In reply to Stephen Gordon from comment #23) > (In reply to Andrew Beekhof from comment #22) > > (In reply to John Williams from comment #18) > > > We are seeing the same error when we try to run the "nova host-evacuate ..." > > > command shown above from the command line of one of the controller nodes > > > with and without the --on-shared-storage argument. > > > > Ah, my bad, I just assumed the fence agent was involved. > > John can you perhaps clarify here, the bug title indicates you see this > issue with both CLI initiated host evacuation *and* instance HA (fence > agent) but the description and comments focus on the CLI. > > Are you seeing the issue from both avenues or just when using the CLI > directly? Yes, we are seeing issues with both avenues. I have not been able to check log files with the instance HA (fence_agents) as the QA person has since rebuilt their stamp and is working on other validation tasks. I have mainly need working with another QA person who reported the CLI issue, thus the focus on CLI. I'm hoping once the CLI issues is resolved the instance HA will be a non-issue. (In reply to melanie witt from comment #20) > The new package python-novaclient-3.3.2-1.el7ost is ready for you to try. I looked at in our subscribed repos and couldn't find it. Where else can I look, to find the new package python-novaclient-3.3.2-1.el7ost? Mike can you clarify how this was handed off? I received the file from Mike Burns. With the new rpm package installed and without the argument --on-shared-storage, the CLI works. I still need to test instanceHA. [root@overcloud-controller-0 ~]# nova host-evacuate --target_host overcloud-compute-1.amberauto.org overcloud-compute-2.amberauto.org +--------------------------------------+-------------------+---------------+ | Server UUID | Evacuate Accepted | Error Message | +--------------------------------------+-------------------+---------------+ | d8e4e94a-6927-4a2e-93d4-53deec5d3fba | True | | | 727b2c2c-ce87-404c-9c29-b925c9d1fbb5 | True | | | 995fa66a-ab75-4751-bdb9-3954cca38e94 | True | | | f8644da6-caa4-4022-92b6-1ef70ed50715 | True | | | b3930673-f62c-4399-b3b5-777c4c53b292 | True | | | 1d1cf10a-da7b-468f-a522-594b8b184b64 | True | | | f26f4017-57d7-42b8-bf20-f1a0f3949b88 | True | | | 9ae5f178-9b94-43ff-843c-1526d6dbe82a | True | | | c64790bd-4532-400a-8f96-bd92ecbc5bfc | True | | | cbfec874-2db1-4ab3-abf4-fb6c3b648320 | True | | | ac58d9cd-56a8-4409-88c1-5e7fdb620505 | True | | +--------------------------------------+-------------------+---------------+ I'm still having issues with InstanceHA, Andrew and I did a quick review of my cib.xml configuration and he located a few entries I should change on the pcs stonith fence-nova configuration. auth_url => auth-url password => password username => login tenant_name tenant-name root@r11b-controller-0 ~]# pcs stonith show fence-nova Resource: fence-nova (class=stonith type=fence_compute) Attributes: auth-url=http://192.168.190.50:5000/v2.0 login=admin passwd=NsDJdtWeDzezvvvTYnGKgArB3 tenant-name=admin record-only=1 action=off I made those changes this morning and it cleared up the Parse errors in the log, but still I don't see the entries in the log where it tries to evacuate the instance. Further testing of the CLI is indicating there may be another issue (different bug). Evacuation of an instance or instances from a node (A) to a new node (B) appears to be working correctly. However, if you then attempt to evacuate the same instance/s from node B to another node (C, D, ...) or back to node (A) you receive the following error. ERROR (BadRequest): Compute service of r11b-compute-0.localdomain is still in use. (HTTP 400) (Request-ID: req-f279117f-b962-4f57-8a92-feac8fa26396). This was seen on 2 stamps, just waiting for QA to test and they will file a new bug on this if they see it as well. (In reply to John Williams from comment #30) > I'm still having issues with InstanceHA, Andrew and I did a quick review of > my cib.xml configuration and he located a few entries I should change on the > pcs stonith fence-nova configuration. > > auth_url => auth-url > password => password > username => login > tenant_name tenant-name > > root@r11b-controller-0 ~]# pcs stonith show fence-nova > > Resource: fence-nova (class=stonith type=fence_compute) > Attributes: auth-url=http://192.168.190.50:5000/v2.0 login=admin > passwd=NsDJdtWeDzezvvvTYnGKgArB3 tenant-name=admin record-only=1 action=off > > I made those changes this morning and it cleared up the Parse errors in the > log, but still I don't see the entries in the log where it tries to evacuate > the instance. Can we get updated logs please? (In reply to John Williams from comment #31) > Further testing of the CLI is indicating there may be another issue > (different bug). > > Evacuation of an instance or instances from a node (A) to a new node (B) > appears to be working correctly. However, if you then attempt to evacuate > the same instance/s from node B to another node (C, D, ...) or back to node > (A) you receive the following error. > > ERROR (BadRequest): Compute service of r11b-compute-0.localdomain is still > in use. (HTTP 400) (Request-ID: req-f279117f-b962-4f57-8a92-feac8fa26396). > > This was seen on 2 stamps, just waiting for QA to test and they will file a > new bug on this if they see it as well. I've seen this in the past, I thought the nova folks had squashed it though. Working on re-creating the issue since the QA stamp I was borrowing was wiped out over the weekend. I found the last issue. It was on my side where the stonith resources were being created with the wrong login information, thus fencing the compute nodes was failing. To summarize: I had the original issue with the python-novaclient needing to be patch taht was fixed wit the new rpm package python-novaclient-3.3.2-1.el7ost.noarch. I had to adjust the names of the parameters for the pcs stonith create fence-nova command. I had to fix a bug on my side for the the creation of the stonith compute resources. - Thanks for the help. (In reply to Andrew Beekhof from comment #33) > (In reply to John Williams from comment #31) > > Further testing of the CLI is indicating there may be another issue > > (different bug). > > > > Evacuation of an instance or instances from a node (A) to a new node (B) > > appears to be working correctly. However, if you then attempt to evacuate > > the same instance/s from node B to another node (C, D, ...) or back to node > > (A) you receive the following error. > > > > ERROR (BadRequest): Compute service of r11b-compute-0.localdomain is still > > in use. (HTTP 400) (Request-ID: req-f279117f-b962-4f57-8a92-feac8fa26396). > > > > This was seen on 2 stamps, just waiting for QA to test and they will file a > > new bug on this if they see it as well. > > I've seen this in the past, I thought the nova folks had squashed it though. I did see this same issue. When I removed instances in Error state, nova host-evacuate from compute to next compute worked. It seems to be intermittent and it looks like there is an older bug on this that is still in the assigned state. https://bugzilla.redhat.com/show_bug.cgi?id=1168676 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2032.html |