Bug 1477706
Summary: | migration with block migration fails as disk_available_least is negative | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Martin Schuppert <mschuppe> | |
Component: | openstack-nova | Assignee: | Matthew Booth <mbooth> | |
Status: | CLOSED ERRATA | QA Contact: | Archit Modi <amodi> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 9.0 (Mitaka) | CC: | achernet, berrange, ccollett, dasmith, ebarrera, eglynn, gkadam, kchamart, lyarwood, mbooth, mmethot, mnadeem, mschuppe, sbauza, sferdjao, sgordon, skinjo, srevivo, vromanso | |
Target Milestone: | zstream | Keywords: | Triaged, ZStream | |
Target Release: | 9.0 (Mitaka) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-nova-13.1.4-15.el7ost | Doc Type: | Bug Fix | |
Doc Text: |
Previously, a combination of circumstances could result in a failed live migration. This could arise when using block migration, with disk overcommit enabled, and using a client sending microversion < 2.25 (this version scope included the openstack client, but not the nova client). As a result of these circumstances, the live migration call could fail with the following error:
Migration pre-check error: Unable to migrate [instance uuid]: Disk of instance is too large(available on destination host:[some negative number] < need:[disk space required])
With this update, the check has been updated to correctly consider overcommittal of existing instances. As a result, live migration succeeds as expected.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1485427 1530330 (view as bug list) | Environment: | ||
Last Closed: | 2018-03-15 12:43:27 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1485427, 1530330, 1533161, 1533164 |
Description
Martin Schuppert
2017-08-02 16:28:32 UTC
The launchpad bug you have noticed on the description is not related to Nova. (or it's related to fork or something else) anyway the fix is not merged upstream. After a look at master it seems that the issue is fixed by that patch [0] but unfortunately i'm not sure it will be easily backportable since it comes with a large refactor. Some investigations still need to be done... [0] https://review.openstack.org/#/c/275585/ just realized, --disk-overcommit is deprecated since api version 2.25. And e.g. the need to pass disk-overcommit got removed: https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/making_live_migration_api_friendly.html The openstackclient uses v2 per default: $ openstack server migrate --live compute-0.localdomain --disk-overcommit test2 --debug defaults: {u'auth_type': 'password', u'compute_api_version': u'2', 'key': None, u'database_api_version': u'1.0', 'api_timeout': None, u'baremetal_api_version': u'1', u'image_api_version': u'2', 'cacert': None, u'image_api_use_tasks': False, u'floating_ip_source': u'neutron', u'orchestration_api_version': u'1', u'interface': None, u'network_api_version': u'2', u'image_format': u'qcow2', u'key_manager_api_version': u'v1', u'metering_api_version': u'2', 'verify': True, u'identity_api_version': u'2.0', u'volume_api_version': u'2', 'cert': None, u'secgroup_source': u'neutron', u'container_api_version': u'1', u'dns_api_version': u'2', u'object_store_api_version': u'1', u'disable_vendor_agent': {}} cloud cfg: {'auth_type': 'password', 'tripleoclient_api_version': '1', u'compute_api_version': u'2', u'orchestration_api_version': '1', u'database_api_version': u'1.0', 'data_processing_api_version': '1.1', 'inspector_api_version': '1', u'network_api_version': u'2', u'image_format': u'qcow2', u'image_api_version': u'2', 'verify': True, u'dns_api_version': '2', u'object_store_api_version': u'1', 'verbose_level': 3, 'region_name': '', 'api_timeout': None, u'baremetal_api_version': '1.6', 'queues_api_version': '1.1', 'auth': {'username': 'admin', 'project_name': 'admin', 'password': '***', 'auth_url': 'http://10.0.0.101:5000/v2.0'}, 'default_domain': 'default', u'container_api_version': u'1', u'image_api_use_tasks': False, u'floating_ip_source': u'neutron', 'key': None, 'timing': False, 'cacert': None, u'key_manager_api_version': '1', u'metering_api_version': u'2', 'deferred_help': False, u'identity_api_version': u'2.0', u'volume_api_version': u'2', 'cert': None, u'secgroup_source': u'neutron', 'alarming_api_version': '2', 'debug': True, u'interface': None, u'disable_vendor_agent': {}} Migration pre-check error: Unable to migrate 5ca04270-d047-4f6a-b09e-c6f11dd97079: Disk of instance is too large(available on destination host:-48318382080 < need:1573888) (HTTP 400) (Request-ID: req-6405a5af-3d09-4694-9f6e-ed27a2a76df9) When we use the novaclient in OSP9, which per default use compute api version 2.25 : $ nova --debug live-migration --block-migrate test2 compute-0.localdomain --debug ... DEBUG (session:248) REQ: curl -g -i -X POST http://10.0.0.101:8774/v2.1/a126a6b887d6452bb4c8ae99774a07ef/servers/90b887b5-0cd6-4145-a88d-8eba2dcbea12/action -H "User-Agent: python-novaclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-OpenStack-Nova-API-Version: 2.25" -H "X-Auth-Token: {SHA1}c71f107636c6bbf12ade9b747bb259133c5fc6c0" -d '{"os-migrateLive": {"block_migration": true, "host": "compute-0.localdomain"}}' $ nova show test2 +--------------------------------------+----------------------------------------------------------+ | Property | Value | +--------------------------------------+----------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute-0.localdomain | | OS-EXT-SRV-ATTR:hostname | test2 | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute-0.localdomain | Seems with API version 2.25 and later where we do not have the disk_over_commit flag we then do not run the check for the DST host storage: 5592 if 'disk_over_commit' in dest_check_data: 5593 self._assert_dest_node_has_enough_disk(context, instance, 5594 dest_check_data.disk_available_mb, 5595 dest_check_data.disk_over_commit, 5596 block_device_info) But when we specify the api version 2.25 with the openstackclient it fails as it seems disk_over_commit is provided per default in the API request. $ openstack --os-compute-api-version 2.25 server migrate --live compute-0.localdomain test2 Setting 'disk_over_commit' argument is prohibited after microversion 2.25. Maybe the correct would be to set the disk-overcommit default values to None 945 disk_group.add_argument( 946 '--disk-overcommit', 947 action='store_true', 948 default=None, 949 help=_('Allow disk over-commit on the destination host'), 950 ) 951 disk_group.add_argument( 952 '--no-disk-overcommit', 953 dest='disk_overcommit', 954 action='store_false', 955 default=None, 956 help=_('Do not over-commit disk on the' 957 ' destination host (default)'), 958 ) Then we can use the api version 2.25 also with openstackclient: $ openstack --os-compute-api-version 2.25 server migrate --live compute-0.localdomain test2 With this change when we use the default api version we fail with the disk too large error, but I'd say that is ok as we do not have specified disk-overcommit: $ openstack server migrate --live compute-1.localdomain test1 Migration pre-check error: Unable to migrate 90b887b5-0cd6-4145-a88d-8eba2dcbea12: Disk of instance is too large(available on destination host:38654705664 < need:85940472320) (HTTP 400) (Request-ID: req-2067b927-af13-494a-acd8-fe5226d17a30) $ openstack server migrate --live compute-1.localdomain --disk-overcommit test1 So maybe this bug needs to be moved to openstackclient component? Yes so that seems to be an issue in openstackclient. Thanks for you investigation. I don't think this is an client issue. I think the bug is is the reporting of disk_available_least by the compute host. The disk_over_commit setting affects this code: necessary = 0 if disk_over_commit: for info in disk_infos: necessary += int(info['disk_size']) else: for info in disk_infos: necessary += int(info['virt_disk_size']) It's picking between checking that the available space at the destination is big enough for the actual [overcommit] or maximum [no overcommit] disk sizes. This makes sense. The problem is that the reported available space at the destination is negative, which is nonsense in both cases. I think that the only value we're interested in here is the actual amount of disk space at the destination, and overcommit is irrelevant. Perhaps this is why it has been removed from the newest value of the api. We need to fix this flow such that the value we're comparing against is the actual current free disk space on the target. Additionally, this code seems to imply that we're mixing overcommit and non-overcommit instances on the same host, with non-overcommit instances not fully allocated. If so, this is highly likely to be broken. (In reply to Matthew Booth from comment #6) > I don't think this is an client issue. I think the bug is is the reporting > of disk_available_least by the compute host. The disk_over_commit setting > affects this code: > > necessary = 0 > if disk_over_commit: > for info in disk_infos: > necessary += int(info['disk_size']) > else: > for info in disk_infos: > necessary += int(info['virt_disk_size']) > > It's picking between checking that the available space at the destination is > big enough for the actual [overcommit] or maximum [no overcommit] disk > sizes. This makes sense. The problem is that the reported available space at > the destination is negative, which is nonsense in both cases. I think that > the only value we're interested in here is the actual amount of disk space > at the destination, and overcommit is irrelevant. Perhaps this is why it has > been removed from the newest value of the api. > yes this is correct, but we only hit it when we use API version < 2.25 as with 2.25 we do not have the overcommit setting any longer. So as a workaround you can use the novaclient and specify to use API v2.25 to not go that path in the code. > We need to fix this flow such that the value we're comparing against is the > actual current free disk space on the target. Agreed, it would be good to fix it as e.g. in horizon there is no way to specify a minor API version to be used. You can just say API v1 or v2. > > Additionally, this code seems to imply that we're mixing overcommit and > non-overcommit instances on the same host, with non-overcommit instances not > fully allocated. If so, this is highly likely to be broken. Not sure I understand this correct. You mean you should not mix instances with and without overcommit on a compute? @mdbooth: we only check the disk size when disk_over_commit is set and if I got it correct, we do not have it any more with API 2.25: virt/libvirt/driver.py 5592 if 'disk_over_commit' in dest_check_data: 5593 self._assert_dest_node_has_enough_disk(context, instance, 5594 dest_check_data.disk_available_mb, 5595 dest_check_data.disk_over_commit, 5596 block_device_info) Sorry, Martin. I missed where you said this worked with nova client. The only way I can see this would be a bug in openstack client is if the client wasn't setting the api version to 2.25 as requested. As you point out this test should be entirely skipped[1] if disk_over_commit isn't set. However, MigrateServerController._migrate_live won't attempt to pull disk_over_commit out of the request of api version >= 2.25. Also I don't think it would validate, as disk_over_commit is removed from migrate_server.migrate_live_v2_25. Assuming you still have the reproducer system available, could you please run the command again with both openstack client and nova client, and check which api version request is actually received? I can't remember if we already log this, but if not could you please wedge some extra debug into MigrateServerController._migrate_live to capture it. If we're receiving version >= 2.25 this is a bug in nova, because we should either be rejecting an invalid request, or ignoring the parameter. My guess would be that disk_over_commit=None accidentally becomes False or something like that at some point. Maybe something change in ovo which means the 'in' test doesn't work. I'm guessing. If we're receiving version < 2.25 this is a bug in openstack client. If the code responsible for setting the outgoing api version is in the nova plugin, the bug remains with us. Otherwise we need to punt it to whoever maintains the framework portion of the client. [1] This seems like a different bug to me. Where does disk space checking happen? Martin, as far I know how OSC works, its default behaviour about microversions is different from the nova CLI. By default, the Nova CLI asks for the latest microversion it knows (not 2.latest, but rather a specific maximum version related to the features it supports) while AFAICS openstackclient (OSC) asks for the stable v2.1 microversion. Given the REST resource changed in 2.25 by not accepting the "disk_over_commit" flag, the problem was silently fixed for that version and later, but I guess the problem still remains for older microversions. Since Nova needs to support 2.1 as a minimum, I think it's reasonable to do this : - first, could you please verify that if passing a specific microversion to Nova CLI that is older than 2.25, you still have the problem ? - second, if the problem is still there with old API microversions, it's worth discussing upstream about that and see how we can fix that from a Nova perspective. HTH, -Sylvain (In reply to Sylvain Bauza from comment #10) > Since Nova needs to support 2.1 as a minimum, I think it's reasonable to do > this : > - first, could you please verify that if passing a specific microversion to > Nova CLI that is older than 2.25, you still have the problem ? yes when using an older minor version, e.g. 2.1 we see the problem also with the nova client: $ nova --debug --os-compute-api-version 2.1 live-migration --block-migrate --disk_over_commit true test compute-0.localdomain ... DEBUG (session:248) REQ: curl -g -i -X POST http://10.0.0.101:8774/v2.1/a126a6b887d6452bb4c8ae99774a07ef/servers/257bb249-3a2f-4ca3-95d0-f7a2d48406b5/action -H "User-Agent: python-novaclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-OpenStack-Nova-API-Version: 2.1" -H "X-Auth-Token: {SHA1}07d5b8d3105a99288466337f2bca346a292271d6" -d '{"os-migrateLive": {"disk_over_commit": true, "block_migration": true, "host": "compute-0.localdomain"}}' RESP BODY: {"badRequest": {"message": "Migration pre-check error: Unable to migrate 257bb249-3a2f-4ca3-95d0-f7a2d48406b5: Disk of instance is too large(available on destination host:-49392123904 < need:1573888)", "code": 400}} DEBUG (shell:1082) Migration pre-check error: Unable to migrate 257bb249-3a2f-4ca3-95d0-f7a2d48406b5: Disk of instance is too large(available on destination host:-49392123904 < need:1573888) (HTTP 400) (Request-ID: req-2c960511-b24d-4db3-a50e-00fcdf2d9323) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/novaclient/shell.py", line 1080, in main OpenStackComputeShell().main(argv) File "/usr/lib/python2.7/site-packages/novaclient/shell.py", line 1007, in main args.func(self.cs, args) File "/usr/lib/python2.7/site-packages/novaclient/v2/shell.py", line 3850, in do_live_migration args.disk_over_commit) File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 433, in live_migrate disk_over_commit) File "/usr/lib/python2.7/site-packages/novaclient/api_versions.py", line 370, in substitution return methods[-1].func(obj, *args, **kwargs) File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 1515, in live_migrate 'disk_over_commit': disk_over_commit}) File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 1682, in _action info=info, **kwargs) File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 1693, in _action_return_resp_and_body return self.api.client.post(url, body=body) File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 179, in post return self.request(url, 'POST', **kwargs) File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 94, in request raise exceptions.from_response(resp, body, url, method) BadRequest: Migration pre-check error: Unable to migrate 257bb249-3a2f-4ca3-95d0-f7a2d48406b5: Disk of instance is too large(available on destination host:-49392123904 < need:1573888) (HTTP 400) (Request-ID: req-2c960511-b24d-4db3-a50e-00fcdf2d9323) ERROR (BadRequest): Migration pre-check error: Unable to migrate 257bb249-3a2f-4ca3-95d0-f7a2d48406b5: Disk of instance is too large(available on destination host:-49392123904 < need:1573888) (HTTP 400) (Request-ID: req-2c960511-b24d-4db3-a50e-00fcdf2d9323) > - second, if the problem is still there with old API microversions, it's > worth discussing upstream about that and see how we can fix that from a Nova > perspective. Summary of the facts from above as I understand them: * If client sends <2.25 we execute a broken overcommit check in Nova * If client sends >=2.25 it's fine * Nova client sends >=2.25 * OSC sends 2.1 by default * Nova is supposed to support 2.1 Based on the above, I think the correct resolution is to fix (or at least bring some sanity to) the broken overcommit check in Nova. (In reply to Matthew Booth from comment #12) > Summary of the facts from above as I understand them: > > * If client sends <2.25 we execute a broken overcommit check in Nova > * If client sends >=2.25 it's fine > * Nova client sends >=2.25 > * OSC sends 2.1 by default > * Nova is supposed to support 2.1 > > Based on the above, I think the correct resolution is to fix (or at least > bring some sanity to) the broken overcommit check in Nova. yes, and OSC should not send overcommit flags, the current default is to set it to false which make it fail when specify >=2.25, as shown in comment 3. This bug will continue to track the Nova aspect of this bug described in comment 6. Specifically, Nova is expected to correctly handle api requests with microversion 2.1, but there is a bug in this code path. *** Bug 1496913 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0538 |