Hide Forgot
Description of problem: When an instance is forced to be spawned on a specific compute the scheduler filter get ignored and we can oversubscribe the resource limits. But there is a situation, or inconsistent behavior where instances result in error even if a compute is specified. Version-Release number of selected component (if applicable): nova 2015.1.4-17.el7ost , but also tested with latest 2015.1.2-18.2 How reproducible: always Steps to Reproduce: - OSP7 all-in-one => so this is not ha related like we have the issue on evacuation ( for reference BZ#1358284 ) - system with 4 CPUs - cpu_allocation_ratio=1.0 1) spawn 5 instances _with_ AZ , so we see the over-subscription (vm1-5) => ACTIVE, expected. nova boot --flavor m1.tiny --image 017f7c3d-2acc-4fad-bfb2-fd75eb2b6194 --nic net-id=31fea0b3-aae0-4a24-8515-3885f01d14be --availability_zone nova:osp7-all-in-one-2 vm1 # nova list +--------------------------------------+------+--------+------------+-------------+-------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------+--------+------------+-------------+-------------------+ | 58240408-3e49-4568-a228-4646ddc6eb4b | vm1 | ACTIVE | - | Running | private=10.0.0.15 | | 3c9ce74c-b95d-4f0d-8dea-6cb8a4247fe2 | vm2 | ACTIVE | - | Running | private=10.0.0.16 | | 4bdc4248-d2e3-4f1b-93f2-1f4329223db4 | vm3 | ACTIVE | - | Running | private=10.0.0.17 | | 6f22faa8-06e2-40c3-a3ca-12dc83154ad1 | vm4 | ACTIVE | - | Running | private=10.0.0.18 | | 030b99b0-1f55-4ad0-b280-bc23b6aca272 | vm5 | ACTIVE | - | Running | private=10.0.0.19 | | 5d350881-bac5-4868-b677-7dd1ccacd771 | vm6 | ERROR | - | NOSTATE | | | bbd8e594-51aa-48d7-97fd-f48efe73ab27 | vm7 | ERROR | - | NOSTATE | | | 7f1f4bfd-76f1-4b35-ab83-d72e9b483078 | vm8 | ACTIVE | - | Running | private=10.0.0.20 | +--------------------------------------+------+--------+------------+-------------+-------------------+ 2) spawn another instance _without_ the AZ (vm6) => ERROR, expected. 3) spawn another instance again _with_ the AZ (vm7) => ERROR, not expected. 4) we can revert to the initial state by restart nova-scheduler 5) spawn another instance again _with_ the AZ (vm8) => ACTIVE, different behavior as step 3) When we look at the ERRORed instance bbd8e594-51aa-48d7-97fd-f48efe73ab27 with the AZ specified, , we see * from the nova-scheduler log that as we passed the AZ, we see the forcing message on the scheduler and all resource filter got not triggered. This is the behavior which is expected when we set the target AZ/host: 2016-10-06 08:54:41.487 27496 WARNING nova.scheduler.host_manager [req-0bde59b1-cb46-4ac8-9317-18ceacf56112 bc4a2c2129f8487e97f3e836185fc19f fa18ae0d190f48d0bd830c6f6afd328f - - -] Host osp7-all-in-one-2 has more disk space than database expected (35gb > 34gb) 2016-10-06 08:54:41.514 27496 INFO nova.scheduler.host_manager [req-0bde59b1-cb46-4ac8-9317-18ceacf56112 bc4a2c2129f8487e97f3e836185fc19f fa18ae0d190f48d0bd830c6f6afd328f - - -] Host filter forcing available hosts to osp7-all-in-one-2 2016-10-06 08:54:41.514 27496 DEBUG nova.scheduler.filter_scheduler [req-0bde59b1-cb46-4ac8-9317-18ceacf56112 bc4a2c2129f8487e97f3e836185fc19f fa18ae0d190f48d0bd830c6f6afd328f - - -] Filtered [(osp7-all-in-one-2, osp7-all-in-one-2) ram:4751 disk:34816 io_ops:5 instances:5] _schedule /usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py:143 2016-10-06 08:54:41.515 27496 DEBUG nova.scheduler.filter_scheduler [req-0bde59b1-cb46-4ac8-9317-18ceacf56112 bc4a2c2129f8487e97f3e836185fc19f fa18ae0d190f48d0bd830c6f6afd328f - - -] Weighed [WeighedHost [host: (osp7-all-in-one-2, osp7-all-in-one-2) ram:4751 disk:34816 io_ops:5 instances:5, weight: 0.0]] _schedule /usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py:148 2016-10-06 08:54:41.515 27496 DEBUG nova.scheduler.filter_scheduler [req-0bde59b1-cb46-4ac8-9317-18ceacf56112 bc4a2c2129f8487e97f3e836185fc19f fa18ae0d190f48d0bd830c6f6afd328f - - -] Selected host: WeighedHost [host: (osp7-all-in-one-2, osp7-all-in-one-2) ram:4751 disk:34816 io_ops:5 instances:5, weight: 0.0] _schedule /usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py:158 * we also see from the compute log that there was the try to start it and we fail on the compute manager as there are no free VCPUs: 2016-10-06 08:54:41.855 24488 DEBUG nova.compute.resources.vcpu [req-0bde59b1-cb46-4ac8-9317-18ceacf56112 bc4a2c2129f8487e97f3e836185fc19f fa18ae0d190f48d0bd830c6f6afd328f - - -] CPUs limit: 4.00 VCPUs, free: -1.00 VCPUs test /usr/lib/python2.7/site-packages/nova/compute/resources/vcpu.py:63 2016-10-06 08:54:41.856 24488 DEBUG oslo_concurrency.lockutils [req-0bde59b1-cb46-4ac8-9317-18ceacf56112 bc4a2c2129f8487e97f3e836185fc19f fa18ae0d190f48d0bd830c6f6afd328f - - -] Lock "compute_resources" released by "instance_claim" :: held 0.044s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:456 This is also the error message we get from nova show: | fault | {"message": "Build of instance da582d59-de0c-439d-80f5-ad95ddc9ec85 was re-scheduled: Insufficient compute resources: Free CPUs -1.00 VCPUs < requested 1 VCPUs.", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 2261, in _do_build_and_run_instance | | | filter_properties) | | | File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 2393, in _build_and_run_instance | | | instance_uuid=instance.uuid, reason=e.format_message()) | | | ", "created": "2016-10-06T12:54:42Z"} | Actual results: inconsistent behavior Expected results: consistent behavior Additional info:
Let me clarify a bit of misunderstandings : the Availability Zone hack able to force a specific destination when booting is not intended for making sure that the instance will eventually boot, but just that the instance will be spawned on the provided destination. The usecase behind that specific feature was that the operators wanted to force the scheduler to return a specific destination instead of blindly picking a host, but it also wanted to make sure that the compute node was still accepting it. To be clear, as it's an admin-only boot flag, it assumes that the person wanting to boot the instance knows all the possible destinations and selected by theirselves which one could be the best. That means that they are sure that the instance wouldn't be a bad neighbor for the existing VMs but also that the resources would be enough for supporting that new allocation. Technically, that means that even if the scheduler would bypass the filters, the compute node would verify the resource usage. That's something we do by the "instance claiming" operation in the compute manager, and that's what is responsible for rescheduling in case something goes wrong on the compute side. In theory, we should only be able to spin 4 flavors of 1 vCPU on a 4-pCPU host with a 1.0 allocation ratio, hence why you're getting kicked for vm6 even if you're providing the destination. That said, there were a couple of race conditions that were fixed in Liberty and beyond that were hitting the resource claiming on the compute side, which is why I guess you were still accepted for vm5 while you shouldn't. Unfortunately, those race condition bugfixes are very impossible to backport given the code complexity, but either way like I mentioned, that's something we should anyway consider as impossible given it's an admin-only call, which supposes the admin being responsible for their cloud.
Hi Sylvain, I did test on: openstack-nova-api-12.0.4-8.el7ost.noarch openstack-nova-novncproxy-12.0.4-8.el7ost.noarch python-nova-12.0.4-8.el7ost.noarch openstack-nova-common-12.0.4-8.el7ost.noarch python-novaclient-3.1.0-2.el7ost.noarch openstack-nova-compute-12.0.4-8.el7ost.noarch openstack-nova-conductor-12.0.4-8.el7ost.noarch openstack-nova-cert-12.0.4-8.el7ost.noarch openstack-nova-scheduler-12.0.4-8.el7ost.noarch openstack-nova-console-12.0.4-8.el7ost.noarch And still was able to reproduce the bad behavior: Launched 11 instances, after 10, scheduler complained on quota, but only 2 get actually running (2 cores on this system) with: for test in $(seq 10 20); do nova boot --image ffdef9ad-468a-4c3f-868e-03deb25a7fe2 --nic net-id=ec54357c-9800-4c22-8e74-71c4de4d7e73 --flavor m1.tiny test$test ;done Now, testing with AZ set to host: for test in $(seq 10 20); do nova boot --image ffdef9ad-468a-4c3f-868e-03deb25a7fe2 --nic net-id=ec54357c-9800-4c22-8e74-71c4de4d7e73 --flavor m1.tiny test$test --availability_zone nova:osp8-allinone.example.com;done Also, fails when quota arriving, but status reported for just two running instances: [root@osp8-allinone ~(keystone_admin)]# nova list +--------------------------------------+--------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------+--------+------------+-------------+-----------------------+ | e86bcc85-9b17-477a-8887-904d347e9072 | test10 | ACTIVE | - | Running | internal=192.168.3.18 | | c2df8f79-6786-45c3-8e7b-2a3d4d3130d7 | test11 | ACTIVE | - | Running | internal=192.168.3.19 | | 8caa98b4-9a20-45e8-80b4-abf6d5ec795a | test12 | ERROR | - | NOSTATE | | | 94114ab2-80dd-46f9-83a1-c2024168238b | test13 | ERROR | - | NOSTATE | | | 6d6076e1-4cb6-4281-84ab-03b2df93b384 | test14 | ERROR | - | NOSTATE | | | d0ae3e39-db6a-440e-9aa9-2df57e8f444a | test15 | ERROR | - | NOSTATE | | | 379b5c56-ec98-487f-8215-055b8adc7f4d | test16 | ERROR | - | NOSTATE | | | da817e9b-4c5e-430c-8231-361d5ef5add1 | test17 | ERROR | - | NOSTATE | | | 96a5a9d6-119e-4abb-a4dc-72a5d040a143 | test18 | ERROR | - | NOSTATE | | | 503bec17-c0b9-444c-999a-d800467961e3 | test19 | ERROR | - | NOSTATE | | +--------------------------------------+--------+--------+------------+-------------+-----------------------+ (removed some instances to not hit project quotas, but left the two running and some others that failed), and: - booted another instance WITHOUT AZ -> failed - booted another instance WITH AZ -> failed - restarted nova-scheduler - booted another instance WITH AZ -> ACTIVE [root@osp8-allinone ~(keystone_admin)]# nova list +--------------------------------------+--------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------+--------+------------+-------------+-----------------------+ | e86bcc85-9b17-477a-8887-904d347e9072 | test10 | ACTIVE | - | Running | internal=192.168.3.18 | | c2df8f79-6786-45c3-8e7b-2a3d4d3130d7 | test11 | ACTIVE | - | Running | internal=192.168.3.19 | | 8caa98b4-9a20-45e8-80b4-abf6d5ec795a | test12 | ERROR | - | NOSTATE | | | 94114ab2-80dd-46f9-83a1-c2024168238b | test13 | ERROR | - | NOSTATE | | | 6d6076e1-4cb6-4281-84ab-03b2df93b384 | test14 | ERROR | - | NOSTATE | | | 14125128-9a29-45ba-8e8d-56a882f14a1e | test50 | ERROR | - | NOSTATE | | | 4cefd77f-a573-4de3-b7fe-9076bb85b4c4 | test50 | ERROR | - | NOSTATE | | | 66700107-cbd5-42d2-acba-121f66972b12 | test51 | ACTIVE | - | Running | internal=192.168.3.20 | +--------------------------------------+--------+--------+------------+-------------+-----------------------+ Now deploying all-in-one osp9 to restest. It would be great if we get identified the patches to see if they are pending any rebase on OSP8 to get in. Thanks, Pablo
Hi, Got the same results with osp9 all-in-one with one CPU: - only one instance was getting active, remaining in error - tried with AZ nova:host and same result - restarted nova-scheduler - [root@osp9-allinone ~(keystone_admin)]# nova list +--------------------------------------+--------+--------+------------+-------------+--------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------+--------+------------+-------------+--------------------+ | 5adc8ac6-bcad-425b-a372-a5d1833588a6 | test10 | ACTIVE | - | Running | internal=10.0.0.9 | | dbc9047a-f5bb-421d-8218-c8243c241e60 | test10 | ACTIVE | - | Running | internal=10.0.0.8 | | c5752d44-1613-45ea-b8de-e7f5a9af9474 | test11 | ACTIVE | - | Running | internal=10.0.0.10 | | fe4b0ff5-8786-4189-b430-f35eaa9d0ae9 | test11 | ERROR | - | NOSTATE | | | 5734dde4-2888-47cb-8aed-9d7f2a70b485 | test12 | ERROR | - | NOSTATE | | | c377ad31-93e6-4b03-9b58-aaf986f6aeec | test12 | ACTIVE | - | Running | internal=10.0.0.11 | | 54be36d2-c932-4fa0-86fb-462c344fa05a | test13 | ACTIVE | - | Running | internal=10.0.0.12 | | c4f2b15e-cc7d-40f2-af27-32c0864eab5e | test14 | ERROR | - | NOSTATE | | | e76c17b6-adba-4f7b-86cb-424f5dc5b8fd | test14 | ACTIVE | - | Running | internal=10.0.0.13 | | 8835136b-92f2-43e3-a0c9-9b11d731f263 | test15 | ACTIVE | - | Running | internal=10.0.0.14 | +--------------------------------------+--------+--------+------------+-------------+--------------------+ So, as said, would be great if we can get the patches identified to check if there are pending to be applied or if we're hitting something else. Regards, Pablo
So, after a few findings, I found the problem. Disclaimer, I hadn't yet time to verify if it's already in master, but I'm thinking by 90% it's already there. When the scheduler tries to find a destination, it returns a tuple to the conductor (host, node, limits) that is passed to the compute node. Then, the compute node checks the resource usage by claiming the instance values (at least the VCPU, memory and disk) against the host values. That claim is then using the limits for limiting the possible size ask. For example, say that I have an vcpu allocation ratio of 1.0, it will limit the possible VCPU by 1. When calling the filters, those limits are set for each field by each filter : for example, the CoreFilter sets the vcpu limit to the scheduler related HostState. When we return the destination, we're then passing those limit fields to the conductor like I said before. So, why do we have a different behaviour when restarting the scheduler ? Because when restarting it, we're recreating the in-memory objects called HostState so we're setting the limits back to 0, up until one instance goes to the CoreFilter. Proof here : - case #1 : restart the scheduler and boot an instance using the AZ hack (Pdb) dests [{'host': u'vm-133.gsslab.fab.redhat.com', 'nodename': u'vm-133.gsslab.fab.redhat.com', 'limits': {}}] - case #2 : restart the scheduler, boot a normal instance and then boot an instance with the AZ hack (Pdb) dests [{'host': u'vm-133.gsslab.fab.redhat.com', 'nodename': u'vm-133.gsslab.fab.redhat.com', 'limits': {'memory_mb': 19150.5, 'vcpu': 1.0}}] That looks a design problem honestly. If we're never checking the filters when using the AZ hack, we shouldn't return limits then which would mean a consistent behaviour whatever the scheduler verified those limits first or not. The other hand tho is that we wouldn't be able to verify whether the host can still support the instance request, but given it's an admin-only possible usage, I think the operator already knows whether the host can be good or not.
Just reported the upstream bug since I'm pretty sure the problem still exists.