| Summary: | not always possible to oversubscribe vcpu when force an instance on a compute when use --availability_zone | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Martin Schuppert <mschuppe> |
| Component: | openstack-nova | Assignee: | Sylvain Bauza <sbauza> |
| Status: | CLOSED WONTFIX | QA Contact: | Prasanth Anbalagan <panbalag> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.0 (Kilo) | CC: | berrange, dasmith, eglynn, kchamart, owalsh, pablo.iranzo, sbauza, sferdjao, sgordon, srevivo, vromanso |
| Target Milestone: | --- | Keywords: | Reopened, ZStream |
| Target Release: | 7.0 (Kilo) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-01-19 11:08:55 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Martin Schuppert
2016-10-06 13:27:52 UTC
Let me clarify a bit of misunderstandings : the Availability Zone hack able to force a specific destination when booting is not intended for making sure that the instance will eventually boot, but just that the instance will be spawned on the provided destination. The usecase behind that specific feature was that the operators wanted to force the scheduler to return a specific destination instead of blindly picking a host, but it also wanted to make sure that the compute node was still accepting it. To be clear, as it's an admin-only boot flag, it assumes that the person wanting to boot the instance knows all the possible destinations and selected by theirselves which one could be the best. That means that they are sure that the instance wouldn't be a bad neighbor for the existing VMs but also that the resources would be enough for supporting that new allocation. Technically, that means that even if the scheduler would bypass the filters, the compute node would verify the resource usage. That's something we do by the "instance claiming" operation in the compute manager, and that's what is responsible for rescheduling in case something goes wrong on the compute side. In theory, we should only be able to spin 4 flavors of 1 vCPU on a 4-pCPU host with a 1.0 allocation ratio, hence why you're getting kicked for vm6 even if you're providing the destination. That said, there were a couple of race conditions that were fixed in Liberty and beyond that were hitting the resource claiming on the compute side, which is why I guess you were still accepted for vm5 while you shouldn't. Unfortunately, those race condition bugfixes are very impossible to backport given the code complexity, but either way like I mentioned, that's something we should anyway consider as impossible given it's an admin-only call, which supposes the admin being responsible for their cloud. Hi Sylvain, I did test on: openstack-nova-api-12.0.4-8.el7ost.noarch openstack-nova-novncproxy-12.0.4-8.el7ost.noarch python-nova-12.0.4-8.el7ost.noarch openstack-nova-common-12.0.4-8.el7ost.noarch python-novaclient-3.1.0-2.el7ost.noarch openstack-nova-compute-12.0.4-8.el7ost.noarch openstack-nova-conductor-12.0.4-8.el7ost.noarch openstack-nova-cert-12.0.4-8.el7ost.noarch openstack-nova-scheduler-12.0.4-8.el7ost.noarch openstack-nova-console-12.0.4-8.el7ost.noarch And still was able to reproduce the bad behavior: Launched 11 instances, after 10, scheduler complained on quota, but only 2 get actually running (2 cores on this system) with: for test in $(seq 10 20); do nova boot --image ffdef9ad-468a-4c3f-868e-03deb25a7fe2 --nic net-id=ec54357c-9800-4c22-8e74-71c4de4d7e73 --flavor m1.tiny test$test ;done Now, testing with AZ set to host: for test in $(seq 10 20); do nova boot --image ffdef9ad-468a-4c3f-868e-03deb25a7fe2 --nic net-id=ec54357c-9800-4c22-8e74-71c4de4d7e73 --flavor m1.tiny test$test --availability_zone nova:osp8-allinone.example.com;done Also, fails when quota arriving, but status reported for just two running instances: [root@osp8-allinone ~(keystone_admin)]# nova list +--------------------------------------+--------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------+--------+------------+-------------+-----------------------+ | e86bcc85-9b17-477a-8887-904d347e9072 | test10 | ACTIVE | - | Running | internal=192.168.3.18 | | c2df8f79-6786-45c3-8e7b-2a3d4d3130d7 | test11 | ACTIVE | - | Running | internal=192.168.3.19 | | 8caa98b4-9a20-45e8-80b4-abf6d5ec795a | test12 | ERROR | - | NOSTATE | | | 94114ab2-80dd-46f9-83a1-c2024168238b | test13 | ERROR | - | NOSTATE | | | 6d6076e1-4cb6-4281-84ab-03b2df93b384 | test14 | ERROR | - | NOSTATE | | | d0ae3e39-db6a-440e-9aa9-2df57e8f444a | test15 | ERROR | - | NOSTATE | | | 379b5c56-ec98-487f-8215-055b8adc7f4d | test16 | ERROR | - | NOSTATE | | | da817e9b-4c5e-430c-8231-361d5ef5add1 | test17 | ERROR | - | NOSTATE | | | 96a5a9d6-119e-4abb-a4dc-72a5d040a143 | test18 | ERROR | - | NOSTATE | | | 503bec17-c0b9-444c-999a-d800467961e3 | test19 | ERROR | - | NOSTATE | | +--------------------------------------+--------+--------+------------+-------------+-----------------------+ (removed some instances to not hit project quotas, but left the two running and some others that failed), and: - booted another instance WITHOUT AZ -> failed - booted another instance WITH AZ -> failed - restarted nova-scheduler - booted another instance WITH AZ -> ACTIVE [root@osp8-allinone ~(keystone_admin)]# nova list +--------------------------------------+--------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------+--------+------------+-------------+-----------------------+ | e86bcc85-9b17-477a-8887-904d347e9072 | test10 | ACTIVE | - | Running | internal=192.168.3.18 | | c2df8f79-6786-45c3-8e7b-2a3d4d3130d7 | test11 | ACTIVE | - | Running | internal=192.168.3.19 | | 8caa98b4-9a20-45e8-80b4-abf6d5ec795a | test12 | ERROR | - | NOSTATE | | | 94114ab2-80dd-46f9-83a1-c2024168238b | test13 | ERROR | - | NOSTATE | | | 6d6076e1-4cb6-4281-84ab-03b2df93b384 | test14 | ERROR | - | NOSTATE | | | 14125128-9a29-45ba-8e8d-56a882f14a1e | test50 | ERROR | - | NOSTATE | | | 4cefd77f-a573-4de3-b7fe-9076bb85b4c4 | test50 | ERROR | - | NOSTATE | | | 66700107-cbd5-42d2-acba-121f66972b12 | test51 | ACTIVE | - | Running | internal=192.168.3.20 | +--------------------------------------+--------+--------+------------+-------------+-----------------------+ Now deploying all-in-one osp9 to restest. It would be great if we get identified the patches to see if they are pending any rebase on OSP8 to get in. Thanks, Pablo Hi, Got the same results with osp9 all-in-one with one CPU: - only one instance was getting active, remaining in error - tried with AZ nova:host and same result - restarted nova-scheduler - [root@osp9-allinone ~(keystone_admin)]# nova list +--------------------------------------+--------+--------+------------+-------------+--------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------+--------+------------+-------------+--------------------+ | 5adc8ac6-bcad-425b-a372-a5d1833588a6 | test10 | ACTIVE | - | Running | internal=10.0.0.9 | | dbc9047a-f5bb-421d-8218-c8243c241e60 | test10 | ACTIVE | - | Running | internal=10.0.0.8 | | c5752d44-1613-45ea-b8de-e7f5a9af9474 | test11 | ACTIVE | - | Running | internal=10.0.0.10 | | fe4b0ff5-8786-4189-b430-f35eaa9d0ae9 | test11 | ERROR | - | NOSTATE | | | 5734dde4-2888-47cb-8aed-9d7f2a70b485 | test12 | ERROR | - | NOSTATE | | | c377ad31-93e6-4b03-9b58-aaf986f6aeec | test12 | ACTIVE | - | Running | internal=10.0.0.11 | | 54be36d2-c932-4fa0-86fb-462c344fa05a | test13 | ACTIVE | - | Running | internal=10.0.0.12 | | c4f2b15e-cc7d-40f2-af27-32c0864eab5e | test14 | ERROR | - | NOSTATE | | | e76c17b6-adba-4f7b-86cb-424f5dc5b8fd | test14 | ACTIVE | - | Running | internal=10.0.0.13 | | 8835136b-92f2-43e3-a0c9-9b11d731f263 | test15 | ACTIVE | - | Running | internal=10.0.0.14 | +--------------------------------------+--------+--------+------------+-------------+--------------------+ So, as said, would be great if we can get the patches identified to check if there are pending to be applied or if we're hitting something else. Regards, Pablo So, after a few findings, I found the problem. Disclaimer, I hadn't yet time to verify if it's already in master, but I'm thinking by 90% it's already there.
When the scheduler tries to find a destination, it returns a tuple to the conductor (host, node, limits) that is passed to the compute node.
Then, the compute node checks the resource usage by claiming the instance values (at least the VCPU, memory and disk) against the host values. That claim is then using the limits for limiting the possible size ask.
For example, say that I have an vcpu allocation ratio of 1.0, it will limit the possible VCPU by 1.
When calling the filters, those limits are set for each field by each filter : for example, the CoreFilter sets the vcpu limit to the scheduler related HostState.
When we return the destination, we're then passing those limit fields to the conductor like I said before.
So, why do we have a different behaviour when restarting the scheduler ? Because when restarting it, we're recreating the in-memory objects called HostState so we're setting the limits back to 0, up until one instance goes to the CoreFilter.
Proof here :
- case #1 : restart the scheduler and boot an instance using the AZ hack
(Pdb) dests
[{'host': u'vm-133.gsslab.fab.redhat.com', 'nodename': u'vm-133.gsslab.fab.redhat.com', 'limits': {}}]
- case #2 : restart the scheduler, boot a normal instance and then boot an instance with the AZ hack
(Pdb) dests
[{'host': u'vm-133.gsslab.fab.redhat.com', 'nodename': u'vm-133.gsslab.fab.redhat.com', 'limits': {'memory_mb': 19150.5, 'vcpu': 1.0}}]
That looks a design problem honestly. If we're never checking the filters when using the AZ hack, we shouldn't return limits then which would mean a consistent behaviour whatever the scheduler verified those limits first or not.
The other hand tho is that we wouldn't be able to verify whether the host can still support the instance request, but given it's an admin-only possible usage, I think the operator already knows whether the host can be good or not.
Just reported the upstream bug since I'm pretty sure the problem still exists. |