Bug 1302074 - [OSP 8.0 Bug]: Nova-scheduler DiskFilter causing problems with NFS shared storage for Cinder [NEEDINFO]
[OSP 8.0 Bug]: Nova-scheduler DiskFilter causing problems with NFS shared sto...
Status: NEW
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
8.0 (Liberty)
All Linux
medium Severity medium
: Upstream M2
: ---
Assigned To: Eoghan Glynn
Prasanth Anbalagan
: FutureFeature, Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-26 12:26 EST by Dave Cain
Modified: 2017-11-16 22:01 EST (History)
21 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
sgordon: needinfo? (sbauza)


Attachments (Terms of Use)
nova-scheduler.log in RHEL-OSP8 beta showing the DiskFilter preventing a new instance from spinning up. (3.47 KB, text/plain)
2016-01-26 12:26 EST, Dave Cain
no flags Details
nova-scheduler.log in RHEL-OSP6 showing the DiskFilter not running (3.77 KB, text/plain)
2016-01-26 12:28 EST, Dave Cain
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1387812 None None None 2016-01-26 12:26 EST
Launchpad 1469179 None None None 2016-10-10 05:48 EDT

  None (edit)
Description Dave Cain 2016-01-26 12:26:24 EST
Created attachment 1118556 [details]
nova-scheduler.log in RHEL-OSP8 beta showing the DiskFilter preventing a new instance from spinning up.

Using the OSP8 Overcloud image contained in Beta 2.

Using NFS for persistent storage, the system adds up the local disks seen in each compute node and uses the DiskFilter in the nova-scheduler for placement decisions.  This quickly becomes a problem when spinning up multiple instances and it eventually thinks there's no storage left for root disk placement.  See attachment titled 'diskfilterosp8.txt' for full debug output from Nova.

The following message is displayed to the user, informing them of the failure to boot a new instance:

ERROR (InstanceInErrorState): No valid host was found. There are not enough hosts available.

If one omits the DiskFilter in /etc/nova/nova.conf:
scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter

Instances can then be built on shared storage without consulting (in error) the free space allocated in "free_disk_gb" as shown by the following command:

[stack@osp-director ~]$ nova hypervisor-stats
+----------------------+--------+
| Property             | Value  |
+----------------------+--------+
| count                | 1      |
| current_workload     | 1      |
| disk_available_least | 36     |
| free_disk_gb         | -31    |
| free_ram_mb          | 255158 |
| local_gb             | 39     |
| local_gb_used        | 70     |
| memory_mb            | 257718 |
| memory_mb_used       | 2560   |
| running_vms          | 4      |
| vcpus                | 40     |
| vcpus_used           | 4      |
+----------------------+--------+

 I checked and RHEL-OSP6 did not employ the DiskFilter as a part of an OpenStack Deployment.  Negative values represented in the field "free_disk_gb" do not seem to be causing a problem.  See attachment 'diskfilterosp6.txt' for comparison, and see that the DiskFilter is not enabled by default.
Comment 2 Dave Cain 2016-01-26 12:28 EST
Created attachment 1118557 [details]
nova-scheduler.log in RHEL-OSP6 showing the DiskFilter not running
Comment 3 Dave Cain 2016-01-26 12:34:19 EST
I found on Launchpad this bug, which would seem that more needs to happen upstream before something can be done to fix the DiskFilter:

https://bugs.launchpad.net/bugs/1387812

I guess what I'd like to see here is suggestions or an automated way as a part of an OSP Director based installation to disable the DiskFilter, without having to manually edit files on the Controllers.  Appreciate any help or guidance you can provide!
Comment 4 Stephen Gordon 2016-01-29 16:42:05 EST
Needs discussion, we may want to consider disabling this filter OOTB for now.
Comment 5 Stephen Gordon 2016-02-02 12:06:54 EST
We need to determine whether or not we can (or should) disable the DiskFilter OOTB to workaround this issue for now? Secondary question is where we do that, is it in the conf or somewhere in the configuration done by director's puppet/heat?
Comment 6 Sylvain Bauza 2016-02-03 06:18:05 EST
There is a long story about shared storage not correctly handled by Nova and that's an on-going effort to fix that.
In the meantime, some bug report is more accurate about some workaround that could be found :

https://bugs.launchpad.net/nova/+bug/1469179

That would prevent us needing to disable the DiskFilter and just consider to not look at the disk space if the instance is volume-backed.

At the moment, the patch is needing some refresh (merge conflict + unittests failing) https://review.openstack.org/#/c/200870/ so I don't know by how much we can land that in Mitaka and do the OSP8 backport.



On the other hand, disabling DiskFilter would be a bit offending regular instances that are not volume-backend, but fortunately, we do check the disk when the instance is claiming resources on the compute node, so it would mean that it would just be a retry.
Comment 7 Stephen Gordon 2016-02-09 02:33:02 EST
We discussed this on the compute call and determined that the best approach for now would be to remove the DiskFilter from the list of enabled filters.
Comment 8 Dave Cain 2016-03-31 21:42:17 EDT
Is this going to be in the GA, meaning a default install will have the DiskFilter disabled by default?  This is still enabled in Beta9 of OSP8.
Comment 9 Mike Burns 2016-04-01 05:05:55 EDT
This appears like it won't be fixed for osp 8 ga
Comment 10 Dave Cain 2016-04-04 08:23:04 EDT
Seems this behavior has gotten worse in Beta9 vs. what I found in Beta2.  In Beta9, I found that I couldn't spin up any instances at all (backed by NFS storage) until I disabled the DiskFilter.  Contrast that with Beta2 where I could spin up a few at least.  Is this expected?

What is the recommended way of disabling the DiskFilter in Nova in the resulting Compute servers as a part of an Overcloud install by the Director?
Comment 11 Mike Burns 2016-04-07 17:07:13 EDT
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.
Comment 13 Sylvain Bauza 2016-10-10 05:48:08 EDT
There was an effort to workaround that problem in https://bugs.launchpad.net/nova/+bug/1469179 by modifying the instance's root disk to 0 in case it was a boot from volume. Unfortunatly, that change has not landed in Newton and now related to my comment #6 the resource-providers effort tries to land something for Ocata that would solve the problem.

If you see https://bugs.launchpad.net/nova/+bug/1469179/comments/25 there are two possibilities like you said :
 - either disable the DiskFilter if you only have users creating boot-from-volume instances
 - or provide sets of flavors having root disk set to 0 for boot-from-volume instances

Given our OSP deployment tool, I don't know how we can ask for automatically do either way, so I'll punt the bugzilla to the TripleO team so they could help us see how to do that.
Comment 14 Stephen Gordon 2016-11-11 16:18:11 EST
(In reply to Sylvain Bauza from comment #13)
> Given our OSP deployment tool, I don't know how we can ask for automatically
> do either way, so I'll punt the bugzilla to the TripleO team so they could
> help us see how to do that.

The right approach here is to work with Sven and Ollie from our DFG, if they need help from the Deployment Framework DFG then they will pursue it. To me it seems like the question is more about what the deployment flow would look like rather than the implementation. I don't see how you can automatically determine ahead of time whether the environment will primarily be used for boot-from-volume or not, that's a question the deployer would have to answer.
Comment 15 Stephen Gordon 2017-02-23 15:17:41 EST
(In reply to Stephen Gordon from comment #14)
> (In reply to Sylvain Bauza from comment #13)
> > Given our OSP deployment tool, I don't know how we can ask for automatically
> > do either way, so I'll punt the bugzilla to the TripleO team so they could
> > help us see how to do that.
> 
> The right approach here is to work with Sven and Ollie from our DFG, if they
> need help from the Deployment Framework DFG then they will pursue it. To me
> it seems like the question is more about what the deployment flow would look
> like rather than the implementation. I don't see how you can automatically
> determine ahead of time whether the environment will primarily be used for
> boot-from-volume or not, that's a question the deployer would have to answer.

Sylvain it's still not clear to be what action you took to work with the folks involved with Tripleo development here?
Comment 16 Sylvain Bauza 2017-02-23 16:01:11 EST
Per your comment #14, I agree with you on the fact that we can't really guess which use of the cloud it will be for our customers deploying Nova so we can't really think of disabling DiskFilter.

About the second option (ie. having flavors only for BFV instances), that's only a workaround that could be to have some prepared flavors in case operators want so.

I haven't yet discussed with Ollie and Sven about that, but we can try to see how to add some separate flavors for our deployment in case it helps operators.
Comment 17 Stephen Gordon 2017-03-16 15:29:41 EDT
(In reply to Sylvain Bauza from comment #13)
> There was an effort to workaround that problem in
> https://bugs.launchpad.net/nova/+bug/1469179 by modifying the instance's
> root disk to 0 in case it was a boot from volume. Unfortunatly, that change
> has not landed in Newton and now related to my comment #6 the
> resource-providers effort tries to land something for Ocata that would solve
> the problem.
> 
> If you see https://bugs.launchpad.net/nova/+bug/1469179/comments/25 there
> are two possibilities like you said :
>  - either disable the DiskFilter if you only have users creating
> boot-from-volume instances
>  - or provide sets of flavors having root disk set to 0 for boot-from-volume
> instances
> 
> Given our OSP deployment tool, I don't know how we can ask for automatically
> do either way, so I'll punt the bugzilla to the TripleO team so they could
> help us see how to do that.

Reading the release notes for Placement API it also seems like the mechanism is about to change?:

"""
It is currently possible to exclude the CoreFilter, RamFilter and DiskFilter from the list of enabled FilterScheduler filters such that scheduling decisions are not based on CPU, RAM or disk usage. Once all computes are reporting into the Placement service, however, and the FilterScheduler starts to use the Placement service for decisions, those excluded filters are ignored and the scheduler will make requests based on VCPU, MEMORY_MB and DISK_GB inventory. If you wish to effectively ignore that type of resource for placement decisions, you will need to adjust the corresponding cpu_allocation_ratio, ram_allocation_ratio, and/or disk_allocation_ratio configuration options to be very high values, e.g. 9999.0.
"""

How does this intersect with the resource-providers work you mention that would address this issue?

Note You need to log in before you can comment on or make changes to this bug.