This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1258877 - 40GB disk default for undercloud gives you about 7 days of uptime before undercloud fireworks
40GB disk default for undercloud gives you about 7 days of uptime before unde...
Status: CLOSED CURRENTRELEASE
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
urgent Severity high
: ---
: 10.0 (Newton)
Assigned To: Ben Nemec
Shai Revivo
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-01 08:52 EDT by Fabio Massimo Di Nitto
Modified: 2016-10-04 14:38 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-10-04 14:38:23 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Fabio Massimo Di Nitto 2015-09-01 08:52:49 EDT
Description of problem:

40 GB is just not enough to deploy the undercloud when all debug logs are enabled by default.

Following the documentation, and using the undercloud deployment defaults in virt env, the VM will receive 40GB of disk. The disk fills up in about a week.

Even tho this is virt, it should still have some sane defaults.
Comment 3 Fabio Massimo Di Nitto 2015-09-02 04:05:32 EDT
After additional investigation, it turns out that running out of disk on the undercloud, caused dhcpd to be unable to renew leases to the overcloud and the overcloud exploded.

3 problems really:

1) default disk size is too small
2) logs are by default in debug mode. This would be fine if they were rotate more often
3) overcloud should really not be so depend on the undercloud after deployment. Yes I am aware that there is a RFE to make static IP configurable for the overcloud but the issue is that undercloud is not HA and presents itself as Single Point of Failure in the infrastructure that we are trying to avoid by using HA in the overcloud.
Comment 5 Hugh Brock 2016-02-04 07:30:54 EST
The DHCP issue is fixed, but let's get the image size or the log rotation fixed anyway.
Comment 6 Ben Nemec 2016-02-09 16:39:48 EST
I'm not really sure what to do here.  We already configure log rotation on the undercloud, and I can't say that I've ever had a problem with a 40 GB undercloud so I don't think it's correct to say that 40 GB is universally too small.  In fact, on an undercloud that is approaching a week of uptime, here's what I see:

[centos@undercloud ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        40G   21G   20G  51% /
[centos@undercloud ~]$ uptime
 18:34:59 up 6 days,  3:53,  1 user,  load average: 0.80, 0.98, 0.94

Now, obviously there are factors that can impact the disk usage.  Doing a lot of overcloud deployments will probably cause more log usage from things like Heat and Keystone, but even then the log rotation should mitigate the disk usage.

Note that when this bug was opened, it's possible log rotation was not configured.  I don't know for sure what the configuration looked like in 7.0.  In any case, I'm inclined to call this one fixed unless someone is frequently running out of disk space in virt environments on the latest version of director.
Comment 7 Fabio Massimo Di Nitto 2016-02-09 23:59:56 EST
(In reply to Ben Nemec from comment #6)
> I'm not really sure what to do here.  We already configure log rotation on
> the undercloud, and I can't say that I've ever had a problem with a 40 GB
> undercloud so I don't think it's correct to say that 40 GB is universally
> too small.  In fact, on an undercloud that is approaching a week of uptime,
> here's what I see:
> 
> [centos@undercloud ~]$ df -h
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/vda1        40G   21G   20G  51% /
> [centos@undercloud ~]$ uptime
>  18:34:59 up 6 days,  3:53,  1 user,  load average: 0.80, 0.98, 0.94
> 
> Now, obviously there are factors that can impact the disk usage.  Doing a
> lot of overcloud deployments will probably cause more log usage from things
> like Heat and Keystone, but even then the log rotation should mitigate the
> disk usage.
> 
> Note that when this bug was opened, it's possible log rotation was not
> configured.  I don't know for sure what the configuration looked like in
> 7.0.  In any case, I'm inclined to call this one fixed unless someone is
> frequently running out of disk space in virt environments on the latest
> version of director.

For what I remember log rotation was already active at the time.

We can't really guess or exclude that user will do lots of overcloud deployment/changes and logs will grow.

Truth told, considering how cheap disk space is those days, it shouldn't be an issue to create a sparse disk image of 200GB+ at least to be safer.
But I agree with you, there is no perfect magic number here.
Comment 8 Ben Nemec 2016-02-11 13:55:04 EST
The other thing that occurs to me is that I believe in 7.0 Heat was doing some excessive logging of some very large requests, which may also explain why I'm seeing much less disk usage on an undercloud that saw moderately heavy use over the course of the week.  Are you still seeing problems with this in more recent versions?

That said, looking at the upstream code it appears we are actually creating a 30 GB VM in virt setups: https://github.com/openstack/instack-undercloud/blob/master/scripts/instack-virt-setup#L140

I would be open to increasing that default, although I'm a little wary of going straight to 200+.  My personal development environment is on SSDs where storage is not as cheap and plentiful as spinny disks and I'd be a little nervous that an undercloud with some runaway logging or something might actually fill up the host disk.  How would you feel about just doubling it to 60 GB?
Comment 9 Mike Burns 2016-04-07 16:50:54 EDT
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.
Comment 11 Jaromir Coufal 2016-10-04 14:38:23 EDT
This does not seem to be an issue anymore. Please re-open if it re-appears.

Note You need to log in before you can comment on or make changes to this bug.