Bug 1606573 - Overcloud deployment fails on nodes with 4G of RAM with Unable to write image to /tmp/ec6cd4ad-e2ec-4f3a-a3fb-7b01a87440c1. Error: [Errno 28] No space left on device
Summary: Overcloud deployment fails on nodes with 4G of RAM with Unable to write image...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: beta
: 14.0 (Rocky)
Assignee: Dmitry Tantsur
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-20 20:40 UTC by Marius Cornea
Modified: 2019-01-11 11:51 UTC (History)
10 users (show)

Fixed In Version: openstack-tripleo-heat-templates-9.0.0-0.20180827161726.1bdefbe.0rc1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-11 11:50:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 585370 0 None MERGED undercloud: revert to using the iscsi deploy interface by default 2020-12-07 08:17:46 UTC
Red Hat Product Errata RHEA-2019:0045 0 None None None 2019-01-11 11:51:03 UTC

Description Marius Cornea 2018-07-20 20:40:44 UTC
Description of problem:

Overcloud deployment fails on nodes with 4G of RAM with Unable to write image to /tmp/ec6cd4ad-e2ec-4f3a-a3fb-7b01a87440c1. Error: [Errno 28] No space left on device.

It looks that /tmp is backed by the root tmpfs so if the image is filling up the free memory then deployment fails with no space left on device

Version-Release number of selected component (if applicable):
rhosp-director-images-14.0-20180713.3.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP14 virtual env with 4G of RAM overcloud nodes 

Actual results:
Nodes provisioning fails with /var/log/containers/ironic/ironic-conductor.log showing Error: [Errno 28] No space left on device

Expected results:
Deployment succeeds without issues.

Additional info:
This is a regression compared to OSP13 where we didn't see this behavior. If it is expected we need to make sure it is properly documented.

Comment 1 Bob Fournier 2018-07-20 20:53:52 UTC
As Marius pointed out, this looks similar to  https://bugs.launchpad.net/ironic-python-agent/+bug/1661328

TheJulia
   bfournier, yup, it is totally valid too and the only way around it is to use a raw file type, not qcow2 raw gets streamed out. alternatively iscsi deploy is another option since it does not get held in memory on the node being deployed
bfournier
    mcornea: are we using a different file type in 14 or…
TheJulia
    bfournier, I suspect the default to direct deploy..
mcornea
    bfournier: TheJulia ok, so I can confirm that it passed after increasing the ceph memory from 4G to 6G

Comment 2 Bob Fournier 2018-07-20 21:01:13 UTC
The direct deploy RFE is here - https://bugzilla.redhat.com/show_bug.cgi?id=1477713.  Not sure if related or not.

Comment 3 Dmitry Tantsur 2018-07-24 09:22:06 UTC
Yes, I think it's because of the direct deploy. The options are:
1. Recommend low memory deployments with IronicDefaultDeployInterface=iscsi
2. Revert the default to iscsi, allow high-scale deployments to override
3. Store images as RAW to allow their streaming right to the disk (probably too late for Rocky, also will consume undercloud space).

Discussed with shardy, he votes for #2. We can revisit the default again for Stein if bug 1607779 helps with streaming images. Ramon, thoughts?

Comment 4 Jaromir Coufal 2018-07-24 13:09:55 UTC
I am voting for #2 as well. We want to have minimum requirements and low barrier entry for director node. If user wants to go big in production, then they should tweak config to allow so.

Comment 5 Dmitry Tantsur 2018-07-24 13:39:25 UTC
Jarda, this conversation is not quite about production, IIRC we don't support nodes with less than 8 GiB (12 or even 16 in practice). But I do agree with making it opt-in for now.

Comment 6 Ramon Acedo 2018-07-24 14:04:06 UTC
Direct deploy is supposed to allow the undercloud to deploy a larger number of nodes by default by reducing the load added per overcloud node deployed at a time.

I wouldn't decide based on that it makes the VMs used for the Overcloud to go with 6GB instead of 4GB but instead on how much of an improvement it makes for operators in production environments.

I propose to make it opt-in for OSP 14, test the improvements in performance in director and based on this make it default in OSP 15. This would also allow time for potential not yet uncovered issues.

Comment 8 Dmitry Tantsur 2018-08-22 13:58:49 UTC
The partial revert landed. We will look into making it less RAM-consuming in the next release.

Comment 14 Alexander Chuzhoy 2018-09-24 14:37:36 UTC
Verified:
Environment:
openstack-tripleo-heat-templates-9.0.0-0.20180831204457.17bb71e.0rc1.el7ost.noarch

Successfully deployed ceph nodes with 4GB of ram:
[heat-admin@overcloud-cephstorage2-1 ~]$ free
              total        used        free      shared  buff/cache   available
Mem:        3880860      144940     3471580         680      264340     3467496
Swap:             0           0           0

Comment 18 errata-xmlrpc 2019-01-11 11:50:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045


Note You need to log in before you can comment on or make changes to this bug.