Bug 1564671

Summary: Container configuration generation fails if the host file system is xfs that was created with ftype=0
Product: Red Hat OpenStack Reporter: Alex Schultz <aschultz>
Component: openstack-tripleo-heat-templatesAssignee: Emilien Macchi <emacchi>
Status: CLOSED CANTFIX QA Contact: Gurenko Alex <agurenko>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: augol, ccamacho, dwalsh, esandeen, jcoufal, jschluet, mburns, mcornea, morazi, mszeredi, pasik, rhel-osp-director-maint, roxenham, rscarazz, sbaker, vgoyal
Target Milestone: zstreamKeywords: Triaged, ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1575115 (view as bug list) Environment:
Last Closed: 2018-10-25 20:48:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1575115, 1580463, 1580469, 1580476    
Attachments:
Description Flags
fast_forward_upgrade_playbook.yaml output none

Description Alex Schultz 2018-04-06 20:33:21 UTC
Description of problem:

docker-puppet.py may fail with IO or rsync errors when run on systems with xfs and the xfs setting ftype=0.

More details:
In attempting to try out containerized undercloud, I kept running into an issue where the configuration generation would fail with something like:

+ rsync -a -R --delay-updates --delete-after /etc /root /opt /var/www /var/spool/cron /var/lib/config-data/heat_api
file has vanished: "/etc/httpd/conf.d/README"
file has vanished: "/etc/httpd/conf.d/autoindex.conf"
file has vanished: "/etc/httpd/conf.d/userdir.conf"
file has vanished: "/etc/httpd/conf.d/welcome.conf"
file has vanished: "/etc/httpd/conf.modules.d/00-base.conf"
file has vanished: "/etc/httpd/conf.modules.d/00-dav.conf"
file has vanished: "/etc/httpd/conf.modules.d/00-lua.conf"
file has vanished: "/etc/httpd/conf.modules.d/00-mpm.conf"
file has vanished: "/etc/httpd/conf.modules.d/00-proxy.conf"
file has vanished: "/etc/httpd/conf.modules.d/00-ssl.conf"
file has vanished: "/etc/httpd/conf.modules.d/00-systemd.conf"
file has vanished: "/etc/httpd/conf.modules.d/01-cgi.conf"
file has vanished: "/etc/httpd/conf.modules.d/10-wsgi.conf"
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1052) [sender=3.0.9]

2018-04-06 20:12:30,750 INFO: 15716 -- Finished processing puppet configs for heat_api
2018-04-06 20:12:30,751 ERROR: 15715 -- ERROR configuring heat_api


In trying to figure out what was happening, I noticed that in the dmesg output there would be these messages:
[79910.073570] overlayfs: upper fs needs to support d_type. This is an invalid configuration.
[79910.091994] overlayfs: upper fs needs to support d_type. This is an invalid configuration.
[79910.110953] overlayfs: upper fs needs to support d_type. This is an invalid configuration.


From these messages I found, https://github.com/moby/moby/issues/10294#issuecomment-267846091

From this comment I found the deprecation notice for v1.13 around this message which indicates that xfs doesn't support d_type if it was formated with ftype=0

https://github.com/moby/moby/blob/v1.13.0-rc4/docs/deprecated.md#backing-filesystem-without-d_type-support-for-overlayoverlay2


So the system I was using was from a centos guest image that did not have crc enabled for the xfs.

$ xfs_info /
meta-data=/dev/vda1              isize=256    agcount=20, agsize=524224 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=10484164, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


Impact:
New installs won't be affected as we've have the correct fs setting now.  Older systems being upgraded from baremetal installations to containers may fail.

Comment 1 Alex Schultz 2018-04-06 21:16:48 UTC
CRC was enabled in RHEL in rhbz#1309498

Comment 2 Steve Baker 2018-04-08 22:13:17 UTC
I hit this 12 months ago which resulted in the following bugs:
https://bugs.launchpad.net/tripleo/+bug/1693398
https://bugzilla.redhat.com/show_bug.cgi?id=1455713
https://bugzilla.redhat.com/show_bug.cgi?id=1288162

My takeaway at the time was that the next unreleased RHEL kernel might improve the situation for overlay2 on ftype=0 xfs, so it should be retested then. The situation is improved, now we get an early error message instead of weird behaviour on deleted files.

But yes, we have a problem now for those who have upgraded all the way from early  OSP versions when the default ftype was still 0.

It would be interesting to know which OSP/RHEL version combo was the last one to be deployed with xfs ftype=0 to get an idea of the scope of this upgrade problem.

Comment 3 Alex Schultz 2018-04-09 16:31:52 UTC
Based on what I found, it was changed in RHEL7.3. According to the lifecycle page, we shipped OSP10 on 7.3. https://access.redhat.com/support/policy/updates/openstack/platform

So <=OSP9 upgrades may be affected.

Comment 4 Marius Cornea 2018-04-09 20:22:33 UTC
I did some checks of the overcloud images that we shipped in the past
and below are my results:

OSP10 shipped rhel 7.3 overcloud image at GA time so we should be safe there.

The overcloud image shipped at 9 GA(rhosp-director-image rpm in [1])
has the root fs formatted as ext4. According to [2] xfs is the only
supported lower layer fs for OverlayFS so I believe deployments that
used this image for initial deployment cannot be upgraded to
containers. Overcloud image in the following 9-director builds are
RHEL 7.3.

Regarding the initial XFS issue - I found a RHEL 7.2 xfs root fs with
ftype=0 overcloud image in OSP8 director Y1[3]. I can confirm that I
reproduced the issue reported by Alex during FFU of the OSP8
environment deployed with that image(8->9->10->FFU->13).

To summarize: OSP7/8/9 deployments are potentially blocked from being
upgraded to containerized deployments(depending if the initial
deployment was on RHEL 7.3 or earlier).

Comment 5 Marius Cornea 2018-04-09 20:24:34 UTC
Created attachment 1419558 [details]
fast_forward_upgrade_playbook.yaml output

Attaching the output of fast_forward_upgrade_playbook.yaml playbook where these errors show up.

Comment 7 Marius Cornea 2018-04-09 21:02:18 UTC
(In reply to Marius Cornea from comment #5)
> Created attachment 1419558 [details]
> fast_forward_upgrade_playbook.yaml output
> 
> Attaching the output of fast_forward_upgrade_playbook.yaml playbook where
> these errors show up.

Small correction - it's actually the deploy_steps_playbook.yaml  playbook which fails.

Comment 9 Eric Sandeen 2018-04-12 14:29:14 UTC
It is true that overlayfs in RHEL7 requires ftype to be enabled on XFS:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.2_release_notes/technology-preview-file_systems

> Note that XFS file systems must be created with the -n ftype=1 option enabled for use as an overlay.

If you are attempting to use overlayfs on a filesystem without the ftype feature enabled, then unfortunately this behavior is expected.  Further, there is no in-place upgrade to add the ftype feature; dump, mkfs, & restore is the only path forward if you require ftype.

ftype was indeed made default in RHEL7.3 / xfsprogs-4.5.0-1 in March 2016.


however, I'm digging into this situation a bit more, upstream lack of d_type causes overlayfs to warn but not fail.

cc: miklos as well.

-Eric

Comment 10 Eric Sandeen 2018-04-12 14:39:21 UTC
(But this bug may be conflating two issues, I'm not sure that

> rsync warning: some files vanished before they could be transferred

has anything to do with ftype support.  Doesn't that simply mean that the source files were removed while rsync was running?)

Comment 11 Vivek Goyal 2018-04-12 14:48:32 UTC
(In reply to Alex Schultz from comment #0)
> 
> In trying to figure out what was happening, I noticed that in the dmesg
> output there would be these messages:
> [79910.073570] overlayfs: upper fs needs to support d_type. This is an
> invalid configuration.
> [79910.091994] overlayfs: upper fs needs to support d_type. This is an
> invalid configuration.
> [79910.110953] overlayfs: upper fs needs to support d_type. This is an
> invalid configuration.
> 

This just means that overlay has undrelying xfs with ftype=0 and side effect
of this should be that whiteout files will become visible to user/container. It should not lead to missing files during rsync. So something else is wrong.

> 
> From these messages I found,
> https://github.com/moby/moby/issues/10294#issuecomment-267846091
> 
> From this comment I found the deprecation notice for v1.13 around this
> message which indicates that xfs doesn't support d_type if it was formated
> with ftype=0
> 
> https://github.com/moby/moby/blob/v1.13.0-rc4/docs/deprecated.md#backing-
> filesystem-without-d_type-support-for-overlayoverlay2
> 

BTW, to catch errors during configuration, I had modified container-storage-setup and error out if overlay is being setup with ftype=0 on underlying fs. But looks like in your setup you are somehow bypassing it.

https://github.com/projectatomic/container-storage-setup/commit/7fffea78b4195bdb883c3dada90d11d140a2c60a


> 
> So the system I was using was from a centos guest image that did not have
> crc enabled for the xfs.
> 
> $ xfs_info /
> meta-data=/dev/vda1              isize=256    agcount=20, agsize=524224 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=0        finobt=0 spinodes=0
> data     =                       bsize=4096   blocks=10484164, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
> log      =internal               bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> 
> Impact:
> New installs won't be affected as we've have the correct fs setting now. 
> Older systems being upgraded from baremetal installations to containers may
> fail.

Was this system using overlay even before upgrade? Or it setup a fresh docker after upgrade?

Can you paste "docker info" output after upgrade and possibly before upgrade as well.

Comment 12 Alex Schultz 2018-04-12 14:51:17 UTC
> has anything to do with ftype support.  Doesn't that simply mean that the source files were removed while rsync was running?)

So because overlayfs doesn't fail, you end up with weird results. I had some containers not start up and some would start but get the rsync issues.  So it "works" but you get some really odd interactions in the containers.

Comment 13 Vivek Goyal 2018-04-12 14:57:40 UTC
(In reply to Alex Schultz from comment #12)
> > has anything to do with ftype support.  Doesn't that simply mean that the source files were removed while rsync was running?)
> 
> So because overlayfs doesn't fail, you end up with weird results. I had some
> containers not start up and some would start but get the rsync issues.  So
> it "works" but you get some really odd interactions in the containers.

I doubt that this is related to fype=0. Even if it is, simply don't use overlay with ftype=0. And, to make it easy, we put a check in container-storage-setup. Docker will fail, user will notice it and change your storage driver to say devicemapper.

Comment 14 Alex Schultz 2018-04-12 16:25:41 UTC
(In reply to Vivek Goyal from comment #13)
> I doubt that this is related to fype=0. Even if it is, simply don't use
> overlay with ftype=0. And, to make it easy, we put a check in
> container-storage-setup. Docker will fail, user will notice it and change
> your storage driver to say devicemapper.

So for the openstack deployments we've settled on overlayfs and this bug is around the fact that there is an issue with older xfs and overlayfs. We'll have to evaluate the various issues related to to not using it.  Currently this is not a configurable thing.  The problem is not on new installs where everyone is getting compatible xfs but rather systems customers may be migrating from baremetal installations (done with <=7.2) to containerized installations (done with >=7.4).  We're trying to figure out a solution that isn't reformat your system. 

NOTE: In my original test the same processes/software versions where used and the only difference was 1 node was a 7.2 node that was yum updated to 7.4. And the other node was a 7.4 node.  Once both systems were up to date, the brand new installation proceeded and the 7.2 node exhibited odd docker behavior while the 7.4 worked fine.  The only thing different was the xfs version.

Comment 15 Alex Schultz 2018-04-12 17:23:28 UTC
(In reply to Vivek Goyal from comment #11)
> This just means that overlay has undrelying xfs with ftype=0 and side effect
> of this should be that whiteout files will become visible to user/container.
> It should not lead to missing files during rsync. So something else is wrong.

Yea it just seemed to be the only difference between the two machines when one was successful and the other was not.

> 
> BTW, to catch errors during configuration, I had modified
> container-storage-setup and error out if overlay is being setup with ftype=0
> on underlying fs. But looks like in your setup you are somehow bypassing it.
> 
> https://github.com/projectatomic/container-storage-setup/commit/
> 7fffea78b4195bdb883c3dada90d11d140a2c60a
> 

We're not using this in openstack.  I think we might need to add a similar check to prevent anything from proceeding.

> 
> Was this system using overlay even before upgrade? Or it setup a fresh
> docker after upgrade?
> 
> Can you paste "docker info" output after upgrade and possibly before upgrade
> as well.

Fresh docker install after system updated to 7.4

[centos@undercloud ~]$ sudo docker info
Containers: 11
 Running: 0
 Paused: 0
 Stopped: 11
Images: 21
Server Version: 1.13.1
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: false
 Native Overlay Diff: true
Logging Driver: journald
Cgroup Driver: systemd
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Init Binary: docker-init
containerd version:  (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: N/A (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f)
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  WARNING: You're not using the default seccomp profile
  Profile: /etc/docker/seccomp.json
Kernel Version: 3.10.0-693.21.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 4
Total Memory: 7.639 GiB
Name: undercloud.localdomain
ID: SCW7:NFRC:TDQB:DF7A:PIT3:JDZB:RE4W:FL3K:2YEZ:W7LD:YPGO:EFOH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 15
 Goroutines: 23
 System Time: 2018-04-12T17:20:13.340489947Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Experimental: false
Insecure Registries:
 192.168.24.1:8787
 127.0.0.0/8
Live Restore Enabled: true
Registries: docker.io (secure)

Comment 16 Vivek Goyal 2018-04-13 12:34:30 UTC
(In reply to Alex Schultz from comment #14)
> (In reply to Vivek Goyal from comment #13)
> > I doubt that this is related to fype=0. Even if it is, simply don't use
> > overlay with ftype=0. And, to make it easy, we put a check in
> > container-storage-setup. Docker will fail, user will notice it and change
> > your storage driver to say devicemapper.
> 
> So for the openstack deployments we've settled on overlayfs and this bug is
> around the fact that there is an issue with older xfs and overlayfs. We'll
> have to evaluate the various issues related to to not using it.  Currently
> this is not a configurable thing.  The problem is not on new installs where
> everyone is getting compatible xfs but rather systems customers may be
> migrating from baremetal installations (done with <=7.2) to containerized
> installations (done with >=7.4).  We're trying to figure out a solution that
> isn't reformat your system.
> 
> NOTE: In my original test the same processes/software versions where used
> and the only difference was 1 node was a 7.2 node that was yum updated to
> 7.4. And the other node was a 7.4 node.  Once both systems were up to date,
> the brand new installation proceeded and the 7.2 node exhibited odd docker
> behavior while the 7.4 worked fine.  The only thing different was the xfs
> version.

If it was working on 7.2 and stopped working after upgrading to 7.4, this is really strange. Are you able to reproduce this consistently. If yes, let us
try to narrow it down. I don't understand puppet and all the operations which are happening. If somebody can bring down the reproducer to docker level, I might be able to help you.

Comment 17 Vivek Goyal 2018-04-13 12:39:40 UTC
(In reply to Alex Schultz from comment #15)
> > BTW, to catch errors during configuration, I had modified
> > container-storage-setup and error out if overlay is being setup with ftype=0
> > on underlying fs. But looks like in your setup you are somehow bypassing it.
> > 
> > https://github.com/projectatomic/container-storage-setup/commit/
> > 7fffea78b4195bdb883c3dada90d11d140a2c60a
> > 
> 
> We're not using this in openstack.  I think we might need to add a similar
> check to prevent anything from proceeding.

Why did you decide to bypass container-storage-setup in openstack. I think
it is a good idea to keep container-storage-setup in the path.


> Operating System: CentOS Linux 7 (Core)

Hmmm... you are using CentOS. Interesting.

Comment 19 Alex Schultz 2018-04-13 15:02:51 UTC
(In reply to Vivek Goyal from comment #16)
> If it was working on 7.2 and stopped working after upgrading to 7.4, this is
> really strange. Are you able to reproduce this consistently. If yes, let us
> try to narrow it down. I don't understand puppet and all the operations
> which are happening. If somebody can bring down the reproducer to docker
> level, I might be able to help you.

Yes it's consistent. Also it's not puppet, we're actually running a shell script to do some file copy operations during the config generation phase.  Specifically it's this bit of code:

https://github.com/openstack/tripleo-heat-templates/blob/master/docker/docker-puppet.py#L253-L276


So it should be noted that if you were to manually do this from within the container via a docker run -it bash, it works fine. It only fails when it's occurring so quickly in the throw away container we're using.  It seems like a race condition of some sort.

Comment 20 Alex Schultz 2018-04-13 15:04:42 UTC
(In reply to Vivek Goyal from comment #17)
> (In reply to Alex Schultz from comment #15)
> > > BTW, to catch errors during configuration, I had modified
> > > container-storage-setup and error out if overlay is being setup with ftype=0
> > > on underlying fs. But looks like in your setup you are somehow bypassing it.
> > > 
> > > https://github.com/projectatomic/container-storage-setup/commit/
> > > 7fffea78b4195bdb883c3dada90d11d140a2c60a
> > > 
> > 
> > We're not using this in openstack.  I think we might need to add a similar
> > check to prevent anything from proceeding.
> 
> Why did you decide to bypass container-storage-setup in openstack. I think
> it is a good idea to keep container-storage-setup in the path.
> 

We don't use atomic in the OSP project yet.

> 
> > Operating System: CentOS Linux 7 (Core)
> 
> Hmmm... you are using CentOS. Interesting.

Yes I could try and reproduce it in RHEL, but it's unlikely to change anything as we're using the same version of docker upstream.

Comment 21 Daniel Walsh 2018-04-13 16:35:36 UTC
Overlay Storage was not supported in 7.2.  So upgrading it to 7.4/7.5 is not supported.  We did not support overlay until 7.4 and only with newly created xfs with the correct D-Type.  The issue as has been pointed out is container images built using the bad xfs setting will be invalid.  Basically they will end up with bogus files in them.

Comment 23 Vivek Goyal 2018-04-13 17:19:32 UTC
(In reply to Daniel Walsh from comment #21)
> Overlay Storage was not supported in 7.2.  So upgrading it to 7.4/7.5 is not
> supported.  We did not support overlay until 7.4 and only with newly created
> xfs with the correct D-Type.  The issue as has been pointed out is container
> images built using the bad xfs setting will be invalid.  Basically they will
> end up with bogus files in them.

Right. There might not be much point in debugging issues on a not-supported configuration. That is have ftype=1. Otherwise use devicemapper as storage.

Comment 28 Carlos Camacho 2018-04-18 13:58:59 UTC
Here you have some validations for the steps previous the upgrade: https://review.openstack.org/#/c/562282/

Comment 29 Alex Schultz 2018-10-25 20:48:15 UTC
There is no fix for OSP8. We have documentation about the issue and have added in some validations.