Bug 841358 - Stateless-Linux readonly-root tmpfs overlay reversion
Stateless-Linux readonly-root tmpfs overlay reversion
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: nfs-utils (Show other bugs)
16
x86_64 Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: Steve Dickson
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-18 14:41 EDT by James Vess
Modified: 2012-07-31 11:55 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-07-31 11:55:20 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description James Vess 2012-07-18 14:41:10 EDT
Description of problem:
Using Read-only root support on an NFS root, Files/Dirs made read-writable via tmpfs ( mount --bind ) will suddenly show the original content and no longer the overlay mounted content if server side contents are updated.

Version-Release number of selected component (if applicable):


How reproducible:
At this time, I do not have replication steps as users are experiencing this issue intermittently. The only reliable part is that it does happen eventually. It may take anywhere from hours to days. I will update this bug report once I have a method of causing this to occur.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
[root@workstation ~]# mount | grep /etc/openldap/ldap.conf
none on /etc/openldap/ldap.conf type tmpfs (rw,relatime)
[root@workstation ~]# grep /etc/openldap/ldap.conf /etc/mtab
none /etc/openldap/ldap.conf tmpfs rw,relatime 0 0
[root@workstation ~]# umount /etc/openldap/ldap.conf
umount: /etc/openldap/ldap.conf: not mounted
[root@workstation ~]# grep /etc/openldap/ldap.conf /etc/rwtab
files	/etc/openldap/ldap.conf
[root@workstation ~]# md5sum /etc/openldap/ldap.conf
85add6611836caf716fe03f0f3bb12f2  /etc/openldap/ldap.conf
[root@workstation ~]# md5sum /var/lib/stateless/writable/etc/openldap/ldap.conf 
84826e68c2fa259e7245d8f61be1d88a  /var/lib/stateless/writable/etc/openldap/ldap.conf

Expected results:
The overlay mount remains in use at all times.

Additional info:
At startup everything works as expected, However after several hours of use suddenly this issue will occur.
I have not been able to find a causality link as of yet, common applications thus far would be google-chrome, firefox, thunderbird, terminator, gnome-terminal.

If anyone can point me a direction toward a potential resolution or any troubleshooting steps, I will be happy to provide all the information I can.
Comment 1 James Vess 2012-07-18 14:42:21 EDT
I apologize, Please strike "if server side contents are updated". Though I thought I had found a link between NFS file updates and this occurring, I have not been able to replicate the issue.
Comment 2 Bill Nottingham 2012-07-18 14:48:54 EDT
Are the bind mounts still in place when you're seeing the underlying content?
Comment 3 James Vess 2012-07-18 16:11:59 EDT
Hello Bill,

I have the ability to replicate now.
Yes, The bind mounts are still in place, however I know now what is causing the issue.

When forcing the workstation to loose NFS access ( to where I see NFS re-connection messages within /var/log/messages ) and allowing it to reconnect, I lose access to my bind mounts.

What was interesting is that this was occurring on workstations which had not been up for an extremely long period of time and were not experiencing any issue that would present itself within /var/log/messages ( when using the standard debugging levels ).

It appears that NFS does a soft-reconnect at some point and due to this does a remount of sorts that causes the bind-mounts to fall below the read-only system in terms of overlay ordering.

I'm currently looking to find out:
  A. Why the existing FC16 images I'm using are not experiencing this issue ( they are about 3-4 months out of date )
  B. A proper resolution path to allow NFS to gracefully perform it's soft-reloads without loosing bind access or ( in the best possible situation ) survive a disconnection without loosing bind overlays.

I may simply have to generate some symlink workarounds in the mean time in order to get our NFS updates pushed out :)
Comment 4 Bill Nottingham 2012-07-18 16:18:21 EDT
Moving to nfs-utils - might be a kernel filesystem thing.
Comment 5 James Vess 2012-07-18 16:26:36 EDT
Thank you all for your assistance.

As I've not provided a large amount of debugging detail, Please advise what logs or debug output you would like or if there are any additional tests/troubleshooting steps that would assist you.

I will be happy to provide you all the information I can.
Comment 6 James Vess 2012-07-18 18:09:29 EDT
As an example, On one system the following is shown after the issue occurs:

06c66951345ac3cd6194f376eac98f3a  /etc/adjtime
06c66951345ac3cd6194f376eac98f3a  /var/lib/stateless/writable/etc/adjtime

628057c7b29c51e01cb2741885df9f95  /etc/resolv.conf
628057c7b29c51e01cb2741885df9f95  /var/lib/stateless/writable/etc/resolv.conf

24824d8748b3df06b08183868b861d2b  /var/lib/logrotate.status
0ca6f4b61d8b8b8770a2f7e95e55c70e  /var/lib/stateless/writable/var/lib/logrotate.status

613e99de852d94f0048b1868a7fdd470  /var/lib/random-seed
8a1fe730daddfdf4b5f7e33ad5b08c8e  /var/lib/stateless/writable/var/lib/random-seed

7c4abb395031ed87f3a2c5158bb1fab1  /etc/krb5.conf
5fd908d70880a32e2d2e0cbd18325420  /var/lib/stateless/writable/etc/krb5.conf

e8daaee7109cc66563c227f2759fef99  /etc/nslcd.conf
f554a7af6d0f305f6df491d740cb6329  /var/lib/stateless/writable/etc/nslcd.conf

85add6611836caf716fe03f0f3bb12f2  /etc/openldap/ldap.conf
84826e68c2fa259e7245d8f61be1d88a  /var/lib/stateless/writable/etc/openldap/ldap.conf

c84029fb0e2ae481660fa3cf653a8eea  /etc/pam_ldap.conf
98f4441846de033e4dd4699456eacd8a  /var/lib/stateless/writable/etc/pam_ldap.conf

adc1b27570c7f3040f7218b3c73048a0  /etc/sssd/sssd.conf
7055da41221089bf31020ff3da80cac4  /var/lib/stateless/writable/etc/sssd/sssd.conf

857cb32eb006acbccb6eacbdea6bc62b  /etc/cups/client.conf
857cb32eb006acbccb6eacbdea6bc62b  /var/lib/stateless/writable/etc/cups/client.conf

30ab8d7721d3ea3ae1292700e0d02c26  /etc/security/pam_mount.conf.xml
e748828a5407780e88f61ba66cc44751  /var/lib/stateless/writable/etc/security/pam_mount.conf.xml

/etc/cups/client.conf is perfectly writeable at this point, all other files are not.

none on /etc/adjtime type tmpfs (rw,relatime)
none on /etc/resolv.conf type tmpfs (rw,relatime)
none on /var/lib/logrotate.status type tmpfs (rw,relatime)
none on /var/lib/random-seed type tmpfs (rw,relatime)
none on /etc/krb5.conf type tmpfs (rw,relatime)
none on /etc/nslcd.conf type tmpfs (rw,relatime)
none on /etc/openldap/ldap.conf type tmpfs (rw,relatime)
none on /etc/pam_ldap.conf type tmpfs (rw,relatime)
none on /etc/cups/client.conf type tmpfs (rw,relatime)
none on /etc/security/pam_mount.conf.xml type tmpfs (rw,relatime)

[root@workstation ~]# umount /etc/cups/client.conf
[root@workstation ~]# umount /etc/security/pam_mount.conf.xml
umount: /etc/security/pam_mount.conf.xml: not mounted
[root@workstation ~]#

Looking at several systems, the files that are writable still via the bind mount are completely random.
This may be a race condition that's occurring.
Comment 7 James Vess 2012-07-18 18:16:35 EDT
On our existing FC16 image, We are currently running nfs-utils version: 1:1.2.5-5.fc16
This package version appears not to have the issue.

I will downgrade the nfs-utils version on the image experiencing this issue to verify. The current installed version on the problematic image is: 1:1.2.5-8.fc16
Comment 8 James Vess 2012-07-18 18:43:11 EDT
After downgrading to 1:1.2.5-1.fc16, I am not able to immediately replicate as I was.
I'm reserving judgement however as to whether or not the issue is "resolved" with the downgrade for at least a day. If the issue is still occurring, Someone will be experience the issue.

I'll update the bug report as more information arrives.
Comment 9 James Vess 2012-07-18 19:09:09 EDT
Additional information:

When pushing an image update on top of the existing NFS Root image ( downgraded nfs-utils image ) the bind mount issue immediately occurred on the client workstations.

It may not be too common place to roll over existing nfsroot images, however for minor changes it's a lifesaver.

Is this a known issue or perchance related to the issue above?
Comment 10 James Vess 2012-07-18 19:50:36 EDT
I've now been able to replicate bind mounts dropping when server content is updated.

It may have been a fluke on my side with the updated version that it did not occur immediately or perhaps applications had the files in a state where the bind mounts held. I've not had any issues replicating post-downgrade.

I'm just rsyncing the same filesystem from another box to the nfs mount  ( No changes ), However the clients immediately drop their bind mounts.

Here's the output from a tail of /etc/openldap/ldap.conf during the bind mount drop.

-- Valid bind mount
Jul 18 18:31:08 workstation kernel: [  169.767483] NFS: revalidating (0:19/48104820)
Jul 18 18:31:08 workstation kernel: [  169.767484] NFS call  getattr
Jul 18 18:31:08 workstation kernel: [  169.767671] NFS reply getattr: 0
Jul 18 18:31:08 workstation kernel: [  169.767678] NFS: nfs_update_inode(0:19/48104820 fh_crc=0xb57efcf1 ct=1 info=0x7e7f)
Jul 18 18:31:08 workstation kernel: [  169.767682] NFS: (0:19/48104820) revalidation complete
Jul 18 18:31:08 workstation kernel: [  169.767686] NFS: nfs_lookup_revalidate(openldap/ldap.conf) is valid
Jul 18 18:31:08 workstation kernel: [  169.767847] NFS: release(bin/tail)
Jul 18 18:31:08 workstation kernel: [  169.767853] NFS: release(lib64/ld-2.14.90.so)
Jul 18 18:31:08 workstation kernel: [  169.767859] NFS: release(lib64/libc-2.14.90.so)
Jul 18 18:31:08 workstation kernel: [  169.767863] NFS: release(locale/locale-archive)

-- Bind mount no more :(
Jul 18 18:31:26 workstation kernel: [  188.132129] NFS: revalidating (0:19/48104820)
Jul 18 18:31:26 workstation kernel: [  188.132134] NFS call  getattr
Jul 18 18:31:26 workstation kernel: [  188.132265] NFS reply getattr: -116
Jul 18 18:31:26 workstation kernel: [  188.132266] nfs_revalidate_inode: (0:19/48104820) getattr failed, error=-116
Jul 18 18:31:26 workstation kernel: [  188.132270] NFS: nfs_lookup_revalidate(openldap/ldap.conf) is invalid
Jul 18 18:31:26 workstation kernel: [  188.132273] NFS: lookup(openldap/ldap.conf)
Jul 18 18:31:26 workstation kernel: [  188.132275] NFS call  lookup ldap.conf
Jul 18 18:31:26 workstation kernel: [  188.132440] NFS: nfs_update_inode(0:19/48103990 fh_crc=0x919a7b62 ct=1 info=0x7e7f)
Jul 18 18:31:26 workstation kernel: [  188.132443] NFS: mtime change on server for file 0:19/48103990
Jul 18 18:31:26 workstation kernel: [  188.132445] NFS reply lookup: 0
Jul 18 18:31:26 workstation kernel: [  188.132451] NFS: nfs_fhget(0:19/48104115 fh_crc=0x35255c1f ct=1)
Jul 18 18:31:26 workstation kernel: [  188.132454] NFS call  access
Jul 18 18:31:26 workstation kernel: [  188.132641] NFS: nfs_update_inode(0:19/48104115 fh_crc=0x35255c1f ct=1 info=0x7e7f)
Jul 18 18:31:26 workstation kernel: [  188.132646] NFS reply access: 0
Jul 18 18:31:26 workstation kernel: [  188.132651] NFS: permission(0:19/48104115), mask=0x24, res=0
Jul 18 18:31:26 workstation kernel: [  188.132655] NFS: open file(openldap/ldap.conf)
Jul 18 18:31:26 workstation kernel: [  188.132665] NFS: llseek file(openldap/ldap.conf, 0, 1)
Jul 18 18:31:26 workstation kernel: [  188.132669] NFS: llseek file(openldap/ldap.conf, 0, 2)
Jul 18 18:31:26 workstation kernel: [  188.132679] NFS: llseek file(openldap/ldap.conf, 0, 0)
Jul 18 18:31:26 workstation kernel: [  188.132686] NFS: read(openldap/ldap.conf, 336@0)
Jul 18 18:31:26 workstation kernel: [  188.132693] NFS: nfs_readpages (0:19/48104115 1)
Jul 18 18:31:26 workstation kernel: [  188.132703] NFS:     0 initiated read call (req 0:19/48104115, 336 bytes @ offset 0)
Jul 18 18:31:26 workstation kernel: [  188.132906] NFS: nfs_readpage_result: 42381, (status 336)
Jul 18 18:31:26 workstation kernel: [  188.132909] NFS: nfs_update_inode(0:19/48104115 fh_crc=0x35255c1f ct=1 info=0x7e7f)
Jul 18 18:31:26 workstation kernel: [  188.132915] NFS: read done (0:19/48104115 336@0)
Jul 18 18:31:26 workstation kernel: [  188.132932] NFS: read(openldap/ldap.conf, 0@336)
Jul 18 18:31:26 workstation kernel: [  188.132934] NFS: flush(openldap/ldap.conf)
Jul 18 18:31:26 workstation kernel: [  188.132936] NFS: release(openldap/ldap.conf)
Jul 18 18:31:26 workstation kernel: [  188.132939] NFS: dentry_delete(openldap/ldap.conf, 8c)
Jul 18 18:31:26 workstation kernel: [  188.133018] NFS: release(bin/tail)
Jul 18 18:31:26 workstation kernel: [  188.133021] NFS: release(lib64/ld-2.14.90.so)
Jul 18 18:31:26 workstation kernel: [  188.133024] NFS: release(lib64/libc-2.14.90.so)
Jul 18 18:31:26 workstation kernel: [  188.133026] NFS: release(locale/locale-archive)

It looks like since validation failed the bind mount gets dropped?
Is the change in mtime enough to cause this or is there something else going on?

Our existing image does not appear to have this issue and is a few minors above the current nfs-utils version, So it may not be an nfs-utils issue.
Any ideas?
Comment 11 James Vess 2012-07-18 19:55:36 EDT
The more I work on this, The more I find that the entire issue may be just atime / mtime changes on the server side causing client bind mounts to drop.
Comment 12 J. Bruce Fields 2012-07-23 08:52:09 EDT
I'm pretty confused by the problem: I don't understand what you mean by "bind mounts getting dropped".  (And part of the problem may be that I'm ignorant of stateless/read-only root support.)

But: if you're e.g. bind-mounting a file that may be replaced on the server side, that's unsupported.  The client really has to assume that mountpoints aren't replaced.  (On local filesystems this can be enforced by the VFS.  For NFS, I'm not sure what the client does--I would have expected it to start returning ESTALE on attempts to access the mountpoint.)
Comment 13 James Vess 2012-07-23 13:10:12 EDT
The base problem is that I've been able to roll overlay updates on the server side without issue until this last series of updates ( Fresh FC16 install image ), It may be a change in the environment that I'm not aware of causing the issue or a change in a package(s) which is causing the affected issue.

Basically here's the situation:

  I have a template image which I'm making updates to, this template image is read/writable. I have several boxes which a have read-only exports of a copy of this image that is serving clients. Previously, I was able to simply roll over the read-only exports with the changes made on the read-write image without any client issue.

Suddenly now when doing the same thing, the clients drop any bind mounts ( which is performed by the stateless linux package to allow read-write files on a read-only filesystem ), Subsequently causing them to loose their runtime configuration.

If it was a fluke that the method of rollout was working prior, I can accept that perfectly and I have already altered my methods to work around this issue.

It just seems strange that it would suddenly no longer work.

Looking at the debug information above, It seems that mtime changes are the cause of the issue, however I'm certain that prior rsyncs would cause the same time disparity.

This may not necessarily be a bug if it's intended to function this way, however I am certainly not an expert nor do have enough programming experience to properly digest the code involved enough to reach a decision on my own.

Honestly, That's why I've turned to you.

So to confirm, You're saying that this is an expected result?

The files are not being replaced, Their content is not being updated, However their metadata is ( mtime/atime/etc ).

However that is enough to cause NFS to re-validate and thus drop the bind mount?
Comment 14 J. Bruce Fields 2012-07-23 13:40:37 EDT
"I was able to simply roll over the read-only exports with the changes made on the read-write image"

What does "roll over" mean?  (How exactly are you doing this?)

"The files are not being replaced, Their content is not being updated, However their metadata is ( mtime/atime/etc )."

I would expect the client to be able to deal with changes in mtime of mountpoints.  (The problem would be if the files are being replaced: for example, if the inode number changes.)

"However that is enough to cause NFS to re-validate and thus drop the bind mount?"

I'm not sure what you mean by "drop the bind mount" (and I'm having trouble following the discussions of md5 sums above as I'm not clear what's expected).

Do you mean: the bind mounts never happen?  Or it appears that something on the system has unmounted them?  Or mounted something else over them?
Comment 15 James Vess 2012-07-23 15:51:21 EDT
I apologize, When I say "roll over", I mean that I am performing an rsync of the entire contents of read-only export over the read-only export. Typically only a few files ( such as a configuration file ) may have been changed before the rsync.

I completely agree that an inode change would cause such an issue, I'll replicate and report the results.

When I say that the client drops the bind mount, I mean that from the time that I perform a roll over update ( though the affected files are not modified, except for metadata ), the bind mounts which were present and operating properly at system startup are suddenly not forefront.

By this, I mean the existing read-only content which would be expected before the rwtab was processed ( a list of files/directories to generate a copy in tmpfs then bind mount over origination point ) suddenly bleeds back through. Though the bind mounts show in mtab/mount, the original content is suddenly visible.

An example would be:

[root@pizza james]# echo "omgtest" > /tmp/omgtest
[root@pizza james]# chattr +ia /tmp/omgtest
[root@pizza james]# cp /tmp/omgtest /tmp/omgtest.rw
[root@pizza james]# chattr -ia /tmp/omgtest.rw
[root@pizza james]# mount --bind /tmp/omgtest.rw /tmp/omgtest
[root@pizza james]# echo "This is a test" > /tmp/omgtest
[root@pizza james]# cat /tmp/omgtest
This is a test
[root@pizza james]# mount | grep omgtest
/dev/mapper/luks-3c2bb197-fa06-479b-9890-ad02ccd83134 on /tmp/omgtest type ext4 (rw,relatime,seclabel,user_xattr,barrier=1,data=ordered)

Now, After performing a rollover update here is what would occur:

[root@pizza james]# cat /tmp/omgtest
omgtest
[root@pizza james]# mount | grep omgtest
/dev/mapper/luks-3c2bb197-fa06-479b-9890-ad02ccd83134 on /tmp/omgtest type ext4 (rw,relatime,seclabel,user_xattr,barrier=1,data=ordered)

Granted, This is a dramatization done on my local workstation, However the results are identical and are listed above in my bug report.

This is the reason for including md5sums.
The files in /var/lib/stateless/writable are the equivalent of the .rw file in this example, They are the files in tmpfs that are bind mounted by the stateless linux package ( in accordance with the content of the rwtab ) over the real files to allow read-write of certain files/directories on a read-only root.
Comment 16 James Vess 2012-07-23 15:53:43 EDT
Hello, In my last response, I made a mistake:

"I mean that I am performing an rsync of the entire contents of read-only export over the read-only export."

Should be

"I mean that I am performing an rsync of the entire contents of the read-write export over the read-only exports."
Comment 17 James Vess 2012-07-23 16:14:19 EDT
Alright, I verified an inode change, That completely accounts for everything.
Thank you for pointing me in the right direction.

I'll work on this issue, You can consider the bug closed.
Comment 18 J. Bruce Fields 2012-07-31 11:55:20 EDT
If further analysis does find a bug it may be best to open a new bug with a summary.

Note You need to log in before you can comment on or make changes to this bug.