Bug 1066751

Summary:

tmpfs: creates files with inode number 0, rendering parent directory unremovable

Product:

Red Hat Enterprise Linux 6

Reporter:

Roni Gordon / Casale Media <roni.gordon>

Component:

kernel

Assignee:

Carlos Maiolino <cmaiolin>

kernel sub component:

Other

QA Contact:

Murphy Zhou <xzhou>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

medium

Priority:

medium

CC:

ajb, cmaiolin, eguan, esandeen, kdudka, kfujii, manuel.wolfshant, myamazak, pasteur, riel, roni.gordon, rwheeler, salmy, tejaswinipoluri3, toracat, trajaraman, yanwang, yohmura

Version:

6.4

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-582.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1241665 (view as bug list)

Environment:

Last Closed:

2016-05-10 21:52:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1075802, 1159933, 1172231, 1241665, 1268411

Attachments:

Description	Flags
test script to overflow the inode counter	none

Description Roni Gordon / Casale Media 2014-02-19 03:22:31 UTC

Created attachment 864931 [details]
test script to overflow the inode counter

Description of problem:

Directories containing a filename assigned inode number zero cannot be removed by our 'rm -rf', even as root.  Currently, the only way to remove the parent directory is by deleting the offending file by name -- but ls will not list the filename for this inode number.  This makes it unreliable to use TMPFS  for with transient scratch data where cleanup jobs remove stale files/directories.

Version-Release number of selected component (if applicable):
N/A

How reproducible:
always

Steps to Reproduce:

1. create an arbitrarily named directory (e.g. 1381276560) on a TMPFS fount
2. generate sufficient (~4.3G) inodes inside to overflow 32-bit inode counter
3. run 'rm -rf' to delete all files inside this directory, and directory itself

(see attached perl script)

Actual results:

# rm -rf 1381276560
rm: cannot remove directory `1381276560': Directory not empty

# ls -la 1381276560
total 0
drwxrw-r-- 2 user user 60 Oct 8 19:58 .
drwxrw-r-- 5 user user 100 Dec 4 18:10 ..

# ls -i 64045
? 64045

# stat 64045
  File: `64045'
  Size: 5 Blocks: 8 IO Block: 4096 regular file
Device: 15h/21d Inode: 0 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2014-01-13 10:39:42.272135907 -0500
Modify: 2014-01-11 22:55:36.418285866 -0500
Change: 2014-01-11 22:55:36.418285866 -0500

Expected results:

Directory and all of its contents should be deleted.  

Additional info:

According to http://stackoverflow.com/questions/4411701/how-are-inode-numbers-generated-in-linux-tmpfs, "the bulk of the tmpfs code is in mm/shmem.c., but it delegates almost everything to the generic filesystem code in fs/inode.c." The field "i_ino" of the inode struct handled by new_inode(), which simply performs a 'inode->i_ino = ++last_ino;', which is a 32-bit unsigned integer that can overflow. Only other filesystems, this value is typical overwritten by an unused inode number, but TMPFS does not appear to have any special handling for this.

Confirmed to be an issue with the kernels used with RHEL 5.8, 5.9, 6.3 and 6.4 (albeit on CentOS).

Comment 1 Roni Gordon / Casale Media 2014-02-19 03:24:13 UTC

We modified https://raw.github.com/aidenbell/getdents/master/src/getdents.c, which was originally designed as a faster alternative to ls, so that it would only list files with inode number 0:

- if( d->d_ino != 0 && d_type == DT_REG ) {
- printf("%s\n", (char *)d->d_name );
+ if( d->d_ino == 0 && d_type == DT_REG ) {
+ printf("Inode number %ld: %s\n", d->d_ino, (char *)d->d_name );
 
And much to our horror/delight, the mystery filename that neither ls nor rm could locate appeared out of thin air:

# gcc getdents.c -o getdents
# getdents 1381276560
Inode number 0: 71A800181400

This file was completely intact (i.e. contained the correct contents and typical file size for a file in this directory), and could be trivially deleted by name:

# cat 71A800181400 | wc -c
776

# rm 71A800181400
rm: remove regular file `71A800181400'? y

At which point removing its parent directory was no longer an issue (directory block size was restored, etc.), and our problem went away.

It's possible that it's remained unknown because the following things need to occur in order to get this unlikely situation to re-occur:

1) have a server with sufficient uptime to generate ~4.3G files on a device with a reboot; and
2) have the file that would be allocated inode 0 for that device created on the TMPFS partition; and
3) trigger a process which deletes these TMPFS files without knowledge of their name; and finally
4) try to delete the parent directory

Nonetheless, we consider this a bug in TMPFS -- there's no reason to hand out a reserved inode number when starting again at 1 would be just fine, and thereby never encounter this issue.

Comment 2 Roni Gordon / Casale Media 2014-02-19 03:25:28 UTC

Further details (strace, etc.) are all cross-posted in the original CentOS bug (see link in external tracker section).

Comment 4 Kamil Dudka 2014-02-19 10:42:30 UTC

The filesystem component has nothing to do with the file system implementation in kernel.  I am switching the component...

Comment 5 Rik van Riel 2014-02-19 14:07:48 UTC

Sounds like the VFS should not be generating inode number 0.

Comment 6 Eric Sandeen 2014-02-21 17:15:15 UTC

Rik, I agree.

Roni, have you contacted your RHEL support team for this bug?

Comment 7 Roni Gordon / Casale Media 2014-02-21 17:20:15 UTC

@Eric: this bug was logged here at the request of a CentOS developer (http://bugs.centos.org/view.php?id=6992#c19301) -- the kernel bug was discovered on CentOS, which is running the same kernels as the RHEL releases mentioned above.

Comment 8 Eric Sandeen 2014-02-21 17:21:54 UTC

Ok, but RHEL staff doesn't support CentOS, so this won't have a high priority compared to other customer bugs.

If you can reproduce it on an upstream kernel, sending the problem to LKML or the fs-devel mailing list may get some traction.

Comment 9 Akemi Yagi 2014-02-26 20:47:29 UTC

@Eric,

I would rather see this as 'fixing the EL kernel' than 'supporting CentOS'.

From that point of view, as you suggested, if this can be reproduced in the mainline kernel, reporting the problem to LKML would be the right thing to do. I suppose RHEL kernel would not be fixed anyway unless the patch is in the upstream kernel. 

But once this issue is brought to LKML, aren't you the one who would be in charge? :)

Comment 10 Eric Sandeen 2014-02-26 21:00:47 UTC

Not necessarily me, no ;)

I do appreciate bug reports from RHEL clones; it's free testing and all that, and sometimes exposes serious issues that we need to get right on top of.

But like just about everyone, there's more work to do than there are hours in the day.  If I have customers waiting on me, they have to come first.  And I always have customers waiting on me.  ;)

So I'm asking you to do a little legwork to help me out; if you can still duplicate the behavior upstream, that's a great datapoint.  If you *can't* then knowing which kernel release fixed it would be very valuable as well.

If you wanted to go so far as to take the issue upstream, if it persists, that'd be super helpful too.  Maybe someone who is intimately familiar w/ the generic inode counters will know what the obvious one-liner fix is.

IOWS: Backporting an upstream patch is a lot less work than testing, triaging, reading code, patching, testing again, etc.  You can help me get there.  ;)

-Eric

Comment 11 Roni Gordon / Casale Media 2014-02-27 15:14:27 UTC

(In reply to Eric Sandeen from comment #10)
> So I'm asking you to do a little legwork to help me out; if you can still
> duplicate the behavior upstream, that's a great datapoint.  If you *can't*
> then knowing which kernel release fixed it would be very valuable as well.

Do you have any preference as to which upstream kernel is tested?  I'll see if I can do some of that "legwork" on my end.

Comment 12 Eric Sandeen 2014-02-27 17:17:50 UTC

thanks.

Usually, if I want to know if something is fixed upstream, I test the latest kernel available in git.  If that's out of scope for you, see if your distro has something prebuilt which is almost that new.

-Eric

Comment 13 Akemi Yagi 2014-02-27 17:23:07 UTC

Yes, indeed there is such a thing. The latest mainline kernel is available for RHEL from The ELRepo Project [1] and is named kernel-ml [2].

[1] http://elrepo.org
[2] http://elrepo.org/tiki/kernel-ml

Comment 14 Roni Gordon / Casale Media 2014-03-03 17:02:11 UTC

Recreated on kernel 3.13.5-1.el6.elrepo.x86_64 (latest stable kernel):

[root@centos6-5 inode0Test]# rm -rf test
rm: cannot remove `test': Device or resource busy
[root@centos6-5 inode0Test]# cd test
[root@centos6-5 test]# ls -i 3946307038
? 3946307038
[root@centos6-5 test]# stat 3946307038
  File: `3946307038'
  Size: 10        	Blocks: 8          IO Block: 4096   regular file
Device: 14h/20d	Inode: 0           Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2014-03-01 11:00:11.561272342 -0500
Modify: 2014-03-01 11:00:11.561272342 -0500
Change: 2014-03-01 11:00:11.561272342 -0500

Comment 20 Eric Sandeen 2014-04-18 17:48:23 UTC

Ok, I'll dig into this one - but while an i_ino of "0" is a problem, I agree with Rik that the potential for duplicate inodes is another serious problem.

Perhaps the use case is such that no file is long-lived enough for the counter to wrap and obtain a duplicate inode number, but otherwise I think it'd be a concern - and one that is unlikely to be addressed in a simple filesystem like tmpfs.

-Eric

Comment 22 Carlos Maiolino 2014-07-22 16:42:46 UTC

Looks like there is no problem with the kernel at all, actually, there is no problem in having a file with inode 0.

The problem to remove the files with inode zero, appears to be caused by userspace tools (coreutils executables and/or glibc).

I'm still investigating more in depth where the problem exactly is, but, afaik, the readdir() function provided by glibc used to ignore files with the inode 0, due pre-historic problems. AFAICT, this readdir() behavior has been fixed a while ago, but, I'm still investigating the readdir() beahvior to get more information about it.

-Carlos

Comment 23 Roni Gordon / Casale Media 2014-07-22 17:04:19 UTC

(In reply to Carlos Maiolino from comment #22)
> I'm still investigating more in depth where the problem exactly is, but,
> afaik, the readdir() function provided by glibc used to ignore files with
> the inode 0, due pre-historic problems.

This link (http://stackoverflow.com/questions/2099121/why-do-inode-numbers-start-from-1-and-not-0) seems to suggest some "historic" issues with EXT2, and possibly MacOS related to deleted-but-not-removed files.

Comment 24 Carlos Maiolino 2014-07-25 18:17:10 UTC

Hi,

I confirmed, this is not a kernel bug, but, the problem described here is caused by the behavior of glibc library.

I'm looking for now, about how should we proceed with this bug.

-Carlos

Comment 25 Carlos Maiolino 2014-08-04 14:05:52 UTC

Hi,

just an update about the bug.

Although the problem with removing the files isn't a problem with the kernel, but, with the way glibc treats files with inode 0, the kernel development team decided that a better solution here will be to avoid VFS to allocate an inode 0 to a file (when the inode is generated by VFS, like the tmpfs case).

The problem, as already know, is still visible in upstream kernels, and a discussion to fix the problem is already happening, so we should have a resolution soon.

-Carlos

Comment 32 tejaswini 2015-06-25 07:38:59 UTC

Hii, 

We are facing the same problem in 3.13 kernel as well. It would be great if you can share the the upstream kernel discussion on the same and if any details of the issue fix.

- Tejaswini

Comment 33 tejaswini 2015-06-25 11:47:50 UTC

For 2.6.32 kernel, we have tried the following fix and it worked. It would be great if you can review the same and confirm it.

@@ -682,6 +682,8 @@ struct inode *new_inode(struct super_block *sb)
        if (inode) {
                spin_lock(&inode_lock);
                __inode_add_to_lists(sb, NULL, inode);
+               if (unlikely(!(last_ino + 1)))
+                       last_ino = 0;
                inode->i_ino = ++last_ino;
                inode->i_state = 0;
                spin_unlock(&inode_lock);

- Tejaswini

Comment 34 tejaswini 2015-06-25 11:50:44 UTC

For 2.6.32 kernel, we have tried the following fix in fs/inode.c and it worked. It would be great if you can review the same and confirm it.

@@ -682,6 +682,8 @@ struct inode *new_inode(struct super_block *sb)
        if (inode) {
                spin_lock(&inode_lock);
                __inode_add_to_lists(sb, NULL, inode);
+               if (unlikely(!(last_ino + 1)))
+                       last_ino = 0;
                inode->i_ino = ++last_ino;
                inode->i_state = 0;
                spin_unlock(&inode_lock);

- Tejaswini

Comment 35 Carlos Maiolino 2015-06-25 13:51:32 UTC

Hello Tejaswini,

A similar solution was provided previously upstream, but has not been accepted at a first glance.

I'm talking to another developers to see what would still be the best solution for that.

I'm looking for the discussion thread to provide it to you, as soon as I find it I let you know

Comment 36 tejaswini 2015-06-26 07:13:36 UTC

Thanks Carlos. Looking forward for the updates. Also there are repercussions of wrapping around like the possibility of having the same inode number for two files right. Did you find any fixes for the same?

Comment 37 thiruvadi rajaraman 2015-06-26 07:53:43 UTC

Hi,

The followings are test logs which observed on Ubuntu 14.04.

Since the issue exist in latest kernel and the "test" directory is unable to
remove, I have done this test on Ubuntu 14.04.

Test:1
======

root@cavium-desktop:~/temp# ls
test  test-logs

root@cavium-desktop:~/temp# rm -rf test
rm: cannot remove ‘test’: Directory not empty
root@cavium-desktop:~/temp#

root@cavium-desktop:~/temp# ls -la test
total 0
d--x--x--x 2 root root 60 Jun 24 15:00 .
dr--r--r-- 3 root root 80 Jun 24 15:28 ..
root@cavium-desktop:~/temp#

root@cavium-desktop:~/temp# stat test
  File: ‘test’
  Size: 60              Blocks: 0          IO Block: 4096   directory 
         -------------------------------------------------------->   [size 60 ]
Device: 1bh/27d Inode: 149834664   Links: 2
Access: (0111/d--x--x--x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-06-24 15:00:29.658154496 +0530
Modify: 2015-06-24 15:00:28.434154518 +0530
Change: 2015-06-24 15:00:28.434154518 +0530
 Birth: -
root@cavium-desktop:~/temp#
root@cavium-desktop:~/temp#


Test:2 
======

The directory "test1" created in the same tmpfs mount path

root@cavium-desktop:~/temp# mkdir test1
root@cavium-desktop:~/temp# touch test1/test-file
root@cavium-desktop:~/temp#

root@cavium-desktop:~/temp# stat test1
  File: ‘test1’
  Size: 60              Blocks: 0          IO Block: 4096   directory
Device: 1bh/27d Inode: 1024301708  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-06-24 15:25:50.902127297 +0530
Modify: 2015-06-24 15:26:01.166127114 +0530
Change: 2015-06-24 15:26:01.166127114 +0530
 Birth: -
root@cavium-desktop:~/temp#

root@cavium-desktop:~/temp# ls -la test1/
total 0
drwxr-xr-x 2 root root 60 Jun 24 15:26 .
dr--r--r-- 4 root root 80 Jun 24 15:25 ..
-rw-r--r-- 1 root root  0 Jun 24 15:26 test-file
root@cavium-desktop:~/temp#

root@cavium-desktop:~/temp# rm -rf test1/*

root@cavium-desktop:~/temp#
root@cavium-desktop:~/temp# stat test1
  File: ‘test1’
  Size: 40              Blocks: 0          IO Block: 4096   directory  
         ------------------------------------------------------>  [ size: 40 ]
Device: 1bh/27d Inode: 1024301708  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-06-24 15:26:16.902126832 +0530
Modify: 2015-06-24 15:26:27.382126645 +0530
Change: 2015-06-24 15:26:27.382126645 +0530
 Birth: -
root@cavium-desktop:~/temp#

root@cavium-desktop:~/temp#

As per my observation about this issue from Test:1 and Test:2 logs, if the
directory is empty, the i_node_size is points to value as "40" as well as if
the directory is not empty, the i_node_size is points to value as either "60"
or greater.

Though able to remove all files inside "test" directory, During the file deletion operation inside "test" dir is not properly updated with value "40" but found with "60" due to the file has inode as "0".      

No files inside the dir but the i_node_size value is "60", i am suspecting this
is going to be the issue for failure found with file removal(tmpfs).

Is force updating the i_node_size value as "40" while the directory is empty which may makes the file removal possible? 

Please correct me if my observation is wrong and suggest your comments.

Thanks,
Thiruvadi rajaraman

Comment 38 Carlos Maiolino 2015-06-26 14:01:51 UTC

Hello Thiruvadi,

The only reason I can tell you now, that you are seeing in your tests, a bigger size in a supposed empty directory, is that it is not empty at all, which is the reason of this BZ. Files with inode 0 are not listed by the userspace tools, so, although you are not seeing the files, they are inside the directory, and this is the reason why you could not delete the directory.
The possibility of removing the directory and the files inside it, has nothing to do with the size of the directory, but with the inodes inside it. Any file with inode 0 will not be processed by userspace tools based on glibc, which ignores files with this inode number.

Comment 39 tejaswini 2015-06-29 14:18:34 UTC

Hii Carlos, 

We are trying to work on another work around for inode zero. As glibc is not recognizing inode 0, i thought the kernel can check if inode 0 is present while doing rmdir and delete it if present. The following are the code changes. It would be great if you can review the same:
static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
{
        if (!simple_empty(dentry)) {
        /*Check if it has a zero inode*/
                struct dentry *child;
                int is_inode_zero = 0;
                list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
                        if (child->d_inode->i_ino == 0) {
                                printk("XXX %s:inode is zero \n",__func__);
                                iput(child->d_inode);
                                is_inode_zero = 1;
                        }
                if(!is_inode_zero)
                        return -ENOTEMPTY;
        }       
        drop_nlink(dentry->d_inode);
        drop_nlink(dir);
        return shmem_unlink(dir, dentry);
}

Comment 40 Carlos Maiolino 2015-06-30 03:33:20 UTC

Hi tejaswini.

I don't believe this is the correct approach, because you are just limiting the handling of the inode 0 for rmdir call and for tmpfs only, but as you said, this is a workaround and might work for you.

FYI, I sent a patch past week to linux-fsdevel list, with changes I discussed with some other filesystem developers.

http://marc.info/?l=linux-fsdevel&m=143526593507774&w=2

The main idea is to avoid VFS to create an inode with number 0, when using get_next_ino(), so, with this approach, all filesystems relying on VFS for inode creation will not be able to create a file with inode 0.

I'm waiting for coments in the patch I sent, or for it to be picked-up by the vfs maintainer, let's see what will happen. If the patch is accepted, it should hit upstream code soon.

Cheers

Comment 41 tejaswini 2015-06-30 10:43:34 UTC

Hii Carlos,

Yeah.I took that approach considering that inode 0 creation shouldn't be a problem. And the workaround was done for tmpfs alone because all the matured filesystems like ext4 have their own mechanism for allocating inode numbers. 

Regarding the patch, I am just wondering if it shouldn't be while(res) instead of while(!res). Correct me if i am wrong. I might be missing something.


-Tejaswini

Comment 42 tejaswini 2015-06-30 12:12:40 UTC

I am sorry regarding the previous comment on the patch.I was little confused. 
What is the reason for moving from unlikely to do while loop ?

Comment 43 Carlos Maiolino 2015-07-09 15:59:29 UTC

Hi tejaswini,

it's more for readability.

Comment 50 Aristeu Rozanski 2015-10-20 19:09:08 UTC

Patch(es) available on kernel-2.6.32-582.el6

Comment 53 Murphy Zhou 2016-04-15 02:25:19 UTC

TEST PASS

Reproduced with inodeOverflow.pl[1] on -573.el6 kernel.
...
Tue Apr 12 01:09:23 2016 Current inode number:4294000000
File '/tmpfs//test/4233754690' created with inode number 0
Finished
# uname -r
2.6.32-573.el6.x86_64
# stat /tmpfs//test/4233754690
  File: `/tmpfs//test/4233754690'
  Size: 10              Blocks: 8          IO Block: 4096   regular file
Device: 15h/21d Inode: 0           Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-04-12 01:11:08.573329775 +0800
Modify: 2016-04-12 01:11:08.573329775 +0800
Change: 2016-04-12 01:11:08.573329775 +0800


Verified on -639.el6 kernel.

After running multiple instance for days:
...
Thu Apr 14 14:26:38 2016 Current inode number:4291000000
Thu Apr 14 14:27:10 2016 Current inode number:4292000000
Thu Apr 14 14:28:13 2016 Current inode number:4294000000
Thu Apr 14 14:29:47 2016 Current inode number:2000000     # no overflow
Thu Apr 14 14:32:27 2016 Current inode number:7000000
...

# ps -ef | grep erl
root     16171 16083 99 Apr12 pts/1    2-23:34:40 perl inodeOverflow.pl /tmpfs/
root     24087 24063 99 Apr13 pts/0    1-16:02:17 perl inodeOverflow.pl zxm/
root     24126 24103 99 Apr13 pts/3    1-15:57:49 perl inodeOverflow.pl zxm1
root     31685 31642  0 10:21 pts/2    00:00:00 grep erl
# uname -r
2.6.32-639.el6.x86_64

[1] https://bugzilla.redhat.com/attachment.cgi?bugid=1241665&action=enter
[2] set testcoverage to - due to long long test time.

Comment 55 errata-xmlrpc 2016-05-10 21:52:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-0855.html