Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 729176

Summary:

ext4 regression: quota incorrect/orphan inodes on removal of (locked) files

Product:

Red Hat Enterprise Linux 6

Reporter:

J. Bruce Fields <bfields>

Component:

kernel

Assignee:

J. Bruce Fields <bfields>

Status:

CLOSED ERRATA

QA Contact:

Petr Beňas <pbenas>

Severity:

high

Docs Contact:

Priority:

medium

Version:

6.1

CC:

bfields, eguan, esandeen, jiali, jlayton, kmcmartin, lczerner, pbenas, pstehlik, rik.theys, rwheeler, yanwang

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-206.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

714153

Environment:

Last Closed:

2011-12-06 14:00:58 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

714153

Bug Blocks:

Attachments:

Description	Flags
fix an open-downgrade problem	none

Comment 1 J. Bruce Fields 2011-08-08 23:00:28 UTC

I reproduced problems locally that resulted in similar reference count leaks, and submitted patches to fix those.  According to the original reporter, those patches may help his situation somewhat, but he is still left with an unmountable filesystem.

Further debugging will therefore be required.

Comment 3 Rik Theys 2011-08-10 09:42:25 UTC

I've tested the -178 kernel and (as mentioned) it does not fully fix our bug. The quota information is more correct now, but still not completely correct. In the past the quota would show a value slightly below the hard limit. With the patches applied, the quota shows 28 block in use, where this should be 4 (only a single directory).

I've also tried the 3.0.1 kernel as this seemed to have a patch that could be relevant and I was not sure the -178 kernel has this one:

commit 83d20a07d3fc171d5d7cddb6ebe2cd7a5fee1047
Author: J. Bruce Fields <bfields>
Date:   Wed Jun 29 16:49:04 2011 -0400

    svcrpc: fix list-corrupting race on nfsd shutdown
    
    commit ebc63e531cc6a457595dd110b07ac530eae788c3 upstream.

The 3.0.1 has similar behaviour.

Kernels having these fixes applied all seem to have the 40s delay during the simulation, although there doesn't seem to be any network traffic during the freeze. Older kernels don't have this freeze period.

I will also add this comment to the new bug report. I've attached a vmcore and tcpdump using the -178 kernel to the support case. I'm not sure you can access that information?

Comment 4 Rik Theys 2011-08-10 09:43:50 UTC

FYI: I can not see comment 2. Does it contain relevant information?

Comment 5 J. Bruce Fields 2011-09-13 20:36:07 UTC

No fix on hand at this point, so too late for 6.2; expecting to get back to this for 6.3.

Comment 6 Rik Theys 2011-09-19 09:53:25 UTC

Hi,

According to the support case, "the issue is now known and we are expecting to come with the patch soon".

Does that mean you have been able to reproduce the remaining problems and are working on a fix?

If you have not yet been able to reproduce this problem, can I provide some more debugging output in some way? If possible I can run test kernels with specific debugging enabled on the server to hopefully help pinpoint where things go wrong.

Do you have any estimate on when a patch could be available? The system on which I'm testing this needs to be reconfigured for something else in the near future.

Does it make sense to test newer -rc kernels?

Regards,

Rik

Comment 7 J. Bruce Fields 2011-09-19 15:12:29 UTC

Thanks for your patience.  Taking a closer look at the 3.0.1 network trace, I see strange behavior around frame 1677: a downgrade to OPEN_WRITE, followed by an open for READ which is granted a read delegation.  That read delegation should not have been granted as long as the client still held a write open.

Subsequent delays appear to be due to a write performed under the stateid associated with that open returning DELAY, because the write is causing the server to recall the delegation.

I'm not quite sure what's going on here yet.  But I'll try to have a patch for testing soon.

Comment 8 J. Bruce Fields 2011-09-20 12:29:40 UTC

Created attachment 524022 [details]
fix an open-downgrade problem

I think this could explain both of the symptoms you were seeing: the client's and server's idea of the open state could get out of state, causing strange delegation recall behavior that could cause delays on write (explaining the ERR_DELAY replies I see to WRITE's in your trace).  And a failure to convert open types in the downgrade logic here could mess up the reference counting.

Thanks for your patience; any testing you could do would be appreciated.

The patch is against the tip of my latest (3.1-rc1-based) tree, but I believe it should also apply to any kernel (such as 3.0.1) that has the patches you previously tested.

Comment 9 Rik Theys 2011-09-20 13:30:04 UTC

I tested the 3.0.1 kernel with your patch two times and it seems to fix the issue!

I no longer have the long delay during the start of the simulation, the quota information is correct and I can unmount the file system on the server! Yay!

I assume this will be in the upstream 3.1 kernel?

Will it be possible to still have this for RHEL 6.2, please?

If the patch is in Linus' tree, can it be proposed to be included in the longterm stable kernels?

So for the RHEL 6.1 kernels I need the 3 patches you provided. I believe the RHEL 5.7 kernel also has this bug now? Does it need all 3 patches or just the latest one?

Thanks for you help!

Regards,

Rik

Comment 10 J. Bruce Fields 2011-09-20 19:06:47 UTC

Thanks once more for the quick test results.

I've posted the same patch upstream, and if nobody catches a problem in review then it should be included in 3.2 (and applied to stable 3.1.z and 3.0.z shortly afterwards).  I want to give other upstream developers a chance to comment and then we can start the process for 6.3 and 6.2.z.

Comment 11 Rik Theys 2011-09-26 14:41:40 UTC

Hi,

I've tried rebuilding the 2.6.32-131.12.1 kernel with the patches applied but it seems it fails to build from source. Even without any patches applied the compilation (on x86_64) fails with 

Documentation/video4linux/v4lgrab.c:34:28: error: linux/videodev.h: No such file or directory
Documentation/video4linux/v4lgrab.c: In function 'main':
Documentation/video4linux/v4lgrab.c:103: error: storage size of 'cap' isn't known
Documentation/video4linux/v4lgrab.c:104: error: storage size of 'win' isn't known
Documentation/video4linux/v4lgrab.c:105: error: storage size of 'vpic' isn't known
Documentation/video4linux/v4lgrab.c:116: error: 'VIDIOCGCAP' undeclared (first use in this function)
Documentation/video4linux/v4lgrab.c:116: error: (Each undeclared identifier is reported only once
Documentation/video4linux/v4lgrab.c:116: error: for each function it appears in.)
Documentation/video4linux/v4lgrab.c:123: error: 'VIDIOCGWIN' undeclared (first use in this function)
Documentation/video4linux/v4lgrab.c:129: error: 'VIDIOCGPICT' undeclared (first use in this function)
Documentation/video4linux/v4lgrab.c:135: error: 'VID_TYPE_MONOCHROME' undeclared (first use in this function)
Documentation/video4linux/v4lgrab.c:137: error: 'VIDEO_PALETTE_GREY' undeclared (first use in this function)
Documentation/video4linux/v4lgrab.c:138: error: 'VIDIOCSPICT' undeclared (first use in this function)
Documentation/video4linux/v4lgrab.c:151: error: 'VIDEO_PALETTE_RGB24' undeclared (first use in this function)
Documentation/video4linux/v4lgrab.c:154: error: 'VIDEO_PALETTE_RGB565' undeclared (first use in this function)
Documentation/video4linux/v4lgrab.c:158: error: 'VIDEO_PALETTE_RGB555' undeclared (first use in this function)
Documentation/video4linux/v4lgrab.c:105: warning: unused variable 'vpic'
Documentation/video4linux/v4lgrab.c:104: warning: unused variable 'win'
Documentation/video4linux/v4lgrab.c:103: warning: unused variable 'cap'
make[2]: *** [Documentation/video4linux/v4lgrab] Error 1
make[1]: *** [Documentation/video4linux] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [vmlinux] Error 2

Is there a patch for this error that is already applied to a later revision or the RHEL6 kernel? Are the offial RHEL kernels not built from the same sources?

Regards,

Rik

Comment 12 J. Bruce Fields 2011-09-28 01:30:02 UTC

For QA: I've added a pynfs test to

  git://linux-nfs.org/~bfields/pynfs.git

Run

  # ./nfs4.0/testserver.py server:/export/ --maketree --rundeps OPDG10

and then, on "server", run "service nfs stop" and "umount /export".  The umount should fail before the patch, and succeed after.

Comment 13 J. Bruce Fields 2011-09-28 01:32:21 UTC

I'm not sure why that compile is failing, apologies.  It seems unrelated to the patch.

This is a worse bug than I thought, and would be a regression new to 6.2, so I think it should go into 6.2.

Comment 14 RHEL Program Management 2011-09-28 01:40:40 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 15 Rik Theys 2011-09-28 06:35:44 UTC

Hi,

Compiling the kernel from the .src.rpm works on a 6.0 system, but not a 6.1 system.

When you say that the bug is worse than you thought and that it should go into 6.2, are you referring to the NFS bug or the FTBFS?

Regards,

Rik

Comment 17 J. Bruce Fields 2011-09-28 11:21:42 UTC

"When you say that the bug is worse than you thought and that it should go into
6.2, are you referring to the NFS bug or the FTBFS?"

I'm referring to the NFS bug.

Comment 19 Aristeu Rozanski 2011-10-05 15:33:51 UTC

Patch(es) available on kernel-2.6.32-206.el6

Comment 27 Petr Beňas 2011-10-06 15:20:17 UTC

Reproduced in 2.6.32-178.el6.x86_64, nable to reproduce in 2.6.32-205.el6.x86_64 and verified in 2.6.32-206.el6.x86_64.

Comment 28 Rik Theys 2011-12-02 15:11:39 UTC

Are the fixes for this bug now in the upstream kernel? What are the relevant commits and/or since what kernel version. Has it been applied to -stable kernels?

Comment 29 J. Bruce Fields 2011-12-02 16:37:53 UTC

The upstream commit was 3d02fa29dec920c, upstream as of 3.2-rc1, 3.1.1, and 3.0.9.  Looks like it's passing all our tests, but any additional test results are welcomed.

Comment 30 errata-xmlrpc 2011-12-06 14:00:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html