Bug 664931 - COW corruption using popen(3).
Summary: COW corruption using popen(3).
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.6
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: rc
: ---
Assignee: Larry Woodman
QA Contact: Zhouping Liu
URL:
Whiteboard:
Depends On:
Blocks: 667050
TreeView+ depends on / blocked
 
Reported: 2010-12-22 06:53 UTC by Wade Mealing
Modified: 2018-11-14 15:54 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Prior to this update, a multi-threaded application, which invoked popen(3) internally, could cause a thread stall by FILE lock corruption. The application program waited for a FILE lock in glibc, but the lock seemed to be corrupted, which was caused by a race condition in the COW (Copy On Write) logic. With this update, the race condition was corrected and FILE lock corruption no longer occurs.
Clone Of:
Environment:
Last Closed: 2011-07-21 10:27:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Reproducing example. (48.12 KB, application/x-tar-gz)
2010-12-22 06:55 UTC, Wade Mealing
no flags Details
backport to 2_6_18-238 (2.16 KB, patch)
2010-12-22 16:08 UTC, Jon Thomas
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description Wade Mealing 2010-12-22 06:53:59 UTC
Description of problem:

Multi-threaded customer application, which invokes popen(3) internally,
can cause thread stall by FILE lock corruption. Custom application stalled two times this month during operation. The application program is waiting for FILE lock in glibc, but the lock seems to be corrupted. 

Version-Release number of selected component (if applicable):

2.6.18-194

How reproducible:

Once every few weeks of real-world usage.  The custom reproducer (to be attached) should happen once every 10 minutes.

Steps to Reproduce:

This is simplified real-app reproducer.

16 threads do popen(3) invoking gunzip and read gunzip-ped data through
pipe using fread(3). Main thread is watching if other 16 threads are
running.

You can run it:
----------------------------------------------------------------
# tar xvzf popen-reproducer.tgz
# sh MK
# ./pt-popen
---------------------------------------------------------------- 

Actual results:

After start running, you will see "thr XX stalled" assertion report.

----------------------------------------------------------------
# ./pt_popen
thr 12 stalled, fp=0x404ba50, futex=2, nest=-1, owner=4288e940
----------------------------------------------------------------

The victim thread stalls in fread(3) and "fp" is the FILE pointer.
The "futex", "nest" and owner is FILE->_lock contents.
Please notice FILE->_lock.nest should be >=0 and is corrupted.
For detailed scenario, please refer to Additonal Info:

Expected results:

The program run forever without assertion report.

Business impact:

This issue was detected on customer's production application.
And it is obviously considered a data corruption:


Additional info:


* Upstream patch
commit 945754a1754f9d4c2974a8241ad4f92fad7f3a6a

* Stall scenario details between popen(3) and fread(3)

popen(3) creates child process and both parent and child write fp->pid causing do_wp_page() race. And then, child scans all FILE and initializes (zero-ing) fp->_lock.

Here, the other thread doing fread(3) will try to unlock the fp, it can
peek the zeroed nest value and it makes nest value to -1.
The unlock code set owner to 0 if and only if the nest value becomes 0
and owner is left 'owned' state.

Then victim thread accesses the fp, it will lock it and wait for wake-up
because owner is left 'owned', but nobody wakes him up.

* More simple reproducer

Please find the attached corruption-reproduce.tgz to demonstrate
peeking the child process memory.

One thread repeatedly invokes fork(2) and both parent and child write
the data region of same virtual address and break COW.

Parent writes 0xffffffff, child writes 0xa5a5a5a5.

Another thread in parent is watching parent data region and it checks
if the data value is correctly parent side value (0xffffffff)

You can run it:
----------------------------------------------------------------
# tar xvzf corruption-reproduce.tgz.
# sh MK
# ./pt-fork-simple2
----------------------------------------------------------------

You will see data corruption assersion:
----------------------------------------------------------------
# ./pt-fork-simple2
!!! corrupt !!! read=a5a5a5a5
----------------------------------------------------------------

Comment 1 Wade Mealing 2010-12-22 06:55:38 UTC
Created attachment 470135 [details]
Reproducing example.

Comment 6 Larry Woodman 2010-12-23 06:03:45 UTC
Posted upstream backport for RHEL5 to rhkernel-list

Larry Woodman

Comment 12 RHEL Program Management 2010-12-23 18:50:19 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 Jarod Wilson 2011-01-26 21:09:14 UTC
in kernel-2.6.18-241.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 23 Martin Prpič 2011-07-13 20:29:06 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Prior to this update, a multi-threaded application, which invoked popen(3) internally, could cause a thread stall by FILE lock corruption. The application program waited for a FILE lock in glibc, but the lock seemed to be corrupted, which was caused by a race condition in the COW (Copy On Write) logic. With this update, the race condition was corrected and FILE lock corruption no longer occurs.

Comment 24 errata-xmlrpc 2011-07-21 10:27:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.