Bug 186325 - upstream kernel patch for global_flush_tlb missing in Red Hat kernel
Summary: upstream kernel patch for global_flush_tlb missing in Red Hat kernel
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Peter Martuccelli
QA Contact: Brian Brock
URL:
Whiteboard:
: 170538 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-03-22 22:14 UTC by Terence Ripperda
Modified: 2007-11-30 22:07 UTC (History)
7 users (show)

Fixed In Version: RHBA-2007-0304
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-05-08 00:59:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0304 0 normal SHIPPED_LIVE Updated kernel packages available for Red Hat Enterprise Linux 4 Update 5 2007-04-28 18:58:50 UTC

Description Terence Ripperda 2006-03-22 22:14:48 UTC
Description of problem:
There are 2 upstream kernel patches that attempted an optimization to
global_flush_tlb and subsequently removed this optimization. Red Hat's kernels
have picked up the first of these two changes, but not the later.

the attempted optimization caused global_flush_tlb to only flush the caches and
tlbs if pages sent to change_page_attr had been restored back to the 'cached'
state. if the pages were modified to the 'uncached' state, global_flush_tlb
would skip flushing the caches and tlbs (right when it's needed the most).

the net result is that stale cache data exists for pages marked uncached and
used for dma push buffers. this stale cache data can later be flushed,
corrupting push buffer data.

the file and changes in question are linux/arch/x86_64/mm/pageattr.c, revisions
1.16 and 1.17. the problem is that df_list only contains pages added to the list
by change_page_attr when the page is being reverted. pages being marked uncached
are never added to this list. so when global_flush_tlb is called, it early
exits, due to no pages being in the df_list and as a result, does not flush the
caches/tlb.

Version-Release number of selected component (if applicable):


How reproducible:
by their very nature, caching issues range greatly in reproducibility, depending
on system and application. an earlier customer report that first reported this
could reproduce it very easily. we worked around this by adding an extra flush
in our driver, but we're still seeing problems when running stress tests for
multiple days, so we're re-investigating our workaround.

hopefully there's enough information above about the problem. if needed, I can
try to put together a specific reproduction for you.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Terence Ripperda 2006-04-10 16:25:49 UTC
I just wanted to check if there were any updates to this. I was hoping there
would be an updated kernel package that we could have some of our customers test
to verify if their bugs were related to this. thanks.

Comment 3 Gunther Mayer 2006-04-11 07:45:10 UTC
I am also waiting for a Redhat comment/test kernel for this.
Please also see my IT#90639.



Comment 4 Pancrazio `ezio' de Mauro 2006-04-11 17:37:38 UTC
Terence, exactly which patches are you referring to?

From a look at upstream, it looks like you may be talking about
http://git.kernel.org/git/?p=linux/kernel/git/stable/linux-2.6.16.y.git;a=blobdiff;h=b90e8fe9eeb00509da9cbda82aa45034e38a64c2;hp=94862e1ec032d2616ca270071e70fb523e1aa150;hb=094804c5a132f04c12dd4902ee15c64362e5c1af;f=arch/x86_64/mm/pageattr.c

@@ -220,8 +220,6 @@ void global_flush_tlb(void)
       down_read(&init_mm.mmap_sem);
       df = xchg(&df_list, NULL);
       up_read(&init_mm.mmap_sem);
-         if (!df)
-                 return;
       flush_map((df && !df->next) ? df->address : 0);
       for (; df; df = next_df) {
               next_df = df->next;

which effectively reverts one of the changes we apply in our
linux-2.6.9-x86_64-change_page_attr-flush-fix.patch

Could you confirm which patches you are referring to?

Comment 5 Terence Ripperda 2006-04-11 17:42:12 UTC
yes, that's the patch.

that early return will return if the df_list (deferred page list) is empty. but
if you look through the rest of the file, this list is only populated in
save_page, which is only called from __change_page_attr when a page is being
reverted back to cached.

the flip side of this is that there are no pages added to the pg_list when a
page is being converted to uncached, so the TLB/cache flush is skipped in this
case. that leaves stale data cached for a page that is now expected to be
uncached. this cached data may be flushed out to system memory at a later point
in time.

it's subtle, but leads to a lot of stability problems in graphics intensive
environments (especially stress tests).

Comment 15 Jeff Burke 2006-06-02 14:31:27 UTC
Which stress tests we used in duplicating the issue?
I would like to add them to our internal tests.

Thanks,
Jeff

Comment 16 Terence Ripperda 2006-06-20 22:46:08 UTC
Hi Jeff,

unfortunately, this was reproduced using our binary driver running a stress test
suite. we're working on a directed test that should reproduce this problem,
which we can then give you full source to. I hope to have that done within 1-2
weeks.

Thanks,
Terence

Comment 17 Ernie Petrides 2006-08-11 20:13:13 UTC
*** Bug 170538 has been marked as a duplicate of this bug. ***

Comment 18 Jason Baron 2006-08-21 20:45:24 UTC
committed in stream U5 build 42.2. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 19 RHEL Program Management 2006-09-07 19:25:30 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 20 RHEL Program Management 2006-09-07 19:25:49 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 21 RHEL Program Management 2006-09-07 19:26:11 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 24 Red Hat Bugzilla 2007-03-18 22:38:20 UTC
User jparadis's account has been closed

Comment 26 Red Hat Bugzilla 2007-05-08 00:59:58 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html


Note You need to log in before you can comment on or make changes to this bug.