Bug 526092

Summary: rw_semaphore bug
Product: Red Hat Enterprise Linux 5 Reporter: Issue Tracker <tao>
Component: kernelAssignee: Cong Wang <amwang>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.4CC: dzickus, kbaxley, lwoodman, pbenas, rkhan, syeghiay, tao, woodard
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 07:29:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
reproducer none

Description Issue Tracker 2009-09-28 18:09:41 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2009-09-28 18:09:43 UTC
Event posted on 09-25-2009 06:07pm EDT by woodard


From: Brian Behlendorf <behlendorf1@llnl.gov>
To: Ben Woodard <bwoodard@llnl.gov>
Subject: rwsem_is_locked() bug
Date: Wed, 23 Sep 2009 13:56:36 -0700



Ben,

This morning I stumbled in to a bug with the kernels semaphore
implementation and
I am hoping RH could shed some light on it.  The bug is that
rwsem_is_locked
may return 0 even when the semaphore is locked.  This is because
sem->activity is checked outside the wait_lock (at least in the
generic+x86
versions).  The fix is easy...just add the needed spin locks.

I assume this was never really a problem because in the latest kernels
there
are only two callers __mlock_vma_pages_range() and 
page_referenced_one().
Neither of thse cause a problem because they are called quite a while
after
the lock is taken, and for the race to occur rwsem_is_locked()  must be
called immediately after the lock is acquired.

I've attached a trival reproducer which I've tested against the chaos
kernel.
Just make the module and use the test script to load and unload the module
a
few times.  It shouldn't take long before you trip the BUG_ON().

Thanks,
Brian





To: Brian Behlendorf <behlendorf1@llnl.gov>
cc: Ben Woodard <bwoodard@llnl.gov>, Jim Garlick <garlick1@llnl.gov>
Subject: Re: rwsem_is_locked() bug 
In-Reply-To: Message from Brian Behlendorf <behlendorf1@llnl.gov> 
   of "Wed, 23 Sep 2009 13:56:36 PDT."
<200909231356.40745.behlendorf1@llnl.gov> 
From: "Mark A. Grondona" <mgrondona@llnl.gov>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Wed, 23 Sep 2009 14:29:10 -0700
Sender: mgrondona@llnl.gov
Message-Id: <E1MqZOg-0004S5-Sk@wednesday.llnl.gov>

> 
> I assume this was never really a problem because in the latest kernels
there
> are only two callers __mlock_vma_pages_range() and 
page_referenced_one().
> Neither of thse cause a problem because they are called quite a while
after
> the lock is taken, and for the race to occur rwsem_is_locked()  must be
> called immediately after the lock is acquired.

This part is interesting to me. For my own edification, can you explain
why there is only a race just after down_read() is called?

Also, what is the purpose of calling rwsem_is_locked() just after you've
locked the rwsem? I am confused by that part too. I do know that
rwsem_is_locked() is not supposed to block, maybe they left out all
locks in the function for that reason?

Thanks!
mark


kbaxley assigned to issue for LLNL (HPC).
Status set to: Waiting on Tech
Version set to: '2.1'

This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 347389

Comment 2 Issue Tracker 2009-09-28 18:09:45 UTC
Event posted on 09-25-2009 06:17pm EDT by woodard

From: Brian Behlendorf <behlendorf1@llnl.gov>
To: "Mark A. Grondona" <mgrondona@llnl.gov>
Subject: Re: rwsem_is_locked() bug
Date: Wed, 23 Sep 2009 14:48:37 -0700


> > I assume this was never really a problem because in the latest
kernels
> > there are only two callers __mlock_vma_pages_range() and=20
> > page_referenced_one(). Neither of thse cause a problem because they
are
> > called quite a while after the lock is taken, and for the race to
occur
> > rwsem_is_locked()  must be called immediately after the lock is
acquire=
d.
>
> This part is interesting to me. For my own edification, can you explain
> why there is only a race just after down_read() is called?

The race is possible because on an SMP system it's possible for there to
be a version of the semaphore structure in each of the cpu caches.  This
means while one processor calls down_read() and updates the activity field
under the spin lock.  Another processor might  be operating on a stale copy
of the the activity field in its local cpu cache.  This condition will only
last as long as the stale entry resides in the cpu cache which won't be
for long. Moving the check under the spin lock closes the race entirely,
which is what all the rest of the semaphore functions do and why this is
only an issue for this function.

> Also, what is the purpose of calling rwsem_is_locked() just after
you've
> locked the rwsem? I am confused by that part too. I do know that
> rwsem_is_locked() is not supposed to block, maybe they left out all
> locks in the function for that reason?

The only reason I was don't this was because I was updating my SPL code
to more tightly layer the Solaris rw locks on top of Linux semaphores.
Everything works fine but the regression tests I wrote to verify
everything was fine would trip this issue.  So my finding this was really
just a side effect of me wanting to make sure my Solaris like locks were
working properly.  I can't think of any reason you would actually ever
want to do this outside of a regression test.  Of course that still
doesn't excuse a broken implementation.  :)

Thanks,
Brian


This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 347389

Comment 3 Issue Tracker 2009-09-28 18:09:46 UTC
Event posted on 09-25-2009 06:18pm EDT by woodard

From: "Mark A. Grondona" <mgrondona@llnl.gov>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Wed, 23 Sep 2009 14:58:04 -0700
Sender: mgrondona@llnl.gov
Message-Id: <E1MqZqe-0006J7-1c@wednesday.llnl.gov>


> The race is possible because on an SMP system it's possible for there
to be a
> version of the semaphore structure in each of the cpu caches.  This
means
> while one processor calls down_read() and updates the activity field
under
> the spin lock.  Another processor might  be operating on a stale copy of
the
> the activity field in its local cpu cache.  This condition will only
last as
> long as the stale entry resides in the cpu cache which won't be for
long.
> Moving the check under the spin lock closes the race entirely, which is
what
> all the rest of the semaphore functions do and why this is only an issue
for
> this function.

Thanks, that is what I suspected. Now, continuing to treat this as a
lesson
for me... ;-) 

How does the current task get rescheduled onto a different CPU between
the
time down_read() returns and rwsem_is_locked() is called?

And would a memory barrier in rwsem_is_locked() also fix this problem? Or
am I still confused ;-)

mark



This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 347389

Comment 4 Issue Tracker 2009-09-28 18:09:49 UTC
Event posted on 09-25-2009 06:20pm EDT by woodard

From: Brian Behlendorf <behlendorf1@llnl.gov>
To: "Mark A. Grondona" <mgrondona@llnl.gov>
Subject: Re: rwsem_is_locked() bug
Date: Wed, 23 Sep 2009 15:30:17 -0700

> > The race is possible because on an SMP system it's possible for there
to
> > be a version of the semaphore structure in each of the cpu caches. 
This
> > means while one processor calls down_read() and updates the activity
> > field under the spin lock.  Another processor might  be operating on
a
> > stale copy of the the activity field in its local cpu cache.  This
> > condition will only last as long as the stale entry resides in the
cpu
> > cache which won't be for long. Moving the check under the spin lock
> > closes the race entirely, which is what all the rest of the semaphore
> > functions do and why this is only an issue for this function.
>
> Thanks, that is what I suspected. Now, continuing to treat this as a
less=
on
> for me... ;-)
>
> How does the current task get rescheduled onto a different CPU between
the
> time down_read() returns and rwsem_is_locked() is called?

It doesn't.  The only way I've been able to reproduce this is with
multiple threads.  I don't see how you could cause it with one thread
because just as you said you would need to reschedule at which point
there's a memory barrior.

> And would a memory barrier in rwsem_is_locked() also fix this problem?
Or
> am I still confused ;-)

Yes it should.  I just happen to like the spin lock better because it's
already there and more readable.

Thanks,
Brian






This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 347389

Comment 5 Issue Tracker 2009-09-28 18:09:50 UTC
Event posted on 09-25-2009 06:40pm EDT by woodard

Message-ID: <4ABD4630.20506@llnl.gov>
Date: Fri, 25 Sep 2009 15:37:36 -0700
From: Ben Woodard <bwoodard@llnl.gov>

Brian Behlendorf wrote:
> Ben,
>
> This morning I stumbled in to a bug with the kernels semaphore
implementation
> I was hoping RH could shed some light on.  The bug is that
rwsem_is_locked()
> may return 0 even when the semaphore is locked.  This is because
> sem->activity is checked outside the wait_lock (at least in the
generic+x86
> versions).  The fix is easy just add the needed spin locks.
>

OK I looked at this. It appears to be pretty cut and dried and as
described. That one looks really obvious I don't know how it passed
review. It IS a tight race but it is a race never the less. Your
reproducer looks like the classic, provoke a race kind of reproducer.
There is not really anything else that I can do except for file a bug on
this. Good job. I really can't add anything.

My only thing is that I'm not sure that a spinlock is the right way to
fix it. I think that this might require a slight change to the assembly
code so that the atomic operation is synchronized across all the
processors. For as low level as a rw semaphore is, coded in assembly,
there must be a way to tell the processors that this incl instruction
needs to be coordinated across processors. I just don't know what the
right approach is. It just seems like a spinlock is kind of like a
sledgehammer in this case.

Did you find it by inspection or did you find it through trying to make
use of it in a way different than __mlock_vma_pages_range() and
page_referenced_one(). do?

-ben






This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 347389

Comment 6 Issue Tracker 2009-09-28 18:09:52 UTC
Event posted on 09-25-2009 08:16pm EDT by woodard

From: Brian Behlendorf <behlendorf1@llnl.gov>
To: bwoodard@llnl.gov
Subject: Re: rwsem_is_locked() bug
Date: Fri, 25 Sep 2009 17:06:24 -0700



> My only thing is that I'm not sure that a spinlock is the right way to
> fix it. I think that this might require a slight change to the assembly
> code so that the atomic operation is synchronized across all the
> processors. For as low level as a rw semaphore is coded in assembly
> there must be a way to tell the processors that this incl instruction
> needs to be coordinated across processors. I just don't know what the
> right approach is. It just seems like a spinlock is kind of like a
> sledgehammer in this case.

So the right fix will be a spin lock for the generic implementation.  Then
there will need to be a per-arch fix as well likely in assembly which is
why I didn't spend the time putting together a patch.

> Did you find it by inspection or did you find it through trying to make
> use of it in a way different than __mlock_vma_pages_range() and
> page_referenced_one(). do?

I stumbled across it when adding some paranoia checks to some locking code
I
was working on.  I managed to trip the check with one of my test cases
which
prompted me to investigate.

Thanks,
Brian



This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 347389

Comment 7 Issue Tracker 2009-09-28 18:09:54 UTC
Event posted on 09-25-2009 08:23pm EDT by woodard

Message-ID: <4ABD5400.3090101@llnl.gov>
Date: Fri, 25 Sep 2009 16:36:32 -0700
From: Ben Woodard <bwoodard@llnl.gov>
Reply-To: bwoodard@llnl.gov
To: "Mark A. Grondona" <mgrondona@llnl.gov>
Subject: Re: rwsem_is_locked() bug

Mark A. Grondona wrote:
> Hmm, my subsequent lesson from Brian wasn't really meant to go into
the
> bug report, but that's ok.
>
>   
Sorry. I thought it added something to the understanding of the issue.

> What I was trying to get at was that locks may have been left out
> on purpose in rwsem_is_locked(). It will be interesting to see what RH
> engineers say.
>
> I also still don't get how Brian's reproducer shows there is an SMP
> coherency problem  with rwsem_is_locked(), since he always calls it
> just after he has successfully taken the lock, and the check must be
> run on the same CPU that the lock was granted on.
>   
To me that is the bug. That simple test should never ever fail and it
only does when you manage to hit a pretty tight race due to the variable
not being coherent across the cpus (maybe cores -- didn't look that
closely). The code looks fine but demonstrably doesn't work as
anticipated. That is the bug. Does that make sense to you?

If this was Dave Peterson and he said, "This is a hypothetical problem"
as Dave was apt to do, there would be little that I could do. However,
since Brian included a reproducer which demonstrates the race condition,
he proved that it is more than theoretical. The only problem that I face
is how to convince engineering that this is important to fix. The fact
that neither__mlock_vma_pages_range() nor page_referenced_one() will hit
the problem tells engineering that the problem is with an out of tree
module and they are less inclined to fix it that case. So that is why I
asked Brian why and how he triggered it.

-ben






This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 347389

Comment 8 Issue Tracker 2009-09-28 18:09:56 UTC
Event posted on 09-25-2009 08:34pm EDT by woodard

Message-ID: <4ABD6147.6090809@llnl.gov>
Date: Fri, 25 Sep 2009 17:33:11 -0700
From: Ben Woodard <bwoodard@llnl.gov>
Reply-To: bwoodard@llnl.gov
Organization: Red Hat Inc.
User-Agent: Thunderbird 2.0.0.22 (X11/20090609)
MIME-Version: 1.0
To: Brian Behlendorf <behlendorf1@llnl.gov>
CC: Jim Garlick <garlick1@llnl.gov>, Mark Grondona <mgrondona@llnl.gov>
Subject: Re: rwsem_is_locked() bug
References: <200909231356.40745.behlendorf1@llnl.gov>
<4ABD4630.20506@llnl.gov> <200909251706.29751.behlendorf1@llnl.gov>
In-Reply-To: <200909251706.29751.behlendorf1@llnl.gov>
X-Enigmail-Version: 0.95.6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Brian Behlendorf wrote:
> I stumbled across it when adding some paranoia checks to some locking
code I
> was working on.  I managed to trip the check with one of my test cases
which
> prompted me to investigate.

The only problem that I face is how to convince engineering that this is
important to fix. The fact that neither__mlock_vma_pages_range() nor
page_referenced_one() will hit
the problem tells engineering that the problem is with an out of tree
module and they are less inclined to fix it that case. So that is why I
asked. I might have have trouble convincing engineering that this is
more than a hypothetical problem.

Have you submitted this upstream?

>
> Thanks,
> Brian






This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 347389

Comment 9 Issue Tracker 2009-09-28 18:09:58 UTC
Event posted on 09-28-2009 12:47pm EDT by woodard

To: bwoodard@llnl.gov
Subject: Re: rwsem_is_locked() bug
Date: Mon, 28 Sep 2009 09:09:47 -0700
User-Agent: KMail/1.9.4
Cc: Jim Garlick <garlick1@llnl.gov>, Mark Grondona 
> Have you submitted this upstream?

Yes, I thought that might be the response.  I have not submitted this
upstr=
eam yet hoping RH would just take the ball and fix what is obviously
wrong.  It's also not inconceivable that the other two usage cases could
hit this, it's just exceptionally unlikely.

Thanks,
Brian

--nextPart2701221.nVbLf53G29
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQBKwN/OCrnpkcavZYsRAu19AKDEqKzCDLOTEHB4yB9xOvScLQOmLgCgwNDS
x4eRBQvw5J8LJLQJqPH+Row=
=jOWv
-----END PGP SIGNATURE-----

--nextPart2701221.nVbLf53G29--



This event sent from IssueTracker by kbaxley  [LLNL (HPC)]
 issue 347389

Comment 10 Kent Baxley 2009-09-28 18:11:53 UTC
Created attachment 362930 [details]
reproducer

extract the file, run make, then run the shell script to try and reproduce the problem

Comment 11 Cong Wang 2009-09-29 08:18:09 UTC
(In reply to comment #2)
> Event posted on 09-25-2009 06:17pm EDT by woodard
> 
> From: Brian Behlendorf <behlendorf1@llnl.gov>
> To: "Mark A. Grondona" <mgrondona@llnl.gov>
> Subject: Re: rwsem_is_locked() bug
> Date: Wed, 23 Sep 2009 14:48:37 -0700
> 
> 
> > > I assume this was never really a problem because in the latest
> kernels
> > > there are only two callers __mlock_vma_pages_range() and=20
> > > page_referenced_one(). Neither of thse cause a problem because they
> are
> > > called quite a while after the lock is taken, and for the race to
> occur
> > > rwsem_is_locked()  must be called immediately after the lock is
> acquire=
> d.
> >
> > This part is interesting to me. For my own edification, can you explain
> > why there is only a race just after down_read() is called?
> 
> The race is possible because on an SMP system it's possible for there to
> be a version of the semaphore structure in each of the cpu caches.  This
> means while one processor calls down_read() and updates the activity field
> under the spin lock.  Another processor might  be operating on a stale copy
> of the the activity field in its local cpu cache.  This condition will only
> last as long as the stale entry resides in the cpu cache which won't be
> for long. Moving the check under the spin lock closes the race entirely,
> which is what all the rest of the semaphore functions do and why this is
> only an issue for this function.


This doesn't make any sense. On SMP, CPU caches use MESI protocol to maintain the cache coherence. It shouldn't let you to have two different versions of the same copy. If this does happen, that probably means the CPU cache is ill-designed.

Comment 12 Ben Woodard 2009-09-29 18:58:59 UTC
Are you willing to push this with Intel and assert that their cache is ill defined? Do you have another explanation for the fact that the reproducer works?

Comment 13 Cong Wang 2009-09-30 02:58:43 UTC
(In reply to comment #12)
> Are you willing to push this with Intel and assert that their cache is ill
> defined?

If you could prove you really have two different versions for the same memory
location, yes, you should file a bug to Intel, that is a HUGE bug which they are supposed to fix.

The problem is, however, the code is more likely than Intel to have bugs.

> Do you have another explanation for the fact that the reproducer
> works?  

Of course yes.
I believe the patch will say everything, I will send my patch out now.

Guys, you are going to a _totally_ wrong direction, we don't need memory barriers, nor spinlocks. We *have* bugs.

Comment 14 Cong Wang 2009-09-30 03:31:08 UTC
FYI:
http://lkml.org/lkml/2009/9/29/376

Comment 15 Cong Wang 2009-12-04 09:52:02 UTC
Hello,

Patch is still in -mm, I am still waiting for it to be merged into Linus tree,
then I will backport to RHEL5.

If you can't wait, please let me know.

Thanks.

Comment 17 RHEL Product and Program Management 2010-02-04 20:23:01 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 19 Jarod Wilson 2010-02-23 20:05:11 UTC
in kernel-2.6.18-190.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 23 Petr Be┼łas 2010-03-19 14:28:15 UTC
VERIFIED

[root@dell-pe2950-01 bz526092-rw-semaphore-bug]# uname -r
2.6.18-189.el5
:::::rwsem_test: Unknown symbol rwsem_is_locked
::::::::::::::::rwsem_test: Unknown symbol rwsem_is_locked
::::::::::::::::rwsem_test: Unknown symbol rwsem_is_locked
::::::::::::::::rwsem_test: Unknown symbol rwsem_is_locked
::::::::::::::::rwsem_test: Unknown symbol rwsem_is_locked
:::::::::::

Prwsem_test: Unknown symbol rwsem_is_locked
ass 0
insmod: erwsem_test: Unknown symbol rwsem_is_locked
rror inserting 'rwsem_test: Unknown symbol rwsem_is_locked
./rwsem-test.ko'rwsem_test: Unknown symbol rwsem_is_locked
: -1 Unknown symrwsem_test: Unknown symbol rwsem_is_locked
bol in module
Prwsem_test: Unknown symbol rwsem_is_locked

[root@dell-pe2950-01 bz526092-rw-semaphore-bug]# uname -r
2.6.18-194.el5
Pass 97
Pass 98
Pass 99
:: [   PASS   ] :: running the test

Comment 25 errata-xmlrpc 2010-03-30 07:29:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html