Escalated to Bugzilla from IssueTracker
Event posted on 09-25-2009 06:07pm EDT by woodard From: Brian Behlendorf <behlendorf1> To: Ben Woodard <bwoodard> Subject: rwsem_is_locked() bug Date: Wed, 23 Sep 2009 13:56:36 -0700 Ben, This morning I stumbled in to a bug with the kernels semaphore implementation and I am hoping RH could shed some light on it. The bug is that rwsem_is_locked may return 0 even when the semaphore is locked. This is because sem->activity is checked outside the wait_lock (at least in the generic+x86 versions). The fix is easy...just add the needed spin locks. I assume this was never really a problem because in the latest kernels there are only two callers __mlock_vma_pages_range() and page_referenced_one(). Neither of thse cause a problem because they are called quite a while after the lock is taken, and for the race to occur rwsem_is_locked() must be called immediately after the lock is acquired. I've attached a trival reproducer which I've tested against the chaos kernel. Just make the module and use the test script to load and unload the module a few times. It shouldn't take long before you trip the BUG_ON(). Thanks, Brian To: Brian Behlendorf <behlendorf1> cc: Ben Woodard <bwoodard>, Jim Garlick <garlick1> Subject: Re: rwsem_is_locked() bug In-Reply-To: Message from Brian Behlendorf <behlendorf1> of "Wed, 23 Sep 2009 13:56:36 PDT." <200909231356.40745.behlendorf1> From: "Mark A. Grondona" <mgrondona> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 23 Sep 2009 14:29:10 -0700 Sender: mgrondona Message-Id: <E1MqZOg-0004S5-Sk.gov> > > I assume this was never really a problem because in the latest kernels there > are only two callers __mlock_vma_pages_range() and page_referenced_one(). > Neither of thse cause a problem because they are called quite a while after > the lock is taken, and for the race to occur rwsem_is_locked() must be > called immediately after the lock is acquired. This part is interesting to me. For my own edification, can you explain why there is only a race just after down_read() is called? Also, what is the purpose of calling rwsem_is_locked() just after you've locked the rwsem? I am confused by that part too. I do know that rwsem_is_locked() is not supposed to block, maybe they left out all locks in the function for that reason? Thanks! mark kbaxley assigned to issue for LLNL (HPC). Status set to: Waiting on Tech Version set to: '2.1' This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 347389
Event posted on 09-25-2009 06:17pm EDT by woodard From: Brian Behlendorf <behlendorf1> To: "Mark A. Grondona" <mgrondona> Subject: Re: rwsem_is_locked() bug Date: Wed, 23 Sep 2009 14:48:37 -0700 > > I assume this was never really a problem because in the latest kernels > > there are only two callers __mlock_vma_pages_range() and=20 > > page_referenced_one(). Neither of thse cause a problem because they are > > called quite a while after the lock is taken, and for the race to occur > > rwsem_is_locked() must be called immediately after the lock is acquire= d. > > This part is interesting to me. For my own edification, can you explain > why there is only a race just after down_read() is called? The race is possible because on an SMP system it's possible for there to be a version of the semaphore structure in each of the cpu caches. This means while one processor calls down_read() and updates the activity field under the spin lock. Another processor might be operating on a stale copy of the the activity field in its local cpu cache. This condition will only last as long as the stale entry resides in the cpu cache which won't be for long. Moving the check under the spin lock closes the race entirely, which is what all the rest of the semaphore functions do and why this is only an issue for this function. > Also, what is the purpose of calling rwsem_is_locked() just after you've > locked the rwsem? I am confused by that part too. I do know that > rwsem_is_locked() is not supposed to block, maybe they left out all > locks in the function for that reason? The only reason I was don't this was because I was updating my SPL code to more tightly layer the Solaris rw locks on top of Linux semaphores. Everything works fine but the regression tests I wrote to verify everything was fine would trip this issue. So my finding this was really just a side effect of me wanting to make sure my Solaris like locks were working properly. I can't think of any reason you would actually ever want to do this outside of a regression test. Of course that still doesn't excuse a broken implementation. :) Thanks, Brian This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 347389
Event posted on 09-25-2009 06:18pm EDT by woodard From: "Mark A. Grondona" <mgrondona> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 23 Sep 2009 14:58:04 -0700 Sender: mgrondona Message-Id: <E1MqZqe-0006J7-1c.gov> > The race is possible because on an SMP system it's possible for there to be a > version of the semaphore structure in each of the cpu caches. This means > while one processor calls down_read() and updates the activity field under > the spin lock. Another processor might be operating on a stale copy of the > the activity field in its local cpu cache. This condition will only last as > long as the stale entry resides in the cpu cache which won't be for long. > Moving the check under the spin lock closes the race entirely, which is what > all the rest of the semaphore functions do and why this is only an issue for > this function. Thanks, that is what I suspected. Now, continuing to treat this as a lesson for me... ;-) How does the current task get rescheduled onto a different CPU between the time down_read() returns and rwsem_is_locked() is called? And would a memory barrier in rwsem_is_locked() also fix this problem? Or am I still confused ;-) mark This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 347389
Event posted on 09-25-2009 06:20pm EDT by woodard From: Brian Behlendorf <behlendorf1> To: "Mark A. Grondona" <mgrondona> Subject: Re: rwsem_is_locked() bug Date: Wed, 23 Sep 2009 15:30:17 -0700 > > The race is possible because on an SMP system it's possible for there to > > be a version of the semaphore structure in each of the cpu caches. This > > means while one processor calls down_read() and updates the activity > > field under the spin lock. Another processor might be operating on a > > stale copy of the the activity field in its local cpu cache. This > > condition will only last as long as the stale entry resides in the cpu > > cache which won't be for long. Moving the check under the spin lock > > closes the race entirely, which is what all the rest of the semaphore > > functions do and why this is only an issue for this function. > > Thanks, that is what I suspected. Now, continuing to treat this as a less= on > for me... ;-) > > How does the current task get rescheduled onto a different CPU between the > time down_read() returns and rwsem_is_locked() is called? It doesn't. The only way I've been able to reproduce this is with multiple threads. I don't see how you could cause it with one thread because just as you said you would need to reschedule at which point there's a memory barrior. > And would a memory barrier in rwsem_is_locked() also fix this problem? Or > am I still confused ;-) Yes it should. I just happen to like the spin lock better because it's already there and more readable. Thanks, Brian This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 347389
Event posted on 09-25-2009 06:40pm EDT by woodard Message-ID: <4ABD4630.20506> Date: Fri, 25 Sep 2009 15:37:36 -0700 From: Ben Woodard <bwoodard> Brian Behlendorf wrote: > Ben, > > This morning I stumbled in to a bug with the kernels semaphore implementation > I was hoping RH could shed some light on. The bug is that rwsem_is_locked() > may return 0 even when the semaphore is locked. This is because > sem->activity is checked outside the wait_lock (at least in the generic+x86 > versions). The fix is easy just add the needed spin locks. > OK I looked at this. It appears to be pretty cut and dried and as described. That one looks really obvious I don't know how it passed review. It IS a tight race but it is a race never the less. Your reproducer looks like the classic, provoke a race kind of reproducer. There is not really anything else that I can do except for file a bug on this. Good job. I really can't add anything. My only thing is that I'm not sure that a spinlock is the right way to fix it. I think that this might require a slight change to the assembly code so that the atomic operation is synchronized across all the processors. For as low level as a rw semaphore is, coded in assembly, there must be a way to tell the processors that this incl instruction needs to be coordinated across processors. I just don't know what the right approach is. It just seems like a spinlock is kind of like a sledgehammer in this case. Did you find it by inspection or did you find it through trying to make use of it in a way different than __mlock_vma_pages_range() and page_referenced_one(). do? -ben This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 347389
Event posted on 09-25-2009 08:16pm EDT by woodard From: Brian Behlendorf <behlendorf1> To: bwoodard Subject: Re: rwsem_is_locked() bug Date: Fri, 25 Sep 2009 17:06:24 -0700 > My only thing is that I'm not sure that a spinlock is the right way to > fix it. I think that this might require a slight change to the assembly > code so that the atomic operation is synchronized across all the > processors. For as low level as a rw semaphore is coded in assembly > there must be a way to tell the processors that this incl instruction > needs to be coordinated across processors. I just don't know what the > right approach is. It just seems like a spinlock is kind of like a > sledgehammer in this case. So the right fix will be a spin lock for the generic implementation. Then there will need to be a per-arch fix as well likely in assembly which is why I didn't spend the time putting together a patch. > Did you find it by inspection or did you find it through trying to make > use of it in a way different than __mlock_vma_pages_range() and > page_referenced_one(). do? I stumbled across it when adding some paranoia checks to some locking code I was working on. I managed to trip the check with one of my test cases which prompted me to investigate. Thanks, Brian This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 347389
Event posted on 09-25-2009 08:23pm EDT by woodard Message-ID: <4ABD5400.3090101> Date: Fri, 25 Sep 2009 16:36:32 -0700 From: Ben Woodard <bwoodard> Reply-To: bwoodard To: "Mark A. Grondona" <mgrondona> Subject: Re: rwsem_is_locked() bug Mark A. Grondona wrote: > Hmm, my subsequent lesson from Brian wasn't really meant to go into the > bug report, but that's ok. > > Sorry. I thought it added something to the understanding of the issue. > What I was trying to get at was that locks may have been left out > on purpose in rwsem_is_locked(). It will be interesting to see what RH > engineers say. > > I also still don't get how Brian's reproducer shows there is an SMP > coherency problem with rwsem_is_locked(), since he always calls it > just after he has successfully taken the lock, and the check must be > run on the same CPU that the lock was granted on. > To me that is the bug. That simple test should never ever fail and it only does when you manage to hit a pretty tight race due to the variable not being coherent across the cpus (maybe cores -- didn't look that closely). The code looks fine but demonstrably doesn't work as anticipated. That is the bug. Does that make sense to you? If this was Dave Peterson and he said, "This is a hypothetical problem" as Dave was apt to do, there would be little that I could do. However, since Brian included a reproducer which demonstrates the race condition, he proved that it is more than theoretical. The only problem that I face is how to convince engineering that this is important to fix. The fact that neither__mlock_vma_pages_range() nor page_referenced_one() will hit the problem tells engineering that the problem is with an out of tree module and they are less inclined to fix it that case. So that is why I asked Brian why and how he triggered it. -ben This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 347389
Event posted on 09-25-2009 08:34pm EDT by woodard Message-ID: <4ABD6147.6090809> Date: Fri, 25 Sep 2009 17:33:11 -0700 From: Ben Woodard <bwoodard> Reply-To: bwoodard Organization: Red Hat Inc. User-Agent: Thunderbird 2.0.0.22 (X11/20090609) MIME-Version: 1.0 To: Brian Behlendorf <behlendorf1> CC: Jim Garlick <garlick1>, Mark Grondona <mgrondona> Subject: Re: rwsem_is_locked() bug References: <200909231356.40745.behlendorf1> <4ABD4630.20506> <200909251706.29751.behlendorf1> In-Reply-To: <200909251706.29751.behlendorf1> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Brian Behlendorf wrote: > I stumbled across it when adding some paranoia checks to some locking code I > was working on. I managed to trip the check with one of my test cases which > prompted me to investigate. The only problem that I face is how to convince engineering that this is important to fix. The fact that neither__mlock_vma_pages_range() nor page_referenced_one() will hit the problem tells engineering that the problem is with an out of tree module and they are less inclined to fix it that case. So that is why I asked. I might have have trouble convincing engineering that this is more than a hypothetical problem. Have you submitted this upstream? > > Thanks, > Brian This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 347389
Event posted on 09-28-2009 12:47pm EDT by woodard To: bwoodard Subject: Re: rwsem_is_locked() bug Date: Mon, 28 Sep 2009 09:09:47 -0700 User-Agent: KMail/1.9.4 Cc: Jim Garlick <garlick1>, Mark Grondona > Have you submitted this upstream? Yes, I thought that might be the response. I have not submitted this upstr= eam yet hoping RH would just take the ball and fix what is obviously wrong. It's also not inconceivable that the other two usage cases could hit this, it's just exceptionally unlikely. Thanks, Brian --nextPart2701221.nVbLf53G29 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQBKwN/OCrnpkcavZYsRAu19AKDEqKzCDLOTEHB4yB9xOvScLQOmLgCgwNDS x4eRBQvw5J8LJLQJqPH+Row= =jOWv -----END PGP SIGNATURE----- --nextPart2701221.nVbLf53G29-- This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 347389
Created attachment 362930 [details] reproducer extract the file, run make, then run the shell script to try and reproduce the problem
(In reply to comment #2) > Event posted on 09-25-2009 06:17pm EDT by woodard > > From: Brian Behlendorf <behlendorf1> > To: "Mark A. Grondona" <mgrondona> > Subject: Re: rwsem_is_locked() bug > Date: Wed, 23 Sep 2009 14:48:37 -0700 > > > > > I assume this was never really a problem because in the latest > kernels > > > there are only two callers __mlock_vma_pages_range() and=20 > > > page_referenced_one(). Neither of thse cause a problem because they > are > > > called quite a while after the lock is taken, and for the race to > occur > > > rwsem_is_locked() must be called immediately after the lock is > acquire= > d. > > > > This part is interesting to me. For my own edification, can you explain > > why there is only a race just after down_read() is called? > > The race is possible because on an SMP system it's possible for there to > be a version of the semaphore structure in each of the cpu caches. This > means while one processor calls down_read() and updates the activity field > under the spin lock. Another processor might be operating on a stale copy > of the the activity field in its local cpu cache. This condition will only > last as long as the stale entry resides in the cpu cache which won't be > for long. Moving the check under the spin lock closes the race entirely, > which is what all the rest of the semaphore functions do and why this is > only an issue for this function. This doesn't make any sense. On SMP, CPU caches use MESI protocol to maintain the cache coherence. It shouldn't let you to have two different versions of the same copy. If this does happen, that probably means the CPU cache is ill-designed.
Are you willing to push this with Intel and assert that their cache is ill defined? Do you have another explanation for the fact that the reproducer works?
(In reply to comment #12) > Are you willing to push this with Intel and assert that their cache is ill > defined? If you could prove you really have two different versions for the same memory location, yes, you should file a bug to Intel, that is a HUGE bug which they are supposed to fix. The problem is, however, the code is more likely than Intel to have bugs. > Do you have another explanation for the fact that the reproducer > works? Of course yes. I believe the patch will say everything, I will send my patch out now. Guys, you are going to a _totally_ wrong direction, we don't need memory barriers, nor spinlocks. We *have* bugs.
FYI: http://lkml.org/lkml/2009/9/29/376
Hello, Patch is still in -mm, I am still waiting for it to be merged into Linus tree, then I will backport to RHEL5. If you can't wait, please let me know. Thanks.
Done: http://post-office.corp.redhat.com/archives/rhkernel-list/2009-December/msg01335.html
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-190.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
VERIFIED [root@dell-pe2950-01 bz526092-rw-semaphore-bug]# uname -r 2.6.18-189.el5 :::::rwsem_test: Unknown symbol rwsem_is_locked ::::::::::::::::rwsem_test: Unknown symbol rwsem_is_locked ::::::::::::::::rwsem_test: Unknown symbol rwsem_is_locked ::::::::::::::::rwsem_test: Unknown symbol rwsem_is_locked ::::::::::::::::rwsem_test: Unknown symbol rwsem_is_locked ::::::::::: Prwsem_test: Unknown symbol rwsem_is_locked ass 0 insmod: erwsem_test: Unknown symbol rwsem_is_locked rror inserting 'rwsem_test: Unknown symbol rwsem_is_locked ./rwsem-test.ko'rwsem_test: Unknown symbol rwsem_is_locked : -1 Unknown symrwsem_test: Unknown symbol rwsem_is_locked bol in module Prwsem_test: Unknown symbol rwsem_is_locked [root@dell-pe2950-01 bz526092-rw-semaphore-bug]# uname -r 2.6.18-194.el5 Pass 97 Pass 98 Pass 99 :: [ PASS ] :: running the test
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html