Bug 883429

Summary: Incorrect synchronization in mmap cache
Product: [Fedora] Fedora Reporter: Florian Weimer <fweimer>
Component: sssdAssignee: Jakub Hrozek <jhrozek>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 17CC: jhrozek, sbose, sgallagh, ssorce
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-12 00:10:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 882335    
Attachments:
Description Flags
mmap-cache.patch
none
USe a macro to do memcpy with barriers
none
Macro to deal with memcpy barriers none

Description Florian Weimer 2012-12-04 15:18:52 UTC
sss_nss_check_header() in src/sss_client/nss_mc_common.c lacks a memory barrier.  As a result, this loop:

    /* retry barrier protected reading max 5 times then give up */
    for (count = 5; count > 0; count--) {
        memcpy(&h, ctx->mmap_base, sizeof(struct sss_mc_header));
        if (MC_VALID_BARRIER(h.b1) && h.b1 == h.b2) {
            /* record is consistent so we can proceed */
            break;
        }
    }
    if (count == 0) {
        /* couldn't successfully read header we have to give up */
        return EIO;
    }

is compiled to:

   0x0000000000004d20 <+0>:	mov    0x10(%rdi),%rdx
   0x0000000000004d24 <+4>:	mov    %rbx,-0x28(%rsp)
   0x0000000000004d29 <+9>:	mov    $0x5,%eax
   0x0000000000004d2e <+14>:	mov    %rbp,-0x20(%rsp)
   0x0000000000004d33 <+19>:	mov    %r12,-0x18(%rsp)
   0x0000000000004d38 <+24>:	mov    %r13,-0x10(%rsp)
   0x0000000000004d3d <+29>:	mov    %r14,-0x8(%rsp)
   0x0000000000004d42 <+34>:	mov    (%rdx),%esi
   0x0000000000004d44 <+36>:	mov    0x28(%rdx),%ebp
   0x0000000000004d47 <+39>:	mov    0x20(%rdx),%ebx
   0x0000000000004d4a <+42>:	mov    0x30(%rdx),%r8d
   0x0000000000004d4e <+46>:	mov    0x8(%rdx),%r9d
   0x0000000000004d52 <+50>:	mov    0xc(%rdx),%r11d
   0x0000000000004d56 <+54>:	mov    0x10(%rdx),%r12d
   0x0000000000004d5a <+58>:	mov    0x1c(%rdx),%r13d
   0x0000000000004d5e <+62>:	mov    %esi,%ecx
   0x0000000000004d60 <+64>:	mov    0x4(%rdx),%r10d
   0x0000000000004d64 <+68>:	mov    0x14(%rdx),%r14d
   0x0000000000004d68 <+72>:	and    $0xff000000,%ecx
   0x0000000000004d6e <+78>:	cmp    $0xf0000000,%ecx
   0x0000000000004d74 <+84>:	je     0x4da0 <sss_nss_check_header+128>
   0x0000000000004d76 <+86>:	sub    $0x1,%eax
   0x0000000000004d79 <+89>:	jne    0x4d6e <sss_nss_check_header+78>
   0x0000000000004d7b <+91>:	mov    $0x5,%eax
   0x0000000000004d80 <+96>:	mov    -0x28(%rsp),%rbx
   0x0000000000004d85 <+101>:	mov    -0x20(%rsp),%rbp
   0x0000000000004d8a <+106>:	mov    -0x18(%rsp),%r12
   0x0000000000004d8f <+111>:	mov    -0x10(%rsp),%r13
   0x0000000000004d94 <+116>:	mov    -0x8(%rsp),%r14
   0x0000000000004d99 <+121>:	retq   
   0x0000000000004d9a <+122>:	nopw   0x0(%rax,%rax,1)
   0x0000000000004da0 <+128>:	cmp    %r8d,%esi

That is, %ecx is never reloaded from memory.

In sss_nss_mc_get_record(), __sync_synchronize() is needed before the final barrier check.

sss_mc_add_rec_to_chain() in src/responder/nss/nsssrv_mmap_cache.c contains this comment:

    /* changing a single uint32_t is atomic, so there is no
     * need to use barriers in this case */

I think the comment is misleading because it's not just atomicity that matters, ordering can also be relevant.  However, in this particular case, it does not appear to matter in what order the writes to the links in the hash chain happen.

A client can pick up a reference to a hash table slot which is about to be invalidated, resulting in a lookup failure when the record is actually present in the cache.  This is probably not a problem for this application.

You need to deal with counter overflow in your barriers (the ABA problem).  One way to do this is to switch to switch to a different file when it happens, and rename it over the existing file.  Concurrent readers will still have a consistent view.

It seems you use this scheme because you need wait-free writers, so that the privileged process which updates the cache does not block on readers.  Correct?  Otherwise, there are probably simpler approaches.

Comment 1 Florian Weimer 2012-12-05 10:06:59 UTC
(In reply to comment #0)
> sss_nss_check_header() in src/sss_client/nss_mc_common.c lacks a memory
> barrier.

I've thought about this some more, and I think you need at least acquire barriers after loading h.b1 and before loading h.b2.  Similar for sss_nss_mc_get_record().  (Plus the change to deal with counter overflow.)

I don't feel confident to confirm the absence of any remaining concurrency bugs, though.

Comment 2 Jakub Hrozek 2012-12-05 12:53:40 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/1694

Comment 3 Fedora Update System 2012-12-06 19:30:51 UTC
sssd-1.9.3-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/sssd-1.9.3-1.fc18

Comment 4 Fedora Update System 2012-12-07 20:46:05 UTC
Package sssd-1.9.3-1.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing sssd-1.9.3-1.fc18'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-19960/sssd-1.9.3-1.fc18
then log in and leave karma (feedback).

Comment 5 Florian Weimer 2012-12-10 15:57:18 UTC
Created attachment 660930 [details]
mmap-cache.patch

The upstream patch fb0de650e7454e1dfa76136e325e62a00748238b is still incorrect, I believe.  Barriers are needed after loading b1 and before loading b2, so that the internal ordering of the memcpy does not matter.  I'm attaching a patch on top of current master which hopefully clarifies my concerns.

Comment 6 Simo Sorce 2012-12-10 20:54:50 UTC
I do not see why the new patch would be necessary.
The way b1 and b2 are *written* guarantees that if you read the memory and you get both with the same value then the memory chunk is consistent.

Can you describe a case where the current code would fail ?
Are you worring that the memcpy may end up copying the stuff in the middle before either b1 or b2 are loaded ? Is that possible ?

Comment 7 Florian Weimer 2012-12-11 07:57:15 UTC
(In reply to comment #6)
> I do not see why the new patch would be necessary.
> The way b1 and b2 are *written* guarantees that if you read the memory and
> you get both with the same value then the memory chunk is consistent.

Reads can be reordered as well, not just writes.

And in the sss_nss_mc_get_record, you really have to check against the old b1 value, otherwise you don't detect the case where the slot has been altered during the memcpy.

> Can you describe a case where the current code would fail ?
> Are you worring that the memcpy may end up copying the stuff in the middle
> before either b1 or b2 are loaded ? Is that possible ?

It's definitely possible.  Even if the machine code contains the read instructions in the expected order, the processor may execute them in a different order or issue speculative reads which are used later.

Perhaps you could put the reader and writer code into a library, then we could write a test harness and run it on several architectures, to check that the results are as expected.  But I'm not sure how effective such testing is in uncovering concurrency bugs.

Comment 8 Simo Sorce 2012-12-11 17:21:15 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > I do not see why the new patch would be necessary.
> > The way b1 and b2 are *written* guarantees that if you read the memory and
> > you get both with the same value then the memory chunk is consistent.
> 
> Reads can be reordered as well, not just writes.
> 
> And in the sss_nss_mc_get_record, you really have to check against the old
> b1 value, otherwise you don't detect the case where the slot has been
> altered during the memcpy.
> 
> > Can you describe a case where the current code would fail ?
> > Are you worring that the memcpy may end up copying the stuff in the middle
> > before either b1 or b2 are loaded ? Is that possible ?
> 
> It's definitely possible.  Even if the machine code contains the read
> instructions in the expected order, the processor may execute them in a
> different order or issue speculative reads which are used later.

You are right, I didn't properly consider the effect on CPU architectures with SMP and multiple layers of caching. This may hit particularly hard on NUMA architectures where CPUs try to optmize as much as possible cross-node transfers.

I'll reopen the upstream ticket.

> Perhaps you could put the reader and writer code into a library, then we
> could write a test harness and run it on several architectures, to check
> that the results are as expected.  But I'm not sure how effective such
> testing is in uncovering concurrency bugs.

I used to have test code to verify barriers worked properly. and you can catch concurrency issues if you have multiple threads running in tight loops over and over.
I will try to find out if I still have them around.

Comment 10 Simo Sorce 2012-12-11 22:35:22 UTC
Created attachment 661730 [details]
USe a macro to do memcpy with barriers

I came up with a slightly different patch that uses a macro to memcpy with barriers.
Can you check this is ok, it is not 100% identical to your code.
Also I found a glaring error around line 200, b2 wsa ssigned the value of rec-b1 instead of rec-b2 ...

Comment 12 Florian Weimer 2012-12-12 15:25:31 UTC
(In reply to comment #10)
> Created attachment 661730 [details]
> USe a macro to do memcpy with barriers
> 
> I came up with a slightly different patch that uses a macro to memcpy with
> barriers.  Can you check this is ok, it is not 100% identical to your code.

+        if (copy_ok && b1 == copy_rec->b2) {

The read of b2 does not necessarily come after all the other reads for other members of the copy_rec structure, so an unchanged value does not necessarily mean that copy is consistent (I think).  Using rec->b2 should be okay, but the macro really obscures what's going on here.

Comment 13 Simo Sorce 2012-12-12 17:11:33 UTC
Created attachment 662504 [details]
Macro to deal with memcpy barriers

I think the patch is correct wrt using copy_rec->b2 however Florian pointed out on IRC that I mistakenly swapped src/dest in the macro when testing b2.
With that corrected I am confident this code properly handles the necessary synchronizations now.

Comment 14 Fedora Update System 2013-01-12 00:10:17 UTC
sssd-1.9.3-1.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.