Bug 1327142

Summary: Object delete did not sync to the peer zone in multisite configuration
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: shilpa <smanjara>
Component: RGWAssignee: Casey Bodley <cbodley>
Status: CLOSED ERRATA QA Contact: shilpa <smanjara>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 2.0CC: amaredia, cbodley, ceph-eng-bugs, gmeno, hnallurv, j.klare, kbader, kdreyer, mbenjamin, owasserm, smanjara, sweil
Target Milestone: rc   
Target Release: 2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.1-13.el7cp Ubuntu: ceph_10.2.1-2redhat1xenial Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:36:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1347904    
Bug Blocks: 1343229    
Attachments:
Description Flags
Master zone logs
none
Logs none

Description shilpa 2016-04-14 10:32:54 UTC
Description of problem:
In an active-active multisite configuration, created an object of about 1.5G and uploaded to a bucket on one zone. While the object is being synced to the peer zone, deleted the object on the source zone. The peer zone, did not pick up the delete request, but instead resynced the object back to the source.

Version-Release number of selected component (if applicable):
ceph-radosgw-10.1.1-1.el7cp.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create an object of about 1.5G and uploaded to a bucket on one zone rgw2.
2. Ensure that the object is being synced to the master zone rgw1. 
3. While the sync is going on, delete the object from the source zone rgw2 where object was uploaded.
4. Check if the object is deleted on both rgw1 and rgw2. 

Actual results:
The object that was deleted was resynced back to rgw2. And rgw1 also had the object. The delete request did not go through to rgw1. 


Expected results:
The object should have been deleted on rgw1.
The object should not be rewritten on rgw2.

Additional info:

[root@rgw2 ~]# swift -A http://rgw2:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' upload container4 f22

[root@rgw2 ~]# swift -A http://rgw2:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list container4
f22

[root@rgw2 ~]# swift -A http://rgw2:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' delete container4 f22
f22
[root@rgw2 ~]# swift -A http://rgw2:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list container4
[root@rgw2 ~]

On rgw2 we should have seen a _send_request() sent to rgw1. I did not find that in the logs. 

On rgw1 the object is still listed. Waited for about 30 minutes and checked:

[root@rgw1 ~]# swift -A http://rgw1:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list container4
f22

Back on rgw2 I see that the object has been created again:
[root@rgw2 ~]# swift -A http://rgw2:8080/auth/1.0 -U test-user:swift -K 'kzmbCQgR3L5CqjQmvjatXLjeZi1Ss8RFlWLGu1Vj' list container4
f22


I see errors like this on source rgw2 zone:

2016-04-14 08:08:16.236540 7feaa1ffb700  0 ERROR: lease cr failed, done early 
2016-04-14 08:08:16.236567 7feaa1ffb700  0 ERROR: incremental sync on container4 bucket_id=acadcc66-10b9-4829-b8e2-306c0048bff5.4176.8 shard_id=-1 failed, retcode=-16

The sync status is stuck at this:

[root@rgw2 ~]# radosgw-admin sync status --rgw-zone=us-2
          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2)
  metadata sync syncing
                full sync: 0/64 shards
                metadata is caught up with master
                incremental sync: 64/64 shards
      data sync source: acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        oldest incremental change not applied: 2016-04-14 08:16:23.0.899181s


[root@rgw1 ~]# radosgw-admin sync status --rgw-zone=us-1
          realm 4e00a610-36e9-43d0-803e-4001442b8232 (earth)
      zonegroup e66e1293-e63b-4afe-9dad-3397647dfb03 (us)
           zone acadcc66-10b9-4829-b8e2-306c0048bff5 (us-1)
  metadata sync no sync (zone is master)
      data sync source: 001da65b-c3a8-42e2-a1ce-79cacefbace2 (us-2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards

Comment 2 Yehuda Sadeh 2016-04-27 18:26:42 UTC
There are multiple issues coming into play here. The following upstream PRs take care of the different issues:

https://github.com/ceph/ceph/pull/8709
https://github.com/ceph/ceph/pull/8685
https://github.com/ceph/ceph/pull/8772

Comment 3 Yehuda Sadeh 2016-05-03 17:44:54 UTC
*** Bug 1327519 has been marked as a duplicate of this bug. ***

Comment 4 Ken Dreyer (Red Hat) 2016-05-10 02:41:26 UTC
The patches Yehuda mentioned in Comment 2 have been merged to master. We still need the changes to be backported to jewel.

Comment 5 Ken Dreyer (Red Hat) 2016-05-10 13:08:11 UTC
Cherry-pick to jewel in progress: 
https://github.com/ceph/ceph/pull/9017

Comment 6 shilpa 2016-05-11 11:19:58 UTC
(In reply to Ken Dreyer (Red Hat) from comment #5)
> Cherry-pick to jewel in progress: 
> https://github.com/ceph/ceph/pull/9017

Ken, will this be available in the next build?

Comment 7 Ken Dreyer (Red Hat) 2016-05-11 14:36:02 UTC
The next build will be based on 10.2.1, so yes.

Comment 8 Ken Dreyer (Red Hat) 2016-05-16 15:31:11 UTC
The above PR was merged to jewel and is present in v10.2.1.

Comment 11 shilpa 2016-05-18 08:22:15 UTC
I still see this issue on ceph-radosgw-10.2.1-1.el7cp.x86_64

Object deletes are still not synced to the peer zone. Only this time, the file is not resynced back to the zone where the delete was initiated. Any further operations (creates or deletes) also fail to sync. 


2016-05-18 07:09:44.685026 7feee1ffb700  0 rgw meta sync: ERROR: can't remove key: bucket.instance:container17:0a6973f5-af95-4665-aa6b-be8dd41491d8.14164.19 ret=-2
2016-05-18 07:09:44.849754 7feda3fff700  1 civetweb: 0x7feda4004df0: 10.8.128.59 - - [18/May/2016:07:09:43 +0000] "POST /admin/log HTTP/1.1" 200 0 - -
2016-05-18 07:09:44.851600 7fee53fff700  0 rgw meta sync: cr:s=0x7fee4c003890:op=0x7fee4c036e90:18RGWMetaSyncShardCR: adjusting marker pos=1_1463555295.641429_553.1
2016-05-18 07:09:44.851609 7fee53fff700  0 rgw meta sync: cr:s=0x7fee4c003890:op=0x7fee4c036e90:18RGWMetaSyncShardCR: adjusting marker pos=1_1463555295.717155_557.1
2016-05-18 07:09:44.851613 7fee53fff700  0 rgw meta sync: cr:s=0x7fee4c003890:op=0x7fee4c036e90:18RGWMetaSyncShardCR: adjusting marker pos=1_1463555295.764512_559.1
2016-05-18 07:09:44.851616 7fee53fff700  0 rgw meta sync: cr:s=0x7fee4c003890:op=0x7fee4c036e90:18RGWMetaSyncShardCR: adjusting marker pos=1_1463555295.787972_560.1
2016-05-18 07:09:45.144490 7feda37fe700  1 civetweb: 0x7fedac004df0: 10.8.128.59 - - [18/May/2016:07:09:44 +0000] "POST /admin/log HTTP/1.1" 200 0 - -
2016-05-18 07:09:45.194290 7fee52ffd700  0 ERROR: lease cr failed, done early 
2016-05-18 07:09:45.194302 7fee52ffd700  0 ERROR: full sync on container17 bucket_id=0a6973f5-af95-4665-aa6b-be8dd41491d8.14164.16 shard_id=-1 failed, retcode=-16

Any further syncs between the zones are failing. 


# radosgw-admin sync status --rgw-zone=us-1 --debug-rgw=0 --debug-ms=0
          realm 4a77379a-174c-42c6-abff-45c148cdab4e (earth)
      zonegroup c0e3ca48-6260-4ed8-b83a-165f530e4869 (us)
           zone 0a6973f5-af95-4665-aa6b-be8dd41491d8 (us-1)
  metadata sync no sync (zone is master)
      data sync source: 411e1b05-2194-447c-a30d-63cf9b2d8884 (us-2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 5 shards


A radosgw restart helps. After restarting the requests sync successfully:
2016-05-18 08:18:55.598521 7f91d37f6700  0 store->fetch_remote_obj() returned r=-2
2016-05-18 08:18:55.599695 7f91e57fa700  0 store->fetch_remote_obj() returned r=-2
2016-05-18 08:18:55.599953 7f91dd7fa700  0 store->fetch_remote_obj() returned r=-2
2016-05-18 08:18:55.612941 7f91d2ff5700  0 _send_request(): deleting obj=container3:ks-pre.log
2016-05-18 08:18:55.612954 7f91e5ffb700  0 _send_request(): deleting obj=container3:s3file.py
2016-05-18 08:18:55.617699 7f91d57fa700  0 _send_request(): deleting obj=container17:s3file.py
2016-05-18 08:18:55.646109 7f91cefed700  0 store->fetch_remote_obj() returned r=-2
2016-05-18 08:18:55.648808 7f91d67fc700  0 _send_request(): deleting obj=container18:s3file.py

# radosgw-admin sync status --rgw-zone=us-1 --debug-rgw=0 --debug-ms=0
          realm 4a77379a-174c-42c6-abff-45c148cdab4e (earth)
      zonegroup c0e3ca48-6260-4ed8-b83a-165f530e4869 (us)
           zone 0a6973f5-af95-4665-aa6b-be8dd41491d8 (us-1)
  metadata sync no sync (zone is master)
      data sync source: 411e1b05-2194-447c-a30d-63cf9b2d8884 (us-2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

Comment 12 shilpa 2016-05-18 13:47:51 UTC
Another thing to add, trying a delete operation on an object of about 1.5G, recreates the deleted object same as before. http://tracker.ceph.com/issues/15542

Comment 13 Yehuda Sadeh 2016-05-18 15:40:32 UTC
(In reply to shilpa from comment #11)
> I still see this issue on ceph-radosgw-10.2.1-1.el7cp.x86_64
> 
> Object deletes are still not synced to the peer zone. Only this time, the
> file is not resynced back to the zone where the delete was initiated. Any
> further operations (creates or deletes) also fail to sync. 
> 
> 
> 2016-05-18 07:09:44.685026 7feee1ffb700  0 rgw meta sync: ERROR: can't
> remove key:
> bucket.instance:container17:0a6973f5-af95-4665-aa6b-be8dd41491d8.14164.19
> ret=-2
> 2016-05-18 07:09:44.849754 7feda3fff700  1 civetweb: 0x7feda4004df0:
> 10.8.128.59 - - [18/May/2016:07:09:43 +0000] "POST /admin/log HTTP/1.1" 200
> 0 - -
> 2016-05-18 07:09:44.851600 7fee53fff700  0 rgw meta sync:
> cr:s=0x7fee4c003890:op=0x7fee4c036e90:18RGWMetaSyncShardCR: adjusting marker
> pos=1_1463555295.641429_553.1
> 2016-05-18 07:09:44.851609 7fee53fff700  0 rgw meta sync:
> cr:s=0x7fee4c003890:op=0x7fee4c036e90:18RGWMetaSyncShardCR: adjusting marker
> pos=1_1463555295.717155_557.1
> 2016-05-18 07:09:44.851613 7fee53fff700  0 rgw meta sync:
> cr:s=0x7fee4c003890:op=0x7fee4c036e90:18RGWMetaSyncShardCR: adjusting marker
> pos=1_1463555295.764512_559.1
> 2016-05-18 07:09:44.851616 7fee53fff700  0 rgw meta sync:
> cr:s=0x7fee4c003890:op=0x7fee4c036e90:18RGWMetaSyncShardCR: adjusting marker
> pos=1_1463555295.787972_560.1
> 2016-05-18 07:09:45.144490 7feda37fe700  1 civetweb: 0x7fedac004df0:
> 10.8.128.59 - - [18/May/2016:07:09:44 +0000] "POST /admin/log HTTP/1.1" 200
> 0 - -
> 2016-05-18 07:09:45.194290 7fee52ffd700  0 ERROR: lease cr failed, done
> early 
> 2016-05-18 07:09:45.194302 7fee52ffd700  0 ERROR: full sync on container17
> bucket_id=0a6973f5-af95-4665-aa6b-be8dd41491d8.14164.16 shard_id=-1 failed,
> retcode=-16

this means that it couldn't lock the container to (probably) do the sync because it was busy doing another thing. Did you do anything else on that bucket before the object removal?

Can you document precisely what you do, and what the order and timing of things?

> 
> Any further syncs between the zones are failing. 

For how long?

Comment 15 shilpa 2016-05-24 07:25:55 UTC
Created attachment 1160898 [details]
Master zone logs

Comment 16 shilpa 2016-05-24 07:27:09 UTC
Created attachment 1160899 [details]
Logs

Comment 19 Casey Bodley 2016-05-26 19:06:08 UTC
*** Bug 1326290 has been marked as a duplicate of this bug. ***

Comment 20 Casey Bodley 2016-05-26 19:06:22 UTC
*** Bug 1327569 has been marked as a duplicate of this bug. ***

Comment 22 Jan Klare 2016-05-31 17:16:46 UTC
Hi,

i am seeing the same errors in my logs while i use COSBench to test the multisite deployment. The COSBench workload first creates 100 containers, then writes one object to each container, then deletes the object and lastly the containers too. For now i am working with 4k objects to avoid multipart uploads and it seems that i get exactly one message per container and lateron per object. I tried to pin it to some specific actions by manually creating a bucket with s3cmd and then adding one 4k object into it:

(the usual logs from civetweb are filtered out, just showing logs containing "ERROR")

create bucket1:
-> logs in master zone:
2016-05-31 16:58:52.343411 7f5850ff1700  0 ERROR: failed to wait for op, ret=-22: POST http://rgw1.de/admin/log?type=metadata&notify&rgwx-zonegroup=ce98cb46-10d3-4b0f-827a-cc19774057af
-> no logs in secondary zone

create bucket2:
-> logs in master zone:
2016-05-31 16:59:13.205150 7f5850ff1700  0 ERROR: failed to wait for op, ret=-22: POST http://rgw1.de/admin/log?type=metadata&notify&rgwx-zonegroup=ce98cb46-10d3-4b0f-827a-cc19774057af
-> no logs in secondary zone

put one 4k file into bucket1:
-> no logs in master zone:
-> logs in secondary zone:
2016-05-31 17:00:25.341013 7f8488fe9700  0 ERROR: lease cr failed, done early
2016-05-31 17:00:25.341037 7f8488fe9700  0 ERROR: full sync on bucket1 bucket_id=79d864f5-3c4b-48da-8f9e-b24d82c1aacd.770892.103 shard_id=1 failed, retcode=-16

put one 4k file into bucket2:
-> no logs in master zone:
-> logs in secondary zone:
2016-05-31 17:00:03.020905 7f8488fe9700  0 ERROR: lease cr failed, done early
2016-05-31 17:00:03.020932 7f8488fe9700  0 ERROR: full sync on bucket2 bucket_id=79d864f5-3c4b-48da-8f9e-b24d82c1aacd.770892.104 shard_id=1 failed, retcode=-16

delete one 4k file from bucket1:
-> no logs in master zone:
-> logs in secondary zone:
2016-05-31 17:07:09.891998 7f8488fe9700  0 ERROR: lease cr failed, done early
2016-05-31 17:07:09.892024 7f8488fe9700  0 ERROR: incremental sync on bucket1 bucket_id=79d864f5-3c4b-48da-8f9e-b24d82c1aacd.770892.103 shard_id=1 failed, retcode=-16

same for the file in bucket2

I am seeing no errors when deleting the bucket.

The sync itself seems to work fine and i can download the file without any issue from both sites after i create it. So i have no idea if this is an actual issue or just an ERROR in the logs. I am running the radosgw on ubuntu xenial and use a selfbuild package from the jewel branch of today.

Cheers,
Jan

Comment 26 Casey Bodley 2016-05-31 19:38:07 UTC
(In reply to Jan Klare from comment #22)
> Hi,
> 
> i am seeing the same errors in my logs while i use COSBench to test the
> multisite deployment. The COSBench workload first creates 100 containers,
> then writes one object to each container, then deletes the object and lastly
> the containers too. For now i am working with 4k objects to avoid multipart
> uploads and it seems that i get exactly one message per container and
> lateron per object. I tried to pin it to some specific actions by manually
> creating a bucket with s3cmd and then adding one 4k object into it:
> 
> (the usual logs from civetweb are filtered out, just showing logs containing
> "ERROR")
> 
> create bucket1:
> -> logs in master zone:
> 2016-05-31 16:58:52.343411 7f5850ff1700  0 ERROR: failed to wait for op,
> ret=-22: POST
> http://rgw1.de/admin/log?type=metadata&notify&rgwx-zonegroup=ce98cb46-10d3-
> 4b0f-827a-cc19774057af
> -> no logs in secondary zone
> 
> create bucket2:
> -> logs in master zone:
> 2016-05-31 16:59:13.205150 7f5850ff1700  0 ERROR: failed to wait for op,
> ret=-22: POST
> http://rgw1.de/admin/log?type=metadata&notify&rgwx-zonegroup=ce98cb46-10d3-
> 4b0f-827a-cc19774057af
> -> no logs in secondary zone
> 
> put one 4k file into bucket1:
> -> no logs in master zone:
> -> logs in secondary zone:
> 2016-05-31 17:00:25.341013 7f8488fe9700  0 ERROR: lease cr failed, done early
> 2016-05-31 17:00:25.341037 7f8488fe9700  0 ERROR: full sync on bucket1
> bucket_id=79d864f5-3c4b-48da-8f9e-b24d82c1aacd.770892.103 shard_id=1 failed,
> retcode=-16
> 
> put one 4k file into bucket2:
> -> no logs in master zone:
> -> logs in secondary zone:
> 2016-05-31 17:00:03.020905 7f8488fe9700  0 ERROR: lease cr failed, done early
> 2016-05-31 17:00:03.020932 7f8488fe9700  0 ERROR: full sync on bucket2
> bucket_id=79d864f5-3c4b-48da-8f9e-b24d82c1aacd.770892.104 shard_id=1 failed,
> retcode=-16
> 
> delete one 4k file from bucket1:
> -> no logs in master zone:
> -> logs in secondary zone:
> 2016-05-31 17:07:09.891998 7f8488fe9700  0 ERROR: lease cr failed, done early
> 2016-05-31 17:07:09.892024 7f8488fe9700  0 ERROR: incremental sync on
> bucket1 bucket_id=79d864f5-3c4b-48da-8f9e-b24d82c1aacd.770892.103 shard_id=1
> failed, retcode=-16
> 
> same for the file in bucket2
> 
> I am seeing no errors when deleting the bucket.
> 
> The sync itself seems to work fine and i can download the file without any
> issue from both sites after i create it. So i have no idea if this is an
> actual issue or just an ERROR in the logs. I am running the radosgw on
> ubuntu xenial and use a selfbuild package from the jewel branch of today.
> 
> Cheers,
> Jan

Thanks Jan. The ERRORs you're seeing do occur during normal operation. You can safely ignore them, unless you also observe puts or deletes that don't get synced to other zones for a significant amount of time.

Comment 27 Jan Klare 2016-06-01 09:50:35 UTC
Hi Casey,

thanks for the fast response. If these ERRORs are part of the "normal operation", i can ignore them for my testing setup, but in my opinion ERROR messages should never be part of "normal operation". Any chance this will be fixed in jewel? Any idea why i get these ERRORs, even thou everything works fine? If you want, we can move this to a separate bug, since this seems not to be directly related to the actual "delete sync not working" issue.

Cheers,
Jan

Comment 28 Casey Bodley 2016-06-01 18:58:59 UTC
(In reply to Jan Klare from comment #27)
> Hi Casey,
> 
> thanks for the fast response. If these ERRORs are part of the "normal
> operation", i can ignore them for my testing setup, but in my opinion ERROR
> messages should never be part of "normal operation". Any chance this will be
> fixed in jewel? Any idea why i get these ERRORs, even thou everything works
> fine? If you want, we can move this to a separate bug, since this seems not
> to be directly related to the actual "delete sync not working" issue.
> 
> Cheers,
> Jan

I agree that the ERROR text and --debug-rgw levels for these messages should change. Could you please open a new bug?

Comment 29 Casey Bodley 2016-06-01 20:47:50 UTC
Hi Shilpa,

We're preparing a new build that includes the latest fix from Yehuda. This allows the data sync process to keep retrying on bucket sync failures, including those related to "ERROR: lease cr failed, done early". This may not resolve all of the sync issues that you're seeing, but should at least limit the ways that it can reproduce. We'll follow up with the fixed-in-version when it's ready.

Comment 30 Jan Klare 2016-06-02 10:36:34 UTC
(In reply to Casey Bodley from comment #28)
> (In reply to Jan Klare from comment #27)
> > Hi Casey,
> > 
> > thanks for the fast response. If these ERRORs are part of the "normal
> > operation", i can ignore them for my testing setup, but in my opinion ERROR
> > messages should never be part of "normal operation". Any chance this will be
> > fixed in jewel? Any idea why i get these ERRORs, even thou everything works
> > fine? If you want, we can move this to a separate bug, since this seems not
> > to be directly related to the actual "delete sync not working" issue.
> > 
> > Cheers,
> > Jan
> 
> I agree that the ERROR text and --debug-rgw levels for these messages should
> change. Could you please open a new bug?

new bug created here http://tracker.ceph.com/issues/16121

Comment 32 shilpa 2016-06-14 10:22:58 UTC
(In reply to Casey Bodley from comment #29)
> Hi Shilpa,
> 
> We're preparing a new build that includes the latest fix from Yehuda. This
> allows the data sync process to keep retrying on bucket sync failures,
> including those related to "ERROR: lease cr failed, done early". This may
> not resolve all of the sync issues that you're seeing, but should at least
> limit the ways that it can reproduce. We'll follow up with the
> fixed-in-version when it's ready.


Hi Casey,

I was able to reproduce the issue again on the new build 10.2.1-18. 
I have uploaded the compressed logs here. I first saw the issue on container2. So that would be the place to look at, with timestamp: 2016-06-14 10:05:30.0.783646s. There were about 235 files I uploaded at once. And that's when none of the files got synced.

Comment 33 shilpa 2016-06-14 10:28:47 UTC
Created attachment 1167825 [details]
magna075 logs

Comment 36 Casey Bodley 2016-06-16 13:10:32 UTC
Logs for magna075 show that the data sync thread stops producing output about 15 minutes before the ~235 objects were uploaded on magna115. The last output from that thread indicates a deadlock when submitting an http request to another thread.

I've asked Shilpa to set up another reproducer so that I can attach a debugger and figure out exactly where these threads are blocked.

Comment 37 Casey Bodley 2016-06-16 18:05:05 UTC
Here are the stack traces from gdb illustrating the deadlock. The same deadlock occurred between the metadata sync and http threads.

data sync thread:

Thread 138 (Thread 0x7f261affd700 (LWP 9325)):
#0  0x00007f271b5e41cd in write () from /lib64/libpthread.so.0
#1  0x00007f271bf7f940 in RGWHTTPManager::signal_thread() () from /lib64/librgw.so.2
#2  0x00007f271bf807ee in RGWHTTPManager::remove_request(RGWHTTPClient*) () from /lib64/librgw.so.2
#3  0x00007f271bf80855 in RGWHTTPClient::~RGWHTTPClient() () from /lib64/librgw.so.2
#4  0x00007f271c0cd8a9 in RGWRESTStreamRWRequest::~RGWRESTStreamRWRequest() () from /lib64/librgw.so.2
#5  0x00007f271c0d3549 in RGWRESTReadResource::~RGWRESTReadResource() () from /lib64/librgw.so.2
#6  0x00007f271bf316f8 in RefCountedObject::put() () from /lib64/librgw.so.2
#7  0x00007f271bfe2bc3 in RGWReadRemoteDataLogShardInfoCR::~RGWReadRemoteDataLogShardInfoCR() () from /lib64/librgw.so.2
#8  0x00007f271bf316f8 in RefCountedObject::put() () from /lib64/librgw.so.2
#9  0x00007f271bf2d94f in RGWCoroutinesStack::operate(RGWCoroutinesEnv*) () from /lib64/librgw.so.2
#10 0x00007f271bf2f6a1 in RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&) () from /lib64/librgw.so.2
#11 0x00007f271bf30260 in RGWCoroutinesManager::run(RGWCoroutine*) () from /lib64/librgw.so.2
#12 0x00007f271bfd75d2 in RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&) () from /lib64/librgw.so.2
#13 0x00007f271c0988b9 in RGWDataSyncProcessorThread::process() () from /lib64/librgw.so.2
#14 0x00007f271c03e313 in RGWRadosThread::Worker::entry() () from /lib64/librgw.so.2
#15 0x00007f271b5dddc5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f271abe7ced in clone () from /lib64/libc.so.6


data sync http thread:

Thread 136 (Thread 0x7f2619ffb700 (LWP 9327)):
#0  0x00007f271abdd69d in poll () from /lib64/libc.so.6
#1  0x00007f271ba353a9 in Curl_poll () from /lib64/libcurl.so.4
#2  0x00007f271ba2e37b in curl_multi_wait () from /lib64/libcurl.so.4
#3  0x00007f271bf7e311 in do_curl_wait(CephContext*, void*, int) () from /lib64/librgw.so.2
#4  0x00007f271bf80e02 in RGWHTTPManager::reqs_thread_entry() () from /lib64/librgw.so.2
#5  0x00007f271bf8148d in RGWHTTPManager::ReqsThread::entry() () from /lib64/librgw.so.2
#6  0x00007f271b5dddc5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f271abe7ced in clone () from /lib64/libc.so.6


RGWHTTPManager::signal_thread() is trying to write 4 bytes to a pipe while curl_multi_wait() is polling that pipe for wakeups. Blocking on a write() indicates that the pipe's buffer is full, so we must be leaking bytes on the read size. I'm now testing a potential fix.

Comment 38 Casey Bodley 2016-06-17 16:05:57 UTC
I was able to reproduce this deadlock locally by building with HAVE_CURL_MULTI_WAIT=0. If libcurl doesn't support curl_multi_wait(), we fall back to different logic that uses select() instead. This fallback has a bug that prevents us from ever reading bytes from the pipe, which is why it deadlocks consistently.

However, looking at the build logs for these downstream packages, I do see "checking for curl_multi_wait in -lcurl... yes" which would indicate that HAVE_CURL_MULTI_WAIT=1. curl_multi_wait() was added in libcurl 7.28.0, and it appears that we're building against libcurl-devel-7.29.0.

So I'm having trouble reconciling these: configure is telling us that HAVE_CURL_MULTI_WAIT=1, while we're observing behavior that should only be possible with HAVE_CURL_MULTI_WAIT=0.

Comment 39 Casey Bodley 2016-06-17 17:30:44 UTC
Submitted a PR against master with the fix for select(): https://github.com/ceph/ceph/pull/9785

Comment 40 Casey Bodley 2016-06-17 20:46:03 UTC
I've established that these downstream packages do have curl_multi_wait(), so the fix to select() will not help here.

Further investigation with gdb on magna075 shows that curl_multi_wait() is never flagging this pipe fd as signaled. This behavior has not been observed in development environments running recent versions of libcurl (my Fedora has libcurl-devel-7.43.0).

I managed to find a "curl_multi_wait: fix revents" commit at https://github.com/curl/curl/commit/513e587c5eb966038731530c8f47fe0cf27513ce, which landed in 7.32.0. I haven't yet tested it to determine whether it resolves the issue.

Comment 41 Casey Bodley 2016-06-18 01:57:12 UTC
Ken created a rhel-7.2-patches branch of libcurl at version 7.29.0.

After building that branch, my RGWHTTPManager unit test reproduced the deadlock every time.

I then cherry-picked "curl_multi_wait: fix revents" and two related commits onto that branch, and the unit test passed consistently without deadlock. This confirms that those fixes will resolve this bug.

I opened a pull request for Ken at https://github.com/ktdreyer/curl/pull/1 for packaging.

Comment 45 shilpa 2016-06-22 08:38:35 UTC
Testing with the new curl package. Plain object sync seems to work. But I noticed two issues with syncing deletes:

1. 
Upload an object of about 1.5G from master zone. Wait for it to finish uploading. While the object has not finished syncing, delete the object from the master zone. At this point checked the object sync operation has completed and now the object starts to reverse sync thereby re-creating the object on the master zone. 


2. 
Introduced a network delay of 200ms. Did the same operation of creating an object on master zone but with a smaller size file of 500MB and deleting it before the sync finishes. This time, the first object create sync operation never completed. The two zones are out of sync:

# radosgw-admin sync status --rgw-zone=us-2 --debug-rgw=0 --debug-ms=0          realm fedc07d8-a4cc-40c0-b8ad-4e1be8251726 (earth)
      zonegroup 4401713c-7fdf-4619-adea-829c5e7fdd0d (us)
           zone 591f5f4f-2b22-4346-ae9c-45c7e37ad5ac (us-2)
  metadata sync syncing
                full sync: 0/64 shards
                metadata is caught up with master
                incremental sync: 64/64 shards
      data sync source: 38b0ab46-20fd-4c94-9f19-193e86c7e343 (us-1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 2 shards
                        oldest incremental change not applied: 2016-06-22 07:37:26.0.924151s

I have debug logs for the second scenario attached. Will reproduce the first one and upload.

Comment 46 shilpa 2016-06-22 09:31:21 UTC
The tests were run in buckets 'test20' and 'test21' with file name f22

Comment 47 shilpa 2016-06-22 09:48:57 UTC
Created attachment 1170650 [details]
magna115 logs

Comment 48 shilpa 2016-06-22 10:00:02 UTC
Created attachment 1170656 [details]
magna075

Comment 49 Casey Bodley 2016-06-22 14:26:45 UTC
Thanks Shilpa. We have a preliminary fix for these issues upstream. Can we track them as a separate bug so we can close this one for 2.0?

Comment 50 shilpa 2016-06-23 07:10:22 UTC
(In reply to Casey Bodley from comment #49)
> Thanks Shilpa. We have a preliminary fix for these issues upstream. Can we
> track them as a separate bug so we can close this one for 2.0?

Sure that makes sense.

Comment 51 shilpa 2016-06-23 07:15:25 UTC
The new curl package fixes the object create sync. Tracking sync delete issues in BZ:1349285

Comment 53 errata-xmlrpc 2016-08-23 19:36:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html