Bug 1359696

Summary:	When the zones are switched, sync status complains of period mismatch
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	shilpa <smanjara>
Component:	RGW	Assignee:	Casey Bodley <cbodley>
Status:	CLOSED ERRATA	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	2.0	CC:	cbodley, ceph-eng-bugs, kbader, kdreyer, mbenjamin, nlevine, owasserm, smanjara, sweil
Target Milestone:	rc
Target Release:	2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	RHEL: ceph-10.2.2-31.el7cp Ubuntu: ceph_10.2.2-24redhat1xenial	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-23 19:45:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description shilpa 2016-07-25 10:17:19 UTC

Description of problem:
Switch master and non-master zones. Check sync status on the current non-master zone. 


Version-Release number of selected component (if applicable):
ceph-radosgw-10.2.2-27.el7cp.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Modify non-master zone with '--master' flag.
radosgw-admin zone modify --rgw-zonegroup=us --rgw-zone=us-2 --access_key=secret --secret=secret --endpoints=http://magna059:80 --default --master
2. Update and commit the period and restart rgw gateway
3. On the current non-master zone:

# radosgw-admin sync status --rgw-zone=us-1 --debug-rgw=0 --debug-ms=0
2016-07-25 10:14:33.479604 7f82e22b39c0  0 error in read_id for id  : (2) No such file or directory
          realm bee08496-1b97-4f04-8d3f-c682c08565a3 (earth)
      zonegroup 0bf0fc77-43ce-4a44-8b16-8f5fcfa84c95 (us)
           zone f5717851-2682-475a-b24b-7bcdec728cbe (us-1)
  metadata sync syncing
                full sync: 0/64 shards
                master is on a different period: master_period= local_period=86c3c833-ee7e-454c-a54d-b451fd829755
                metadata is caught up with master
                incremental sync: 64/64 shards
      data sync source: a6dfdcd1-6d8a-4a34-8c47-4b6a02ac8105 (us-2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source


Both the zone are actually on the same period and it is totally different from the one that is shown in the above command:

]# radosgw-admin period get-current --debug-rgw=0 --debug-ms=0
{
    "current_period": "21a40ae5-721f-4a67-8c3a-28baf3104d3f"

  "period_map": {
        "id": "21a40ae5-721f-4a67-8c3a-28baf3104d3f",
        "zonegroups": [
            {
                "id": "0bf0fc77-43ce-4a44-8b16-8f5fcfa84c95",
                "name": "us",
                "api_name": "us",
                "is_master": "true",
                "endpoints": [
                    "http:\/\/magna115:80"
                ],
                "hostnames": [],
                "hostnames_s3website": [],
                "master_zone": "a6dfdcd1-6d8a-4a34-8c47-4b6a02ac8105",
                "zones": [
                    {
                        "id": "a6dfdcd1-6d8a-4a34-8c47-4b6a02ac8105",
                        "name": "us-2",
                        "endpoints": [
                            "http:\/\/magna059:80"
                        ],
                        "log_meta": "true",
                        "log_data": "true",
                        "bucket_index_max_shards": 0,
                        "read_only": "false"
                    },
                    {
                        "id": "f5717851-2682-475a-b24b-7bcdec728cbe",
                        "name": "us-1",
                        "endpoints": [
                            "http:\/\/magna115:80"
                        ],
                        "log_meta": "false",
                        "log_data": "true",
                        "bucket_index_max_shards": 0,
                        "read_only": "false"
                    }
                ],
                "placement_targets": [
                    {
                        "name": "default-placement",
                        "tags": []
                    }
                ],
                "default_placement": "default-placement",
                "realm_id": "bee08496-1b97-4f04-8d3f-c682c08565a3"
            }
        ],
        "short_zone_ids": [
            {
                "key": "a6dfdcd1-6d8a-4a34-8c47-4b6a02ac8105",
                "val": 4235791203
            },
            {
                "key": "f5717851-2682-475a-b24b-7bcdec728cbe",
                "val": 311909243
            }
        ]
    },
    "master_zonegroup": "0bf0fc77-43ce-4a44-8b16-8f5fcfa84c95",
    "master_zone": "a6dfdcd1-6d8a-4a34-8c47-4b6a02ac8105",

Comment 2 Orit Wasserman 2016-07-25 18:49:15 UTC

can you provide the procedure to switch masters?
can you provide rgw logs?

Comment 5 Casey Bodley 2016-07-26 21:09:00 UTC

I've reproduced this "master is on a different period: master_period= " error. Here's what I've learned:

The "radosgw-admin zone modify --rgw-zonegroup=us --rgw-zone=us-2 --master" command correctly changes the zonegroup's "master_zone" field to point to us-2, but doesn't modify the zonegroup's "endpoints" field.

When the "radosgw-admin sync status" command sees that it's not the master zone, it tries to send a "get_metadata_log_info" request to the new master zone's gateway. It uses the RGWRados::rest_master_conn connection, which is initialized with the zonegroup's endpoints, to do this. So after switching the master zone, it's accidentally sending the request to the endpoint associated with the old master zone, us-1. us-1 knows it's not the master zone, so it returns the empty period id that's displayed as "master_period= ".

This issue is larger than just the "sync status" output, though. We also use rest_master_conn:
* for all metadata sync requests
* when forwarding bucket/user creation operations that need to be processed by the metadata master
* when committing periods or fetching periods that we're missing

Comment 6 Casey Bodley 2016-07-26 22:32:49 UTC

As a workaround, you can fix the zonegroup endpoints with:

$ radosgw-admin zonegroup modify --rgw-zonegroup=us --endpoints=http://magna059:80

I'll discuss this with the rest of the multisite team to see if we can do better.

Comment 7 shilpa 2016-07-27 13:39:18 UTC

(In reply to Casey Bodley from comment #6)
> As a workaround, you can fix the zonegroup endpoints with:
> 
> $ radosgw-admin zonegroup modify --rgw-zonegroup=us
> --endpoints=http://magna059:80
> 
> I'll discuss this with the rest of the multisite team to see if we can do
> better.

The workaround helps. Thanks.

Comment 8 Ken Dreyer (Red Hat) 2016-07-27 14:02:39 UTC

Shilpa, did you follow any document when running the initial "radosgw-admin zone modify" command that led to this bug?

Discussed in the QE/Dev sync today. Next steps: Casey and the RGW team will update the zone modification workflow so that the workaround in Comment #6 is not needed.

Comment 16 shilpa 2016-08-01 05:46:35 UTC

Verified in ceph-10.2.2-32

Comment 18 errata-xmlrpc 2016-08-23 19:45:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html