Bug 1352888 - [Upgrade]: on Ceph upgrade from 1.3.2 to 2.0 the RGW default zone setup is not working
Summary: [Upgrade]: on Ceph upgrade from 1.3.2 to 2.0 the RGW default zone setup is no...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 2.0
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: rc
: 2.1
Assignee: Orit Wasserman
QA Contact: Vasishta
Bara Ancincova
URL:
Whiteboard:
Depends On:
Blocks: 1322504 1383917
TreeView+ depends on / blocked
 
Reported: 2016-07-05 12:07 UTC by Tejas
Modified: 2017-07-30 15:48 UTC (History)
13 users (show)

Fixed In Version: RHEL: ceph-10.2.3-2.el7cp Ubuntu: ceph_10.2.3-3redhat1xenial
Doc Type: Bug Fix
Doc Text:
.Bucket creation no longer fails after upgrading Red Hat Ceph Storage 1.3 to 2.0 Previously, after upgrading an Ceph Object Gateway node from Red Hat Ceph Storage 1.3 to 2.0, an attempt to create a bucket failed. This bug has been fixed, and bucket creation no longer fails in this case.
Clone Of:
Environment:
Last Closed: 2016-11-22 19:28:17 UTC
Embargoed:


Attachments (Terms of Use)
rgw log (18.18 MB, text/plain)
2016-07-05 12:07 UTC, Tejas
no flags Details
rgw log with debug_rgw=20 debug_ms=5 (18.71 MB, text/plain)
2016-07-05 12:14 UTC, Tejas
no flags Details
Rados gw log (10.66 MB, text/plain)
2016-07-15 16:37 UTC, Tejas
no flags Details
multipart upload script (1.13 KB, text/plain)
2016-07-21 15:05 UTC, Tejas
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 16627 0 None None None 2016-07-08 23:25:20 UTC
Red Hat Product Errata RHSA-2016:2815 0 normal SHIPPED_LIVE Moderate: Red Hat Ceph Storage security, bug fix, and enhancement update 2017-03-22 02:06:33 UTC

Description Tejas 2016-07-05 12:07:27 UTC
Created attachment 1176390 [details]
rgw log

Description of problem:

On a Ceph upgrade from 1.3.2 to 2.0, the RGW setup has changed. 
I am unable to do IO with the older RGW setup.

Version-Release number of selected component (if applicable):
ceph version 10.2.2-15.el7cp (60cd52496ca02bdde9c2f4191e617f75166d87b6)

How reproducible:
Always

Steps to Reproduce:
1. Create  as3 user and do some IO on a ceph 1.3.2 cluster.
2. Upgrade the ceph cluster to 2.0
3. The older setup is not working on upgrade.



Additional info:

I have attached the log files which has debug_rgw=20 debug_ms=5

RGW node:
magna080

Comment 2 Tejas 2016-07-05 12:14:34 UTC
Created attachment 1176399 [details]
rgw log with debug_rgw=20 debug_ms=5

Comment 5 Orit Wasserman 2016-07-07 17:02:17 UTC
can you try running:

radosgw-admin zone modify --master --rgw-zone=default

see if now you can create the buckets successfully?

Comment 9 Orit Wasserman 2016-07-08 08:52:16 UTC
upstream fix: https://github.com/ceph/ceph/pull/10205

Comment 18 Harish NV Rao 2016-07-13 06:17:42 UTC
Tejas provided the qa ack. resetting need_info.

Comment 20 Tejas 2016-07-15 14:39:32 UTC
Orit,

    The issue is fixed for sure. But the behaviour seems strange.
We do a reboot after the upgrade, and the radosgw process comes up automatically after the reboot:

root@magna080 ~]# ps -ef | grep ceph
ceph      1445     1  0 14:26 ?        00:00:00 /usr/bin/radosgw -f --cluster ceph --name client.rgw.magna080 --setuser ceph --setgroup ceph
ceph      3162     1  1 14:26 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 4 --setuser ceph --setgroup ceph
ceph      4034     1  1 14:26 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
ceph      4843     1  1 14:26 ?        00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph

root@magna080 ~]# netstat -ntlp | grep :7480
tcp        0      0 0.0.0.0:7480            0.0.0.0:*               LISTEN      1445/radosgw

But the IO still fails. The moment I do a process restart of radosgw, the IO happens as expected.
Any idea why we need an additional ps restart?

Thanks,
Tejas

Comment 21 Orit Wasserman 2016-07-15 14:54:43 UTC
can you provide radosgw logs?

Comment 22 Orit Wasserman 2016-07-15 14:55:41 UTC
It could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1352396

Comment 23 Tejas 2016-07-15 16:34:18 UTC
I have changed the owner to Ceph:ceph unlike  https://bugzilla.redhat.com/show_bug.cgi?id=1352396

Attaching the radosgw log, Had enabled the debug prams before doing a process restart.

Thanks,
Tejas

Comment 24 Tejas 2016-07-15 16:37:25 UTC
Created attachment 1180207 [details]
Rados gw log

Comment 25 Harish NV Rao 2016-07-19 15:08:14 UTC
Moving this to assigned state based on comment 20. Please see comment 24 for logs.

Comment 26 Orit Wasserman 2016-07-20 08:13:46 UTC
It looks like there is a problem connecting to the monitor:
 10.8.128.80:0/2525835698 submit_message mon_subscribe({osdmap=73}) v2 remote, 10.8.128.58:6789/0, failed lossy con, dropping message 0x7f7f480102d0
2016-07-15 14:26:27.010541 7f7f610e7700  0 monclient: hunting for new mon

This is a different issue than the one described, could you open a new BZ?

Comment 27 Ken Dreyer (Red Hat) 2016-07-20 13:50:11 UTC
Orit, does that mon contact failure have anything to do with RGW anymore?

Comment 28 Orit Wasserman 2016-07-20 14:13:41 UTC
Could be that the osds were not restarted after upgrade?

Comment 29 John Poelstra 2016-07-20 16:00:12 UTC
Matt will arrange a meeting with Orit and QE the team

Comment 30 Orit Wasserman 2016-07-20 16:39:31 UTC
After looking later in the log I see we created a new bucket bigbucket and added an object big.txt at 14:34:17.

Maybe the IO was started too soon? 
Maybe you need to increase the timeout for the I/O ops?

But like I said before this is not the same issue this should be a new BZ

Comment 33 Tejas 2016-07-21 15:05:51 UTC
Created attachment 1182552 [details]
multipart upload script

Comment 36 Tejas 2016-07-22 12:53:37 UTC
hi,

   QE worked with Orit to repro this on a live setup.So this what we found from the meeting:
1. Created a 1.3.2 Ceph cluster with RGW on a separate node with IO in progress.
2. Stop the RGW process, and upgrade RGW, and reboot the node.
3. RGW process is running after the node comes up.
4. Bucket creation is failing.
5. Restart the RGW  service.
6. Bucket creation works

The rgw logs from today's testing is too big to be copied here.
Please take a local copy of the log from here:
root@magna080://var/log/ceph/ceph-client.rgw.magna080.log

Thanks,
Tejas

Comment 37 Orit Wasserman 2016-07-22 13:04:40 UTC
(In reply to Tejas from comment #36)
> hi,
> 
>    QE worked with Orit to repro this on a live setup.So this what we found
> from the meeting:
> 1. Created a 1.3.2 Ceph cluster with RGW on a separate node with IO in
> progress.
> 2. Stop the RGW process, and upgrade RGW, and reboot the node.
> 3. RGW process is running after the node comes up.
> 4. Bucket creation is failing.
> 5. Restart the RGW  service.
> 6. Bucket creation works
> 
> The rgw logs from today's testing is too big to be copied here.
> Please take a local copy of the log from here:
> root@magna080://var/log/ceph/ceph-client.rgw.magna080.log
> 
> Thanks,
> Tejas

Thanks,
I copied to my computer.

Comment 43 Orit Wasserman 2016-08-18 11:07:15 UTC
Looks good,
Orit

Comment 47 Vasishta 2016-11-08 14:12:58 UTC
Hi,

After upgrading cluster to 2.0, I was able to create new bucket and run I/Os.
so moving this bug to verified state.

Regards,
Vasishta

Comment 51 errata-xmlrpc 2016-11-22 19:28:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2815.html


Note You need to log in before you can comment on or make changes to this bug.