Bug 1459919 - Swift rebalance might fail when adding/replacing disks or rebalance is not optimal
Swift rebalance might fail when adding/replacing disks or rebalance is not op...
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-swift (Show other bugs)
10.0 (Newton)
Unspecified Unspecified
medium Severity medium
: z4
: 10.0 (Newton)
Assigned To: Christian Schwede (cschwede)
Mike Abrams
: Triaged, ZStream
: 1468030 1470789 1488290 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-08 10:43 EDT by Christian Schwede (cschwede)
Modified: 2017-11-09 22:39 EST (History)
15 users (show)

See Also:
Fixed In Version: puppet-swift-9.5.0-3.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-09-06 13:09:30 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 472253 None None None 2017-06-08 10:43 EDT

  None (edit)
Description Christian Schwede (cschwede) 2017-06-08 10:43:54 EDT
Overcloud update fails if adding additional disks to Swift on OSP10. This has been already fixed upstream, but requires a backport for Newton.

Can be easily reproduced: deploy using 3 controllers, and then update the deployment to include an additional disk for Swift using a following environment file like this:

parameter_defaults:
  SwiftRawDisks: {"vdb": {}}

This fails during ring rebalance then.
Comment 1 Red Hat Bugzilla Rules Engine 2017-06-08 10:45:20 EDT
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.
Comment 2 Christian Schwede (cschwede) 2017-07-24 04:48:24 EDT
This also applies if the rebalance is sup-optimal; ie the rebalance was executed, but partition distribution could be improved. In that case another rebalance should be done later on. However, if the next rebalance is executed within the minimum rebalance time, it will again return an exit code of 1.

All of this should be fixed by the proposed patch.
Comment 3 Andreas Karis 2017-07-24 13:50:10 EDT
Hi Christian,

The customer have a huge ceph cluster, and according to them a rebalance could take some time there. Note that this is happening on a stack CREATE. Not during an UPDATE. And I have frankly no idea why Ceph would have an influence on the swift stuff:

++++++++++++++++++++++++++++++++++++++++++++

RH: 

Could you apply the following change to test this?

https://review.openstack.org/#/c/472253/1/manifests/ringbuilder/rebalance.pp

Apply this change on the controller nodes, in file /etc/puppet/modules/swift/manifests/ringbuilder/rebalance.pp

So after a failed deploy with this error message, ssh to the 3 controllers, apply the change, and kick off a stack update.

Let's see if that fixed the issue, if it does, I'll get you a hotfix.


Customer:

Update completed successfully after the change.
Comment 4 Andreas Karis 2017-07-24 13:51:13 EDT
Christian, 

Can we get a hotfix for https://review.openstack.org/#/c/472253/1/manifests/ringbuilder/rebalance.pp

The customer can then virt-customize their images and upload the RPM. I can also tell them to virt-customize the code change directly ... I'm not a fan of that, though.

Thanks!!!

- Andreas
Comment 6 Christian Schwede (cschwede) 2017-07-26 11:13:15 EDT
*** Bug 1470789 has been marked as a duplicate of this bug. ***
Comment 8 Andreas Karis 2017-07-27 01:48:35 EDT
Workaround until we get the fixed RPMs:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Ugly hack:
======================

https://review.openstack.org/#/c/472253/1/manifests/ringbuilder/rebalance.pp

Apply this change on the controller nodes, in file /etc/puppet/modules/swift/manifests/ringbuilder/rebalance.pp

So after a failed deploy with this error message, ssh to the 3 controllers, apply the change, and kick off a stack update.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

A bit less ugly hack:
========================

On the undercloud, go to the same Directory where overcloud.qcow2 resides. Create file rebalance.pp
~~~
# Swift::Ring::Rebalance
#   Reblances the specified ring. Assumes that the ring already exists
#   and is stored at /etc/swift/${name}.builder
#
# == Parameters
#
# [*name*] Type of ring to rebalance. The ring file is assumed to be at the path
#   /etc/swift/${name}.builder
#
# [*seed*] Optional. Seed value used to seed pythons pseudo-random for ringbuilding.
define swift::ringbuilder::rebalance(
  $seed = undef
) {

  include ::swift::deps

  validate_re($name, '^object|container|account$')
  if $seed {
    validate_re($seed, '^\d+$')
  }

  exec { "rebalance_${name}":
    command     => strip("swift-ring-builder /etc/swift/${name}.builder rebalance ${seed}"),
    path        => ['/usr/bin'],
    refreshonly => true,
    before      => Anchor['swift::config::end'],
    returns     => [0, 1],
  }
}
~~~

Then, execute:
~~~
virt-customize -a overcloud-full.qcow2 --upload rebalance.pp:/etc/puppet/modules/swift/manifests/ringbuilder/rebalance.pp
~~~

Then, execute:
~~~
source stackrc
openstack overcloud image upload --update-existing --image-path .
~~~
Comment 10 Andreas Karis 2017-07-27 11:32:08 EDT
There may be another workaround:


On the next iteration when I deleted stack and the overcloud plan, I  ran swift list to find 

[stack@devbclu001 ~]$ swift list
overcloud-swift-rings

Somehow these are being left behind. 

I deleted them and attempted a fresh  deploy
[stack@devbclu001 ~]$ swift delete overcloud-swift-rings
swift-rings.tar.gz
overcloud-swift-rings

And the deployment succeeded. 

I have tested one live migration and it was worked fine. 
I plan to redeploy once more and then test several migrations. 

will keep you posted.
Comment 11 Andreas Karis 2017-07-27 11:37:00 EDT
the above workaround is only for new deployments after an old deployment was deleted
Comment 12 Andreas Karis 2017-07-27 12:14:54 EDT
Hi,


If there are old rings they will be updated, and therefore the rebalance will be executed. There is no rebalance if the rings didn't change, but the issue is still there, even if you don't hit it then - it might happen later on. So the fix which I sent to you is still needed.

The above workaround (deleting the swift overcloud-swift-rings) does work for new deployments. According to engineering, there is also a BZ for the left-over rings and this is being handled in another bugzilla, but with the provided fix the old rings will be updated. But of course it would be better if they are cleaned before.

- Andreas
Comment 20 Christian Schwede (cschwede) 2017-08-31 09:18:20 EDT
*** Bug 1468030 has been marked as a duplicate of this bug. ***
Comment 25 Christian Schwede (cschwede) 2017-09-06 03:54:45 EDT
*** Bug 1488290 has been marked as a duplicate of this bug. ***
Comment 26 errata-xmlrpc 2017-09-06 13:09:30 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2654

Note You need to log in before you can comment on or make changes to this bug.