Bug 1459919 - Swift rebalance might fail when adding/replacing disks or rebalance is not optimal
Summary: Swift rebalance might fail when adding/replacing disks or rebalance is not op...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-swift
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z4
: 10.0 (Newton)
Assignee: Christian Schwede (cschwede)
QA Contact: Mike Abrams
URL:
Whiteboard:
: 1468030 1470789 1488290 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-08 14:43 UTC by Christian Schwede (cschwede)
Modified: 2020-12-14 08:50 UTC (History)
15 users (show)

Fixed In Version: puppet-swift-9.5.0-3.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-06 17:09:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 472253 0 None MERGED Fix raising an error on rebalance warnings 2020-07-02 13:33:06 UTC
Red Hat Product Errata RHBA-2017:2654 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 director Bug Fix Advisory 2017-09-06 20:55:36 UTC

Internal Links: 1468030

Description Christian Schwede (cschwede) 2017-06-08 14:43:54 UTC
Overcloud update fails if adding additional disks to Swift on OSP10. This has been already fixed upstream, but requires a backport for Newton.

Can be easily reproduced: deploy using 3 controllers, and then update the deployment to include an additional disk for Swift using a following environment file like this:

parameter_defaults:
  SwiftRawDisks: {"vdb": {}}

This fails during ring rebalance then.

Comment 1 Red Hat Bugzilla Rules Engine 2017-06-08 14:45:20 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 2 Christian Schwede (cschwede) 2017-07-24 08:48:24 UTC
This also applies if the rebalance is sup-optimal; ie the rebalance was executed, but partition distribution could be improved. In that case another rebalance should be done later on. However, if the next rebalance is executed within the minimum rebalance time, it will again return an exit code of 1.

All of this should be fixed by the proposed patch.

Comment 3 Andreas Karis 2017-07-24 17:50:10 UTC
Hi Christian,

The customer have a huge ceph cluster, and according to them a rebalance could take some time there. Note that this is happening on a stack CREATE. Not during an UPDATE. And I have frankly no idea why Ceph would have an influence on the swift stuff:

++++++++++++++++++++++++++++++++++++++++++++

RH: 

Could you apply the following change to test this?

https://review.openstack.org/#/c/472253/1/manifests/ringbuilder/rebalance.pp

Apply this change on the controller nodes, in file /etc/puppet/modules/swift/manifests/ringbuilder/rebalance.pp

So after a failed deploy with this error message, ssh to the 3 controllers, apply the change, and kick off a stack update.

Let's see if that fixed the issue, if it does, I'll get you a hotfix.


Customer:

Update completed successfully after the change.

Comment 4 Andreas Karis 2017-07-24 17:51:13 UTC
Christian, 

Can we get a hotfix for https://review.openstack.org/#/c/472253/1/manifests/ringbuilder/rebalance.pp

The customer can then virt-customize their images and upload the RPM. I can also tell them to virt-customize the code change directly ... I'm not a fan of that, though.

Thanks!!!

- Andreas

Comment 6 Christian Schwede (cschwede) 2017-07-26 15:13:15 UTC
*** Bug 1470789 has been marked as a duplicate of this bug. ***

Comment 8 Andreas Karis 2017-07-27 05:48:35 UTC
Workaround until we get the fixed RPMs:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Ugly hack:
======================

https://review.openstack.org/#/c/472253/1/manifests/ringbuilder/rebalance.pp

Apply this change on the controller nodes, in file /etc/puppet/modules/swift/manifests/ringbuilder/rebalance.pp

So after a failed deploy with this error message, ssh to the 3 controllers, apply the change, and kick off a stack update.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

A bit less ugly hack:
========================

On the undercloud, go to the same Directory where overcloud.qcow2 resides. Create file rebalance.pp
~~~
# Swift::Ring::Rebalance
#   Reblances the specified ring. Assumes that the ring already exists
#   and is stored at /etc/swift/${name}.builder
#
# == Parameters
#
# [*name*] Type of ring to rebalance. The ring file is assumed to be at the path
#   /etc/swift/${name}.builder
#
# [*seed*] Optional. Seed value used to seed pythons pseudo-random for ringbuilding.
define swift::ringbuilder::rebalance(
  $seed = undef
) {

  include ::swift::deps

  validate_re($name, '^object|container|account$')
  if $seed {
    validate_re($seed, '^\d+$')
  }

  exec { "rebalance_${name}":
    command     => strip("swift-ring-builder /etc/swift/${name}.builder rebalance ${seed}"),
    path        => ['/usr/bin'],
    refreshonly => true,
    before      => Anchor['swift::config::end'],
    returns     => [0, 1],
  }
}
~~~

Then, execute:
~~~
virt-customize -a overcloud-full.qcow2 --upload rebalance.pp:/etc/puppet/modules/swift/manifests/ringbuilder/rebalance.pp
~~~

Then, execute:
~~~
source stackrc
openstack overcloud image upload --update-existing --image-path .
~~~

Comment 10 Andreas Karis 2017-07-27 15:32:08 UTC
There may be another workaround:


On the next iteration when I deleted stack and the overcloud plan, I  ran swift list to find 

[stack@devbclu001 ~]$ swift list
overcloud-swift-rings

Somehow these are being left behind. 

I deleted them and attempted a fresh  deploy
[stack@devbclu001 ~]$ swift delete overcloud-swift-rings
swift-rings.tar.gz
overcloud-swift-rings

And the deployment succeeded. 

I have tested one live migration and it was worked fine. 
I plan to redeploy once more and then test several migrations. 

will keep you posted.

Comment 11 Andreas Karis 2017-07-27 15:37:00 UTC
the above workaround is only for new deployments after an old deployment was deleted

Comment 12 Andreas Karis 2017-07-27 16:14:54 UTC
Hi,


If there are old rings they will be updated, and therefore the rebalance will be executed. There is no rebalance if the rings didn't change, but the issue is still there, even if you don't hit it then - it might happen later on. So the fix which I sent to you is still needed.

The above workaround (deleting the swift overcloud-swift-rings) does work for new deployments. According to engineering, there is also a BZ for the left-over rings and this is being handled in another bugzilla, but with the provided fix the old rings will be updated. But of course it would be better if they are cleaned before.

- Andreas

Comment 20 Christian Schwede (cschwede) 2017-08-31 13:18:20 UTC
*** Bug 1468030 has been marked as a duplicate of this bug. ***

Comment 25 Christian Schwede (cschwede) 2017-09-06 07:54:45 UTC
*** Bug 1488290 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2017-09-06 17:09:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2654


Note You need to log in before you can comment on or make changes to this bug.