1494441 – Inconsistent number of devices in swift ring

Bug 1494441 - Inconsistent number of devices in swift ring

Summary: Inconsistent number of devices in swift ring

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z7
Target Release:	10.0 (Newton)
Assignee:	Christian Schwede (cschwede)
QA Contact:	Mike Abrams
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1494872 (view as bug list)
Depends On:
Blocks:	1507101
TreeView+	depends on / blocked

Reported:	2017-09-22 09:25 UTC by Petersingh Anburaj
Modified:	2022-08-16 12:32 UTC (History)
CC List:	20 users (show)
Fixed In Version:	openstack-tripleo-common-5.4.7-1.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-03-07 13:52:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	504886	0	'None'	MERGED	Fix using old Swift rings when creating a new stack	2021-02-01 14:18:35 UTC
Red Hat Issue Tracker	OSP-4707	0	None	None	None	2022-08-16 12:32:03 UTC

Comment 1 Christian Schwede (cschwede) 2017-10-04 13:12:52 UTC

@Petersingh: was there an earlier deployment which used 172.19.0.13? Has the overcloud been deleted & recreated inbetween? Which z-stream version has been used?

Comment 3 Christian Schwede (cschwede) 2017-10-05 11:45:14 UTC

Petersingh: ah, ok. The problem is that the Swift rings from the previous deployment are kept, and re-used even when the stack has been deleted. We're working on a backported fix atm (https://review.openstack.org/#/c/504886/).

In the meantime, the customer has to delete the rings from the old deployment if he redeploys - but only if he really deletes the stack and starts from scratch again.

On the undercloud:

$ source stackrc
$ swift delete overcloud-swift-rings

If the customer wants to fix the existing deployment, this needs to be done manually:

$ source stackrc 
$ mkdir tmp && cd tmp
$ swift download overcloud-swift-rings swift-rings.tar.gz
$ tar xzvf swift-rings.tar.gz
# Repeat the following action for account.builder, container.builder and object.builder. First command lists the devices, second command deletes a node
$ swift-ring-builder etc/swift/account.builder 
$ swift-ring-builder etc/swift/account.builder remove 172.19.0.13
$ swift-ring-builder etc/swift/account.builder rebalance
# Finally, update the tar file and upload it
$ tar cvzf swift-rings.tar.gz etc/
$ swift upload overcloud-swift-rings swift-rings.tar.gz

The updated rings will be used once the customer updates the overcloud again.

Comment 4 Miro Halas 2017-10-08 07:02:22 UTC

This bug is particularly nasty as it leads to several delayed effects/failures which are hard to troubleshoot.

If user just deletes the stack and then redeploys and if the user doesn't use deterministic IP mapping this leads to tripleo generating new IP addresses for controller / swift nodes and then updating / appending the existing ring with the new IP addresses.

E.g. if the initial correctly built single controller deployment had ring configuration

[root@overcloud-controller-0 swift]# swift-ring-builder account.builder
account.builder, build version 2
1024 partitions, 1.000000 replicas, 1 regions, 1 zones, 1 devices, 0.00 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 1 (0:00:00 remaining)
The overload factor is 0.00% (0.000000)
Ring file account.ring.gz is up-to-date
Devices:   id region zone  ip address:port replication ip:port  name weight partitions balance flags meta
            0      1    1 172.24.1.15:6002    172.24.1.15:6002    d1 100.00       1024    0.00


after several redeployment the same single controller will have the following ring configuratoin

[root@overcloud-controller-0 swift]# swift-ring-builder account.builder
account.builder, build version 12
1024 partitions, 1.000000 replicas, 1 regions, 1 zones, 6 devices, 0.39 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 1 (0:00:00 remaining)
The overload factor is 0.00% (0.000000)
Ring file account.ring.gz is up-to-date
Devices:   id region zone  ip address:port replication ip:port  name weight partitions balance flags meta
            0      1    1 172.24.1.16:6002    172.24.1.16:6002    d1 100.00        171    0.20
            1      1    1 172.24.1.20:6002    172.24.1.20:6002    d1 100.00        170   -0.39
            2      1    1 172.24.1.11:6002    172.24.1.11:6002    d1 100.00        171    0.20
            3      1    1 172.24.1.17:6002    172.24.1.17:6002    d1 100.00        170   -0.39
            4      1    1 172.24.1.15:6002    172.24.1.15:6002    d1 100.00        171    0.20
            5      1    1 172.24.1.14:6002    172.24.1.14:6002    d1 100.00        171    0.20

5 out of the 6 IPs are no longer valid as they are not running Swift service which will then cause several different types of failure, e.g. 

1. in glance when uploading image

2017-10-07 23:08:47.660 142469 INFO swiftclient [req-8e9b3ee5-c6cd-451a-9ffe-9ae8ca1def90 eb53ce62d2c844cbb1cca3322d1ba8d0 f05f5cfe55a24c73a7d76900b283d82d - default default] REQ: curl -i http://172.23.1.17:8080/v1/AUTH_95c22f6d51fc41a29880ca920db012ba/glance/e90061ac-58a0-4224-b62a-354fc18c5d89 -X PUT -H "Content-Length: 0" -H "ETag: d41d8cd98f00b204e9800998ecf8427e" -H "Content-Type: " -H "X-Object-Manifest: glance/e90061ac-58a0-4224-b62a-354fc18c5d89-" -H "X-Auth-Token: 182892dfcec24166..."
2017-10-07 23:08:47.661 142469 INFO swiftclient [req-8e9b3ee5-c6cd-451a-9ffe-9ae8ca1def90 eb53ce62d2c844cbb1cca3322d1ba8d0 f05f5cfe55a24c73a7d76900b283d82d - default default] RESP STATUS: 503 Service Unavailable

2. in openstack-swift-proxy.service

Oct 07 23:17:00 overcloud-controller-0 proxy-server[86865]: ERROR with Account server 172.24.1.15:6002/d1 re: Trying to HEAD /v1/AUTH_95c22f6d51fc41a29880ca920db012ba: ConnectionTimeout (0.5s) (txn: txad23260632c34f0dbd083-0059d9606a)
Oct 07 23:17:01 overcloud-controller-0 proxy-server[86865]: ERROR with Account server 172.24.1.17:6002/d1 re: Trying to HEAD /v1/AUTH_95c22f6d51fc41a29880ca920db012ba: ConnectionTimeout (0.5s) (txn: txad23260632c34f0dbd083-0059d9606a)
Oct 07 23:17:01 overcloud-controller-0 proxy-server[86865]: Account HEAD returning 503 for [] (txn: txad23260632c34f0dbd083-0059d9606a)

3. in openstack-swift-container-updater.service

Oct 07 23:03:14 overcloud-controller-0 container-server[87534]: Begin container update sweep
Oct 07 23:03:15 overcloud-controller-0 container-server[290429]: ERROR account update failed with 172.24.1.15:6002/d1 (will retry later): : ConnectionTimeout (0.5s)
Oct 07 23:03:15 overcloud-controller-0 container-server[87534]: Container update sweep completed: 0.53s

4. in openstack-swift-object-replicator.service

Oct 07 23:16:45 overcloud-controller-0 object-server[86738]: Starting object replication pass.
Oct 07 23:16:48 overcloud-controller-0 object-server[86738]: rsync: failed to connect to 172.24.1.20 (172.24.1.20): No route to host (113)
Oct 07 23:16:48 overcloud-controller-0 object-server[86738]: rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.9]
Oct 07 23:16:48 overcloud-controller-0 object-server[86738]: Bad rsync return code: 10 <- ['rsync', '--recursive', '--whole-file', '--human-readable', '--xattrs', '--itemize-changes', '--ignore-existing', '--timeout=30', '--contimeout=30', '--bwlimit=0', '--exclude=.*.[0-9a-zA-Z][0-9a-zA-Z][0-9a-zA-Z][0-9a-zA-Z][0-9a-zA-Z][0-9a-zA-Z]', u'/srv/node/d1/objects/699/718', u'172.24.1.20::object/d1/objects/699']
Oct 07 23:16:48 overcloud-controller-0 object-server[86738]: 1/1 (100.00%) partitions replicated in 3.03s (0.33/sec, 0s remaining)
Oct 07 23:16:48 overcloud-controller-0 object-server[86738]: 0 successes, 0 failures
Oct 07 23:16:48 overcloud-controller-0 object-server[86738]: Object replication complete. (0.05 minutes)

6. in openstack-swift-container-replicator.service

Oct 07 23:16:31 overcloud-controller-0 container-server[87161]: no_change:1 ts_repl:0 diff:0 rsync:0 diff_capped:0 hashmatch:0 empty:0
Oct 07 23:16:59 overcloud-controller-0 container-server[87161]: Beginning replication run
Oct 07 23:17:01 overcloud-controller-0 container-server[87161]: ERROR reading HTTP response from {'index': 0, u'replication_port': 6001, u'weight': 100.0, u'zone': 1, u'ip': u'172.24.1.17', u'region': 1, u'id': 3, u'replication_ip': u'172.24.1.17', u'meta': u'', u'device': u'd1', u'port': 6001}: Host unreachable
Oct 07 23:17:01 overcloud-controller-0 container-server[87161]: Replication run OVER
Oct 07 23:17:01 overcloud-controller-0 container-server[87161]: Attempted to replicate 2 dbs in 2.70254 seconds (0.74005/s)
Oct 07 23:17:01 overcloud-controller-0 container-server[87161]: Removed 0 dbs
Oct 07 23:17:01 overcloud-controller-0 container-server[87161]: 1 successes, 1 failures

7. in openstack-swift-object-expirer.service

Oct 07 23:14:59 overcloud-controller-0 object-expirer[86780]: Pass beginning; 0 possible containers; 0 possible objects (txn: tx63ee551ebbb24c5db9cdc-0059d95ff3)
Oct 07 23:15:00 overcloud-controller-0 swift[86780]: ERROR with Account server 172.24.1.17:6002/d1 re: Trying to GET /v1/.expiring_objects: ConnectionTimeout (0.5s) (txn: txaba66fab721a4b53977df-0059d95ff3)
Oct 07 23:15:00 overcloud-controller-0 object-expirer[86780]: STDERR: ERROR:root:Error connecting to memcached: 127.0.0.1:11211#012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/swift/common/memcached.py", line 214, in _get_conns#012    fp, sock = self._client_cache[server].get()#012  File "/usr/lib/python2.7/site-packages/swift/common/memcached.py", line 132, in get#012    fp, sock = self.create()#012  File "/usr/lib/python2.7/site-packages/swift/common/memcached.py", line 125, in create#012    sock.connect(sockaddr)#012  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 242, in connect#012    socket_checkerr(fd)#012  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 46, in socket_checkerr#012    raise socket.error(err, errno.errorcode[err])#012error: [Errno 111] ECONNREFUSED (txn: txaba66fab721a4b53977df-0059d95ff3)
Oct 07 23:15:00 overcloud-controller-0 object-expirer[86780]: Pass completed in 1s; 0 objects expired (txn: txaba66fab721a4b53977df-0059d95ff3)

The proposed fix addresses the problem and deleting/renaming the object results in cleanly deployed Swift service

Comment 7 Filip Hubík 2017-11-21 10:36:01 UTC

I also hit this very annoying bug in RHOS10 baremetal deployment (puddle 2017-11-08.3), mentioned workaround
$ source stackrc
$ swift delete overcloud-swift-rings

helped, but I had to delete/redeploy whole overcloud again.

Comment 18 Christian Schwede (cschwede) 2018-02-22 16:43:52 UTC

*** Bug 1494872 has been marked as a duplicate of this bug. ***

Comment 20 Lon Hohberger 2018-03-07 13:52:34 UTC

According to our records, this should be resolved by openstack-tripleo-common-5.4.7-1.el7ost.  This build is available now.

Note You need to log in before you can comment on or make changes to this bug.

akarlsso
cschwede
derekh
dvd
fhubik
jjoyce
jraju
kiyyappa
marjones
mburns
mhalas
nchandek
panburaj
pkundal
pmannidi
slinaber
srevivo
thiago
tvvcox
zaitcev