1281556 – Swift timeouts uploading images to glance

Bug 1281556 - Swift timeouts uploading images to glance

Summary: Swift timeouts uploading images to glance

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	10.0 (Newton)
Assignee:	Christian Schwede (cschwede)
QA Contact:	Mike Abrams
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-12 19:06 UTC by Jason Montleon
Modified:	2016-12-14 15:17 UTC (History)
CC List:	11 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-5.0.0-0.20160725225924.77f6c40.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-14 15:17:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	389351	None	ABANDONED	Add Swift best practices	2020-09-24 01:28:38 UTC
Red Hat Bugzilla	1381721	unspecified	CLOSED	Overcloud Controller Disk is exhausted with few instances and default settings	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2016:2948	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 enhancement update	2016-12-14 19:55:27 UTC

Description Jason Montleon 2015-11-12 19:06:09 UTC

Description of problem:
I am getting ChunkWriteTimeout errors from switft when uploading images to glance. This appears to be due to probably slow disk and the default node_timeout of 10s in the app section of the swift proxy-server.conf. Setting node_timeout=60 stops the problem

Version-Release number of selected component (if applicable):
openstack-swift-2.3.0-1.el7ost.noarch
openstack-swift-account-2.3.0-1.el7ost.noarch
openstack-swift-container-2.3.0-1.el7ost.noarch
openstack-swift-object-2.3.0-1.el7ost.noarch
openstack-swift-plugin-swift3-1.7-3.el7ost.noarch
openstack-swift-proxy-2.3.0-1.el7ost.noarch

How reproducible:
Frequent. I can look through over a couple dozen attempts to upload an image without success. Other times less.

Steps to Reproduce:
1. Running swift on a controller being run on a vm seems a good way to trigger.
2. Upload an image (rhel 7 or cfme is sufficient in size) to glance

Actual results:
glance returns a 500 ISE error. Digging through the glance and swift logs will get to swift being the problem.

Expected results:
Upload should work

Additional info:
Is there a way to configure this from the director? I did not see an option. It would be nice if the default for an install could be upped or adjusted via the plan.

Comment 2 Pete Zaitcev 2015-11-13 00:00:27 UTC

If this bug requests changes in the Director, shouldn't it be assigned
to the corresponding component? I think it's either rhel-osp-installer
or rhel-osp-director.

[BTW the reason why Swift sets default node timeout to a rather low value
of 10s is that it makes sense when you have many nodes. The proxy must
resolve quorum quickly, and if it has 2 broken nodes, then effective
delay increases. Some clients only wait 60 seconds for the start
of the data transfer, and all this mechanics has to act quicker.
It's well known that Swift won't work well on nodes with a great
service time.]

Comment 3 Jason Montleon 2015-11-13 12:44:06 UTC

I'm fine with it being moved to director if a way to change the setting hasn't yet been implemented. And I take what you're saying as truth.

The problem I am is hitting is that we are trying to automate installation of OSP and Cloudforms for RHCI and the CFME image upload pretty much always bombs out in testing. 

Us setting a higher node timeout to test sounds like it's simply going to reverse the problem, making it work well for a few low power nodes like might be seen in a test environment and not well for several proper nodes in a production environment, which is in no way better. 

So, if we had a way to make the change manually prior to deployment I'd be fine with that as a solution.

Comment 5 Mike Burns 2016-04-07 20:57:01 UTC

This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 7 Paul Grist 2016-10-13 22:29:55 UTC

Clearing the OSP10 target and this goes into the list to triage for next release. This may be related to a recent issue or not even a bug anymore and was from OSP7.

Comment 10 Thiago da Silva 2016-10-20 15:21:59 UTC

Ideally, we should be able to set options in tripleo to set different options for different deployments types. For example, if this is a test (or PoC) deployment with very few nodes, then maybe the timeout could be increased, but if this is a real-world production deployment, then the correct (swift default) timeout option should remain in place.

Comment 12 Jason Montleon 2016-10-24 12:52:08 UTC

I think I filed this against OSP 7 originally. Since we moved to OSP 8 I have not encountered problems with this.

Comment 13 Christian Schwede (cschwede) 2016-11-02 14:56:15 UTC

Marking this as MODIFIED; the increased timeout is included in the builds for OSP10.

Comment 21 errata-xmlrpc 2016-12-14 15:17:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Note You need to log in before you can comment on or make changes to this bug.