Description of problem: I am getting ChunkWriteTimeout errors from switft when uploading images to glance. This appears to be due to probably slow disk and the default node_timeout of 10s in the app section of the swift proxy-server.conf. Setting node_timeout=60 stops the problem Version-Release number of selected component (if applicable): openstack-swift-2.3.0-1.el7ost.noarch openstack-swift-account-2.3.0-1.el7ost.noarch openstack-swift-container-2.3.0-1.el7ost.noarch openstack-swift-object-2.3.0-1.el7ost.noarch openstack-swift-plugin-swift3-1.7-3.el7ost.noarch openstack-swift-proxy-2.3.0-1.el7ost.noarch How reproducible: Frequent. I can look through over a couple dozen attempts to upload an image without success. Other times less. Steps to Reproduce: 1. Running swift on a controller being run on a vm seems a good way to trigger. 2. Upload an image (rhel 7 or cfme is sufficient in size) to glance Actual results: glance returns a 500 ISE error. Digging through the glance and swift logs will get to swift being the problem. Expected results: Upload should work Additional info: Is there a way to configure this from the director? I did not see an option. It would be nice if the default for an install could be upped or adjusted via the plan.
If this bug requests changes in the Director, shouldn't it be assigned to the corresponding component? I think it's either rhel-osp-installer or rhel-osp-director. [BTW the reason why Swift sets default node timeout to a rather low value of 10s is that it makes sense when you have many nodes. The proxy must resolve quorum quickly, and if it has 2 broken nodes, then effective delay increases. Some clients only wait 60 seconds for the start of the data transfer, and all this mechanics has to act quicker. It's well known that Swift won't work well on nodes with a great service time.]
I'm fine with it being moved to director if a way to change the setting hasn't yet been implemented. And I take what you're saying as truth. The problem I am is hitting is that we are trying to automate installation of OSP and Cloudforms for RHCI and the CFME image upload pretty much always bombs out in testing. Us setting a higher node timeout to test sounds like it's simply going to reverse the problem, making it work well for a few low power nodes like might be seen in a test environment and not well for several proper nodes in a production environment, which is in no way better. So, if we had a way to make the change manually prior to deployment I'd be fine with that as a solution.
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
Clearing the OSP10 target and this goes into the list to triage for next release. This may be related to a recent issue or not even a bug anymore and was from OSP7.
Ideally, we should be able to set options in tripleo to set different options for different deployments types. For example, if this is a test (or PoC) deployment with very few nodes, then maybe the timeout could be increased, but if this is a real-world production deployment, then the correct (swift default) timeout option should remain in place.
I think I filed this against OSP 7 originally. Since we moved to OSP 8 I have not encountered problems with this.
Marking this as MODIFIED; the increased timeout is included in the builds for OSP10.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html