Bug 1341350 - rhel-osp-director: registering the overcloud images fails on first attempt with "500 Internal Server Error: Failed to upload image 51672726-cc40-40ea-9ca0-1f8b2267313c (HTTP 500)"
Summary: rhel-osp-director: registering the overcloud images fails on first attempt w...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: instack-undercloud
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ga
: 9.0 (Mitaka)
Assignee: Jiri Stransky
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-31 22:31 UTC by Alexander Chuzhoy
Modified: 2016-08-16 03:58 UTC (History)
11 users (show)

Fixed In Version: instack-undercloud-4.0.0-8
Doc Type: Bug Fix
Doc Text:
Slow environments experienced timeouts when Glance tried to communicate with Swift as a backend. This caused some Glance operations, such as image uploads, to fail. This fix increases the Swift proxy server's default node_timeout value to 60 seconds. This increases the reliability of Glance image uploads on slow environments using Swift as an image storage backend.
Clone Of:
Environment:
Last Closed: 2016-08-11 11:31:46 UTC
Target Upstream Version:


Attachments (Terms of Use)
glance logs. (80.12 KB, application/x-gzip)
2016-05-31 22:32 UTC, Alexander Chuzhoy
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:1599 normal SHIPPED_LIVE Red Hat OpenStack Platform 9 director Release Candidate Advisory 2016-08-11 15:25:37 UTC
OpenStack gerrit 332014 'None' 'MERGED' 'Increase swift-proxy node_timeout' 2019-11-14 09:12:03 UTC
Launchpad 1594724 None None None 2016-06-21 08:42:27 UTC

Description Alexander Chuzhoy 2016-05-31 22:31:54 UTC
rhel-osp-director:  registering the overcloud images fails on first attempt with "500 Internal Server Error: Failed to upload image 51672726-cc40-40ea-9ca0-1f8b2267313c (HTTP 500)"


Environment:
instack-undercloud-4.0.0-2.el7ost.noarch
python-glance-store-0.13.0-1.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-8.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-8.el7ost.noarch
python-glanceclient-2.0.0-1.el7ost.noarch
openstack-glance-12.0.0-1.el7ost.noarch
python-glance-12.0.0-1.el7ost.noarch
openstack-tripleo-heat-templates-kilo-2.0.0-8.el7ost.noarch

Steps to reproduce:
1. Deploy undercloud
2. Download the tarballs with the overcloud images and extract the tarballs.
3. run 'openstack overcloud image upload'

Result:
Image "overcloud-full-vmlinuz" was uploaded.
+--------------------------------------+------------------------+-------------+---------+--------+
|                  ID                  |          Name          | Disk Format |   Size  | Status |
+--------------------------------------+------------------------+-------------+---------+--------+
| fd2cc191-4022-4b78-8a73-67abf7214770 | overcloud-full-vmlinuz |     aki     | 5153536 | active |
+--------------------------------------+------------------------+-------------+---------+--------+
Image "overcloud-full-initrd" was uploaded.                                                       
+--------------------------------------+-----------------------+-------------+----------+--------+
|                  ID                  |          Name         | Disk Format |   Size   | Status |
+--------------------------------------+-----------------------+-------------+----------+--------+
| 42fa8e1b-5fcf-4562-a60c-d3867ac9e1d9 | overcloud-full-initrd |     ari     | 46766672 | active |
+--------------------------------------+-----------------------+-------------+----------+--------+
500 Internal Server Error: Failed to upload image caf1631b-98fe-4684-a7dc-8ba9290ba985 (HTTP 500) 


w/a - rerun the command "openstack overcloud image upload":


Image "overcloud-full-vmlinuz" is up-to-date, skipping.                                                                                                                                                              
Image "overcloud-full-initrd" is up-to-date, skipping.                                                                                                                                                               
Image "overcloud-full" was uploaded.
+--------------------------------------+----------------+-------------+------------+--------+
|                  ID                  |      Name      | Disk Format |    Size    | Status |
+--------------------------------------+----------------+-------------+------------+--------+
| 35230f8c-6547-47e7-9585-da8fd73af10b | overcloud-full |    qcow2    | 1151860736 | active |
+--------------------------------------+----------------+-------------+------------+--------+
Image "bm-deploy-kernel" was uploaded.
+--------------------------------------+------------------+-------------+---------+--------+
|                  ID                  |       Name       | Disk Format |   Size  | Status |
+--------------------------------------+------------------+-------------+---------+--------+
| 61c3e229-265f-44b0-ae15-9aa12f8790aa | bm-deploy-kernel |     aki     | 5153536 | active |
+--------------------------------------+------------------+-------------+---------+--------+
Image "bm-deploy-ramdisk" was uploaded.
+--------------------------------------+-------------------+-------------+-----------+--------+
|                  ID                  |        Name       | Disk Format |    Size   | Status |
+--------------------------------------+-------------------+-------------+-----------+--------+
| f6d96003-31c3-4be7-b7cd-b6916b788960 | bm-deploy-ramdisk |     ari     | 406186290 | active |
+--------------------------------------+-------------------+-------------+-----------+--------+


Reproduced the issue several times.

Comment 2 Alexander Chuzhoy 2016-05-31 22:32:58 UTC
Created attachment 1163383 [details]
glance logs.

Comment 3 Marius Cornea 2016-06-08 20:26:59 UTC
From what I can tell the image upload to swift is timing out. We can see in the openstack-swift-proxy.service journal the following:

ERROR with Object server 192.0.2.1:6000/1 re: Trying to get final status of PUT to /v1/AUTH_4b8a69b9d00b41babd2041819f4bec39/glance/a9f16a3f-5ca2-4060-a303-476d696bcec7: Timeout (10.0s)
Object PUT returning 503 for [503] (txn: tx2c5dab211085411b857ad-0057587a8a) (client_ip: 192.0.2.1)

Note that I've only seen it on virt environments. I did some testing and tried switching the cache mode of the undercloud vm disk from unsafe to default and I couldn't reproduce this issue anymore.

Comment 4 Jiri Stransky 2016-06-09 14:43:49 UTC
Is this still an issue? I've deployed a virt environment on Monday 6th June and didn't hit this. Please feel free to e-mail/irc me environment login details when we have it reproduced.

Comment 5 Alexander Chuzhoy 2016-06-09 17:14:13 UTC
I haven't reproduced it (yet) with the last build, despite:
[root@instack ~]# grep default_store /etc/glance/glance-api.conf
#default_store = file
default_store = swift

and
cache='unsafe'

Comment 6 Alexander Chuzhoy 2016-06-14 19:54:47 UTC
Reproduced with 
[root@instack ~]# grep default_store /etc/glance/glance-api.conf
#default_store = file
default_store = swift

and
cache='unsafe' in VM's xml.

Comment 7 Jiri Stransky 2016-06-16 10:11:05 UTC
I still didn't hit this, with default_store = swift, unsafe cache in VM, and virtual environment.

Sasha, can you please ping me with an environment where the issue appeared?

Comment 8 Jiri Stransky 2016-06-20 11:55:40 UTC
I think both the error that i had the opportunity to look at in the environment, and what Marius pasted above, are timeouts controlled by the node_timeout setting of proxy-server. It could be that the virtual environment is just too slow for the default timeout values.

Sasha, you mentioned you can reproduce this fairly reliably, could you please try if it's still reproducible after running the following commands?

sudo crudini --set /etc/swift/proxy-server.conf app:proxy-server node_timeout 30
sudo systemctl restart openstack-swift-proxy

Comment 9 Ben Nemec 2016-06-20 17:32:34 UTC
I've run into this a number of times over the years in different environments, both virt and baremetal, but it's not necessarily reproducible even on the same hardware and software versions.

The other setting that _seems_ to help with this in my experience is this one from Glance:

# The size, in MB, that Glance will start chunking image files and do
# a large object manifest in Swift. (integer value)
#swift_store_large_object_size=5120

I set this to 500 or 1000 so Glance will upload the image in smaller chunks that don't seem to timeout.  Although again, this is a fairly intermittent problem so it's hard to say if changing that fixed the problem or if I just got lucky. :-)

Changing the proxy timeout seems like a reasonable solution too, so +1 to making that change.  I would suggest doing it on the overcloud as well, since I and a few others have run into this there too.

Comment 10 Alexander Chuzhoy 2016-06-21 02:33:05 UTC
Jiri,
I reproduced the issue after setting:
sudo crudini --set /etc/swift/proxy-server.conf app:proxy-server node_timeout 30

But after setting:
sudo crudini --set /etc/swift/proxy-server.conf app:proxy-server node_timeout 60
The issue didn't reproduce.

Comment 12 Alexander Chuzhoy 2016-07-20 20:26:19 UTC
Environment:
instack-undercloud-4.0.0-8.el7ost.noarch

Wasn't able to reproduce the issue in the interim. Will try some more.

Comment 13 Alexander Chuzhoy 2016-07-26 14:45:30 UTC
Verified:
Per comment #12.

Comment 15 errata-xmlrpc 2016-08-11 11:31:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1599.html


Note You need to log in before you can comment on or make changes to this bug.