Bug 1341350

Summary: rhel-osp-director: registering the overcloud images fails on first attempt with "500 Internal Server Error: Failed to upload image 51672726-cc40-40ea-9ca0-1f8b2267313c (HTTP 500)"
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: instack-undercloudAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0 (Mitaka)CC: bnemec, dbecker, dmacpher, jason.dobies, jcoufal, jjoyce, mburns, mcornea, morazi, rhel-osp-director-maint, tvignaud
Target Milestone: gaKeywords: Triaged
Target Release: 9.0 (Mitaka)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: instack-undercloud-4.0.0-8 Doc Type: Bug Fix
Doc Text:
Slow environments experienced timeouts when Glance tried to communicate with Swift as a backend. This caused some Glance operations, such as image uploads, to fail. This fix increases the Swift proxy server's default node_timeout value to 60 seconds. This increases the reliability of Glance image uploads on slow environments using Swift as an image storage backend.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-11 11:31:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
glance logs. none

Description Alexander Chuzhoy 2016-05-31 22:31:54 UTC
rhel-osp-director:  registering the overcloud images fails on first attempt with "500 Internal Server Error: Failed to upload image 51672726-cc40-40ea-9ca0-1f8b2267313c (HTTP 500)"


Environment:
instack-undercloud-4.0.0-2.el7ost.noarch
python-glance-store-0.13.0-1.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-8.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-8.el7ost.noarch
python-glanceclient-2.0.0-1.el7ost.noarch
openstack-glance-12.0.0-1.el7ost.noarch
python-glance-12.0.0-1.el7ost.noarch
openstack-tripleo-heat-templates-kilo-2.0.0-8.el7ost.noarch

Steps to reproduce:
1. Deploy undercloud
2. Download the tarballs with the overcloud images and extract the tarballs.
3. run 'openstack overcloud image upload'

Result:
Image "overcloud-full-vmlinuz" was uploaded.
+--------------------------------------+------------------------+-------------+---------+--------+
|                  ID                  |          Name          | Disk Format |   Size  | Status |
+--------------------------------------+------------------------+-------------+---------+--------+
| fd2cc191-4022-4b78-8a73-67abf7214770 | overcloud-full-vmlinuz |     aki     | 5153536 | active |
+--------------------------------------+------------------------+-------------+---------+--------+
Image "overcloud-full-initrd" was uploaded.                                                       
+--------------------------------------+-----------------------+-------------+----------+--------+
|                  ID                  |          Name         | Disk Format |   Size   | Status |
+--------------------------------------+-----------------------+-------------+----------+--------+
| 42fa8e1b-5fcf-4562-a60c-d3867ac9e1d9 | overcloud-full-initrd |     ari     | 46766672 | active |
+--------------------------------------+-----------------------+-------------+----------+--------+
500 Internal Server Error: Failed to upload image caf1631b-98fe-4684-a7dc-8ba9290ba985 (HTTP 500) 


w/a - rerun the command "openstack overcloud image upload":


Image "overcloud-full-vmlinuz" is up-to-date, skipping.                                                                                                                                                              
Image "overcloud-full-initrd" is up-to-date, skipping.                                                                                                                                                               
Image "overcloud-full" was uploaded.
+--------------------------------------+----------------+-------------+------------+--------+
|                  ID                  |      Name      | Disk Format |    Size    | Status |
+--------------------------------------+----------------+-------------+------------+--------+
| 35230f8c-6547-47e7-9585-da8fd73af10b | overcloud-full |    qcow2    | 1151860736 | active |
+--------------------------------------+----------------+-------------+------------+--------+
Image "bm-deploy-kernel" was uploaded.
+--------------------------------------+------------------+-------------+---------+--------+
|                  ID                  |       Name       | Disk Format |   Size  | Status |
+--------------------------------------+------------------+-------------+---------+--------+
| 61c3e229-265f-44b0-ae15-9aa12f8790aa | bm-deploy-kernel |     aki     | 5153536 | active |
+--------------------------------------+------------------+-------------+---------+--------+
Image "bm-deploy-ramdisk" was uploaded.
+--------------------------------------+-------------------+-------------+-----------+--------+
|                  ID                  |        Name       | Disk Format |    Size   | Status |
+--------------------------------------+-------------------+-------------+-----------+--------+
| f6d96003-31c3-4be7-b7cd-b6916b788960 | bm-deploy-ramdisk |     ari     | 406186290 | active |
+--------------------------------------+-------------------+-------------+-----------+--------+


Reproduced the issue several times.

Comment 2 Alexander Chuzhoy 2016-05-31 22:32:58 UTC
Created attachment 1163383 [details]
glance logs.

Comment 3 Marius Cornea 2016-06-08 20:26:59 UTC
From what I can tell the image upload to swift is timing out. We can see in the openstack-swift-proxy.service journal the following:

ERROR with Object server 192.0.2.1:6000/1 re: Trying to get final status of PUT to /v1/AUTH_4b8a69b9d00b41babd2041819f4bec39/glance/a9f16a3f-5ca2-4060-a303-476d696bcec7: Timeout (10.0s)
Object PUT returning 503 for [503] (txn: tx2c5dab211085411b857ad-0057587a8a) (client_ip: 192.0.2.1)

Note that I've only seen it on virt environments. I did some testing and tried switching the cache mode of the undercloud vm disk from unsafe to default and I couldn't reproduce this issue anymore.

Comment 4 Jiri Stransky 2016-06-09 14:43:49 UTC
Is this still an issue? I've deployed a virt environment on Monday 6th June and didn't hit this. Please feel free to e-mail/irc me environment login details when we have it reproduced.

Comment 5 Alexander Chuzhoy 2016-06-09 17:14:13 UTC
I haven't reproduced it (yet) with the last build, despite:
[root@instack ~]# grep default_store /etc/glance/glance-api.conf
#default_store = file
default_store = swift

and
cache='unsafe'

Comment 6 Alexander Chuzhoy 2016-06-14 19:54:47 UTC
Reproduced with 
[root@instack ~]# grep default_store /etc/glance/glance-api.conf
#default_store = file
default_store = swift

and
cache='unsafe' in VM's xml.

Comment 7 Jiri Stransky 2016-06-16 10:11:05 UTC
I still didn't hit this, with default_store = swift, unsafe cache in VM, and virtual environment.

Sasha, can you please ping me with an environment where the issue appeared?

Comment 8 Jiri Stransky 2016-06-20 11:55:40 UTC
I think both the error that i had the opportunity to look at in the environment, and what Marius pasted above, are timeouts controlled by the node_timeout setting of proxy-server. It could be that the virtual environment is just too slow for the default timeout values.

Sasha, you mentioned you can reproduce this fairly reliably, could you please try if it's still reproducible after running the following commands?

sudo crudini --set /etc/swift/proxy-server.conf app:proxy-server node_timeout 30
sudo systemctl restart openstack-swift-proxy

Comment 9 Ben Nemec 2016-06-20 17:32:34 UTC
I've run into this a number of times over the years in different environments, both virt and baremetal, but it's not necessarily reproducible even on the same hardware and software versions.

The other setting that _seems_ to help with this in my experience is this one from Glance:

# The size, in MB, that Glance will start chunking image files and do
# a large object manifest in Swift. (integer value)
#swift_store_large_object_size=5120

I set this to 500 or 1000 so Glance will upload the image in smaller chunks that don't seem to timeout.  Although again, this is a fairly intermittent problem so it's hard to say if changing that fixed the problem or if I just got lucky. :-)

Changing the proxy timeout seems like a reasonable solution too, so +1 to making that change.  I would suggest doing it on the overcloud as well, since I and a few others have run into this there too.

Comment 10 Alexander Chuzhoy 2016-06-21 02:33:05 UTC
Jiri,
I reproduced the issue after setting:
sudo crudini --set /etc/swift/proxy-server.conf app:proxy-server node_timeout 30

But after setting:
sudo crudini --set /etc/swift/proxy-server.conf app:proxy-server node_timeout 60
The issue didn't reproduce.

Comment 12 Alexander Chuzhoy 2016-07-20 20:26:19 UTC
Environment:
instack-undercloud-4.0.0-8.el7ost.noarch

Wasn't able to reproduce the issue in the interim. Will try some more.

Comment 13 Alexander Chuzhoy 2016-07-26 14:45:30 UTC
Verified:
Per comment #12.

Comment 15 errata-xmlrpc 2016-08-11 11:31:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1599.html