Red Hat Bugzilla – Bug 853045
Getting 500 Internal Server Error when exporting/importing VM/Template - need to change httpd's config
Last modified: 2016-02-10 14:31:24 EST
Created attachment 608158 [details]
Description of problem:
After attempt to import VM or Template with url:/api/storagedomains/0ed780d2-258d-4f29-b627-abd0db0e0f30/vms/0a56c23e-6724-485c-b81f-d134f1a6eb52/import
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
<p>Please contact the server administrator,
root@localhost and inform them of the time the error occurred,
and anything you might have done that may have
caused the error.</p>
<p>More information about this error may be available
in the server error log.</p>
<address>Apache/2.2.17 (Red Hat Enterprise Web Server) Server at jenkins-vm-06.qa.lab.tlv.redhat.com Port 443</address>
error in apache error log:
[Thu Aug 30 13:36:41 2012] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Import VM or Template
There is similar bug for attaching domain bug 845832 and proposed settings for apache_mod bug 849442
Regression since 3.0, blocking all tests using import
Doesn't reproduce for me, I entered similar input:
And result was: 200 OK and the VM was successfully imported.
Perhaps this has to do with the size of the VM disks (I did not try with large disks)?
Also, it would help if you could attach only the relevant part of the log (the attached log file is large and it's difficult to pinpoint the problem)
jakub, do you have reproducer?
ori, did you saw the logs?
I looked at the log, it's a large file and I had difficulty finding the problem there. It would help if the issue would be reproduced and only the relevant portion of the log be attached
Created attachment 609305 [details]
Attaching backend log from the time of import. Note that this is different import (without collapse snapshot and name):
The VM itself is imported correctly after that (which is new for me).
(In reply to comment #9)
> The VM itself is imported correctly after that (which is new for me).
Which is not true every time - we have test where export fails with this 500 error and then vm is not found on export domain.
For comment 8 -
Right, in most cases the failure is easy to see, and then a full log better because it's more comprehensive. In this case a shorter log or a timestamp could have helped, so I asked for such.
seems like #845832, #818084 issue, same symptoms as in 
[Tue Sep 04 12:41:51 2012] [error] (111)Connection refused: proxy: AJP: attempt to connect to 127.0.0.1:8009 (localhost) failed
[Tue Sep 04 12:41:51 2012] [error] ap_proxy_connect_backend disabling worker for (localhost)
[Tue Sep 04 12:41:51 2012] [error] proxy: AJP: failed to make connection to backend: localhost
please try configuration changes suggested in 
=> anyway this is BE issue (BE runs out of resources on heavy loaded env.)
Suggested Michael's workaround in comment 12 solves the problem.
Jakub, what are the characteristics of the VM that you are trying to import? How many disks does it have and what is their size and type?
3 disks - The first disk containing OS is 20GB COW IDE
From VM containing only this one disk is created a template and from this template is created new VM.
To this newly created VM based on template are added two disks 3G COW IDE.
Template is exported and VM containing 3 disks is exported with collapse snapshot option enabled.
I created a VM with similar characteristics and imported it in my environment without problems with a similar RESTAPI request. However the disk of my VM was not used so the COW were really small (less than 1G each) and the import took less than 60 seconds. I reduced the Apache AJP timeout to try to force the issue:
ProxyPass / ajp://localhost:8702/ retry=5 timeout=5
And then I get the same error messages mentioned in the description of the bug. So I think that the solution is to increase this timeout:
ProxyPass / ajp://localhost:8702/ retry=5 timeout=300
The suggested solution is the same that for bug 845832:
This increases the timeout to 5 minutes, but there may be situations where the import takes more than that, specially if the disks are large. We should recommend users of the RESTAPI and SDK to avoid synchronous invocation of these potentially long operations.
continue failing at http://loki01.eng.lab.tlv.redhat.com/view/Automation/view/rhevm_3.1/view/lastest_si_build%20+%20latest%20vdsm%20built/job/rhevm_3.1_automation_export_import_iscsi_rpm/92/testReport/(root)/Positive_test_testcase_exportimportcollapsesnapshot/__068___Import_exported_VM_with_collapse_snapshots_option__iscsi_/
Upstream change to increase the time-out to one hour till we have a better solution:
The upstream change to one hour has been merged:
Haven't occurred on Jenkins jobs using rhevm-3.1.0-30.el6ev.noarch
The relevant message is in /var/log/http/ssl_error_log-20130714 in this case:
[Wed Jul 10 16:33:24 2013] [error] ajp_read_header: ajp_ilink_receive failed
[Wed Jul 10 16:33:24 2013] [error] (70007)The timeout specified has expired: proxy: dialog to [::1]:8702 (localhost) failed
This cerntainly looks like an expired timeout, but we configure that specific timeout for 3600 seconds in /etc/httpd/conf.d/z-ovirt-engine-proxy.conf:
# This is needed to make sure that connections to the application server
# are recovered in a short time interval (5 seconds at the moment)
# otherwise when the application server is restarted the web server will
# refuse to connect during 60 seconds.
# This is needed to make sure that long RESTAPI requests have time to
# finish before the web server aborts the request as the default timeout
# (controlled by the Timeout directive in httpd.conf) is 60 seconds.
This should apply to all the URLs redirected using the proxy. Can we double check that this configuration is applied when running the Jenkins job?
It could also happen that the configuration is only applied to ajp://localhost:8702 and not ajp://locahost/api, but I verified that in my environment and it works correctly. I would suggest to do the following test in the Jenkins machine where this is failing:
1. Create a /usr/share/ovirt-engine/engine.ear/root.war/test.jsp file with the following content:
Thread.sleep(100 * 1000);
This just waits 100 seconds and then returns a "Hello!".
2. Add the following to /etc/httpd/conf.d/z-ovirt-engine-proxy.conf:
# Remove the Expect headers from API requests (this is needed to fix a
# problem with some API clients):
RequestHeader unset Expect
This is a copy of the configuration that we use for the RESTAPI, so it should fail in the same way if the timeout is not applied.
3. Restart the engine and the web server, then try to access the test.jsp page (https://whatever/test.jsp).
In my environment this works correctly, I mean, the timeout applies to all the requests sent to the application server, including the test.jsp page.
Note that there are a lot of similar errors in the ssl_error_log of that machine, seems that the connection between the web server and the application server is fragile in that environment. Do we have other similar environments? Can we check the ssl_error_log files in those environments? Do they have the same type of errors?
I just performed the test described in comment #37 with the version of the web server installed in the Jenkins machine (httpd-2.2.22-18.ep6.el6.x86_64, note that this comes from the EAP channel, not from RHEL) and apparently it isn't taking into account the timeout configured with the ProxySet directive. Please modify the /etc/httpd/conf.d/z-ovirt-engine-proxy.conf file and add the timeout to line 36, like this:
ProxyPass ajp://localhost:8702/api timeout=3600
Then restart the web server and repeat the test. If this solves the issue we will need to make this change in the product.
As this is a timing issue it may fail in some iterations and succeed in others.
The root cause of this problem is a change in behavior from Apache 2.2 to Apache 2.4. Before version 2.4 Apache didn't copy the value of the timeout parameter from the Proxy directive to the ProxyPass directive. To fix it we are repeating the value of the timeout in the ProxyPass directive for the RESTAPI.
The change to add the timeout parameter to the ProxyPass directive has been merged upstream, so this will be fixed with the next rebase.
Ran export import test with
Unfortunately I can't find any reason for the timeout using the log files. In order to make progress with this I need the following:
1. The logs of the web server in /var/log/httpd. If possible truncate them before starting the job:
# for i in /var/log/httpd/*; do > $i; done
2. The configuration files of the web server, /etc/httpd/conf/httpd.conf and everything inside /etc/httpd/conf.d. These should be collected after running the setup of the engine.
3. A dump of the communications between the web server and the application server. This can be collected using the following command:
# tcpdump -i lo -w ajp.dump 'port 8702'
A dump of the traffic between the client and the web server can also help:
# tcpdump -i eth0 -w https.dump 'port 443'
This last dump is only useful if it can be decrypted, and in order to achieve that we would need the private key of the web server (by default /etc/pki/ovirt-engine/keys/apache.key.nopass) and also disable the Diffie-Hellman handshakes:
# echo "SSLCipherSuite AES256-SHA" >> /etc/httd/conf.d/ssl.conf
I would suggest to modify the Jenkins job so that it collects all that information, maybe not by default but at least as a configuration option.
Aharon, can you provide the information requested in comment 51?
I can't make progress with this without the information that I requested in comment 51, so I am closing it. If it appears again please collect the information and reopen it.