Bug 853045 - Getting 500 Internal Server Error when exporting/importing VM/Template - need to change httpd's config
Summary: Getting 500 Internal Server Error when exporting/importing VM/Template - need...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine-setup
Version: 3.3.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 3.3.0
Assignee: Juan Hernández
QA Contact: Tareq Alayan
URL:
Whiteboard: infra
Depends On: 845832
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-08-30 11:28 UTC by Jakub Libosvar
Modified: 2016-02-10 19:31 UTC (History)
17 users (show)

Fixed In Version: si23
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-10-23 10:03:24 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine log (68.15 KB, application/x-xz)
2012-08-30 11:28 UTC, Jakub Libosvar
no flags Details
backend log (35.80 KB, text/x-log)
2012-09-03 09:20 UTC, Jakub Libosvar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 845832 0 unspecified CLOSED [Scalability] Attachment of storage domains to data center via rhevm-sdk sometimes fails 2021-02-22 00:41:40 UTC
oVirt gerrit 16913 0 None None None Never

Internal Links: 845832

Description Jakub Libosvar 2012-08-30 11:28:26 UTC
Created attachment 608158 [details]
engine log

Description of problem:
After attempt to import VM or Template with url:/api/storagedomains/0ed780d2-258d-4f29-b627-abd0db0e0f30/vms/0a56c23e-6724-485c-b81f-d134f1a6eb52/import 
body:<action>
    <async>false</async>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
    <storage_domain>
        <name>master_sd</name>
    </storage_domain>
    <cluster>
        <name>storage_sanity_31_cluster</name>
    </cluster>
    <vm>
        <name>twin_brother</name>
        <snapshots>
            <collapse_snapshots>true</collapse_snapshots>
        </snapshots>
    </vm>
    <clone>true</clone>
</action>

getting response
 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator,
 root@localhost and inform them of the time the error occurred,
and anything you might have done that may have
caused the error.</p>
<p>More information about this error may be available
in the server error log.</p>
<hr>
<address>Apache/2.2.17 (Red Hat Enterprise Web Server) Server at jenkins-vm-06.qa.lab.tlv.redhat.com Port 443</address>
</body></html>

error in apache error log: 
[Thu Aug 30 13:36:41 2012] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header


Version-Release number of selected component (if applicable):
httpd-2.2.17-15.4.ep5.el6.x86_64
rhevm-backend-3.1.0-14.el6ev.noarch
httpd-tools-2.2.17-15.4.ep5.el6.x86_64
rhevm-restapi-3.1.0-14.el6ev.noarch


How reproducible:
Always

Steps to Reproduce:
1. Import VM or Template
2.
3.
  
Actual results:


Expected results:


Additional info:
There is similar bug for attaching domain bug 845832 and proposed settings for apache_mod bug 849442

Regression since 3.0, blocking all tests using import

Comment 4 Ori Liel 2012-09-02 12:03:08 UTC
Doesn't reproduce for me, I entered similar input: 

<action>
    <async>false</async>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
   <cluster>
      <name>Cluster3_1</name>
   </cluster>    
   <storage_domain>
      <name>data_2_smamit</name>
   </storage_domain>
   <clone>true</clone>
   <vm>
      <name>vm_2_clone_2</name>
      <snapshots>
          <collapse_snapshots>true</collapse_snapshots>
      </snapshots>
   </vm>
</action>

And result was: 200 OK and the VM was successfully imported. 

Perhaps this has to do with the size of the VM disks (I did not try with large disks)?

Also, it would help if you could attach only the relevant part of the log (the attached log file is large and it's difficult to pinpoint the problem)

Comment 5 Michael Pasternak 2012-09-02 14:55:57 UTC
jakub, do you have reproducer?

Comment 6 Michael Pasternak 2012-09-02 14:56:14 UTC
ori, did you saw the logs?

Comment 7 Ori Liel 2012-09-03 07:22:52 UTC
I looked at the log, it's a large file and I had difficulty finding the problem there. It would help if the issue would be reproduced and only the relevant portion of the log be attached

Comment 9 Jakub Libosvar 2012-09-03 09:20:06 UTC
Created attachment 609305 [details]
backend log

Attaching backend log from the time of import. Note that this is different import (without collapse snapshot and name):

 url:/api/storagedomains/c4b54eb3-9bcd-4e44-869c-e348f05c79b3/vms/97929842-4391-415c-940d-abe668f1cadc/import body:<action>
    <async>false</async>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
    <storage_domain>
        <name>iscsiStorageTest_data_domain0</name>
    </storage_domain>
    <cluster>
        <name>iscsiStorageTest</name>
    </cluster>
</action>

The VM itself is imported correctly after that (which is new for me).

Comment 10 Jakub Libosvar 2012-09-03 09:26:54 UTC
(In reply to comment #9)
> The VM itself is imported correctly after that (which is new for me).

Which is not true every time - we have test where export fails with this 500 error and then vm is not found on export domain.

Comment 11 Ori Liel 2012-09-03 10:55:41 UTC
For comment 8 - 

Right, in most cases the failure is easy to see, and then a full log better because it's more comprehensive. In this case a shorter log or a timestamp could have helped, so I asked for such.

Comment 12 Michael Pasternak 2012-09-04 12:22:47 UTC
seems like #845832, #818084 issue, same symptoms as in [1]

[Tue Sep 04 12:41:51 2012] [error] (111)Connection refused: proxy: AJP: attempt to connect to 127.0.0.1:8009 (localhost) failed
[Tue Sep 04 12:41:51 2012] [error] ap_proxy_connect_backend disabling worker for (localhost)
[Tue Sep 04 12:41:51 2012] [error] proxy: AJP: failed to make connection to backend: localhost

please try configuration changes suggested in [2]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=818084#c18
[2] https://bugzilla.redhat.com/show_bug.cgi?id=818084#c19

=> anyway this is BE issue (BE runs out of resources on heavy loaded env.)

Comment 13 Jakub Libosvar 2012-09-05 12:29:56 UTC
Suggested Michael's workaround in comment 12 solves the problem.

Comment 22 Juan Hernández 2012-09-11 10:40:59 UTC
Jakub, what are the characteristics of the VM that you are trying to import? How many disks does it have and what is their size and type?

Comment 23 Jakub Libosvar 2012-09-11 11:10:48 UTC
3 disks - The first disk containing OS is 20GB COW IDE
From VM containing only this one disk is created a template and from this template is created new VM.
To this newly created VM based on template are added two disks 3G COW IDE.
Template is exported and VM containing 3 disks is exported with collapse snapshot option enabled.

Comment 26 Juan Hernández 2012-09-11 19:03:57 UTC
I created a VM with similar characteristics and imported it in my environment without problems with a similar RESTAPI request. However the disk of my VM was not used so the COW were really small (less than 1G each) and the import took less than 60 seconds. I reduced the Apache AJP timeout to try to force the issue:

ProxyPass / ajp://localhost:8702/ retry=5 timeout=5

And then I get the same error messages mentioned in the description of the bug. So I think that the solution is to increase this timeout:

ProxyPass / ajp://localhost:8702/ retry=5 timeout=300

The suggested solution is the same that for bug 845832:

http://gerrit.ovirt.org/7933

This increases the timeout to 5 minutes, but there may be situations where the import takes more than that, specially if the disks are large. We should recommend users of the RESTAPI and SDK to avoid synchronous invocation of these potentially long operations.

Comment 27 Itamar Heim 2012-09-20 11:06:14 UTC
downstream patch:
https://gerrit.eng.lab.tlv.redhat.com/#/c/2190/

Comment 29 Juan Hernández 2012-10-22 09:04:28 UTC
Upstream change to increase the time-out to one hour till we have a better solution:

http://gerrit.ovirt.org/8716

Comment 30 Juan Hernández 2012-10-22 13:02:11 UTC
The upstream change to one hour has been merged:

http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=90b97f31a89b8ad88376c3bc275ee47ceb04632e

Comment 33 Jakub Libosvar 2012-11-19 15:28:15 UTC
Haven't occurred on Jenkins jobs using rhevm-3.1.0-30.el6ev.noarch

Comment 37 Juan Hernández 2013-07-16 08:30:53 UTC
The relevant message is in /var/log/http/ssl_error_log-20130714 in this case:

[Wed Jul 10 16:33:24 2013] [error] ajp_read_header: ajp_ilink_receive failed
[Wed Jul 10 16:33:24 2013] [error] (70007)The timeout specified has expired: proxy: dialog to [::1]:8702 (localhost) failed

This cerntainly looks like an expired timeout, but we configure that specific timeout for 3600 seconds in /etc/httpd/conf.d/z-ovirt-engine-proxy.conf:

<Proxy ajp://localhost:8702>
  # This is needed to make sure that connections to the application server
  # are recovered in a short time interval (5 seconds at the moment)
  # otherwise when the application server is restarted the web server will
  # refuse to connect during 60 seconds.
  ProxySet retry=5

  # This is needed to make sure that long RESTAPI requests have time to
  # finish before the web server aborts the request as the default timeout
  # (controlled by the Timeout directive in httpd.conf) is 60 seconds.
  ProxySet timeout=3600
</Proxy>

This should apply to all the URLs redirected using the proxy. Can we double check that this configuration is applied when running the Jenkins job?

It could also happen that the configuration is only applied to ajp://localhost:8702 and not ajp://locahost/api, but I verified that in my environment and it works correctly. I would suggest to do the following test in the Jenkins machine where this is failing:

1. Create a /usr/share/ovirt-engine/engine.ear/root.war/test.jsp file with the following content:

<html>
<body>
<%
  System.out.println("Waiting ...");
  Thread.sleep(100 * 1000);
  System.out.println("Done.");
%>
Hello!
</body>
</html>

This just waits 100 seconds and then returns a "Hello!".

2. Add the following to /etc/httpd/conf.d/z-ovirt-engine-proxy.conf:

<Location /test.jsp>
  ProxyPass ajp://localhost:8702/test.jsp

  #
  # Remove the Expect headers from API requests (this is needed to fix a
  # problem with some API clients):
  #
  <IfModule headers_module>
    RequestHeader unset Expect
  </IfModule>

  <IfModule deflate_module>
    AddOutputFilterByType DEFLATE text/javascript text/css text/html text/xml text/json application/xml application/json application/x-yaml
  </IfModule>
</Location>

This is a copy of the configuration that we use for the RESTAPI, so it should fail in the same way if the timeout is not applied.

3. Restart the engine and the web server, then try to access the test.jsp page (https://whatever/test.jsp).

In my environment this works correctly, I mean, the timeout applies to all the requests sent to the application server, including the test.jsp page.

Comment 38 Juan Hernández 2013-07-16 08:39:29 UTC
Note that there are a lot of similar errors in the ssl_error_log of that machine, seems that the connection between the web server and the application server is fragile in that environment. Do we have other similar environments? Can we check the ssl_error_log files in those environments? Do they have the same type of errors?

Comment 39 Juan Hernández 2013-07-16 09:22:45 UTC
I just performed the test described in comment #37 with the version of the web server installed in the Jenkins machine (httpd-2.2.22-18.ep6.el6.x86_64, note that this comes from the EAP channel, not from RHEL) and apparently it isn't taking into account the timeout configured with the ProxySet directive. Please modify the /etc/httpd/conf.d/z-ovirt-engine-proxy.conf file and add the timeout to line 36, like this:

  ProxyPass ajp://localhost:8702/api timeout=3600

Then restart the web server and repeat the test. If this solves the issue we will need to make this change in the product.

Comment 41 Juan Hernández 2013-07-16 09:38:15 UTC
As this is a timing issue it may fail in some iterations and succeed in others.

Comment 42 Juan Hernández 2013-07-16 10:27:36 UTC
The root cause of this problem is a change in behavior from Apache 2.2 to Apache 2.4. Before version 2.4 Apache didn't copy the value of the timeout parameter from the Proxy directive to the ProxyPass directive. To fix it we are repeating the value of the timeout in the ProxyPass directive for the RESTAPI.

Comment 43 Juan Hernández 2013-07-16 13:29:50 UTC
The change to add the timeout parameter to the ProxyPass directive has been merged upstream, so this will be fixed with the next rebase.

Comment 44 Jakub Libosvar 2013-08-12 13:12:35 UTC
Ran export import test with
rhevm-3.3.0-0.14.master.el6ev.noarch

Comment 51 Juan Hernández 2013-10-09 09:54:27 UTC
Unfortunately I can't find any reason for the timeout using the log files. In order to make progress with this I need the following:

1. The logs of the web server in /var/log/httpd. If possible truncate them before starting the job:

  # for i in /var/log/httpd/*; do > $i; done

2. The configuration files of the web server, /etc/httpd/conf/httpd.conf and everything inside /etc/httpd/conf.d. These should be collected after running the setup of the engine.

3. A dump of the communications between the web server and the application server. This can be collected using the following command:

  # tcpdump -i lo -w ajp.dump 'port 8702'

A dump of the traffic between the client and the web server can also help:

  # tcpdump -i eth0 -w https.dump 'port 443'

This last dump is only useful if it can be decrypted, and in order to achieve that we would need the private key of the web server (by default /etc/pki/ovirt-engine/keys/apache.key.nopass) and also disable the Diffie-Hellman handshakes:

  # echo "SSLCipherSuite AES256-SHA" >> /etc/httd/conf.d/ssl.conf

I would suggest to modify the Jenkins job so that it collects all that information, maybe not by default but at least as a configuration option.

Comment 52 Juan Hernández 2013-10-16 09:22:12 UTC
Aharon, can you provide the information requested in comment 51?

Comment 53 Juan Hernández 2013-10-23 10:03:24 UTC
I can't make progress with this without the information that I requested in comment 51, so I am closing it. If it appears again please collect the information and reopen it.


Note You need to log in before you can comment on or make changes to this bug.