Bug 991742

Summary: Cant parse xml response - 500 Internal Server Error
Product: Red Hat Enterprise Virtualization Manager Reporter: Meital Bourvine <mbourvin>
Component: ovirt-engineAssignee: Spenser Shumaker <sshumake>
Status: CLOSED ERRATA QA Contact: Tareq Alayan <talayan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: aberezin, acathrow, bazulay, cboyle, eedri, iheim, juan.hernandez, kroberts, lpeer, mbourvin, nlevinki, oramraz, pstehlik, Rhev-m-bugs, sshumake, yeylon
Target Milestone: ---Keywords: TestBlocker
Target Release: 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 16:58:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Apache error log
none
Sniff none

Description Meital Bourvine 2013-08-04 08:50:55 UTC
Created attachment 782446 [details]
Apache error log

Description of problem:
Many tasks fails on:

2013-08-04 02:40:52,925 - MainThread - templates - DEBUG - Response body for action request is: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator,
 root@localhost and inform them of the time the error occurred,
and anything you might have done that may have
caused the error.</p>
<p>More information about this error may be available
in the server error log.</p>
<hr>
<address>Apache/2.2.22 (Red Hat Enterprise Web Server) Server at jenkins-vm-06.scl.lab.tlv.redhat.com Port 443</address>
</body></html>
 
2013-08-04 02:40:52,937 - MainThread - templates - ERROR - Cant parse xml response


For example:
http://jenkins.qa.lab.tlv.redhat.com:8080/view/Storage/view/3.3/job/3.3-storage_failed_import_template-iscsi-rest/lastSuccessfulBuild/artifact/logs/art_test_runner.log

Additional logs can be found here:
http://jenkins.qa.lab.tlv.redhat.com:8080/view/Storage/view/3.3/job/3.3-storage_failed_import_template-iscsi-rest/25/artifact/logs/jenkins-3.3-storage_failed_import_template-iscsi-rest-25__08042013_02-41-47.tar.bz2


From apache error log: 

[Sun Aug 04 02:23:24 2013] [notice] Apache/2.2.22 (Unix) DAV/2 mod_ssl/2.2.22 OpenSSL/1.0.0-fips configured -- resuming normal operations
[Sun Aug 04 02:24:21 2013] [error] (111)Connection refused: proxy: AJP: attempt to connect to 127.0.0.1:8702 (localhost) failed
[Sun Aug 04 02:24:21 2013] [error] ap_proxy_connect_backend disabling worker for (localhost)
[Sun Aug 04 02:24:21 2013] [error] proxy: AJP: failed to make connection to backend: localhost
[Sun Aug 04 02:24:25 2013] [error] proxy: AJP: disabled connection for (localhost)
[Sun Aug 04 02:24:38 2013] [error] (111)Connection refused: proxy: AJP: attempt to connect to 127.0.0.1:8702 (localhost) failed
[Sun Aug 04 02:24:38 2013] [error] ap_proxy_connect_backend disabling worker for (localhost)
[Sun Aug 04 02:24:38 2013] [error] proxy: AJP: failed to make connection to backend: localhost
[Sun Aug 04 02:24:42 2013] [error] proxy: AJP: disabled connection for (localhost)
[Sun Aug 04 02:40:52 2013] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
[Sun Aug 04 02:49:36 2013] [notice] caught SIGTERM, shutting down
[Sun Aug 04 02:51:54 2013] [notice] SELinux policy enabled; httpd running as context unconfined_u:system_r:unconfined_java_t:s0-s0:c0.c1023
[Sun Aug 04 02:51:54 2013] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Sun Aug 04 02:51:54 2013] [notice] SSL FIPS mode disabled
[Sun Aug 04 02:51:54 2013] [notice] Digest: generating secret for digest authentication ...
[Sun Aug 04 02:51:54 2013] [notice] Digest: done
[Sun Aug 04 02:51:54 2013] [notice] SSL FIPS mode disabled
[Sun Aug 04 02:51:55 2013] [notice] Apache/2.2.22 (Unix) DAV/2 mod_ssl/2.2.22 OpenSSL/1.0.0-fips configured -- resuming normal operations
[Sun Aug 04 02:52:40 2013] [error] (111)Connection refused: proxy: AJP: attempt to connect to 127.0.0.1:8702 (localhost) failed
[Sun Aug 04 02:52:40 2013] [error] ap_proxy_connect_backend disabling worker for (localhost)
[Sun Aug 04 02:52:40 2013] [error] proxy: AJP: failed to make connection to backend: localhost
[Sun Aug 04 02:52:44 2013] [error] proxy: AJP: disabled connection for (localhost)
[Sun Aug 04 02:52:57 2013] [error] (111)Connection refused: proxy: AJP: attempt to connect to 127.0.0.1:8702 (localhost) failed
[Sun Aug 04 02:52:57 2013] [error] ap_proxy_connect_backend disabling worker for (localhost)
[Sun Aug 04 02:52:57 2013] [error] proxy: AJP: failed to make connection to backend: localhost
[Sun Aug 04 02:53:01 2013] [error] proxy: AJP: disabled connection for (localhost)
[Sun Aug 04 03:32:49 2013] [notice] caught SIGTERM, shutting down


Version-Release number of selected component (if applicable):
rhevm-3.3.0-0.13.master.el6ev.noarch

How reproducible:
Happened on 6 jobs since IS8

Comment 1 Juan Hernández 2013-08-06 08:27:22 UTC
We have had these timeouts many times, but currently the timeout should be 3600 seconds. Last time we changed this was done in the context of bug 853045. I am not sure if the version that caused this problem includes the fix or not. Can you check the content of the /etc/httpd/conf.d/z-ovirt-engine-proxy.conf file? It should contain the following:

    <Location /api>
        #
        # The timeout has to be specified here again because versions of
        # Apache older than 2.4 don't copy the setting from the Proxy
        # directive:
        #
        ProxyPass ajp://localhost:@JBOSS_AJP_PORT@/api timeout=3600

        <IfModule deflate_module>
            AddOutputFilterByType DEFLATE text/javascript text/css text/html text/xml text/json application/xml application/json application/x-yaml
        </IfModule>
    </Location>

The relevant parameter there is timeout=3600.

Comment 2 Meital Bourvine 2013-08-06 08:44:55 UTC
It's the same:

<Location /api>
        #
        # The timeout has to be specified here again because versions of
        # Apache older than 2.4 don't copy the setting from the Proxy
        # directive:
        #
        ProxyPass ajp://localhost:8702/api timeout=3600

        <IfModule deflate_module>
            AddOutputFilterByType DEFLATE text/javascript text/css text/html text/xml text/json application/xml application/json application/x-yaml
        </IfModule>
    </Location>

Comment 3 Juan Hernández 2013-08-07 18:58:43 UTC
Unfortunately I can't find any reason for this error. I will need more information to try to debug it. Is it possible to get a dump of the traffic between the web server and the application server while this is happening? That can be obtained with a command like this:

# tcpdump -i lo -w ajp.dump 'dst host localhost and port 8702'

This can be started before running the test and stopped after running the test. Then attach the resulting ajp.dump together with the logs. Include also the web server logs, if possible.

Comment 4 Meital Bourvine 2013-08-08 13:37:52 UTC
Do you think that it might be the same problem as https://bugzilla.redhat.com/show_bug.cgi?id=853045 ?

It's a bit problematic for me to run tcpdump, since we are working with host broker and I can't know which host it'll give us.

Comment 5 Juan Hernández 2013-08-08 13:52:02 UTC
That was my initial thought, that this was the same that bug 853045, but you have the timeouts configured correctly, so it can't be that.

Can't we run the tcpdump command from the same script that installs/starts the engine?

I can't think of another way to debug this. The only alternative is to have a clear reproducer, but I guess that is not possible at the moment.

Comment 9 Juan Hernández 2013-08-12 13:18:14 UTC
Note that bug 853045 was apparently similar to this one (a timeout problem as well) and has just been moved to VERIFIED after the change to add the timeout explicitly to the RESTAPI proxy directive. That is another indication that this is something different.

Comment 10 Meital Bourvine 2013-08-12 15:09:23 UTC
Created attachment 785748 [details]
Sniff

Comment 11 Juan Hernández 2013-08-12 15:22:19 UTC
Thanks Meital. Can you tell me what is the job corresponding to this dump so that I can make the connection between the logs and the exchange of packets?

Comment 12 Juan Hernández 2013-08-12 15:51:22 UTC
The first anomaly that I see in this log is that Apache is initially trying to contact the application server using IPv6, that is failing because JBoss doesn't use IPv6 by default (not sure if it can be enabled), then Apache tries again using IPv4 and that works. This happens because in the Apache configuration we use "localhost" as the address of the application server, and in most environments that translates both to the IPv6 ::1 and to the IPv4 127.0.0.1 addresses. As this initial IPv6 attempt is useless we should probably modify the Apache configuration to use "127.0.0.1" instead of "localhost":

# sed -i 's/localhost/127.0.0.1/g' /etc/httpd/conf.d/z-ovirt-engine-proxy.conf
# service httpd restart

It may be worth doing the change upstream: http://gerrit.ovirt.org/17980 .

However I don't think this is the root cause of the timeout. I need to study this deeper, and I need the rest of the logs of the job.

Comment 15 Juan Hernández 2013-08-13 13:21:24 UTC
I see in the logs that the test program requested a operation at 17:32:59:

2013-08-12 17:32:59,724 - Thread-79 - disks - DEBUG - Action request content is --  url:/api/vms/d4101640-5719-42c5-be91-5e25ed66bb5e/disks/6f77c4f0-4c29-4f4d-96fd-ac3a65578d7b/activate body:<action>
    <async>false</async>
    <grace_period>
        <expiry>10</expiry>
    </grace_period>
</action>

The same thread received the error response at 17:35:01:

2013-08-12 17:35:01,180 - Thread-79 - core_api - DEBUG - Request response time: 2.560
2013-08-12 17:35:01,180 - Thread-79 - disks - DEBUG - Response body for action request is: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator,
 root@localhost and inform them of the time the error occurred,
and anything you might have done that may have
caused the error.</p>
<p>More information about this error may be available
in the server error log.</p>
<hr>
<address>Apache/2.2.22 (Red Hat Enterprise Web Server) Server at jenkins-vm-04.scl.lab.tlv.redhat.com Port 443</address>
</body></html>

2013-08-12 17:35:01,220 - Thread-79 - disks - ERROR - Cant parse xml response

In the engine side the operation was started at 17:33:01 and finished at 17:35:13, note that it finished 12 seconds *after* the test program got the error response:

2013-08-12 17:33:01,068 INFO  [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (ajp-/127.0.0.1:8702-9) Lock Acquired to object EngineLock [exclusiveLocks= , sharedLocks= key: d4101640-5719-42c5-be91-5e25ed66bb5e value: VM
2013-08-12 17:33:01,105 INFO  [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (ajp-/127.0.0.1:8702-9) Running command: HotPlugDiskToVmCommand internal: false. Entities affected :  ID: d4101640-5719-42c5-be91-5e25ed66bb5e Type: VM
2013-08-12 17:33:01,110 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (ajp-/127.0.0.1:8702-9) START, HotPlugDiskVDSCommand(HostName = 10.35.160.51, HostId = a4ccc75b-8a76-4590-89cb-2e16178c7895, vmId=d4101640-5719-42c5-be91-5e25ed66bb5e, volumeId = 6f77c4f0-4c29-4f4d-96fd-ac3a65578d7b), log id: 1609c20a
2013-08-12 17:35:13,408 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (ajp-/127.0.0.1:8702-9) FINISH, HotPlugDiskVDSCommand, log id: 1609c20a
2013-08-12 17:35:13,513 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-9) Correlation ID: 50236a5b, Call Stack: null, Custom Event ID: -1, Message: VM vm_hotplug_hooks disk disk_to_plug_4 was plugged by admin@internal.
2013-08-12 17:35:13,514 INFO  [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (ajp-/127.0.0.1:8702-9) Lock freed to object EngineLock [exclusiveLocks= , sharedLocks= key: d4101640-5719-42c5-be91-5e25ed66bb5e value: VM

Looking at the dump I see that te request was received by the application server at 17:33:01 (the times here are UTC), but the web server closed its side of the connection at 17:35:01 (this is when the test program received the error), after waiting approx 2 minutes for the response, but the other side was still open, till 17:35:13, when the application server started to send the response headers, and then the web server aborted the connection immediately:

   3114 2013-08-12 14:33:01.006156 127.0.0.1             127.0.0.1             AJP13    466    40:REQ:POST /api/vms/d4101640-5719-42c5-be91-5e25ed66bb5e/disks/6f77c4f0-4c29-4f4d-96fd-ac3a65578d7b/activate HTTP/1.1
   3115 2013-08-12 14:33:01.045626 127.0.0.1             127.0.0.1             AJP13    183    40:REQ:Body
   3116 2013-08-12 14:33:01.045765 127.0.0.1             127.0.0.1             TCP      66     8702 > 60061 [ACK] Seq=32519 Ack=1959 Win=39296 Len=0 TSval=335034010 TSecr=335033970
   3117 2013-08-12 14:33:01.046453 127.0.0.1             127.0.0.1             AJP13    73     40:RSP:GET BODY CHUNK
   3118 2013-08-12 14:33:01.046490 127.0.0.1             127.0.0.1             TCP      66     60061 > 8702 [ACK] Seq=1959 Ack=32526 Win=49152 Len=0 TSval=335034011 TSecr=335034011
   3119 2013-08-12 14:33:01.046532 127.0.0.1             127.0.0.1             AJP13    72     40:REQ:Body
   3126 2013-08-12 14:33:01.090213 127.0.0.1             127.0.0.1             TCP      66     8702 > 60061 [ACK] Seq=32526 Ack=1965 Win=39296 Len=0 TSval=335034054 TSecr=335034011
   3337 2013-08-12 14:35:01.155915 127.0.0.1             127.0.0.1             TCP      66     60061 > 8702 [FIN, ACK] Seq=1965 Ack=32526 Win=49152 Len=0 TSval=335154120 TSecr=335034054
   3338 2013-08-12 14:35:01.195230 127.0.0.1             127.0.0.1             TCP      66     8702 > 60061 [ACK] Seq=32526 Ack=1966 Win=39296 Len=0 TSval=335154160 TSecr=335154120
   3421 2013-08-12 14:35:13.516812 127.0.0.1             127.0.0.1             AJP13    108    40:RSP:SEND HEADERS:200 OK
   3422 2013-08-12 14:35:13.516848 127.0.0.1             127.0.0.1             TCP      54     60061 > 8702 [RST] Seq=1966 Win=0 Len=0

So it looks like for some reason the web server decided to close the connection when it was in the middle of receiving a response, and not an exceptionally long one, only two minutes. I need to study it further.

Comment 16 Juan Hernández 2013-08-13 16:26:05 UTC
After studying this a bit more I see that this happened simultaneously for 4 different connections. All of them were closed exactly 2 minutes after the application server sent the last message, and all of them were then reset when the application server finally sent the response. As Apache is in worker mode this can't be due to a crash of Apache, as 4 processes failing simultaneously isn't likely. I only can reproduce this behavior if I explicitly change the timeout of the proxy directive to 2 minutes:

  ProxyPass ajp://localhost:8702/test.jsp timeout=120

Then I get exactly the same result: the connection is closed after 2 minutes and reset when the application server finally sends the response. However this shouldn't happen in the tests installation, as the timeout is explicitly set to 3600 seconds.

I tested this with exactly the same version of Apache used in the test (httpd-2.2.22-18.ep6.el6.x86_64, from the EAP channel) and if the timeout is configured correctly it works correctly, for me.

The only reason I can imagine for this at the moment is Apache randomly taking the value of the timeout parameter from different ProxyPass directives. So I can suggest two approaches to see if we can fix it:

1. Explicitly add the timeout=3600 parameter to all the ProxyPass directives in the /etc/httpd/conf.d/z-ovirt-engine-proxy.conf file.

2. Change the Timeout parameter in the /etc/httpd/conf/httpd.conf file to 3600 seconds (this is where the 2 minutes timeout comes from).

Can we test these two changes?

Comment 17 Juan Hernández 2013-08-13 16:30:16 UTC
There is a mistake in the previous comment: Apache is in *prefork* mode, not worker. The rest is correct.

Comment 18 Juan Hernández 2013-08-13 17:43:42 UTC
Ok, I believe found the root cause for this. The Apache proxy module has a mechanism that allows sharing of connections to the application server for different re-directions. Lets use this as example:

  ProxyPass / http://localhost:8702 timeout=10
  ProxyPass /api http://localhost:8702/api timeout=20

This is very similar to what we have, but simplified. With this configuration Apache will use the same connection (worker in mod_proxy terminology) for both redirections, because they share a common prefix. But this shared worker has only one timeout parameter, and it is the one corresponding to the last ProxyPass directive, so in this case / will have a timeout of 20, not 10. This is ok, and expected, and we have the ProxyPass directives ordered in such a way that all of them will have the same timeout, in particular the RESTAPI ProxyPass directive is the very last one so that nothing overrides it.

But the recently added support plugin added a new configuration file to the /etc/httpd/conf.d directory, named z-redhat-support-plugin-rhev-proxy.conf. When alphabetically sorted that goes after z-ovirt-engine-proxy.conf. That file contains the following:

<IfModule proxy_ajp_module>
   <LocationMatch ^/redhat-support-plugin-rhev(/.*)?$>
      ProxyPassMatch ajp://localhost:8702
   </LocationMatch>
</IfModule>

As this has the same prefix, doesn't specify a timeout value and it appears after the RESTAPI ProxyPass directive it has the net effect of setting the timeout of the shared worker to the default 2 minutes.

I think we should rename the support plugin configuration file so that it goes before the z-ovirt-engine-proxy.conf file. That or add the explicit timeout there as well.

Keith, what do you think?

Comment 19 Keith Robertson 2013-08-13 18:18:31 UTC
(In reply to Juan Hernández from comment #18)
snip
> 
> I think we should rename the support plugin configuration file so that it
> goes before the z-ovirt-engine-proxy.conf file. That or add the explicit
> timeout there as well.
> 
> Keith, what do you think?

First, let me say ugh at a lexicographic loading of config files.

Second, I like the idea of renaming the support plugin configuration file so that is before ovirt-engine.  In this manner we will only have the timeout listed in one spot; hence, no duplication.  

Is this reasonable?

Comment 20 Keith Robertson 2013-08-13 18:24:11 UTC
Another thought, rename z-ovirt-engine-proxy.conf to Z-ovirt-engine-proxy.conf

Capital Z is higher up on the ASCII table; therefore, it should appear later than lowercase z IIUC.

Keith

Comment 21 Keith Robertson 2013-08-13 18:35:43 UTC
(In reply to Keith Robertson from comment #20)
> Another thought, rename z-ovirt-engine-proxy.conf to
> Z-ovirt-engine-proxy.conf
> 
> Capital Z is higher up on the ASCII table; therefore, it should appear later
> than lowercase z IIUC.
> 
> Keith

Ignore my previous comment. We will just remove the 'z-' prefix from 'z-redhat-support-plugin-rhev-proxy.conf'


Keith

Comment 23 Charlie 2013-11-28 01:34:10 UTC
This bug is currently attached to errata RHBA-2013:15513. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 24 Charlie 2013-12-05 05:53:53 UTC
Does this need doc text? What twas the solution?

Thanks.

Comment 25 Spenser Shumaker 2013-12-05 14:59:37 UTC
Hi Charlie,

The plugin had a config file named improperly.  This bug was fixed by renaming that file.  I do not think doc text is needed for this bug

Thanks!

Comment 26 Charlie 2013-12-06 00:11:06 UTC
Thanks Spenser

Comment 27 errata-xmlrpc 2014-01-21 16:58:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0082.html