Bug 991742
Summary: | Cant parse xml response - 500 Internal Server Error | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Meital Bourvine <mbourvin> | ||||||
Component: | ovirt-engine | Assignee: | Spenser Shumaker <sshumake> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Tareq Alayan <talayan> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.3.0 | CC: | aberezin, acathrow, bazulay, cboyle, eedri, iheim, juan.hernandez, kroberts, lpeer, mbourvin, nlevinki, oramraz, pstehlik, Rhev-m-bugs, sshumake, yeylon | ||||||
Target Milestone: | --- | Keywords: | TestBlocker | ||||||
Target Release: | 3.3.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | infra | ||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-01-21 16:58:06 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Meital Bourvine
2013-08-04 08:50:55 UTC
We have had these timeouts many times, but currently the timeout should be 3600 seconds. Last time we changed this was done in the context of bug 853045. I am not sure if the version that caused this problem includes the fix or not. Can you check the content of the /etc/httpd/conf.d/z-ovirt-engine-proxy.conf file? It should contain the following: <Location /api> # # The timeout has to be specified here again because versions of # Apache older than 2.4 don't copy the setting from the Proxy # directive: # ProxyPass ajp://localhost:@JBOSS_AJP_PORT@/api timeout=3600 <IfModule deflate_module> AddOutputFilterByType DEFLATE text/javascript text/css text/html text/xml text/json application/xml application/json application/x-yaml </IfModule> </Location> The relevant parameter there is timeout=3600. It's the same: <Location /api> # # The timeout has to be specified here again because versions of # Apache older than 2.4 don't copy the setting from the Proxy # directive: # ProxyPass ajp://localhost:8702/api timeout=3600 <IfModule deflate_module> AddOutputFilterByType DEFLATE text/javascript text/css text/html text/xml text/json application/xml application/json application/x-yaml </IfModule> </Location> Unfortunately I can't find any reason for this error. I will need more information to try to debug it. Is it possible to get a dump of the traffic between the web server and the application server while this is happening? That can be obtained with a command like this: # tcpdump -i lo -w ajp.dump 'dst host localhost and port 8702' This can be started before running the test and stopped after running the test. Then attach the resulting ajp.dump together with the logs. Include also the web server logs, if possible. Do you think that it might be the same problem as https://bugzilla.redhat.com/show_bug.cgi?id=853045 ? It's a bit problematic for me to run tcpdump, since we are working with host broker and I can't know which host it'll give us. That was my initial thought, that this was the same that bug 853045, but you have the timeouts configured correctly, so it can't be that. Can't we run the tcpdump command from the same script that installs/starts the engine? I can't think of another way to debug this. The only alternative is to have a clear reproducer, but I guess that is not possible at the moment. Note that bug 853045 was apparently similar to this one (a timeout problem as well) and has just been moved to VERIFIED after the change to add the timeout explicitly to the RESTAPI proxy directive. That is another indication that this is something different. Created attachment 785748 [details]
Sniff
Thanks Meital. Can you tell me what is the job corresponding to this dump so that I can make the connection between the logs and the exchange of packets? The first anomaly that I see in this log is that Apache is initially trying to contact the application server using IPv6, that is failing because JBoss doesn't use IPv6 by default (not sure if it can be enabled), then Apache tries again using IPv4 and that works. This happens because in the Apache configuration we use "localhost" as the address of the application server, and in most environments that translates both to the IPv6 ::1 and to the IPv4 127.0.0.1 addresses. As this initial IPv6 attempt is useless we should probably modify the Apache configuration to use "127.0.0.1" instead of "localhost": # sed -i 's/localhost/127.0.0.1/g' /etc/httpd/conf.d/z-ovirt-engine-proxy.conf # service httpd restart It may be worth doing the change upstream: http://gerrit.ovirt.org/17980 . However I don't think this is the root cause of the timeout. I need to study this deeper, and I need the rest of the logs of the job. I see in the logs that the test program requested a operation at 17:32:59: 2013-08-12 17:32:59,724 - Thread-79 - disks - DEBUG - Action request content is -- url:/api/vms/d4101640-5719-42c5-be91-5e25ed66bb5e/disks/6f77c4f0-4c29-4f4d-96fd-ac3a65578d7b/activate body:<action> <async>false</async> <grace_period> <expiry>10</expiry> </grace_period> </action> The same thread received the error response at 17:35:01: 2013-08-12 17:35:01,180 - Thread-79 - core_api - DEBUG - Request response time: 2.560 2013-08-12 17:35:01,180 - Thread-79 - disks - DEBUG - Response body for action request is: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>500 Internal Server Error</title> </head><body> <h1>Internal Server Error</h1> <p>The server encountered an internal error or misconfiguration and was unable to complete your request.</p> <p>Please contact the server administrator, root@localhost and inform them of the time the error occurred, and anything you might have done that may have caused the error.</p> <p>More information about this error may be available in the server error log.</p> <hr> <address>Apache/2.2.22 (Red Hat Enterprise Web Server) Server at jenkins-vm-04.scl.lab.tlv.redhat.com Port 443</address> </body></html> 2013-08-12 17:35:01,220 - Thread-79 - disks - ERROR - Cant parse xml response In the engine side the operation was started at 17:33:01 and finished at 17:35:13, note that it finished 12 seconds *after* the test program got the error response: 2013-08-12 17:33:01,068 INFO [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (ajp-/127.0.0.1:8702-9) Lock Acquired to object EngineLock [exclusiveLocks= , sharedLocks= key: d4101640-5719-42c5-be91-5e25ed66bb5e value: VM 2013-08-12 17:33:01,105 INFO [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (ajp-/127.0.0.1:8702-9) Running command: HotPlugDiskToVmCommand internal: false. Entities affected : ID: d4101640-5719-42c5-be91-5e25ed66bb5e Type: VM 2013-08-12 17:33:01,110 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (ajp-/127.0.0.1:8702-9) START, HotPlugDiskVDSCommand(HostName = 10.35.160.51, HostId = a4ccc75b-8a76-4590-89cb-2e16178c7895, vmId=d4101640-5719-42c5-be91-5e25ed66bb5e, volumeId = 6f77c4f0-4c29-4f4d-96fd-ac3a65578d7b), log id: 1609c20a 2013-08-12 17:35:13,408 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (ajp-/127.0.0.1:8702-9) FINISH, HotPlugDiskVDSCommand, log id: 1609c20a 2013-08-12 17:35:13,513 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-9) Correlation ID: 50236a5b, Call Stack: null, Custom Event ID: -1, Message: VM vm_hotplug_hooks disk disk_to_plug_4 was plugged by admin@internal. 2013-08-12 17:35:13,514 INFO [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (ajp-/127.0.0.1:8702-9) Lock freed to object EngineLock [exclusiveLocks= , sharedLocks= key: d4101640-5719-42c5-be91-5e25ed66bb5e value: VM Looking at the dump I see that te request was received by the application server at 17:33:01 (the times here are UTC), but the web server closed its side of the connection at 17:35:01 (this is when the test program received the error), after waiting approx 2 minutes for the response, but the other side was still open, till 17:35:13, when the application server started to send the response headers, and then the web server aborted the connection immediately: 3114 2013-08-12 14:33:01.006156 127.0.0.1 127.0.0.1 AJP13 466 40:REQ:POST /api/vms/d4101640-5719-42c5-be91-5e25ed66bb5e/disks/6f77c4f0-4c29-4f4d-96fd-ac3a65578d7b/activate HTTP/1.1 3115 2013-08-12 14:33:01.045626 127.0.0.1 127.0.0.1 AJP13 183 40:REQ:Body 3116 2013-08-12 14:33:01.045765 127.0.0.1 127.0.0.1 TCP 66 8702 > 60061 [ACK] Seq=32519 Ack=1959 Win=39296 Len=0 TSval=335034010 TSecr=335033970 3117 2013-08-12 14:33:01.046453 127.0.0.1 127.0.0.1 AJP13 73 40:RSP:GET BODY CHUNK 3118 2013-08-12 14:33:01.046490 127.0.0.1 127.0.0.1 TCP 66 60061 > 8702 [ACK] Seq=1959 Ack=32526 Win=49152 Len=0 TSval=335034011 TSecr=335034011 3119 2013-08-12 14:33:01.046532 127.0.0.1 127.0.0.1 AJP13 72 40:REQ:Body 3126 2013-08-12 14:33:01.090213 127.0.0.1 127.0.0.1 TCP 66 8702 > 60061 [ACK] Seq=32526 Ack=1965 Win=39296 Len=0 TSval=335034054 TSecr=335034011 3337 2013-08-12 14:35:01.155915 127.0.0.1 127.0.0.1 TCP 66 60061 > 8702 [FIN, ACK] Seq=1965 Ack=32526 Win=49152 Len=0 TSval=335154120 TSecr=335034054 3338 2013-08-12 14:35:01.195230 127.0.0.1 127.0.0.1 TCP 66 8702 > 60061 [ACK] Seq=32526 Ack=1966 Win=39296 Len=0 TSval=335154160 TSecr=335154120 3421 2013-08-12 14:35:13.516812 127.0.0.1 127.0.0.1 AJP13 108 40:RSP:SEND HEADERS:200 OK 3422 2013-08-12 14:35:13.516848 127.0.0.1 127.0.0.1 TCP 54 60061 > 8702 [RST] Seq=1966 Win=0 Len=0 So it looks like for some reason the web server decided to close the connection when it was in the middle of receiving a response, and not an exceptionally long one, only two minutes. I need to study it further. After studying this a bit more I see that this happened simultaneously for 4 different connections. All of them were closed exactly 2 minutes after the application server sent the last message, and all of them were then reset when the application server finally sent the response. As Apache is in worker mode this can't be due to a crash of Apache, as 4 processes failing simultaneously isn't likely. I only can reproduce this behavior if I explicitly change the timeout of the proxy directive to 2 minutes: ProxyPass ajp://localhost:8702/test.jsp timeout=120 Then I get exactly the same result: the connection is closed after 2 minutes and reset when the application server finally sends the response. However this shouldn't happen in the tests installation, as the timeout is explicitly set to 3600 seconds. I tested this with exactly the same version of Apache used in the test (httpd-2.2.22-18.ep6.el6.x86_64, from the EAP channel) and if the timeout is configured correctly it works correctly, for me. The only reason I can imagine for this at the moment is Apache randomly taking the value of the timeout parameter from different ProxyPass directives. So I can suggest two approaches to see if we can fix it: 1. Explicitly add the timeout=3600 parameter to all the ProxyPass directives in the /etc/httpd/conf.d/z-ovirt-engine-proxy.conf file. 2. Change the Timeout parameter in the /etc/httpd/conf/httpd.conf file to 3600 seconds (this is where the 2 minutes timeout comes from). Can we test these two changes? There is a mistake in the previous comment: Apache is in *prefork* mode, not worker. The rest is correct. Ok, I believe found the root cause for this. The Apache proxy module has a mechanism that allows sharing of connections to the application server for different re-directions. Lets use this as example: ProxyPass / http://localhost:8702 timeout=10 ProxyPass /api http://localhost:8702/api timeout=20 This is very similar to what we have, but simplified. With this configuration Apache will use the same connection (worker in mod_proxy terminology) for both redirections, because they share a common prefix. But this shared worker has only one timeout parameter, and it is the one corresponding to the last ProxyPass directive, so in this case / will have a timeout of 20, not 10. This is ok, and expected, and we have the ProxyPass directives ordered in such a way that all of them will have the same timeout, in particular the RESTAPI ProxyPass directive is the very last one so that nothing overrides it. But the recently added support plugin added a new configuration file to the /etc/httpd/conf.d directory, named z-redhat-support-plugin-rhev-proxy.conf. When alphabetically sorted that goes after z-ovirt-engine-proxy.conf. That file contains the following: <IfModule proxy_ajp_module> <LocationMatch ^/redhat-support-plugin-rhev(/.*)?$> ProxyPassMatch ajp://localhost:8702 </LocationMatch> </IfModule> As this has the same prefix, doesn't specify a timeout value and it appears after the RESTAPI ProxyPass directive it has the net effect of setting the timeout of the shared worker to the default 2 minutes. I think we should rename the support plugin configuration file so that it goes before the z-ovirt-engine-proxy.conf file. That or add the explicit timeout there as well. Keith, what do you think? (In reply to Juan Hernández from comment #18) snip > > I think we should rename the support plugin configuration file so that it > goes before the z-ovirt-engine-proxy.conf file. That or add the explicit > timeout there as well. > > Keith, what do you think? First, let me say ugh at a lexicographic loading of config files. Second, I like the idea of renaming the support plugin configuration file so that is before ovirt-engine. In this manner we will only have the timeout listed in one spot; hence, no duplication. Is this reasonable? Another thought, rename z-ovirt-engine-proxy.conf to Z-ovirt-engine-proxy.conf Capital Z is higher up on the ASCII table; therefore, it should appear later than lowercase z IIUC. Keith (In reply to Keith Robertson from comment #20) > Another thought, rename z-ovirt-engine-proxy.conf to > Z-ovirt-engine-proxy.conf > > Capital Z is higher up on the ASCII table; therefore, it should appear later > than lowercase z IIUC. > > Keith Ignore my previous comment. We will just remove the 'z-' prefix from 'z-redhat-support-plugin-rhev-proxy.conf' Keith This bug is currently attached to errata RHBA-2013:15513. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance. Does this need doc text? What twas the solution? Thanks. Hi Charlie, The plugin had a config file named improperly. This bug was fixed by renaming that file. I do not think doc text is needed for this bug Thanks! Thanks Spenser Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0082.html |