Bug 1207299 - [scale] - webadmin request to server failed - unable to evaluate payload
Summary: [scale] - webadmin request to server failed - unable to evaluate payload
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.1
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Vojtech Szocs
QA Contact: guy chen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-30 15:59 UTC by Eldad Marciano
Modified: 2022-06-27 11:48 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-17 09:20:27 UTC
oVirt Team: UX
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
td1 (425.79 KB, text/plain)
2015-03-30 15:59 UTC, Eldad Marciano
no flags Details
td2 (435.95 KB, text/plain)
2015-03-30 16:00 UTC, Eldad Marciano
no flags Details
td3 (422.20 KB, text/plain)
2015-03-30 16:01 UTC, Eldad Marciano
no flags Details
vt14.1 WebAdmin / GWT symbol maps (5.67 MB, application/x-gzip)
2015-04-12 11:25 UTC, Vojtech Szocs
no flags Details
vt14.1 WebAdmin / GWT symbol maps (production build) (2.94 MB, application/x-gzip)
2015-04-12 11:50 UTC, Vojtech Szocs
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 972957 0 medium CLOSED Admin Portal error "Unable to evaluate payload" in Internet Explorer 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1218371 0 medium CLOSED [scale] JavaScriptException "elem is null" in Firefox 37 2021-02-22 00:41:40 UTC
Red Hat Issue Tracker RHV-46685 0 None None None 2022-06-27 11:48:18 UTC
Red Hat Knowledge Base (Solution) 395513 0 None None None Never

Internal Links: 972957 1218371

Description Eldad Marciano 2015-03-30 15:59:27 UTC
Description of problem:
this bug discovered already in 3.4.1, when using cluster tab in large scale.
https://bugzilla.redhat.com/show_bug.cgi?id=1139688

the bug reproduced when using different flow, not sure what exactly make this one.


by stack trace sampling looks like ajp threads running some code related to payload 
but it seems like payload code runs faster than 5 sec since thread dump was taken every 5 sec and payload code runs by different thread each time.

attaching thread dumps.

at the logs there is no such a thing related to payload evaluation.


Version-Release number of selected component (if applicable):
3.5.1 VT14.1

How reproducible:
100%

Steps to Reproduce:
1. large scale setup (500 hosts, 10Kvms).
2. login 
3. click on host tab.
4. focus on some host.

Actual results:
Unable to evaluate payload error dialog and sometimes webadmin stuck or crashed.

Expected results:
webadmin shall work in extreme cases.

Additional info:

Comment 1 Eldad Marciano 2015-03-30 15:59:57 UTC
Created attachment 1008577 [details]
td1

Comment 2 Eldad Marciano 2015-03-30 16:00:38 UTC
Created attachment 1008579 [details]
td2

Comment 3 Eldad Marciano 2015-03-30 16:01:01 UTC
Created attachment 1008581 [details]
td3

Comment 4 Liran Zelkha 2015-03-31 06:35:43 UTC
Eldad - can you run this with a profiler? Any info on slow queries on postgres?

Comment 5 Liran Zelkha 2015-03-31 06:52:14 UTC
Eldad, let's try something. 
Please try to run the 2 following queries (you'll need to get actual engine_sessions.id and ad_element_id values):

SELECT *
   FROM permissions_view
   WHERE (permissions_view.app_mode & 1) > 0
   AND (permissions_view.ad_element_id = 'fdfc627c-d875-11e0-90f0-83df133b58cc'
    OR    ad_element_id IN (SELECT * FROM GetSessionUserAndGroupsById('fdfc627c-d875-11e0-90f0-83df133b58cc', 1)))
   AND (NOT true OR EXISTS (SELECT 1 FROM user_permissions_permissions_view uv, engine_sessions WHERE uv.user_id = engine_sessions.user_id AND  engine_sessions.id = 1::bigint))

and 


SELECT *
   FROM permissions_view
   WHERE (permissions_view.app_mode & 1) > 0
   AND (permissions_view.ad_element_id = 'fdfc627c-d875-11e0-90f0-83df133b58cc'
    OR    ad_element_id IN (SELECT * FROM GetSessionUserAndGroupsById('fdfc627c-d875-11e0-90f0-83df133b58cc', 1)))
   AND (NOT true OR EXISTS (SELECT 1 FROM user_permissions_permissions_view uv, engine_sessions WHERE uv.user_id = engine_sessions.user_id AND  engine_sessions.id = 1))


Tell me which one runs faster.

Comment 6 Eldad Marciano 2015-03-31 09:07:03 UTC
(In reply to Liran Zelkha from comment #4)
> Eldad - can you run this with a profiler? Any info on slow queries on
> postgres?

tried it already, there is no queries longer than 1.8 sec.

Comment 7 Eldad Marciano 2015-03-31 09:15:46 UTC
(In reply to Liran Zelkha from comment #5)
> Eldad, let's try something. 
> Please try to run the 2 following queries (you'll need to get actual
> engine_sessions.id and ad_element_id values):
> 
> SELECT *
>    FROM permissions_view
>    WHERE (permissions_view.app_mode & 1) > 0
>    AND (permissions_view.ad_element_id =
> 'fdfc627c-d875-11e0-90f0-83df133b58cc'
>     OR    ad_element_id IN (SELECT * FROM
> GetSessionUserAndGroupsById('fdfc627c-d875-11e0-90f0-83df133b58cc', 1)))
>    AND (NOT true OR EXISTS (SELECT 1 FROM user_permissions_permissions_view
> uv, engine_sessions WHERE uv.user_id = engine_sessions.user_id AND 
> engine_sessions.id = 1::bigint))
> 
> and 
> 
> 
> SELECT *
>    FROM permissions_view
>    WHERE (permissions_view.app_mode & 1) > 0
>    AND (permissions_view.ad_element_id =
> 'fdfc627c-d875-11e0-90f0-83df133b58cc'
>     OR    ad_element_id IN (SELECT * FROM
> GetSessionUserAndGroupsById('fdfc627c-d875-11e0-90f0-83df133b58cc', 1)))
>    AND (NOT true OR EXISTS (SELECT 1 FROM user_permissions_permissions_view
> uv, engine_sessions WHERE uv.user_id = engine_sessions.user_id AND 
> engine_sessions.id = 1))
> 
> 
> Tell me which one runs faster.

there is no such a function 'GetSessionUserAndGroupsById'

Comment 8 Eldad Marciano 2015-03-31 11:36:22 UTC
another test was taken in order to clarify the bug,
since we have upgrade our setup several times (version).
we tried to remove and re install rhevm to be sure we are using correct packages like GWT.

"yum remove rhevm*"
"yum install rhevm" (by specifying the correct vt14.1 only).


this issue still exist specially using Firefox.

Comment 9 Eldad Marciano 2015-04-02 09:07:46 UTC
Additional findings:

when trying to reproduced this bug in isolated lab at the same LAN this bug not reproduced.


It seems like this bug reproduced only over WAN.
see the following output from web console, when it reproduced.

"Thu Apr 02 12:02:42 GMT+300 2015 org.ovirt.engine.ui.frontend.Frontend
SEVERE: Failed to execute runQuery: com.google.gwt.user.client.rpc.IncompatibleRemoteServiceException: Unable to evaluate payload
com.google.gwt.user.client.rpc.IncompatibleRemoteServiceException: Unable to evaluate payload
	at Unknown.Zp(Unknown Source)
	at Unknown.pje(Unknown Source)
	at Unknown.G0d(Unknown Source)
	at Unknown.E0d(Unknown Source)
	at Unknown.R1d(Unknown Source)
	at Unknown.p1i(Unknown Source)
	at Unknown.fZ(Unknown Source)
	at Unknown.zZ(Unknown Source)
	at Unknown.kQe/c.onreadystatechange<(Unknown Source)
	at Unknown.Ir(Unknown Source)
	at Unknown.Lr(Unknown Source)
	at Unknown.Kr/<(Unknown Source)
	at Unknown.anonymous(Unknown Source)
Caused by: com.google.gwt.core.client.JavaScriptException: (InternalError) : switch statement too large
	at Unknown.anonymous(Unknown Source)
	at Unknown.G0d(Unknown Source)
	at Unknown.E0d(Unknown Source)
	at Unknown.R1d(Unknown Source)
	at Unknown.p1i(Unknown Source)
	at Unknown.fZ(Unknown Source)
	at Unknown.zZ(Unknown Source)
	at Unknown.kQe/c.onreadystatechange<(Unknown Source)
	at Unknown.Ir(Unknown Source)
	at Unknown.Lr(Unknown Source)
	at Unknown.Kr/<(Unknown Source)
	at Unknown.anonymous(Unknown Source)"

Comment 10 Einav Cohen 2015-04-02 12:24:32 UTC
sounds similar to bug 972957 (only in bug 972957 the complaint is on IE9). 

@Eldad:

- Does this reproduce *consistently* on a scale environment over WAN?
- Which browser (type/version) are you using?

Comment 11 Eldad Marciano 2015-04-02 12:32:49 UTC
(In reply to Einav Cohen from comment #10)
> sounds similar to bug 972957 (only in bug 972957 the complaint is on IE9). 
> 
> @Eldad:
> 
> - Does this reproduce *consistently* on a scale environment over WAN?
> - Which browser (type/version) are you using?

not sure i have no smaller setup, but it reproduced consistently on WAN.
i have tested it on top of firefox 33.1 and over chrome Version 39.0.2171.65

Comment 12 Einav Cohen 2015-04-07 18:13:41 UTC
@Vojtech - can you please sit with Eldad and take a look at it? 
Alexander is unable to reproduce this. 

Thanks.

Comment 13 Vojtech Szocs 2015-04-12 08:47:05 UTC
(In reply to Einav Cohen from comment #12)
> @Vojtech - can you please sit with Eldad and take a look at it? 
> Alexander is unable to reproduce this. 
> 
> Thanks.

I'm on it. I talked with Eldad, I'm going to re-build VT14.1 locally to generate GWT symbol maps and correlate them with the obfuscated JavaScript stack trace.

I was able to reproduce this myself (Firefox 34.0) by logging into Eldad's VT14.1 WebAdmin and just waiting few moments for the error dialog to appear.

Comment 14 Vojtech Szocs 2015-04-12 11:22:15 UTC
Attaching GWT symbol maps for vt14.1 WebAdmin [1] for Firefox and Chrome (two files, packed as tar.gz archive).

[1] frontend/webadmin/modules/webadmin/target/generated-gwt/WEB-INF/deploy/webadmin/symbolMaps

Comment 15 Vojtech Szocs 2015-04-12 11:25:42 UTC
Created attachment 1013632 [details]
vt14.1 WebAdmin / GWT symbol maps

Comment 16 Vojtech Szocs 2015-04-12 11:32:42 UTC
Sorry, I forgot to disable the GWT draft compile, need to re-build vt14.1 Engine, removing the attachment for now.

Comment 17 Vojtech Szocs 2015-04-12 11:50:28 UTC
Created attachment 1013634 [details]
vt14.1 WebAdmin / GWT symbol maps (production build)

Comment 18 Vojtech Szocs 2015-04-12 11:57:43 UTC
vt14.1 WebAdmin / GWT symbol maps (production build)

GWT build parameters:
- target browsers: Firefox + Chrome
- target locale: English (only)
- draft compile: no (production build)

GWT symbol maps:
- 5682425E7396046A02D4F3EF10BF73E9.symbolMap = Firefox
- 94E382FAE2D3E09D46A590DB55EEB491.symbolMap = Chrome

Comment 19 Vojtech Szocs 2015-04-12 13:22:33 UTC
Correlating with JS stack trace: http://pastebin.test.redhat.com/274822

at Unknown.G0d(Unknown Source)
  => com.google.gwt.rpc.client.impl.CommandClientSerializationStreamReader::$prepareToRead(Lcom/google/gwt/rpc/client/impl/CommandClientSerializationStreamReader;Ljava/lang/String;)V,com.google.gwt.rpc.client.impl.CommandClientSerializationStreamReader,$prepareToRead,com/google/gwt/rpc/client/impl/CommandClientSerializationStreamReader.java,87,0

at Unknown.E0d(Unknown Source)
  => com.google.gwt.rpc.client.impl.ClientWriterFactory::createReader(Ljava/lang/String;)Lcom/google/gwt/user/client/rpc/SerializationStreamReader;,com.google.gwt.rpc.client.impl.ClientWriterFactory,createReader,com/google/gwt/rpc/super/com/google/gwt/rpc/client/impl/ClientWriterFactory.java,31,0

at Unknown.R1d(Unknown Source)
  => com.google.gwt.rpc.client.impl.RpcCallbackAdapter::onResponseReceived(Lcom/google/gwt/http/client/Request;Lcom/google/gwt/http/client/Response;)V,com.google.gwt.rpc.client.impl.RpcCallbackAdapter,onResponseReceived,com/google/gwt/rpc/client/impl/RpcCallbackAdapter.java,72,0

at Unknown.p1i(Unknown Source)
  => org.ovirt.engine.ui.common.gin.BaseSystemModule$1$1::onResponseReceived(Lcom/google/gwt/http/client/Request;Lcom/google/gwt/http/client/Response;)V,org.ovirt.engine.ui.common.gin.BaseSystemModule$1$1,onResponseReceived,org/ovirt/engine/ui/common/gin/BaseSystemModule.java,127,0

Thrown exception is com.google.gwt.user.client.rpc.IncompatibleRemoteServiceException which typically occurs when GWT client is "out of sync" with GWT servlet. This generally shouldn't happen when using WebAdmin compiled for use with given Engine and assuming gwt-servlet.jar has the correct version, for 3.5.z this should be gwt-servlet-2.5.1.jar.

@Eldad, does this issue still occur when using gwt-servlet-2.5.1.jar?

Digging deeper into stack trace: BaseSystemModule.java,127,0 => XSRF token HTTP request failed @ BaseSystemModule.java line 127.

To me, it seems that GWT RPC (POST) request to /ovirt-engine/webadmin/xsrf returns data incompatible for WebAdmin client. (XsrfTokenServiceServlet as defined in frontend/webadmin/modules/webadmin/src/main/webapp/WEB-INF/web.xml)

@Alex, if you have any thoughts on this, please share.

Anyway, I can remotely debug Eldad's vt14.1 WebAdmin & Engine using my Firefox 34.0 & Java IDE. Staying in touch with Eldad.

Comment 20 Vojtech Szocs 2015-05-04 18:20:04 UTC
After further testing and analysis, we can conclude that:

* Firefox bug https://bugzilla.mozilla.org/show_bug.cgi?id=1083913 was indeed causing the "Switch statement too large" error (as reported in this BZ), it's now fixed in Firefox 36 and above; i.e. the original error cannot be reproduced with Firefox 36+ anymore

* relation to XsrfTokenServiceServlet (as described in comment #19) is not relevant, however there seems to be an application-level null pointer exception occuring in Firefox 37 (as described in bug 1218371)

Since the above mentioned Firefox bug is already fixed (which makes this BZ unable to reproduce in recent Firefox), I suggest to close this bug and fix the application-level null pointer exception as described in bug 1218371.

Comment 21 Einav Cohen 2015-05-04 20:34:52 UTC
(In reply to vszocs from comment #20)
> After further testing and analysis, we can conclude that:
> 
> * Firefox bug https://bugzilla.mozilla.org/show_bug.cgi?id=1083913 was
> indeed causing the "Switch statement too large" error (as reported in this
> BZ), it's now fixed in Firefox 36 and above; i.e. the original error cannot
> be reproduced with Firefox 36+ anymore
> 
> * relation to XsrfTokenServiceServlet (as described in comment #19) is not
> relevant, however there seems to be an application-level null pointer
> exception occuring in Firefox 37 (as described in bug 1218371)
> 
> Since the above mentioned Firefox bug is already fixed (which makes this BZ
> unable to reproduce in recent Firefox), I suggest to close this bug and fix
> the application-level null pointer exception as described in bug 1218371.

agreed. closing on NOT-A-BUG, as the originally reported issue ("unable to evaluate payload") is reproducible only on FF versions that are not / will not be supported for RHEV (33-36). 
the newly discovered issue (application-level NPE) will be handled in the context of bug 1218371.

Comment 23 Vojtech Szocs 2015-08-25 17:49:25 UTC
Problem analysis: https://bugzilla.redhat.com/show_bug.cgi?id=972957#c11

Each "A Request to the Server failed: xxx" error is logged by the GWT client [1].

[1] see org.ovirt.engine.ui.frontend.Frontend#failureEventHandler call hierarchy

The actual JS error, originating from eval() invocation, is not displayed. Therefore we need to analyze GWT client logs.

Please do following when "A Request to the Server failed: xxx" error appears:
- open browser's developer tools (might need to reload page)
- inspect calls to "console.log" API, looking for following records:
  "Failed to execute runQuery: ..."
  "Failed to execute runMultipleQueries: ..."
  "Failed to execute runAction: ..."
  "Failed to execute runMultipleAction: ..."
  "Failed to login: ..."
  "Failed to execute logoff: ..."
- report the "..." part, which should contain details of the actual JS error (root cause)

Comment 35 Einav Cohen 2015-09-21 13:01:03 UTC
still working on reproducing the problem, pushing to 3.5.6 for now.

Comment 59 Yaniv Lavi 2016-05-09 10:56:43 UTC
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.

Comment 70 Yaniv Kaul 2017-01-22 09:26:24 UTC
Eldad - can you retest on latest 4.1?

Comment 71 Eldad Marciano 2017-01-22 09:29:44 UTC
(In reply to Yaniv Kaul from comment #70)
> Eldad - can you retest on latest 4.1?

I see the target release is 4.2, 
anyway we need to coordinate this bug priority.

Comment 72 Vojtech Szocs 2017-01-24 17:50:38 UTC
In oVirt GWT UI, we recently switched to standard GWT RPC mechanism [1], moving away from GWT direct-eval RPC (alternative RPC mechanism).

[1] https://gerrit.ovirt.org/#/c/65735/

The `ovirt-engine-4.1` branch already contains [1], which means the reported (GWT RPC related) bug can be fixed there.

As Yaniv wrote in comment #70, please try to test on latest 4.1.

(In other words, 4.0 still uses the "old" RPC mechanism while 4.1 uses the "standard" RPC mechanism.)

Comment 73 Eldad Marciano 2017-03-13 22:03:31 UTC
Yaniv - seems like we can verify this bug on top of 4.1.x
Is it excepted to verify this bug with ~200 hosts and ~3K vms? 
note, since this bug reproduced for 500 and 10K vms, we might missed it for smaller topology.

Comment 74 Yaniv Kaul 2017-04-09 09:42:26 UTC
Makes sense to verify for 4.2.

Comment 75 mlehrer 2017-08-17 09:01:11 UTC
Previously it was suggested to use 4.1 latest, and also suggested to use a lesser scale we just want a bottom line from dev of what is required for reproduction.

Scale is attempting to reproduce this issue, can you please confirm that the enviroment to reproduce is:
 
500 hosts and 10K vms, and RHV 4.2 per Yaniv's comment.

and not 200 hosts and 3K vms on RHV 4.1 latest.


Note You need to log in before you can comment on or make changes to this bug.