Bug 1392461 - Uncaught exception when increasing the number of VMs in a pool
Summary: Uncaught exception when increasing the number of VMs in a pool
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.0.4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.1.0-beta
: ---
Assignee: Fred Rolland
QA Contact: sefi litmanovich
URL:
Whiteboard:
Depends On:
Blocks: 1416337
TreeView+ depends on / blocked
 
Reported: 2016-11-07 14:46 UTC by nicolas
Modified: 2017-03-16 14:47 UTC (History)
10 users (show)

Fixed In Version:
Clone Of:
: 1416337 (view as bug list)
Environment:
Last Closed: 2017-03-16 14:47:53 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+


Attachments (Terms of Use)
Requested log (ui.log) (3.70 KB, text/plain)
2016-11-08 08:08 UTC, nicolas
no flags Details
Screenshot of the disk allocation tab (45.57 KB, image/png)
2016-11-10 10:43 UTC, nicolas
no flags Details
Copy templates script (783 bytes, application/x-gzip)
2017-01-02 10:59 UTC, nicolas
no flags Details
engine log after trying to extend a VmPool (689.84 KB, application/x-gzip)
2017-01-05 15:31 UTC, nicolas
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 69851 0 master MERGED webadmin: Edit VM-Pool storage domains logic 2017-01-09 14:37:30 UTC
oVirt gerrit 69858 0 ovirt-engine-4.1 MERGED webadmin: Edit VM-Pool storage domains logic 2017-01-09 18:54:58 UTC
oVirt gerrit 69859 0 ovirt-engine-4.0 MERGED webadmin: Edit VM-Pool storage domains logic 2017-01-09 18:58:50 UTC

Description nicolas 2016-11-07 14:46:44 UTC
Description of problem:

When increasing the number of VMs on an already existing pool, an exception is thrown.

2016-11-07 14:35:44,265 ERROR [org.ovirt.engine.ui.frontend.server.gwt.OvirtRemoteLoggingService] (default task-29) [] Permutation name: F3121572B0E0E92B5F00060D10B3F8B7
Uncaught exception: com.google.gwt.event.shared.UmbrellaException: Exception caught: (TypeError)
SEVERE: Uncaught exceptioncom.google.gwt.event.shared.UmbrellaException: Exception caught: (TypeError) 
 __gwt$exception: <skipped>: Cannot read property 'Vg' of undefined
	at Unknown.Ev(webadmin-0.js@25078)
	at Unknown.Mv(webadmin-0.js@41)
	at Unknown.X7(webadmin-0.js@19)
	at Unknown.$7(webadmin-0.js@19)
	at Unknown.i7(webadmin-0.js@117)
	at Unknown.oq(webadmin-0.js@26)
	at Unknown.yq(webadmin-0.js@23798)
	at Unknown.Y2(webadmin-0.js@149)
	at Unknown.qq(webadmin-0.js@112)
	at Unknown.R9e(webadmin-0.js@964)
	at Unknown.z$e(webadmin-0.js@85)
	at Unknown.B0e(webadmin-0.js@46)
	at Unknown.Sx(webadmin-0.js@29)
	at Unknown.Wx(webadmin-0.js@57)
	at Unknown.eval(webadmin-0.js@54)
	at Unknown.OC(webadmin-0.js@20)
	at Unknown.v9e(webadmin-0.js@98)
	at Unknown.lvl(webadmin-0.js@10635)
	at Unknown.R9e(webadmin-0.js@582)
	at Unknown.z$e(webadmin-0.js@85)
	at Unknown.y$e(webadmin-0.js@60)
	at Unknown.z0e(webadmin-0.js@52)
	at Unknown.Sx(webadmin-0.js@29)
	at Unknown.Wx(webadmin-0.js@57)
	at Unknown.eval(webadmin-0.js@54)

Hash Ev in permutation F3121572B0E0E92B5F00060D10B3F8B7 is:

Ev,java.lang.Throwable::fillInStackTrace()Ljava/lang/Throwable;,java.lang.Throwable,fillInStackTrace,com/google/gwt/emul/java/lang/Throwable.java,114,0

Version-Release number of selected component (if applicable):

4.0.4.4-1

How reproducible:

Always 

Steps to Reproduce:
1. Create a VmPool (with one VM, for example)
2. Edit the pool
3. In the "Increase VM pool size in..." field enter 1
4. Save

Actual results:

The exception above is thrown

Expected results:

The pool should have extended their size

Comment 1 Oved Ourfali 2016-11-08 06:26:38 UTC
Tomas - moving this to virt. If it turns out relevant to recent UX changes then please move it to UX.

Comment 2 Tomas Jelinek 2016-11-08 08:03:03 UTC
I can not simulate this on my env, please follow this instructions to provide the debug logs for the frontend:

- yum install ovirt-engine-webadmin-portal-debuginfo
- restart the engine
- simulate the issue again and provide the logs from var/log/ovirt-engine/ui.log

Thank you!

Comment 3 nicolas 2016-11-08 08:08:15 UTC
Created attachment 1218434 [details]
Requested log (ui.log)

Comment 4 Michal Skrivanek 2016-11-08 09:35:15 UTC
looks related to latest ux changes indeed

Comment 5 Oved Ourfali 2016-11-08 10:58:49 UTC
Actually it looks quite different.
Nothing in the call stack is related to the latest cleanup code.
Alexander - can you take a look and see if that's related?

Comment 6 Tomas Jelinek 2016-11-08 11:46:01 UTC
This is the obfuscated function of validation which fails:
function zAq(a,b){
  var c,d,e,f;
  _Ao(a,b);
   if(!a.i){return}
   f=true;
   for(d=new tVf(a.i);d.c<d.e.Zg();) {
     c=mfb(rVf(d),1612);
     e=c.V;
     if(!e.xeb().Vg().xk()||e.zeb()==null){
       c.V.Jc.Pg('Storage Domain must be specified.');
       Qtl(c.V,false);
       f=false
     }
     c.J.ueb(cfb(tEe,Epy,1691,[new J_r,new TSq]));
     f=f&&c.J.Oc}Qtl(a,f)
  }
The part which fails is e.xeb().Vg() which is the diskStorageDomains.getItems().iterator(). Long story short, the diskStorageDomains.getItems() returns null. 

@Tal, do you know about some changes in the storage code which could cause the diskStorageDomains.getItems() to return null?

Comment 7 Greg Sheremeta 2016-11-08 16:33:44 UTC
It looks like a cleanup regression because it's one Model referencing another Model that died. That's been our pattern.


Caused by: com.google.gwt.core.client.JavaScriptException: (TypeError) 
 __gwt$exception: <skipped>: Cannot read property 'Vg' of undefined
	at org.ovirt.engine.ui.uicommonweb.models.storage.DisksAllocationModel.$validateEntity(DisksAllocationModel.java:369)


So the validator is walking the DisksAllocationModel. That property it can't read is another Model (well, ListModel) that is null because -- my guess -- it got prematurely cleaned?


        boolean isModelValid = true;
        for (DiskModel diskModel : getDisks()) {
            ListModel diskStorageDomains = diskModel.getStorageDomain();
            if (!diskStorageDomains.getItems().iterator().hasNext() || diskStorageDomains.getSelectedItem() == null) {
                diskModel.getStorageDomain().getInvalidityReasons().add(
                        constants.storageDomainMustBeSpecifiedInvalidReason());
                diskModel.getStorageDomain().setIsValid(false);
                isModelValid = false;
            }
            diskModel.getAlias().validateEntity(new IValidation[] { new NotEmptyValidation(), new I18NNameValidation() });
            isModelValid = isModelValid && diskModel.getAlias().getIsValid();
        }
        setIsValid(isModelValid);


Need Vojtech or Alexander to chime in, as I'm not sure what the cleanup pattern is that caused the diskStorageDomains to get null.

Comment 8 Vojtech Szocs 2016-11-08 17:20:48 UTC
After switching to 4.0.4 branch [commit `build: post ovirt-engine-4.0.4.4`], I don't see *any* commits related to memory leak fixes. In other words, commit [1] isn't there.

[1] https://gerrit.ovirt.org/#/c/65357/ (lead commit for memory leak fixes)

This leads me to a conclusion that this bug isn't related to memory leak fixes.

Please switch to 4.0.4 branch, build Engine & reproduce the problem.

Comment 9 Oved Ourfali 2016-11-08 17:25:40 UTC
Moving back to virt as 4.0.4 indeed didn't include any cleanup fix.

Comment 10 Tomas Jelinek 2016-11-09 15:27:47 UTC
I can not simulate this on any branch and don't even see from code how could the diskStorageDomains.getItems() become null.

@Nicolas: Could you please provide also the engine logs from time this happens? Maybe some query failed and returned something the frontend did not expect...

Also, could you please give some more details about the storages you have? Both storage domains and what disks the template on which this VM is based on has?

Comment 11 nicolas 2016-11-09 17:49:15 UTC
Unfortunately, there's absolutely nothing in the engine logs. Just the ui.log shows the exception.

There are some additional facts I can provide: It's not needed to extend the VmPool, just clicking on Edit and OK just after (with no changes) throws the exception. So anything I do to edit the VmPool raises the exception.

The template on which the VmPool is based is an ubuntu-1404 template. As per storage domains we have currently 7, and each of them have a copy of the template. All of them are iSCSI based, 4 of them are iSCSI native and 3 of them are based on a gateway for Ceph. No Cinder installation here. The template only has one disk (20GB), thin provisioned.

If I can provide any additional info don't hesitate to ask.

Comment 12 Tomas Jelinek 2016-11-10 10:28:12 UTC
Any chance some of the storage domains on which the template disks are, are in maintenance/down? I came to a similar stack trace when this was the case.

Also, could you please provide a screenshot of the edit pool dialog's resource allocation side tab and the bottom where the "Disk Allocation" is shown?

Comment 13 nicolas 2016-11-10 10:43:09 UTC
Created attachment 1219306 [details]
Screenshot of the disk allocation tab

See snapshot attached.

I see 'Target' and 'Disk profile' fields are empty, could this be the reason of the exception? I see we have this issue in all of our VmPools, even if they use different templates.

We have a script that each night copies all the templates to all of the Storage Domains (to allow migrate any disk to any storage domain).

This is the snippet that does this action:

    template = api.templates.get(name='...')
    disk = template.disk.list()[0]          # Just an example
    action = params.Action(storage_domain=destination_storagedomain, async=False)
    disk.copy(action=action)

Maybe this action is not setting the profile and target fields?

Comment 14 nicolas 2016-11-10 10:52:09 UTC
BTW, forgot to tell: No DSs in maintenance/down.

Comment 15 Tomas Jelinek 2016-11-10 11:59:08 UTC
The NPE happens because the "target" is empty in the dialog.

Seems the api call done as explained in comment 13 causes some issues with the disks not having "target" filled in edit pool dialog.

Moving to storage for further investigation.

Comment 16 Fred Rolland 2016-12-21 14:42:13 UTC
(In reply to nicolas from comment #13)
> Created attachment 1219306 [details]
> Screenshot of the disk allocation tab
> 
> See snapshot attached.
> 
> I see 'Target' and 'Disk profile' fields are empty, could this be the reason
> of the exception? I see we have this issue in all of our VmPools, even if
> they use different templates.
> 
> We have a script that each night copies all the templates to all of the
> Storage Domains (to allow migrate any disk to any storage domain).
> 
> This is the snippet that does this action:
> 
>     template = api.templates.get(name='...')
>     disk = template.disk.list()[0]          # Just an example
>     action = params.Action(storage_domain=destination_storagedomain,
> async=False)
>     disk.copy(action=action)
> 
> Maybe this action is not setting the profile and target fields?

Hi,
What do you mean by "each night copies all the templates to all of the Storage Domains" ?
The disks should be already in all SD after the first run, no ?

Comment 17 Fred Rolland 2016-12-21 16:06:21 UTC
Can you describe also what is your script exactly is doing ?
If I try to copy a disk to a storage domain where it exists already I get an error :

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<action>
    <fault>
        <detail>[Cannot copy Virtual Disk. One of the Template Images already exists.]</detail>
        <reason>Operation Failed</reason>
    </fault>
    <status>failed</status>
    <storage_domain>
        <name>NFS_SD10</name>
    </storage_domain>
</action>

Comment 18 nicolas 2016-12-22 15:26:50 UTC
> Hi,
> What do you mean by "each night copies all the templates to all of the
> Storage Domains" ?
> The disks should be already in all SD after the first run, no ?

Yes for stateful disks, but we work with lots of VMPools and these need the template to be available on each of the DS if you want to be able to migrate their disk to any of them. We also have several iSCSI-based DS, usually not bigger than 500GB to avoid iSCSI performance issues (if you have more than X LVs per DS, performance starts degrading).

I'm on holidays so I have no access to the script that replicates the templates, but basically the algorithm is:

1. Iterate over each template, x
2. Iterate over each DS, y
3. If x is not in y:
4.    copy x to y

If you need to know exactly how it's done I can send it after holidays.

Comment 19 Fred Rolland 2017-01-02 10:20:37 UTC
(In reply to nicolas from comment #18)
> > Hi,
> > What do you mean by "each night copies all the templates to all of the
> > Storage Domains" ?
> > The disks should be already in all SD after the first run, no ?
> 
> Yes for stateful disks, but we work with lots of VMPools and these need the
> template to be available on each of the DS if you want to be able to migrate
> their disk to any of them. We also have several iSCSI-based DS, usually not
> bigger than 500GB to avoid iSCSI performance issues (if you have more than X
> LVs per DS, performance starts degrading).
> 
> I'm on holidays so I have no access to the script that replicates the
> templates, but basically the algorithm is:
> 
> 1. Iterate over each template, x
> 2. Iterate over each DS, y
> 3. If x is not in y:
> 4.    copy x to y
> 
> If you need to know exactly how it's done I can send it after holidays.

Hi,

Thanks for your reply.
What about the pools ? They are already existing when the script runs ?
Are you creating new storage domains, and then copy the template's disks ?
If you can send the script,it will be great.

Thanks

Comment 20 nicolas 2017-01-02 10:58:29 UTC
(In reply to Fred Rolland from comment #19)
> (In reply to nicolas from comment #18)
> > > Hi,
> > > What do you mean by "each night copies all the templates to all of the
> > > Storage Domains" ?
> > > The disks should be already in all SD after the first run, no ?
> > 
> > Yes for stateful disks, but we work with lots of VMPools and these need the
> > template to be available on each of the DS if you want to be able to migrate
> > their disk to any of them. We also have several iSCSI-based DS, usually not
> > bigger than 500GB to avoid iSCSI performance issues (if you have more than X
> > LVs per DS, performance starts degrading).
> > 
> > I'm on holidays so I have no access to the script that replicates the
> > templates, but basically the algorithm is:
> > 
> > 1. Iterate over each template, x
> > 2. Iterate over each DS, y
> > 3. If x is not in y:
> > 4.    copy x to y
> > 
> > If you need to know exactly how it's done I can send it after holidays.
> 
> Hi,
> 
> Thanks for your reply.
> What about the pools ? They are already existing when the script runs ?

Yes, we have plenty of VmPools since a long time ago because they're used for academic subjects, so all of them are already created when the script runs.

> Are you creating new storage domains, and then copy the template's disks ?

Yes. We're creating some SD lately and each time a SD is created, when the script runs, it sees which templates are not copied yet (all of them the first time) and then copy them to the new SD.

> If you can send the script,it will be great.
> 

Find the script attached to this BZ.

> Thanks

Comment 21 nicolas 2017-01-02 10:59:04 UTC
Created attachment 1236588 [details]
Copy templates script

Comment 22 Fred Rolland 2017-01-05 15:11:09 UTC
Thanks for the script, I used to try to reproduced, but I could not get to the same situation you encountered.

Can you provide the engine log? Maybe I will be able to find some clue there.

Thanks.

Comment 23 nicolas 2017-01-05 15:31:35 UTC
Created attachment 1237703 [details]
engine log after trying to extend a VmPool

Find the log attached.

I successfully managed to extend one of our pools which I seem not have tried before (it's called DEMO), so you'll able to find in the logs that it has been extended successfully.

Immediately afterwards I tried to extend a different pool (SIGATEST) and the exception showed up again. I believe it didn't generate any event in the engine log but hopefully you'll be able to find something relevant in it that might shed some light on the issue.

Comment 24 Fred Rolland 2017-01-08 15:44:51 UTC
Thanks for the log. I see there that the update of "Demo" succeeded, but no clues for the other.

Looking at the code , there might be a possibility that it is related to permissions.
Which user do you use for editing the pools ?
Which user created the storage domains?

BTW is remote debug an option ?

Thanks

Comment 25 nicolas 2017-01-08 15:59:22 UTC
Hey Fred,

It was all created and edited with the admin@internal user: Storage domains, pools, etc.

Remote debug might be an option, I'll ask my boss to enable an account with permissions for all the relevant servers and get back to you. We're a moderately big institution so maybe it will take some days but as soon as I get a response I'll reach you.

Comment 26 Fred Rolland 2017-01-09 14:23:29 UTC
Hi,

I succeeded to reproduce the issue finally.

- Create a few Storage Domains
- Create a Template with 1 disk
- Create a Pool from template with 1 VM prestarted
- Copy the template disk using the script provided by Nicolas
- Edit the VM pool

Comment 27 nicolas 2017-01-09 19:15:56 UTC
Nice!

So no remote debug is needed by now?

May I know what's actually the issue that makes the exception be thrown? It's kind of curious that I can reproduce it on one pool and can't on another, as you could see in the logs.

Comment 28 Fred Rolland 2017-01-15 06:58:17 UTC
(In reply to nicolas from comment #27)
> Nice!
> 
> So no remote debug is needed by now?
> 
> May I know what's actually the issue that makes the exception be thrown?
> It's kind of curious that I can reproduce it on one pool and can't on
> another, as you could see in the logs.

Hi,
No need for remote debug. Thanks for the goodwill !!
It is quite a combination like I described in the steps to reproduce,
For example if the pool vms are not prestarted , it did reproduced.

In any case, it is a UI side validation issue.
If you need to update the poll, it will work by REST API even without this fix.

Comment 29 Sandro Bonazzola 2017-01-25 07:55:01 UTC
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

Comment 30 sefi litmanovich 2017-02-14 13:55:13 UTC
Hi, just to clarify how to re produce / verify, regarding comment #26:
1. Does it work when you invoke copy disk yourself (from webadmin or API) without nicolas' script?
2. The edit vm is invoked during the disk copy? (in this case it just failed because disk is locked)

Comment 31 Fred Rolland 2017-02-14 14:37:09 UTC
1. It did not reproduce via the webadmin. I did using the script.
2. No, the edit VM is invoked after the disk copy

Comment 32 sefi litmanovich 2017-03-13 14:14:14 UTC
Verified twice according to comment #26 and the attached script.
Once verified on rhevm-4.0.7.4-0.1.el7ev.noarch and another time on rhevm-4.1.1.4-0.1.el7.noarch.

In both cases edit vm did not produce any exception.
The template's disk was copied between 5 storage domains.


Note You need to log in before you can comment on or make changes to this bug.