Description of problem: When increasing the number of VMs on an already existing pool, an exception is thrown. 2016-11-07 14:35:44,265 ERROR [org.ovirt.engine.ui.frontend.server.gwt.OvirtRemoteLoggingService] (default task-29) [] Permutation name: F3121572B0E0E92B5F00060D10B3F8B7 Uncaught exception: com.google.gwt.event.shared.UmbrellaException: Exception caught: (TypeError) SEVERE: Uncaught exceptioncom.google.gwt.event.shared.UmbrellaException: Exception caught: (TypeError) __gwt$exception: <skipped>: Cannot read property 'Vg' of undefined at Unknown.Ev(webadmin-0.js@25078) at Unknown.Mv(webadmin-0.js@41) at Unknown.X7(webadmin-0.js@19) at Unknown.$7(webadmin-0.js@19) at Unknown.i7(webadmin-0.js@117) at Unknown.oq(webadmin-0.js@26) at Unknown.yq(webadmin-0.js@23798) at Unknown.Y2(webadmin-0.js@149) at Unknown.qq(webadmin-0.js@112) at Unknown.R9e(webadmin-0.js@964) at Unknown.z$e(webadmin-0.js@85) at Unknown.B0e(webadmin-0.js@46) at Unknown.Sx(webadmin-0.js@29) at Unknown.Wx(webadmin-0.js@57) at Unknown.eval(webadmin-0.js@54) at Unknown.OC(webadmin-0.js@20) at Unknown.v9e(webadmin-0.js@98) at Unknown.lvl(webadmin-0.js@10635) at Unknown.R9e(webadmin-0.js@582) at Unknown.z$e(webadmin-0.js@85) at Unknown.y$e(webadmin-0.js@60) at Unknown.z0e(webadmin-0.js@52) at Unknown.Sx(webadmin-0.js@29) at Unknown.Wx(webadmin-0.js@57) at Unknown.eval(webadmin-0.js@54) Hash Ev in permutation F3121572B0E0E92B5F00060D10B3F8B7 is: Ev,java.lang.Throwable::fillInStackTrace()Ljava/lang/Throwable;,java.lang.Throwable,fillInStackTrace,com/google/gwt/emul/java/lang/Throwable.java,114,0 Version-Release number of selected component (if applicable): 4.0.4.4-1 How reproducible: Always Steps to Reproduce: 1. Create a VmPool (with one VM, for example) 2. Edit the pool 3. In the "Increase VM pool size in..." field enter 1 4. Save Actual results: The exception above is thrown Expected results: The pool should have extended their size
Tomas - moving this to virt. If it turns out relevant to recent UX changes then please move it to UX.
I can not simulate this on my env, please follow this instructions to provide the debug logs for the frontend: - yum install ovirt-engine-webadmin-portal-debuginfo - restart the engine - simulate the issue again and provide the logs from var/log/ovirt-engine/ui.log Thank you!
Created attachment 1218434 [details] Requested log (ui.log)
looks related to latest ux changes indeed
Actually it looks quite different. Nothing in the call stack is related to the latest cleanup code. Alexander - can you take a look and see if that's related?
This is the obfuscated function of validation which fails: function zAq(a,b){ var c,d,e,f; _Ao(a,b); if(!a.i){return} f=true; for(d=new tVf(a.i);d.c<d.e.Zg();) { c=mfb(rVf(d),1612); e=c.V; if(!e.xeb().Vg().xk()||e.zeb()==null){ c.V.Jc.Pg('Storage Domain must be specified.'); Qtl(c.V,false); f=false } c.J.ueb(cfb(tEe,Epy,1691,[new J_r,new TSq])); f=f&&c.J.Oc}Qtl(a,f) } The part which fails is e.xeb().Vg() which is the diskStorageDomains.getItems().iterator(). Long story short, the diskStorageDomains.getItems() returns null. @Tal, do you know about some changes in the storage code which could cause the diskStorageDomains.getItems() to return null?
It looks like a cleanup regression because it's one Model referencing another Model that died. That's been our pattern. Caused by: com.google.gwt.core.client.JavaScriptException: (TypeError) __gwt$exception: <skipped>: Cannot read property 'Vg' of undefined at org.ovirt.engine.ui.uicommonweb.models.storage.DisksAllocationModel.$validateEntity(DisksAllocationModel.java:369) So the validator is walking the DisksAllocationModel. That property it can't read is another Model (well, ListModel) that is null because -- my guess -- it got prematurely cleaned? boolean isModelValid = true; for (DiskModel diskModel : getDisks()) { ListModel diskStorageDomains = diskModel.getStorageDomain(); if (!diskStorageDomains.getItems().iterator().hasNext() || diskStorageDomains.getSelectedItem() == null) { diskModel.getStorageDomain().getInvalidityReasons().add( constants.storageDomainMustBeSpecifiedInvalidReason()); diskModel.getStorageDomain().setIsValid(false); isModelValid = false; } diskModel.getAlias().validateEntity(new IValidation[] { new NotEmptyValidation(), new I18NNameValidation() }); isModelValid = isModelValid && diskModel.getAlias().getIsValid(); } setIsValid(isModelValid); Need Vojtech or Alexander to chime in, as I'm not sure what the cleanup pattern is that caused the diskStorageDomains to get null.
After switching to 4.0.4 branch [commit `build: post ovirt-engine-4.0.4.4`], I don't see *any* commits related to memory leak fixes. In other words, commit [1] isn't there. [1] https://gerrit.ovirt.org/#/c/65357/ (lead commit for memory leak fixes) This leads me to a conclusion that this bug isn't related to memory leak fixes. Please switch to 4.0.4 branch, build Engine & reproduce the problem.
Moving back to virt as 4.0.4 indeed didn't include any cleanup fix.
I can not simulate this on any branch and don't even see from code how could the diskStorageDomains.getItems() become null. @Nicolas: Could you please provide also the engine logs from time this happens? Maybe some query failed and returned something the frontend did not expect... Also, could you please give some more details about the storages you have? Both storage domains and what disks the template on which this VM is based on has?
Unfortunately, there's absolutely nothing in the engine logs. Just the ui.log shows the exception. There are some additional facts I can provide: It's not needed to extend the VmPool, just clicking on Edit and OK just after (with no changes) throws the exception. So anything I do to edit the VmPool raises the exception. The template on which the VmPool is based is an ubuntu-1404 template. As per storage domains we have currently 7, and each of them have a copy of the template. All of them are iSCSI based, 4 of them are iSCSI native and 3 of them are based on a gateway for Ceph. No Cinder installation here. The template only has one disk (20GB), thin provisioned. If I can provide any additional info don't hesitate to ask.
Any chance some of the storage domains on which the template disks are, are in maintenance/down? I came to a similar stack trace when this was the case. Also, could you please provide a screenshot of the edit pool dialog's resource allocation side tab and the bottom where the "Disk Allocation" is shown?
Created attachment 1219306 [details] Screenshot of the disk allocation tab See snapshot attached. I see 'Target' and 'Disk profile' fields are empty, could this be the reason of the exception? I see we have this issue in all of our VmPools, even if they use different templates. We have a script that each night copies all the templates to all of the Storage Domains (to allow migrate any disk to any storage domain). This is the snippet that does this action: template = api.templates.get(name='...') disk = template.disk.list()[0] # Just an example action = params.Action(storage_domain=destination_storagedomain, async=False) disk.copy(action=action) Maybe this action is not setting the profile and target fields?
BTW, forgot to tell: No DSs in maintenance/down.
The NPE happens because the "target" is empty in the dialog. Seems the api call done as explained in comment 13 causes some issues with the disks not having "target" filled in edit pool dialog. Moving to storage for further investigation.
(In reply to nicolas from comment #13) > Created attachment 1219306 [details] > Screenshot of the disk allocation tab > > See snapshot attached. > > I see 'Target' and 'Disk profile' fields are empty, could this be the reason > of the exception? I see we have this issue in all of our VmPools, even if > they use different templates. > > We have a script that each night copies all the templates to all of the > Storage Domains (to allow migrate any disk to any storage domain). > > This is the snippet that does this action: > > template = api.templates.get(name='...') > disk = template.disk.list()[0] # Just an example > action = params.Action(storage_domain=destination_storagedomain, > async=False) > disk.copy(action=action) > > Maybe this action is not setting the profile and target fields? Hi, What do you mean by "each night copies all the templates to all of the Storage Domains" ? The disks should be already in all SD after the first run, no ?
Can you describe also what is your script exactly is doing ? If I try to copy a disk to a storage domain where it exists already I get an error : <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <action> <fault> <detail>[Cannot copy Virtual Disk. One of the Template Images already exists.]</detail> <reason>Operation Failed</reason> </fault> <status>failed</status> <storage_domain> <name>NFS_SD10</name> </storage_domain> </action>
> Hi, > What do you mean by "each night copies all the templates to all of the > Storage Domains" ? > The disks should be already in all SD after the first run, no ? Yes for stateful disks, but we work with lots of VMPools and these need the template to be available on each of the DS if you want to be able to migrate their disk to any of them. We also have several iSCSI-based DS, usually not bigger than 500GB to avoid iSCSI performance issues (if you have more than X LVs per DS, performance starts degrading). I'm on holidays so I have no access to the script that replicates the templates, but basically the algorithm is: 1. Iterate over each template, x 2. Iterate over each DS, y 3. If x is not in y: 4. copy x to y If you need to know exactly how it's done I can send it after holidays.
(In reply to nicolas from comment #18) > > Hi, > > What do you mean by "each night copies all the templates to all of the > > Storage Domains" ? > > The disks should be already in all SD after the first run, no ? > > Yes for stateful disks, but we work with lots of VMPools and these need the > template to be available on each of the DS if you want to be able to migrate > their disk to any of them. We also have several iSCSI-based DS, usually not > bigger than 500GB to avoid iSCSI performance issues (if you have more than X > LVs per DS, performance starts degrading). > > I'm on holidays so I have no access to the script that replicates the > templates, but basically the algorithm is: > > 1. Iterate over each template, x > 2. Iterate over each DS, y > 3. If x is not in y: > 4. copy x to y > > If you need to know exactly how it's done I can send it after holidays. Hi, Thanks for your reply. What about the pools ? They are already existing when the script runs ? Are you creating new storage domains, and then copy the template's disks ? If you can send the script,it will be great. Thanks
(In reply to Fred Rolland from comment #19) > (In reply to nicolas from comment #18) > > > Hi, > > > What do you mean by "each night copies all the templates to all of the > > > Storage Domains" ? > > > The disks should be already in all SD after the first run, no ? > > > > Yes for stateful disks, but we work with lots of VMPools and these need the > > template to be available on each of the DS if you want to be able to migrate > > their disk to any of them. We also have several iSCSI-based DS, usually not > > bigger than 500GB to avoid iSCSI performance issues (if you have more than X > > LVs per DS, performance starts degrading). > > > > I'm on holidays so I have no access to the script that replicates the > > templates, but basically the algorithm is: > > > > 1. Iterate over each template, x > > 2. Iterate over each DS, y > > 3. If x is not in y: > > 4. copy x to y > > > > If you need to know exactly how it's done I can send it after holidays. > > Hi, > > Thanks for your reply. > What about the pools ? They are already existing when the script runs ? Yes, we have plenty of VmPools since a long time ago because they're used for academic subjects, so all of them are already created when the script runs. > Are you creating new storage domains, and then copy the template's disks ? Yes. We're creating some SD lately and each time a SD is created, when the script runs, it sees which templates are not copied yet (all of them the first time) and then copy them to the new SD. > If you can send the script,it will be great. > Find the script attached to this BZ. > Thanks
Created attachment 1236588 [details] Copy templates script
Thanks for the script, I used to try to reproduced, but I could not get to the same situation you encountered. Can you provide the engine log? Maybe I will be able to find some clue there. Thanks.
Created attachment 1237703 [details] engine log after trying to extend a VmPool Find the log attached. I successfully managed to extend one of our pools which I seem not have tried before (it's called DEMO), so you'll able to find in the logs that it has been extended successfully. Immediately afterwards I tried to extend a different pool (SIGATEST) and the exception showed up again. I believe it didn't generate any event in the engine log but hopefully you'll be able to find something relevant in it that might shed some light on the issue.
Thanks for the log. I see there that the update of "Demo" succeeded, but no clues for the other. Looking at the code , there might be a possibility that it is related to permissions. Which user do you use for editing the pools ? Which user created the storage domains? BTW is remote debug an option ? Thanks
Hey Fred, It was all created and edited with the admin@internal user: Storage domains, pools, etc. Remote debug might be an option, I'll ask my boss to enable an account with permissions for all the relevant servers and get back to you. We're a moderately big institution so maybe it will take some days but as soon as I get a response I'll reach you.
Hi, I succeeded to reproduce the issue finally. - Create a few Storage Domains - Create a Template with 1 disk - Create a Pool from template with 1 VM prestarted - Copy the template disk using the script provided by Nicolas - Edit the VM pool
Nice! So no remote debug is needed by now? May I know what's actually the issue that makes the exception be thrown? It's kind of curious that I can reproduce it on one pool and can't on another, as you could see in the logs.
(In reply to nicolas from comment #27) > Nice! > > So no remote debug is needed by now? > > May I know what's actually the issue that makes the exception be thrown? > It's kind of curious that I can reproduce it on one pool and can't on > another, as you could see in the logs. Hi, No need for remote debug. Thanks for the goodwill !! It is quite a combination like I described in the steps to reproduce, For example if the pool vms are not prestarted , it did reproduced. In any case, it is a UI side validation issue. If you need to update the poll, it will work by REST API even without this fix.
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.
Hi, just to clarify how to re produce / verify, regarding comment #26: 1. Does it work when you invoke copy disk yourself (from webadmin or API) without nicolas' script? 2. The edit vm is invoked during the disk copy? (in this case it just failed because disk is locked)
1. It did not reproduce via the webadmin. I did using the script. 2. No, the edit VM is invoked after the disk copy
Verified twice according to comment #26 and the attached script. Once verified on rhevm-4.0.7.4-0.1.el7ev.noarch and another time on rhevm-4.1.1.4-0.1.el7.noarch. In both cases edit vm did not produce any exception. The template's disk was copied between 5 storage domains.