Bug 798775 - cannot launch rhevm instances w/ userdata
Summary: cannot launch rhevm instances w/ userdata
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: CloudForms Cloud Engine
Classification: Retired
Component: aeolus-configserver
Version: 1.0.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: beta4
Assignee: Greg Blomquist
QA Contact: dgao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-29 20:09 UTC by dgao
Modified: 2012-08-30 17:17 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-08-30 17:17:12 UTC
Embargoed:


Attachments (Terms of Use)
deployable.xml (1.02 KB, text/xml)
2012-02-29 20:09 UTC, dgao
no flags Details
rhel_template (742 bytes, application/octet-stream)
2012-02-29 20:10 UTC, dgao
no flags Details

Description dgao 2012-02-29 20:09:33 UTC
RHEVM seems to not like the <services> block of the deployable.xml. If that block is enabled, the rhevm instance would not launch (it tries indefinitely) and conductor would report a "pending" state.

The same deployable.xml works in ec2. The same deployable.xml minus the <services> would enable rhevm instance to launch.

The best guess right now is that rhevm is not liking the userdata in the instance, but that could change as investigation continues. 

[root@dell-t7400-01 ~]# rpm -qa | egrep "imagefactory|iwhd|deltacloud|aeolus" | sort
aeolus-all-0.8.0-38.el6.noarch
aeolus-conductor-0.8.0-38.el6.noarch
aeolus-conductor-daemons-0.8.0-38.el6.noarch
aeolus-conductor-doc-0.8.0-38.el6.noarch
aeolus-configure-2.5.0-15.el6.noarch
deltacloud-core-0.5.0-5.el6.noarch
deltacloud-core-ec2-0.5.0-5.el6.noarch
deltacloud-core-rhevm-0.5.0-5.el6.noarch
deltacloud-core-vsphere-0.5.0-5.el6.noarch
imagefactory-1.0.0rc8-1.el6.noarch
imagefactory-jeosconf-ec2-fedora-1.0.0rc8-1.el6.noarch
imagefactory-jeosconf-ec2-rhel-1.0.0rc8-1.el6.noarch
iwhd-1.2-3.el6.x86_64
rubygem-aeolus-cli-0.3.0-11.el6.noarch
rubygem-aeolus-image-0.3.0-10.el6.noarch
rubygem-deltacloud-client-0.5.0-2.el6.noarch
rubygem-imagefactory-console-0.4.0-1.el6.noarch

Comment 1 dgao 2012-02-29 20:09:57 UTC
Created attachment 566630 [details]
deployable.xml

Comment 2 dgao 2012-02-29 20:10:16 UTC
Created attachment 566631 [details]
rhel_template

Comment 3 Greg Blomquist 2012-02-29 22:13:57 UTC
I was able to successfully launch this deployable with the rdu rhevm cluster that dradez built.  However, we realized that the vdsm-hook-floppyinject RPM was old (didn't contain the base64 decode code).

After updating the floppyinject hook on the hypervisors and restarting rhevm, the deployment failed the same way as dgao is reporting in this bug.

Comment 4 Greg Blomquist 2012-02-29 23:06:01 UTC
We redirected all output of the floppyinject hook to a log file and captured this output:

shahar: /bin/mount -o loop,uid=36,gid=36 /tmp/deltacloud-user-data.txt /tmp/tmpWsCn2Y
floppyinject: error /bin/mount: mount: could not find any free loop device
floppyinject: [unexpected error]: Traceback (most recent call last):
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_floppyinject", line 138, in <module>
    createFloppy(filename, path, content)
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_floppyinject", line 96, in createFloppy
    sys.exit(2)
SystemExit: 2

This is telling us that the hypervisor has eaten through all 8 of its available loopback devices.  I.e., by default, you can only launch 8 guests in rhevm that have "user_data" before you hit this problem.

One possibility is to bump up the number of loopback devices, but that's only a temporary measure.  Ultimately, the floppyinject hook needs to cleanup.

It's just unclear how it can know when to cleanup old loopbacks.

Comment 5 David Lutterkort 2012-03-01 04:18:36 UTC
The hook should only need a loopback dev while it's assembling the image; it should unmount (and make sure it does that under all error conditions) as soon as the image has been built.

Comment 6 Michal Fojtik 2012-03-01 14:37:28 UTC
I agree with David. Audrey should eject the floppy once the user_data are consumed inside the guest. It will not solve the problem (you will still not be able to launch more than 8 instance at once) but it should make this temporary VDSM workaround we're using more clever.

Also I would suggest to update the floppyhook in way, where it unmount the lo device once the instance is powered off (if this is not already there)

Additionally we can make this more error-prone increasing number of loopback devices in hypervisor:

/etc/modprobe.conf:
options loop max_loop=64

(or kernel param)

and:

for i in $(seq 0 255); do
  mknod -m0660 /dev/loop$i b 7 $i
  chown root.disk /dev/loop$i
done

That should make us safe for 64 instances running in parallel.

Comment 8 Greg Blomquist 2012-03-06 21:42:03 UTC
git hash: 5f0fe23ea7e41f7306e19a77f4eaa9ffb9761f90
git repo: https://github.com/aeolusproject/vdsm-hook-floppyinject

Comment 10 dgao 2012-03-07 21:51:34 UTC
Marking this as verified since I was able to successfully launch 10+ rhevm instances w/ userdata


Note You need to log in before you can comment on or make changes to this bug.