798775 – cannot launch rhevm instances w/ userdata

Bug 798775 - cannot launch rhevm instances w/ userdata

Summary: cannot launch rhevm instances w/ userdata

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	CloudForms Cloud Engine
Classification:	Retired
Component:	aeolus-configserver
Sub Component:
Version:	1.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	beta4
Assignee:	Greg Blomquist
QA Contact:	dgao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-02-29 20:09 UTC by dgao
Modified:	2012-08-30 17:17 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-08-30 17:17:12 UTC
Embargoed:

Attachments	(Terms of Use)
deployable.xml (1.02 KB, text/xml) 2012-02-29 20:09 UTC, dgao	no flags	Details
rhel_template (742 bytes, application/octet-stream) 2012-02-29 20:10 UTC, dgao	no flags	Details
View All

Description dgao 2012-02-29 20:09:33 UTC

RHEVM seems to not like the <services> block of the deployable.xml. If that block is enabled, the rhevm instance would not launch (it tries indefinitely) and conductor would report a "pending" state.

The same deployable.xml works in ec2. The same deployable.xml minus the <services> would enable rhevm instance to launch.

The best guess right now is that rhevm is not liking the userdata in the instance, but that could change as investigation continues. 

[root@dell-t7400-01 ~]# rpm -qa | egrep "imagefactory|iwhd|deltacloud|aeolus" | sort
aeolus-all-0.8.0-38.el6.noarch
aeolus-conductor-0.8.0-38.el6.noarch
aeolus-conductor-daemons-0.8.0-38.el6.noarch
aeolus-conductor-doc-0.8.0-38.el6.noarch
aeolus-configure-2.5.0-15.el6.noarch
deltacloud-core-0.5.0-5.el6.noarch
deltacloud-core-ec2-0.5.0-5.el6.noarch
deltacloud-core-rhevm-0.5.0-5.el6.noarch
deltacloud-core-vsphere-0.5.0-5.el6.noarch
imagefactory-1.0.0rc8-1.el6.noarch
imagefactory-jeosconf-ec2-fedora-1.0.0rc8-1.el6.noarch
imagefactory-jeosconf-ec2-rhel-1.0.0rc8-1.el6.noarch
iwhd-1.2-3.el6.x86_64
rubygem-aeolus-cli-0.3.0-11.el6.noarch
rubygem-aeolus-image-0.3.0-10.el6.noarch
rubygem-deltacloud-client-0.5.0-2.el6.noarch
rubygem-imagefactory-console-0.4.0-1.el6.noarch

Comment 1 dgao 2012-02-29 20:09:57 UTC

Created attachment 566630 [details]
deployable.xml

Comment 2 dgao 2012-02-29 20:10:16 UTC

Created attachment 566631 [details]
rhel_template

Comment 3 Greg Blomquist 2012-02-29 22:13:57 UTC

I was able to successfully launch this deployable with the rdu rhevm cluster that dradez built.  However, we realized that the vdsm-hook-floppyinject RPM was old (didn't contain the base64 decode code).

After updating the floppyinject hook on the hypervisors and restarting rhevm, the deployment failed the same way as dgao is reporting in this bug.

Comment 4 Greg Blomquist 2012-02-29 23:06:01 UTC

We redirected all output of the floppyinject hook to a log file and captured this output:

shahar: /bin/mount -o loop,uid=36,gid=36 /tmp/deltacloud-user-data.txt /tmp/tmpWsCn2Y
floppyinject: error /bin/mount: mount: could not find any free loop device
floppyinject: [unexpected error]: Traceback (most recent call last):
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_floppyinject", line 138, in <module>
    createFloppy(filename, path, content)
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_floppyinject", line 96, in createFloppy
    sys.exit(2)
SystemExit: 2

This is telling us that the hypervisor has eaten through all 8 of its available loopback devices.  I.e., by default, you can only launch 8 guests in rhevm that have "user_data" before you hit this problem.

One possibility is to bump up the number of loopback devices, but that's only a temporary measure.  Ultimately, the floppyinject hook needs to cleanup.

It's just unclear how it can know when to cleanup old loopbacks.

Comment 5 David Lutterkort 2012-03-01 04:18:36 UTC

The hook should only need a loopback dev while it's assembling the image; it should unmount (and make sure it does that under all error conditions) as soon as the image has been built.

Comment 6 Michal Fojtik 2012-03-01 14:37:28 UTC

I agree with David. Audrey should eject the floppy once the user_data are consumed inside the guest. It will not solve the problem (you will still not be able to launch more than 8 instance at once) but it should make this temporary VDSM workaround we're using more clever.

Also I would suggest to update the floppyhook in way, where it unmount the lo device once the instance is powered off (if this is not already there)

Additionally we can make this more error-prone increasing number of loopback devices in hypervisor:

/etc/modprobe.conf:
options loop max_loop=64

(or kernel param)

and:

for i in $(seq 0 255); do
  mknod -m0660 /dev/loop$i b 7 $i
  chown root.disk /dev/loop$i
done

That should make us safe for 64 instances running in parallel.

Comment 8 Greg Blomquist 2012-03-06 21:42:03 UTC

git hash: 5f0fe23ea7e41f7306e19a77f4eaa9ffb9761f90
git repo: https://github.com/aeolusproject/vdsm-hook-floppyinject

Comment 9 Greg Blomquist 2012-03-06 22:49:50 UTC

rhel5: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4122953
rhel6: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4122980

Comment 10 dgao 2012-03-07 21:51:34 UTC

Marking this as verified since I was able to successfully launch 10+ rhevm instances w/ userdata

Note You need to log in before you can comment on or make changes to this bug.