Bug 798775 - cannot launch rhevm instances w/ userdata
cannot launch rhevm instances w/ userdata
Status: CLOSED CURRENTRELEASE
Product: CloudForms Cloud Engine
Classification: Red Hat
Component: aeolus-configserver (Show other bugs)
1.0.0
Unspecified Unspecified
unspecified Severity high
: beta4
: ---
Assigned To: Greg Blomquist
dgao
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-02-29 15:09 EST by dgao
Modified: 2012-08-30 13:17 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-08-30 13:17:12 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
deployable.xml (1.02 KB, text/xml)
2012-02-29 15:09 EST, dgao
no flags Details
rhel_template (742 bytes, application/octet-stream)
2012-02-29 15:10 EST, dgao
no flags Details

  None (edit)
Description dgao 2012-02-29 15:09:33 EST
RHEVM seems to not like the <services> block of the deployable.xml. If that block is enabled, the rhevm instance would not launch (it tries indefinitely) and conductor would report a "pending" state.

The same deployable.xml works in ec2. The same deployable.xml minus the <services> would enable rhevm instance to launch.

The best guess right now is that rhevm is not liking the userdata in the instance, but that could change as investigation continues. 

[root@dell-t7400-01 ~]# rpm -qa | egrep "imagefactory|iwhd|deltacloud|aeolus" | sort
aeolus-all-0.8.0-38.el6.noarch
aeolus-conductor-0.8.0-38.el6.noarch
aeolus-conductor-daemons-0.8.0-38.el6.noarch
aeolus-conductor-doc-0.8.0-38.el6.noarch
aeolus-configure-2.5.0-15.el6.noarch
deltacloud-core-0.5.0-5.el6.noarch
deltacloud-core-ec2-0.5.0-5.el6.noarch
deltacloud-core-rhevm-0.5.0-5.el6.noarch
deltacloud-core-vsphere-0.5.0-5.el6.noarch
imagefactory-1.0.0rc8-1.el6.noarch
imagefactory-jeosconf-ec2-fedora-1.0.0rc8-1.el6.noarch
imagefactory-jeosconf-ec2-rhel-1.0.0rc8-1.el6.noarch
iwhd-1.2-3.el6.x86_64
rubygem-aeolus-cli-0.3.0-11.el6.noarch
rubygem-aeolus-image-0.3.0-10.el6.noarch
rubygem-deltacloud-client-0.5.0-2.el6.noarch
rubygem-imagefactory-console-0.4.0-1.el6.noarch
Comment 1 dgao 2012-02-29 15:09:57 EST
Created attachment 566630 [details]
deployable.xml
Comment 2 dgao 2012-02-29 15:10:16 EST
Created attachment 566631 [details]
rhel_template
Comment 3 Greg Blomquist 2012-02-29 17:13:57 EST
I was able to successfully launch this deployable with the rdu rhevm cluster that dradez built.  However, we realized that the vdsm-hook-floppyinject RPM was old (didn't contain the base64 decode code).

After updating the floppyinject hook on the hypervisors and restarting rhevm, the deployment failed the same way as dgao is reporting in this bug.
Comment 4 Greg Blomquist 2012-02-29 18:06:01 EST
We redirected all output of the floppyinject hook to a log file and captured this output:

shahar: /bin/mount -o loop,uid=36,gid=36 /tmp/deltacloud-user-data.txt /tmp/tmpWsCn2Y
floppyinject: error /bin/mount: mount: could not find any free loop device
floppyinject: [unexpected error]: Traceback (most recent call last):
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_floppyinject", line 138, in <module>
    createFloppy(filename, path, content)
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_floppyinject", line 96, in createFloppy
    sys.exit(2)
SystemExit: 2

This is telling us that the hypervisor has eaten through all 8 of its available loopback devices.  I.e., by default, you can only launch 8 guests in rhevm that have "user_data" before you hit this problem.

One possibility is to bump up the number of loopback devices, but that's only a temporary measure.  Ultimately, the floppyinject hook needs to cleanup.

It's just unclear how it can know when to cleanup old loopbacks.
Comment 5 David Lutterkort 2012-02-29 23:18:36 EST
The hook should only need a loopback dev while it's assembling the image; it should unmount (and make sure it does that under all error conditions) as soon as the image has been built.
Comment 6 Michal Fojtik 2012-03-01 09:37:28 EST
I agree with David. Audrey should eject the floppy once the user_data are consumed inside the guest. It will not solve the problem (you will still not be able to launch more than 8 instance at once) but it should make this temporary VDSM workaround we're using more clever.

Also I would suggest to update the floppyhook in way, where it unmount the lo device once the instance is powered off (if this is not already there)

Additionally we can make this more error-prone increasing number of loopback devices in hypervisor:

/etc/modprobe.conf:
options loop max_loop=64

(or kernel param)

and:

for i in $(seq 0 255); do
  mknod -m0660 /dev/loop$i b 7 $i
  chown root.disk /dev/loop$i
done

That should make us safe for 64 instances running in parallel.
Comment 8 Greg Blomquist 2012-03-06 16:42:03 EST
git hash: 5f0fe23ea7e41f7306e19a77f4eaa9ffb9761f90
git repo: https://github.com/aeolusproject/vdsm-hook-floppyinject
Comment 10 dgao 2012-03-07 16:51:34 EST
Marking this as verified since I was able to successfully launch 10+ rhevm instances w/ userdata

Note You need to log in before you can comment on or make changes to this bug.