Bug 877428 - VM universe jobs not working on RHEL5 Xen 32
Summary: VM universe jobs not working on RHEL5 Xen 32
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-vm-gahp
Version: Development
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: grid-maint-list
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-11-16 14:04 UTC by Luigi Toscano
Modified: 2016-05-26 19:11 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: Attempting to launch a virtual machine image which is greater then 2GB using condors virtual machine universe on a 32-bit RHEL5 Machine using the Xen hypervisor. Consequence: The job will fail to run. Workaround (if any): - use 64 bit version of RHEL5 - create smaller disk images (less then 2GB) - use RHEL6 - use kvm Result: Using the known workarounds, virtual machine jobs of any size can be run.
Clone Of:
Environment:
Last Closed: 2016-05-26 19:11:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 902339 1 None None None 2021-01-20 06:05:38 UTC

Internal Links: 902339

Description Luigi Toscano 2012-11-16 14:04:31 UTC
Description of problem:

Configure a 32 bit RHEL 5.x, x>=8 system, to support VM jobs:
VM_TYPE=xen
VM_GAHP_SERVER=$(SBIN)/condor_vm-gahp
VM_MEMORY=256*4
XEN_BOOTLOADER=/usr/bin/pygrub


Submit, using a normal non-privileged user, the following job:
--------------
Universe=vm
Log=log.$(cluster)
Executable=testvm
VM_TYPE=xen
VM_MEMORY=768
VM_DISK=/var/lib/xen/images/testvm.img:xvda:w
XEN_KERNEL=included
Queue
--------------



Result: job is hold, VMGahpLog reports:
11/15/12 14:30:17 VMGAHP[30095]: format = /var/lib/xen/images/testvm.img:xvda:w
11/15/12 14:30:17 VMGAHP[30095]: File(/var/lib/xen/images/testvm.img) can't be modified
11/15/12 14:30:17 VMGAHP[30095]: xen disk image file('/var/lib/xen/images/testvm.img') cannot be modified
11/15/12 14:30:17 VMGAHP[30095]: xen disk format(/var/lib/xen/images/testvm.img:xvda:w) is incorrect


The same behavior can be reproduced on both RHEL 5.8 and the snapshot of 5.9.

The same job works on RHEL5 Xen 64 bit, and with the proper modification (hypervisor type), on RHEL 5.9 KVM and RHEL 6 KVM.

All the aforementioned configuration worked properly when using condor-7.6.5-0.22, even on the last RHEL.

A properly working system should show:
11/16/12 08:32:45 VMGAHP[4879]: format = /var/lib/xen/images/testvm.img:xvda:w
11/16/12 08:32:45 VMGAHP[4879]: CreateXenVMConfigFile
11/16/12 08:32:45 VMGAHP[4879]: In VirshType::CreateVirshConfigFile
11/16/12 08:32:45 VMGAHP[4879]: LIBVIRT_XML_SCRIPT_ARGS input_strings= VMPARAM_vm_Disk = "/var/lib/xen/images/testvm.img:xvda:w"

Version-Release number of selected component (if applicable):
python-condorutils-1.5-5
condor-classads-7.8.7-0.4
condor-debuginfo-7.8.7-0.4
condor-7.8.7-0.4
condor-vm-gahp-7.8.7-0.4

Comment 3 Timothy St. Clair 2012-11-16 19:52:23 UTC
Notes thus far trying to determine what has changed: 

1.) There are no meaningful changes in the vm_gahp that would lead me to believe that the source of error is there. 
2.) access is overridden via condor_fix_access access->access_euid
3.) access_euid's only delta between 2.2->2.3 are: safe_fopen_wrapper -> safe_fopen_wrapper_follow (highly suspect, requires further digging)

Comment 4 Timothy St. Clair 2012-11-29 20:42:55 UTC
In tracing it appears to be calling the base level access f(n), and not the redirector as listed in comment #3. 

However when inserting some debug information into the logs the errno = 27
#define EFBIG           27      /* File too large */

11/29/12 12:49:17 VMGAHP[1399]: File(/var/lib/xen/images/testvm.img) can't be modified errno=27

So I created a script and submitted as the 'test' user to dump the ulimit and got: 

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 234405
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 234405
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Comment 5 Timothy St. Clair 2012-11-29 21:17:35 UTC
Code paths appear similar in 7.8 & 7.6 nothing stands out. 

Could you attempt to see if you can repro with a small image?

Comment 6 Timothy St. Clair 2012-11-29 22:26:55 UTC
To add another data point I did the following: 

mv testvm.img orig.testvm.img
touch testvm.img
chmod 755 testvm.img 
condor_release 13.0

And it got the point where it calls into libvirt to start the vm.  So all signs are pointing to image size now, the operative question is "WHY NOW?".

Comment 7 Timothy St. Clair 2012-11-30 22:32:57 UTC
Given the lifespan xen 32-bit RHEL5, and known workarounds:

- use 64 bit 
- create smaller disk images
- use el6 
- use kvm 

we're going to CLOSE WONTFIX on *this one.

Comment 8 Luigi Toscano 2012-12-04 17:57:04 UTC
For reference, the threshold is around 2 GiB.
 - Image whose size is 2 GiB or more are not executed. 
 - Image whose size is up to 2GiB (so 2GiB - 1 block, 2147479552) are executed.

which matches the limit of a signed integer (2^32-1)

Comment 11 Anne-Louise Tangring 2016-05-26 19:11:34 UTC
MRG-G is in maintenance only and only customer escalations will be addressed from this point forward. This issue can be re-opened if a customer escalation associated with this issue occurs.


Note You need to log in before you can comment on or make changes to this bug.