Description of problem: Configure a 32 bit RHEL 5.x, x>=8 system, to support VM jobs: VM_TYPE=xen VM_GAHP_SERVER=$(SBIN)/condor_vm-gahp VM_MEMORY=256*4 XEN_BOOTLOADER=/usr/bin/pygrub Submit, using a normal non-privileged user, the following job: -------------- Universe=vm Log=log.$(cluster) Executable=testvm VM_TYPE=xen VM_MEMORY=768 VM_DISK=/var/lib/xen/images/testvm.img:xvda:w XEN_KERNEL=included Queue -------------- Result: job is hold, VMGahpLog reports: 11/15/12 14:30:17 VMGAHP[30095]: format = /var/lib/xen/images/testvm.img:xvda:w 11/15/12 14:30:17 VMGAHP[30095]: File(/var/lib/xen/images/testvm.img) can't be modified 11/15/12 14:30:17 VMGAHP[30095]: xen disk image file('/var/lib/xen/images/testvm.img') cannot be modified 11/15/12 14:30:17 VMGAHP[30095]: xen disk format(/var/lib/xen/images/testvm.img:xvda:w) is incorrect The same behavior can be reproduced on both RHEL 5.8 and the snapshot of 5.9. The same job works on RHEL5 Xen 64 bit, and with the proper modification (hypervisor type), on RHEL 5.9 KVM and RHEL 6 KVM. All the aforementioned configuration worked properly when using condor-7.6.5-0.22, even on the last RHEL. A properly working system should show: 11/16/12 08:32:45 VMGAHP[4879]: format = /var/lib/xen/images/testvm.img:xvda:w 11/16/12 08:32:45 VMGAHP[4879]: CreateXenVMConfigFile 11/16/12 08:32:45 VMGAHP[4879]: In VirshType::CreateVirshConfigFile 11/16/12 08:32:45 VMGAHP[4879]: LIBVIRT_XML_SCRIPT_ARGS input_strings= VMPARAM_vm_Disk = "/var/lib/xen/images/testvm.img:xvda:w" Version-Release number of selected component (if applicable): python-condorutils-1.5-5 condor-classads-7.8.7-0.4 condor-debuginfo-7.8.7-0.4 condor-7.8.7-0.4 condor-vm-gahp-7.8.7-0.4
Notes thus far trying to determine what has changed: 1.) There are no meaningful changes in the vm_gahp that would lead me to believe that the source of error is there. 2.) access is overridden via condor_fix_access access->access_euid 3.) access_euid's only delta between 2.2->2.3 are: safe_fopen_wrapper -> safe_fopen_wrapper_follow (highly suspect, requires further digging)
In tracing it appears to be calling the base level access f(n), and not the redirector as listed in comment #3. However when inserting some debug information into the logs the errno = 27 #define EFBIG 27 /* File too large */ 11/29/12 12:49:17 VMGAHP[1399]: File(/var/lib/xen/images/testvm.img) can't be modified errno=27 So I created a script and submitted as the 'test' user to dump the ulimit and got: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 234405 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 234405 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
Code paths appear similar in 7.8 & 7.6 nothing stands out. Could you attempt to see if you can repro with a small image?
To add another data point I did the following: mv testvm.img orig.testvm.img touch testvm.img chmod 755 testvm.img condor_release 13.0 And it got the point where it calls into libvirt to start the vm. So all signs are pointing to image size now, the operative question is "WHY NOW?".
Given the lifespan xen 32-bit RHEL5, and known workarounds: - use 64 bit - create smaller disk images - use el6 - use kvm we're going to CLOSE WONTFIX on *this one.
For reference, the threshold is around 2 GiB. - Image whose size is 2 GiB or more are not executed. - Image whose size is up to 2GiB (so 2GiB - 1 block, 2147479552) are executed. which matches the limit of a signed integer (2^32-1)
MRG-G is in maintenance only and only customer escalations will be addressed from this point forward. This issue can be re-opened if a customer escalation associated with this issue occurs.