Hide Forgot
Description of problem: Saw two types of Exceptions all pointing to libvirt, so we may need to categorize this BZ: (1) Started 5 rhevm builds with 10 second delay between each start. 2 of them failed with 2011-11-03 14:51:27,083 DEBUG imgfac.builders.BaseBuilder.Fedora_rhevm_Builder pid(6380) Message: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 65, in build_image self.build_upload(build_id) File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 112, in build_upload libvirt_xml = guest.install(self.app_config["timeout"]) File "/usr/lib/python2.6/site-packages/oz/Guest.py", line 1427, in install return self._do_install(timeout, force, 0) File "/usr/lib/python2.6/site-packages/oz/Guest.py", line 1409, in _do_install self._wait_for_install_finish(dom, timeout) File "/usr/lib/python2.6/site-packages/oz/Guest.py", line 508, in _wait_for_install_finish rx_bytes, rx_packets, rx_errs, rx_drop, tx_bytes, tx_packets, tx_errs, tx_drop = libvirt_dom.interfaceStats(dev) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 824, in interfaceStats if ret is None: raise libvirtError ('virDomainInterfaceStats() failed', dom=self) libvirtError: internal error /proc/net/dev: Interface not found (2) Started 5 rhevm builds and 2 vsphere builds, with 10 second delay between each start. 2 of them failed with: 2011-11-03 16:13:29,814 DEBUG imgfac.builders.BaseBuilder.Fedora_rhevm_Builder pid(6380) Message: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 65, in build_image self.build_upload(build_id) File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 116, in build_upload self.output_descriptor = guest.customize_and_generate_icicle(libvirt_xml) File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 1032, in customize_and_generate_icicle return self._internal_customize(libvirt_xml, True) File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 1016, in _internal_customize self._collect_teardown(modified_xml) File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 269, in _collect_teardown g_handle = self._guestfs_handle_setup(libvirt_xml) File "/usr/lib/python2.6/site-packages/oz/Guest.py", line 825, in _guestfs_handle_setup doc = libxml2.parseDoc(self.libvirt_conn.lookupByID(domid).XMLDesc(0)) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 239, in XMLDesc if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self) libvirtError: cannot send monitor command '{"execute":"query-balloon"}': Connection reset by peer Version-Release number of selected component (if applicable): 0.5.0 stable aeolus repo oz-0.7.0-3.el6.noarch (with this patch https://github.com/clalancette/oz/pull/4) imagefactory-0.8.0-1.el6.noarch iwhd-0.991-1.el6.x86_64 How reproducible: sometimes Steps to Reproduce: For (2) aeolus-image build --target rhevm --template pulp.tdl; sleep 10; aeolus-image build --target rhevm --template pulp.tdl; sleep 10; aeolus-image build --target rhevm --template pulp.tdl; sleep 10; aeolus-image build --target rhevm --template pulp.tdl; sleep 10; aeolus-image build --target rhevm --template pulp.tdl; sleep 10; aeolus-image build --target vsphere --template pulp.tdl; sleep 10; aeolus-image build --target vsphere --template pulp.tdl;
I thought we'd fixed this one ages ago. Ian?
We (Richard actually) identified the root cause of a different concurrent build issue. This is new, though it again seems to be related to libguestfs (and, through it, to libvirt). Switching to ON_DEV and will investigate further.
adding ce-sprint-next bugs to ce-sprint
adding to ce-sprint
removing ce-sprint-next tracker
Created attachment 555875 [details] image factory log w/ three build Three concurrent builds [root@qeblade32 ~]# cat /etc/imagefactory/imagefactory.conf { "warehouse": "http://localhost:9090/", "warehouse_key": "rvL8Rz/WnJasdfC", "warehouse_secret": "MgJ2VKjvAjwUpasdfPc", "image_bucket": "images", "build_bucket": "builds", "target_bucket": "target_images", "template_bucket": "templates", "icicle_bucket": "icicles", "provider_bucket": "provider_images", "imgdir": "/var/lib/imagefactory/images", "ec2_build_style": "snapshot", "ec2_ami_type": "s3", "rhevm_image_format": "qcow2", "clients": { "fqoRpasdfZOw": "cl421T7PWUCFasdfLPnu" }, "proxy_ami_id": "ami-id", "max_concurrent_local_sessions": 3, "max_concurrent_ec2_sessions": 1, "ec2-32bit-util": "m1.small", "ec2-64bit-util": "m1.large" } } [root@qeblade32 ~]# aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ 90638af4-3b05-4fc4-9213-068430438b5c rhevm BUILDING 0355ce56-3de3-4096-894d-3353bba448b6 5270a156-9472-4ce3-8349-af52a1707606 [root@qeblade32 ~]# aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ fd4aaee8-cb0f-4cb0-b8a3-93b6f476369b rhevm BUILDING 05e4c471-3cc9-4b46-83a2-7fd2d91a1fc7 fd09d9ac-04e4-4bfb-8d43-33cef590274a [root@qeblade32 ~]# aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ 48f94e0d-b845-4f31-a9a5-5ba78ffa73cb rhevm BUILDING 4df4ffe0-61cf-4bf1-95d8-df3778b584c6 835bcf39-bf9b-45ed-8175-e12826bd9afe [root@qeblade32 ~]# ./getTemplateStatus.py -i 90638af4-3b05-4fc4-9213-068430438b5c BUILDING [root@qeblade32 ~]# ./getTemplateStatus.py -i fd4aaee8-cb0f-4cb0-b8a3-93b6f476369b BUILDING [root@qeblade32 ~]# ./getTemplateStatus.py -i 48f94e0d-b845-4f31-a9a5-5ba78ffa73cb BUILDING [root@qeblade32 ~]# ./getTemplateStatus.py -i 90638af4-3b05-4fc4-9213-068430438b5c COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i 48f94e0d-b845-4f31-a9a5-5ba78ffa73cb COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i 48f94e0d-b845-4f31-a9a5-5ba78ffa73cb COMPLETED
Now trying w/ five [root@qeblade32 ~]# [root@qeblade32 ~]# date; aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl ; date Tue Jan 17 17:50:05 EST 2012 Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ cf8e433f-bb5c-4296-85c3-66b4b1d5262c rhevm BUILDING c3512d01-1573-4beb-81f9-56e2d04cd076 062ba039-dcd8-42d7-a498-35056614c248 Tue Jan 17 17:50:07 EST 2012 [root@qeblade32 ~]# date; aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl ; date Tue Jan 17 17:50:14 EST 2012 Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ 0f938193-1b65-4199-82d9-ce2adcb87741 rhevm BUILDING 2769cdd1-e9fb-45c9-ae48-f954bc091e85 f959566d-f356-4375-ba21-3a9a67cc3fff Tue Jan 17 17:50:17 EST 2012 [root@qeblade32 ~]# date; aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl ; date Tue Jan 17 17:50:23 EST 2012 Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ 046b259a-5606-4baa-82fa-110db687d9e1 rhevm BUILDING e94c0e9c-16ce-4ff9-9869-94c77a16761c 0d6ac6e7-20bf-4c8d-a35a-729327e7742c Tue Jan 17 17:50:24 EST 2012 [root@qeblade32 ~]# date; aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl ; date Tue Jan 17 17:50:28 EST 2012 Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ 56bdbae5-6dbc-4a8e-b80a-caf4d0e1b8db rhevm BUILDING a85b78b3-87ea-4e27-8fa8-c4c2e2c6284b e5d144e5-23a8-48d3-9478-b54b35a51e21 Tue Jan 17 17:50:35 EST 2012 [root@qeblade32 ~]# date; aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl ; date Tue Jan 17 17:50:38 EST 2012 Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ bb4d42c5-6cec-44dc-b55e-9b97e7740962 rhevm BUILDING 6cdf3602-f020-4fbd-98f8-a40ae6c1acab cd1e6d6f-0547-407e-b3a5-61fe017a7464 Tue Jan 17 17:50:45 EST 2012
So, some offline debugging thoughts on these. The first of these errors looks to be similar to BZ 755830: https://bugzilla.redhat.com/show_bug.cgi?id=755830 It occurs in a block of code where we expect to see exceptions thrown if we attempt to query the status of a VM that has just shut down after finishing its Anaconda install. If you see this in the future it would be helpful to have the full exception output, which provides some additional details about the libvirt exception which we can use to expand the list of "expected exceptions". The second error looks like it may also be related to VMs shuting down at the end of their install and may well be a bug in Oz. This line in Oz initiates a loop through a list of guests running at a particular snapshot in time: https://github.com/clalancette/oz/blob/master/oz/Guest.py#L823 However, it's possible for one or more of these guests to shut down as the loop is executing (because Anaconda has finished running). We likely need to put this into a try: block and catch exceptions related to guest shutdown.
Tue Jan 17 17:50:45 EST 2012 [root@qeblade32 ~]# cat /proc/meminfo MemTotal: 24595796 kB MemFree: 3445648 kB Buffers: 180160 kB Cached: 15435984 kB SwapCached: 0 kB Active: 10940504 kB Inactive: 9345476 kB Active(anon): 4280644 kB Inactive(anon): 393888 kB Active(file): 6659860 kB Inactive(file): 8951588 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 26836984 kB SwapFree: 26836984 kB Dirty: 30772 kB Writeback: 0 kB AnonPages: 4669372 kB Mapped: 51920 kB Shmem: 5160 kB Slab: 525484 kB SReclaimable: 451780 kB SUnreclaim: 73704 kB KernelStack: 3192 kB PageTables: 27244 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 39134880 kB Committed_AS: 6577800 kB VmallocTotal: 34359738367 kB VmallocUsed: 345980 kB VmallocChunk: 34345125532 kB HardwareCorrupted: 0 kB AnonHugePages: 4323328 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 6144 kB DirectMap2M: 25151488 kB [root@qeblade32 ~]# iostat Linux 2.6.32-220.el6.x86_64 (qeblade32.rhq.lab.eng.bos.redhat.com) 01/17/2012 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 5.68 0.00 1.60 15.85 0.00 76.87 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 137.98 8516.15 31026.23 38970000 141976356 [root@qeblade32 ~]# vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 3 0 3272376 181168 15659436 0 0 530 1931 270 1120 6 2 77 16 0
[root@qeblade32 ~]# ./getTemplateStatus.py -i cf8e433f-bb5c-4296-85c3-66b4b1d5262c COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i 0f938193-1b65-4199-82d9-ce2adcb87741 COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i 56bdbae5-6dbc-4a8e-b80a-caf4d0e1b8db COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i bb4d42c5-6cec-44dc-b55e-9b97e7740962 COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i cf8e433f-bb5c-4296-85c3-66b4b1d5262c COMPLETED 5/5 worked for me..
So if we're demonstrated 5 a few times, can we not close this bug? Yes, we still need to doc what some best practices might be, but we no longer believe the factory will in general choke when doing multiple local builds, correct?
One of my tests failed only with one build with this error: raise SubprocessException("'%s' failed(%d): %s" % (cmd, retcode, stderr), retcode) SubprocessException: 'ssh -i /etc/oz/id_rsa-icicle-gen -F /dev/null -o ServerAliveInterval=30 -o StrictHostKeyChecking=no -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o PasswordAuthentication=no root.122.134 rpm -qa' failed(255): Warning: Permanently added '192.168.122.134' (RSA) to the list of known hosts. 2012-01-18 10:35:07,195 DEBUG imgfac.BuildJob.BuildJob pid(28572) Message: Builder (27517e68-dbb0-4157-a97d-43b8b336559a) changed status from BUILDING to FAILED Dunno if that is related. The crazy command in test is: ["aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target vsphere --template templates/bug751209.tdl;",\ "aeolus-image build --target vsphere --template templates/bug751209.tdl;"] where between each start is 10 seconds sleep. Test was ran with packages: #rpm -qa | grep 'aeolus\|imagefactory-\|oz-\|iwhd' rubygem-imagefactory-console-0.5.0-4.20110824113238gitd9debef.el6.noarch aeolus-configure-2.6.0-0.20120112133633git8701011.el6.noarch aeolus-conductor-doc-0.9.0-0.20120111153517gite55da85.el6.noarch imagefactory-jeosconf-ec2-rhel-1.0.0rc2_10_gcf8da4e-1.el6.noarch rubygem-ZenTest-4.3.3-2.aeolus.el6.noarch rubygem-aeolus-cli-0.4.0-0.20120111093540git5ec8dc7.el6.noarch aeolus-conductor-0.9.0-0.20120111153517gite55da85.el6.noarch iwhd-1.2-3.el6.x86_64 imagefactory-jeosconf-ec2-fedora-1.0.0rc2_10_gcf8da4e-1.el6.noarch aeolus-conductor-daemons-0.9.0-0.20120111153517gite55da85.el6.noarch oz-0.8.0-0.20111219203204git5775e9d.el6.noarch aeolus-all-0.9.0-0.20120111153517gite55da85.el6.noarch imagefactory-1.0.0rc2_10_gcf8da4e-1.el6.noarch rubygem-arel-2.0.10-0.aeolus.el6.noarch rubygem-aeolus-image-0.4.0-0.20120111093551git67fdf6f.el6.noarch Otherwise I'm not able to reproduce this issue. For further info (if there is some) see: http://hudson.rhq.lab.eng.bos.redhat.com:8080/hudson/view/DEV-CloudForms/job/ImageFactory-KocaTesting2/220/consoleFull
Now I see the error looks like related to this bug https://bugzilla.redhat.com/show_bug.cgi?id=749653. So I don't think so I have reproduce this bug with concurrent builds.
So, I'm restricting my work on this particular BZ number to the errors reported in the initial opening of this bug. We believe the root causes of both errors are addressed by this commit and the one before it: https://github.com/aeolusproject/oz/commit/bdac9e370924e94b5b2912af72e440b50a4cbe1e Although the initial commit log suggests that these are "tests" they've actually worked well enough in our testing that we are likely to merge them into our aeolus/oz for as-is. I will brew the updated Oz momentarily.
brewed as oz-0.8.0-3 switching to ON_QA
I'd suggest increasing the "max_concurrent_local_sessions" in imagefactory.conf if you really want to stress test this. The original bug report tested with 5 concurrent builds. The current default is no more than 2. (I started having unrelated resource starvation issues when I tried 10 parallel builds on a host with 16 GB of RAM.)
successfully tested with configuration max_concurrent_local_sessions=2 and packages : #rpm -qa | grep 'aeolus\|imagefactory-\|oz-\|iwhd' oz-0.9.0-0.20120127190146git043d582.el6.noarch aeolus-conductor-doc-0.9.0-0.20120118181603git71cd8bc.el6.noarch aeolus-all-0.9.0-0.20120118181603git71cd8bc.el6.noarch iwhd-1.2-3.el6.x86_64 rubygem-imagefactory-console-0.5.0-4.20110824113238gitd9debef.el6.noarch aeolus-conductor-daemons-0.9.0-0.20120118181603git71cd8bc.el6.noarch imagefactory-jeosconf-ec2-fedora-1.0.0rc3_5_g1041f75-1.el6.noarch rubygem-arel-2.0.10-0.aeolus.el6.noarch rubygem-rack-mount-0.7.1-3.aeolus.el6.noarch rubygem-ZenTest-4.3.3-2.aeolus.el6.noarch rubygem-aeolus-image-0.4.0-0.20120118121635git0d31a37.el6.noarch rubygem-aeolus-cli-0.4.0-0.20120118121626git6fddd65.el6.noarch aeolus-configure-2.6.0-0.20120118121620gita996371.el6.noarch imagefactory-jeosconf-ec2-rhel-1.0.0rc3_5_g1041f75-1.el6.noarch aeolus-conductor-0.9.0-0.20120118181603git71cd8bc.el6.noarch imagefactory-1.0.0rc3_5_g1041f75-1.el6.noarch Commands=["aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target vsphere --template templates/bug751209.tdl;",\ "aeolus-image build --target vsphere --template templates/bug751209.tdl;"] with 10s between each run. # aeolus-image status --targetimage 681372a9-89ca-460c-a3d9-16d6fbab06e2 Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage 8ed39891-2771-46fa-a08d-a9b9f13cfac6 Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage 24f4a4fc-e933-4f9d-9afb-7587e18205e3 Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage 1383dc72-5095-4b97-b939-7378e3563a57 Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage 15d5dd67-f528-4ec6-b375-6ff453c8ff34 Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage f787bb97-f3cf-4427-8ec5-f7c2e29567ae Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage 74c4f096-9cfe-46c0-98e5-cc34d9c452df Build Status: COMPLETE However before I move this bug to VERIFIED I would run one more tests..
Another tests passed. Moving bug to VERIFIED and keep testing this bug in continuous integration testing. Tested with packages # rpm -qa|grep -i "oz\|imagefactory\|aeolus\|iwhd" rubygem-imagefactory-console-0.5.0-4.20110824113238gitd9debef.el6.noarch imagefactory-1.0.0rc3_17_g72f7e61-1.el6.noarch rubygem-aeolus-cli-0.4.0-0.20120118121626git6fddd65.el6.noarch iwhd-1.2-3.el6.x86_64 aeolus-conductor-0.9.0-0.20120118181603git71cd8bc.el6.noarch imagefactory-jeosconf-ec2-rhel-1.0.0rc3_17_g72f7e61-1.el6.noarch aeolus-conductor-doc-0.9.0-0.20120118181603git71cd8bc.el6.noarch oz-0.9.0-0.20120206095001git043d582.el6.noarch aeolus-conductor-daemons-0.9.0-0.20120118181603git71cd8bc.el6.noarch aeolus-all-0.9.0-0.20120118181603git71cd8bc.el6.noarch rubygem-aeolus-image-0.4.0-0.20120118121635git0d31a37.el6.noarch aeolus-configure-2.6.0-0.20120118121620gita996371.el6.noarch imagefactory-jeosconf-ec2-fedora-1.0.0rc3_17_g72f7e61-1.el6.noarch