| Summary: | concurrent builds causes some builds to fail | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Retired] CloudForms Cloud Engine | Reporter: | Richard Su <rwsu> | ||||
| Component: | oz | Assignee: | Ian McLeod <imcleod> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Martin Kočí <mkoci> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 1.0.0 | CC: | athomas, clalance, clalancette, jrd, mkoci, whayutin | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2012-08-30 17:16:38 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
I thought we'd fixed this one ages ago. Ian? We (Richard actually) identified the root cause of a different concurrent build issue. This is new, though it again seems to be related to libguestfs (and, through it, to libvirt). Switching to ON_DEV and will investigate further. adding ce-sprint-next bugs to ce-sprint adding to ce-sprint removing ce-sprint-next tracker Created attachment 555875 [details] image factory log w/ three build Three concurrent builds [root@qeblade32 ~]# cat /etc/imagefactory/imagefactory.conf { "warehouse": "http://localhost:9090/", "warehouse_key": "rvL8Rz/WnJasdfC", "warehouse_secret": "MgJ2VKjvAjwUpasdfPc", "image_bucket": "images", "build_bucket": "builds", "target_bucket": "target_images", "template_bucket": "templates", "icicle_bucket": "icicles", "provider_bucket": "provider_images", "imgdir": "/var/lib/imagefactory/images", "ec2_build_style": "snapshot", "ec2_ami_type": "s3", "rhevm_image_format": "qcow2", "clients": { "fqoRpasdfZOw": "cl421T7PWUCFasdfLPnu" }, "proxy_ami_id": "ami-id", "max_concurrent_local_sessions": 3, "max_concurrent_ec2_sessions": 1, "ec2-32bit-util": "m1.small", "ec2-64bit-util": "m1.large" } } [root@qeblade32 ~]# aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ 90638af4-3b05-4fc4-9213-068430438b5c rhevm BUILDING 0355ce56-3de3-4096-894d-3353bba448b6 5270a156-9472-4ce3-8349-af52a1707606 [root@qeblade32 ~]# aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ fd4aaee8-cb0f-4cb0-b8a3-93b6f476369b rhevm BUILDING 05e4c471-3cc9-4b46-83a2-7fd2d91a1fc7 fd09d9ac-04e4-4bfb-8d43-33cef590274a [root@qeblade32 ~]# aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ 48f94e0d-b845-4f31-a9a5-5ba78ffa73cb rhevm BUILDING 4df4ffe0-61cf-4bf1-95d8-df3778b584c6 835bcf39-bf9b-45ed-8175-e12826bd9afe [root@qeblade32 ~]# ./getTemplateStatus.py -i 90638af4-3b05-4fc4-9213-068430438b5c BUILDING [root@qeblade32 ~]# ./getTemplateStatus.py -i fd4aaee8-cb0f-4cb0-b8a3-93b6f476369b BUILDING [root@qeblade32 ~]# ./getTemplateStatus.py -i 48f94e0d-b845-4f31-a9a5-5ba78ffa73cb BUILDING [root@qeblade32 ~]# ./getTemplateStatus.py -i 90638af4-3b05-4fc4-9213-068430438b5c COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i 48f94e0d-b845-4f31-a9a5-5ba78ffa73cb COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i 48f94e0d-b845-4f31-a9a5-5ba78ffa73cb COMPLETED Now trying w/ five [root@qeblade32 ~]# [root@qeblade32 ~]# date; aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl ; date Tue Jan 17 17:50:05 EST 2012 Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ cf8e433f-bb5c-4296-85c3-66b4b1d5262c rhevm BUILDING c3512d01-1573-4beb-81f9-56e2d04cd076 062ba039-dcd8-42d7-a498-35056614c248 Tue Jan 17 17:50:07 EST 2012 [root@qeblade32 ~]# date; aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl ; date Tue Jan 17 17:50:14 EST 2012 Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ 0f938193-1b65-4199-82d9-ce2adcb87741 rhevm BUILDING 2769cdd1-e9fb-45c9-ae48-f954bc091e85 f959566d-f356-4375-ba21-3a9a67cc3fff Tue Jan 17 17:50:17 EST 2012 [root@qeblade32 ~]# date; aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl ; date Tue Jan 17 17:50:23 EST 2012 Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ 046b259a-5606-4baa-82fa-110db687d9e1 rhevm BUILDING e94c0e9c-16ce-4ff9-9869-94c77a16761c 0d6ac6e7-20bf-4c8d-a35a-729327e7742c Tue Jan 17 17:50:24 EST 2012 [root@qeblade32 ~]# date; aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl ; date Tue Jan 17 17:50:28 EST 2012 Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ 56bdbae5-6dbc-4a8e-b80a-caf4d0e1b8db rhevm BUILDING a85b78b3-87ea-4e27-8fa8-c4c2e2c6284b e5d144e5-23a8-48d3-9478-b54b35a51e21 Tue Jan 17 17:50:35 EST 2012 [root@qeblade32 ~]# date; aeolus-image build --target rhevm --template RHEL62RHEVMTools.tpl ; date Tue Jan 17 17:50:38 EST 2012 Target Image Target Status Image Build ------------------------------------ ------ -------- ------------------------------------ ------------------------------------ bb4d42c5-6cec-44dc-b55e-9b97e7740962 rhevm BUILDING 6cdf3602-f020-4fbd-98f8-a40ae6c1acab cd1e6d6f-0547-407e-b3a5-61fe017a7464 Tue Jan 17 17:50:45 EST 2012 So, some offline debugging thoughts on these. The first of these errors looks to be similar to BZ 755830: https://bugzilla.redhat.com/show_bug.cgi?id=755830 It occurs in a block of code where we expect to see exceptions thrown if we attempt to query the status of a VM that has just shut down after finishing its Anaconda install. If you see this in the future it would be helpful to have the full exception output, which provides some additional details about the libvirt exception which we can use to expand the list of "expected exceptions". The second error looks like it may also be related to VMs shuting down at the end of their install and may well be a bug in Oz. This line in Oz initiates a loop through a list of guests running at a particular snapshot in time: https://github.com/clalancette/oz/blob/master/oz/Guest.py#L823 However, it's possible for one or more of these guests to shut down as the loop is executing (because Anaconda has finished running). We likely need to put this into a try: block and catch exceptions related to guest shutdown. Tue Jan 17 17:50:45 EST 2012
[root@qeblade32 ~]# cat /proc/meminfo
MemTotal: 24595796 kB
MemFree: 3445648 kB
Buffers: 180160 kB
Cached: 15435984 kB
SwapCached: 0 kB
Active: 10940504 kB
Inactive: 9345476 kB
Active(anon): 4280644 kB
Inactive(anon): 393888 kB
Active(file): 6659860 kB
Inactive(file): 8951588 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 26836984 kB
SwapFree: 26836984 kB
Dirty: 30772 kB
Writeback: 0 kB
AnonPages: 4669372 kB
Mapped: 51920 kB
Shmem: 5160 kB
Slab: 525484 kB
SReclaimable: 451780 kB
SUnreclaim: 73704 kB
KernelStack: 3192 kB
PageTables: 27244 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 39134880 kB
Committed_AS: 6577800 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 345980 kB
VmallocChunk: 34345125532 kB
HardwareCorrupted: 0 kB
AnonHugePages: 4323328 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 6144 kB
DirectMap2M: 25151488 kB
[root@qeblade32 ~]# iostat
Linux 2.6.32-220.el6.x86_64 (qeblade32.rhq.lab.eng.bos.redhat.com) 01/17/2012 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5.68 0.00 1.60 15.85 0.00 76.87
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 137.98 8516.15 31026.23 38970000 141976356
[root@qeblade32 ~]# vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 3 0 3272376 181168 15659436 0 0 530 1931 270 1120 6 2 77 16 0
[root@qeblade32 ~]# ./getTemplateStatus.py -i cf8e433f-bb5c-4296-85c3-66b4b1d5262c COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i 0f938193-1b65-4199-82d9-ce2adcb87741 COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i 56bdbae5-6dbc-4a8e-b80a-caf4d0e1b8db COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i bb4d42c5-6cec-44dc-b55e-9b97e7740962 COMPLETED [root@qeblade32 ~]# ./getTemplateStatus.py -i cf8e433f-bb5c-4296-85c3-66b4b1d5262c COMPLETED 5/5 worked for me.. So if we're demonstrated 5 a few times, can we not close this bug? Yes, we still need to doc what some best practices might be, but we no longer believe the factory will in general choke when doing multiple local builds, correct? One of my tests failed only with one build with this error:
raise SubprocessException("'%s' failed(%d): %s" % (cmd, retcode, stderr), retcode)
SubprocessException: 'ssh -i /etc/oz/id_rsa-icicle-gen -F /dev/null -o ServerAliveInterval=30 -o StrictHostKeyChecking=no -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o PasswordAuthentication=no root.122.134 rpm -qa' failed(255): Warning: Permanently added '192.168.122.134' (RSA) to the list of known hosts.
2012-01-18 10:35:07,195 DEBUG imgfac.BuildJob.BuildJob pid(28572) Message: Builder (27517e68-dbb0-4157-a97d-43b8b336559a) changed status from BUILDING to FAILED
Dunno if that is related. The crazy command in test is:
["aeolus-image build --target rhevm --template templates/bug751209.tdl;",\
"aeolus-image build --target rhevm --template templates/bug751209.tdl;",\
"aeolus-image build --target rhevm --template templates/bug751209.tdl;",\
"aeolus-image build --target rhevm --template templates/bug751209.tdl;",\
"aeolus-image build --target rhevm --template templates/bug751209.tdl;",\
"aeolus-image build --target vsphere --template templates/bug751209.tdl;",\
"aeolus-image build --target vsphere --template templates/bug751209.tdl;"]
where between each start is 10 seconds sleep.
Test was ran with packages:
#rpm -qa | grep 'aeolus\|imagefactory-\|oz-\|iwhd'
rubygem-imagefactory-console-0.5.0-4.20110824113238gitd9debef.el6.noarch
aeolus-configure-2.6.0-0.20120112133633git8701011.el6.noarch
aeolus-conductor-doc-0.9.0-0.20120111153517gite55da85.el6.noarch
imagefactory-jeosconf-ec2-rhel-1.0.0rc2_10_gcf8da4e-1.el6.noarch
rubygem-ZenTest-4.3.3-2.aeolus.el6.noarch
rubygem-aeolus-cli-0.4.0-0.20120111093540git5ec8dc7.el6.noarch
aeolus-conductor-0.9.0-0.20120111153517gite55da85.el6.noarch
iwhd-1.2-3.el6.x86_64
imagefactory-jeosconf-ec2-fedora-1.0.0rc2_10_gcf8da4e-1.el6.noarch
aeolus-conductor-daemons-0.9.0-0.20120111153517gite55da85.el6.noarch
oz-0.8.0-0.20111219203204git5775e9d.el6.noarch
aeolus-all-0.9.0-0.20120111153517gite55da85.el6.noarch
imagefactory-1.0.0rc2_10_gcf8da4e-1.el6.noarch
rubygem-arel-2.0.10-0.aeolus.el6.noarch
rubygem-aeolus-image-0.4.0-0.20120111093551git67fdf6f.el6.noarch
Otherwise I'm not able to reproduce this issue.
For further info (if there is some) see: http://hudson.rhq.lab.eng.bos.redhat.com:8080/hudson/view/DEV-CloudForms/job/ImageFactory-KocaTesting2/220/consoleFull
Now I see the error looks like related to this bug https://bugzilla.redhat.com/show_bug.cgi?id=749653. So I don't think so I have reproduce this bug with concurrent builds. So, I'm restricting my work on this particular BZ number to the errors reported in the initial opening of this bug. We believe the root causes of both errors are addressed by this commit and the one before it: https://github.com/aeolusproject/oz/commit/bdac9e370924e94b5b2912af72e440b50a4cbe1e Although the initial commit log suggests that these are "tests" they've actually worked well enough in our testing that we are likely to merge them into our aeolus/oz for as-is. I will brew the updated Oz momentarily. brewed as oz-0.8.0-3 switching to ON_QA I'd suggest increasing the "max_concurrent_local_sessions" in imagefactory.conf if you really want to stress test this. The original bug report tested with 5 concurrent builds. The current default is no more than 2. (I started having unrelated resource starvation issues when I tried 10 parallel builds on a host with 16 GB of RAM.) successfully tested with configuration max_concurrent_local_sessions=2 and packages : #rpm -qa | grep 'aeolus\|imagefactory-\|oz-\|iwhd' oz-0.9.0-0.20120127190146git043d582.el6.noarch aeolus-conductor-doc-0.9.0-0.20120118181603git71cd8bc.el6.noarch aeolus-all-0.9.0-0.20120118181603git71cd8bc.el6.noarch iwhd-1.2-3.el6.x86_64 rubygem-imagefactory-console-0.5.0-4.20110824113238gitd9debef.el6.noarch aeolus-conductor-daemons-0.9.0-0.20120118181603git71cd8bc.el6.noarch imagefactory-jeosconf-ec2-fedora-1.0.0rc3_5_g1041f75-1.el6.noarch rubygem-arel-2.0.10-0.aeolus.el6.noarch rubygem-rack-mount-0.7.1-3.aeolus.el6.noarch rubygem-ZenTest-4.3.3-2.aeolus.el6.noarch rubygem-aeolus-image-0.4.0-0.20120118121635git0d31a37.el6.noarch rubygem-aeolus-cli-0.4.0-0.20120118121626git6fddd65.el6.noarch aeolus-configure-2.6.0-0.20120118121620gita996371.el6.noarch imagefactory-jeosconf-ec2-rhel-1.0.0rc3_5_g1041f75-1.el6.noarch aeolus-conductor-0.9.0-0.20120118181603git71cd8bc.el6.noarch imagefactory-1.0.0rc3_5_g1041f75-1.el6.noarch Commands=["aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target rhevm --template templates/bug751209.tdl;",\ "aeolus-image build --target vsphere --template templates/bug751209.tdl;",\ "aeolus-image build --target vsphere --template templates/bug751209.tdl;"] with 10s between each run. # aeolus-image status --targetimage 681372a9-89ca-460c-a3d9-16d6fbab06e2 Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage 8ed39891-2771-46fa-a08d-a9b9f13cfac6 Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage 24f4a4fc-e933-4f9d-9afb-7587e18205e3 Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage 1383dc72-5095-4b97-b939-7378e3563a57 Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage 15d5dd67-f528-4ec6-b375-6ff453c8ff34 Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage f787bb97-f3cf-4427-8ec5-f7c2e29567ae Build Status: COMPLETE [root@qeblade22 ~]# aeolus-image status --targetimage 74c4f096-9cfe-46c0-98e5-cc34d9c452df Build Status: COMPLETE However before I move this bug to VERIFIED I would run one more tests.. Another tests passed. Moving bug to VERIFIED and keep testing this bug in continuous integration testing. Tested with packages # rpm -qa|grep -i "oz\|imagefactory\|aeolus\|iwhd" rubygem-imagefactory-console-0.5.0-4.20110824113238gitd9debef.el6.noarch imagefactory-1.0.0rc3_17_g72f7e61-1.el6.noarch rubygem-aeolus-cli-0.4.0-0.20120118121626git6fddd65.el6.noarch iwhd-1.2-3.el6.x86_64 aeolus-conductor-0.9.0-0.20120118181603git71cd8bc.el6.noarch imagefactory-jeosconf-ec2-rhel-1.0.0rc3_17_g72f7e61-1.el6.noarch aeolus-conductor-doc-0.9.0-0.20120118181603git71cd8bc.el6.noarch oz-0.9.0-0.20120206095001git043d582.el6.noarch aeolus-conductor-daemons-0.9.0-0.20120118181603git71cd8bc.el6.noarch aeolus-all-0.9.0-0.20120118181603git71cd8bc.el6.noarch rubygem-aeolus-image-0.4.0-0.20120118121635git0d31a37.el6.noarch aeolus-configure-2.6.0-0.20120118121620gita996371.el6.noarch imagefactory-jeosconf-ec2-fedora-1.0.0rc3_17_g72f7e61-1.el6.noarch |
Description of problem: Saw two types of Exceptions all pointing to libvirt, so we may need to categorize this BZ: (1) Started 5 rhevm builds with 10 second delay between each start. 2 of them failed with 2011-11-03 14:51:27,083 DEBUG imgfac.builders.BaseBuilder.Fedora_rhevm_Builder pid(6380) Message: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 65, in build_image self.build_upload(build_id) File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 112, in build_upload libvirt_xml = guest.install(self.app_config["timeout"]) File "/usr/lib/python2.6/site-packages/oz/Guest.py", line 1427, in install return self._do_install(timeout, force, 0) File "/usr/lib/python2.6/site-packages/oz/Guest.py", line 1409, in _do_install self._wait_for_install_finish(dom, timeout) File "/usr/lib/python2.6/site-packages/oz/Guest.py", line 508, in _wait_for_install_finish rx_bytes, rx_packets, rx_errs, rx_drop, tx_bytes, tx_packets, tx_errs, tx_drop = libvirt_dom.interfaceStats(dev) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 824, in interfaceStats if ret is None: raise libvirtError ('virDomainInterfaceStats() failed', dom=self) libvirtError: internal error /proc/net/dev: Interface not found (2) Started 5 rhevm builds and 2 vsphere builds, with 10 second delay between each start. 2 of them failed with: 2011-11-03 16:13:29,814 DEBUG imgfac.builders.BaseBuilder.Fedora_rhevm_Builder pid(6380) Message: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 65, in build_image self.build_upload(build_id) File "/usr/lib/python2.6/site-packages/imgfac/builders/Fedora_rhevm_Builder.py", line 116, in build_upload self.output_descriptor = guest.customize_and_generate_icicle(libvirt_xml) File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 1032, in customize_and_generate_icicle return self._internal_customize(libvirt_xml, True) File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 1016, in _internal_customize self._collect_teardown(modified_xml) File "/usr/lib/python2.6/site-packages/oz/RedHat.py", line 269, in _collect_teardown g_handle = self._guestfs_handle_setup(libvirt_xml) File "/usr/lib/python2.6/site-packages/oz/Guest.py", line 825, in _guestfs_handle_setup doc = libxml2.parseDoc(self.libvirt_conn.lookupByID(domid).XMLDesc(0)) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 239, in XMLDesc if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self) libvirtError: cannot send monitor command '{"execute":"query-balloon"}': Connection reset by peer Version-Release number of selected component (if applicable): 0.5.0 stable aeolus repo oz-0.7.0-3.el6.noarch (with this patch https://github.com/clalancette/oz/pull/4) imagefactory-0.8.0-1.el6.noarch iwhd-0.991-1.el6.x86_64 How reproducible: sometimes Steps to Reproduce: For (2) aeolus-image build --target rhevm --template pulp.tdl; sleep 10; aeolus-image build --target rhevm --template pulp.tdl; sleep 10; aeolus-image build --target rhevm --template pulp.tdl; sleep 10; aeolus-image build --target rhevm --template pulp.tdl; sleep 10; aeolus-image build --target rhevm --template pulp.tdl; sleep 10; aeolus-image build --target vsphere --template pulp.tdl; sleep 10; aeolus-image build --target vsphere --template pulp.tdl;