Description of problem: Attempt to catch bad queries before we send them to mysql and get stuck waiting forever. Version-Release number of selected component (if applicable): 0.6.16 Capture some known bad hostRequires and add them to nosetests so we can know when we have a workaround.
Bug 590723 has more details about the MySQL problem.
I think the most recent scheduler failure is actually not related to bug 590723 and those optimizer_search_depth shenanigans. The recipe which caused it has this <hostRequires/>: <hostRequires> <and> <key_value key="MEMORY" value="1400" op=">"></key_value> <key_value key="DISK" value="7999" op=">"></key_value> <key_value key="MEMORY" value="1400" op=">"></key_value> <key_value key="DISK" value="7999" op=">"></key_value> </and> <and> <key_value key="MEMORY" value="1400" op=">"></key_value> <key_value key="DISK" value="7999" op=">"></key_value> <key_value key="MEMORY" value="1400" op=">"></key_value> <key_value key="DISK" value="7999" op=">"></key_value> </and> </hostRequires> the duplicates being caused by bug 706838. This produces a query like the following: 2011-08-09 15:42:32,060 sqlalchemy.engine.base.Engine.0x..10 INFO SELECT count(1) AS count_1 FROM system_arch_map AS system_arch_map_1, system LEFT OUTER JOIN key_value_int AS key_value0 ON key_value0.key_id = %s AND system.id = key_value0.system_id LEFT OUTER JOIN key_value_int AS key_value1 ON key_value1.key_id = %s AND system.id = key_value1.system_id LEFT OUTER JOIN key_value_int AS key_value2 ON key_value2.key_id = %s AND system.id = key_value2.system_id LEFT OUTER JOIN key_value_int AS key_value3 ON key_value3.key_id = %s AND system.id = key_value3.system_id LEFT OUTER JOIN key_value_int AS key_value4 ON key_value4.key_id = %s AND system.id = key_value4.system_id LEFT OUTER JOIN key_value_int AS key_value5 ON key_value5.key_id = %s AND system.id = key_value5.system_id LEFT OUTER JOIN key_value_int AS key_value6 ON key_value6.key_id = %s AND system.id = key_value6.system_id LEFT OUTER JOIN key_value_int AS key_value7 ON key_value7.key_id = %s AND system.id = key_value7.system_id LEFT OUTER JOIN system_group AS system_group_1 ON system.id = system_group_1.system_id LEFT OUTER JOIN tg_group AS tg_group_1 ON tg_group_1.group_id = system_group_1.group_id LEFT OUTER JOIN user_group AS user_group_1 ON tg_group_1.group_id = user_group_1.group_id LEFT OUTER JOIN tg_user AS tg_user_1 ON tg_user_1.user_id = user_group_1.user_id INNER JOIN lab_controller ON lab_controller.id = system.lab_controller_id WHERE key_value0.key_value > %s AND key_value1.key_value > %s AND key_value2.key_value > %s AND key_value3.key_value > %s AND key_value4.key_value > %s AND key_value5.key_value > %s AND key_value6.key_value > %s AND key_value7.key_value > %s AND %s = system.status_id AND (system.owner_id = %s OR system.loan_id = %s OR system.shared = %s AND NOT (EXISTS (SELECT 1 FROM system_group WHERE system.id = system_group.system_id)) OR system.shared = %s AND tg_user_1.user_id = %s) AND system.id = system_arch_map_1.system_id AND %s = system_arch_map_1.arch_id AND NOT (EXISTS (SELECT 1 FROM exclude_osmajor WHERE system.id = exclude_osmajor.system_id AND %s = exclude_osmajor.osmajor_id AND %s = exclude_osmajor.arch_id)) AND NOT (EXISTS (SELECT 1 FROM exclude_osversion WHERE system.id = exclude_osversion.system_id AND %s = exclude_osversion.osversion_id AND %s = exclude_osversion.arch_id)) 2011-08-09 15:42:32,060 sqlalchemy.engine.base.Engine.0x..10 INFO [12L, 9L, 12L, 9L, 12L, 9L, 12L, 9L, u'1400', u'7999', u'1400', u'7999', u'1400', u'7999', u'1400', u'7999', 1L, 1L, 1L, 1, 1, 1L, 1L, 1L, 1L, 1L, 1L] MySQL runs forever trying to answer that query, so I tried cutting down the number of joins from 8 to 4 and the count came back as 926127 rows(!). Worse still, after issuing the COUNT(1) statement beakerd would then fetch all those rows. The problem here is that we are using a JOIN instead of an EXISTS clause, giving us a huge number of duplicate rows in the result set. The number of duplicate rows will explode exponentially with every <key_value/> in the job. So this is a bad query we are generating, it's not a MySQL bug. The good news is, we can fix it.
Here is a test case that shows the problem: --- a/IntegrationTests/src/bkr/inttest/server/test_model.py +++ b/IntegrationTests/src/bkr/inttest/server/test_model.py @@ -568,6 +568,30 @@ def test_key_notequal(self): self.assert_(with_cciss not in systems) self.assert_(without_cciss in systems) + # https://bugzilla.redhat.com/show_bug.cgi?id=729156 + def test_keyvalue_does_not_cause_duplicate_rows(self): + system = data_setup.create_system(arch=u'i386', shared=True, + lab_controller=self.lc) + disk_key = Key.by_name(u'DISK') + system.key_values_int.extend([ + Key_Value_Int(disk_key, 30718), + Key_Value_Int(disk_key, 140011), + Key_Value_Int(disk_key, 1048570)]) + session.flush() + query = self.distro.systems_filter(self.user, """ + <hostRequires> + <and> + <hostname op="=" value="%s" /> + <key_value key="DISK" op=">" value="9000" /> + </and> + </hostRequires> + """ % system.fqdn) + self.assertEquals(len(query.all()), 1) + # with the bug this count comes out as 3 instead of 1, + # which doesn't sound so bad... + # but when it's 926127 instead of 278, that's bad + self.assertEquals(query.count(), 1) + # https://bugzilla.redhat.com/show_bug.cgi?id=714974 def test_hypervisor(self): baremetal = data_setup.create_system(arch=u'i386', shared=True,
Beaker 0.7.1 has been released.