Bug 729156

Summary: <key_value/> uses JOIN instead of EXISTS, causing many duplicate rows to be returned
Product: [Retired] Beaker Reporter: Bill Peck <bpeck>
Component: schedulerAssignee: Dan Callaghan <dcallagh>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.6CC: bpeck, dcallagh, mcsontos, rmancy, stl
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-08 02:42:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Bill Peck 2011-08-08 21:15:08 UTC
Description of problem:
Attempt to catch bad queries before we send them to mysql and get stuck waiting forever.

Version-Release number of selected component (if applicable):
0.6.16


Capture some known bad hostRequires and add them to nosetests so we can know when we have a workaround.

Comment 1 Dan Callaghan 2011-08-09 03:26:43 UTC
Bug 590723 has more details about the MySQL problem.

Comment 2 Dan Callaghan 2011-08-09 06:08:51 UTC
I think the most recent scheduler failure is actually not related to bug 590723 and those optimizer_search_depth shenanigans. The recipe which caused it has this <hostRequires/>:

<hostRequires>
<and>
<key_value key="MEMORY" value="1400" op=">"></key_value>
<key_value key="DISK" value="7999" op=">"></key_value>
<key_value key="MEMORY" value="1400" op=">"></key_value>
<key_value key="DISK" value="7999" op=">"></key_value>
</and>
<and>
<key_value key="MEMORY" value="1400" op=">"></key_value>
<key_value key="DISK" value="7999" op=">"></key_value>
<key_value key="MEMORY" value="1400" op=">"></key_value>
<key_value key="DISK" value="7999" op=">"></key_value>
</and>
</hostRequires>

the duplicates being caused by bug 706838. This produces a query like the following:

2011-08-09 15:42:32,060 sqlalchemy.engine.base.Engine.0x..10 INFO SELECT count(1) AS count_1 
FROM system_arch_map AS system_arch_map_1, system LEFT OUTER JOIN key_value_int AS key_value0 ON key_value0.key_id = %s AND system.id = key_value0.system_id LEFT OUTER JOIN key_value_int AS key_value1 ON key_value1.key_id = %s AND system.id = key_value1.system_id LEFT OUTER JOIN key_value_int AS key_value2 ON key_value2.key_id = %s AND system.id = key_value2.system_id LEFT OUTER JOIN key_value_int AS key_value3 ON key_value3.key_id = %s AND system.id = key_value3.system_id LEFT OUTER JOIN key_value_int AS key_value4 ON key_value4.key_id = %s AND system.id = key_value4.system_id LEFT OUTER JOIN key_value_int AS key_value5 ON key_value5.key_id = %s AND system.id = key_value5.system_id LEFT OUTER JOIN key_value_int AS key_value6 ON key_value6.key_id = %s AND system.id = key_value6.system_id LEFT OUTER JOIN key_value_int AS key_value7 ON key_value7.key_id = %s AND system.id = key_value7.system_id LEFT OUTER JOIN system_group AS system_group_1 ON system.id = system_group_1.system_id LEFT OUTER JOIN tg_group AS tg_group_1 ON tg_group_1.group_id = system_group_1.group_id LEFT OUTER JOIN user_group AS user_group_1 ON tg_group_1.group_id = user_group_1.group_id LEFT OUTER JOIN tg_user AS tg_user_1 ON tg_user_1.user_id = user_group_1.user_id INNER JOIN lab_controller ON lab_controller.id = system.lab_controller_id 
WHERE key_value0.key_value > %s AND key_value1.key_value > %s AND key_value2.key_value > %s AND key_value3.key_value > %s AND key_value4.key_value > %s AND key_value5.key_value > %s AND key_value6.key_value > %s AND key_value7.key_value > %s AND %s = system.status_id AND (system.owner_id = %s OR system.loan_id = %s OR system.shared = %s AND NOT (EXISTS (SELECT 1 
FROM system_group 
WHERE system.id = system_group.system_id)) OR system.shared = %s AND tg_user_1.user_id = %s) AND system.id = system_arch_map_1.system_id AND %s = system_arch_map_1.arch_id AND NOT (EXISTS (SELECT 1 
FROM exclude_osmajor 
WHERE system.id = exclude_osmajor.system_id AND %s = exclude_osmajor.osmajor_id AND %s = exclude_osmajor.arch_id)) AND NOT (EXISTS (SELECT 1 
FROM exclude_osversion 
WHERE system.id = exclude_osversion.system_id AND %s = exclude_osversion.osversion_id AND %s = exclude_osversion.arch_id))
2011-08-09 15:42:32,060 sqlalchemy.engine.base.Engine.0x..10 INFO [12L, 9L, 12L, 9L, 12L, 9L, 12L, 9L, u'1400', u'7999', u'1400', u'7999', u'1400', u'7999', u'1400', u'7999', 1L, 1L, 1L, 1, 1, 1L, 1L, 1L, 1L, 1L, 1L]

MySQL runs forever trying to answer that query, so I tried cutting down the number of joins from 8 to 4 and the count came back as 926127 rows(!). Worse still, after issuing the COUNT(1) statement beakerd would then fetch all those rows.

The problem here is that we are using a JOIN instead of an EXISTS clause, giving us a huge number of duplicate rows in the result set. The number of duplicate rows will explode exponentially with every <key_value/> in the job.

So this is a bad query we are generating, it's not a MySQL bug. The good news is, we can fix it.

Comment 4 Dan Callaghan 2011-08-10 02:11:12 UTC
Here is a test case that shows the problem:

--- a/IntegrationTests/src/bkr/inttest/server/test_model.py
+++ b/IntegrationTests/src/bkr/inttest/server/test_model.py
@@ -568,6 +568,30 @@ def test_key_notequal(self):
         self.assert_(with_cciss not in systems)
         self.assert_(without_cciss in systems)
 
+    # https://bugzilla.redhat.com/show_bug.cgi?id=729156
+    def test_keyvalue_does_not_cause_duplicate_rows(self):
+        system = data_setup.create_system(arch=u'i386', shared=True,
+                lab_controller=self.lc)
+        disk_key = Key.by_name(u'DISK')
+        system.key_values_int.extend([
+                Key_Value_Int(disk_key, 30718),
+                Key_Value_Int(disk_key, 140011),
+                Key_Value_Int(disk_key, 1048570)])
+        session.flush()
+        query = self.distro.systems_filter(self.user, """
+            <hostRequires>
+                <and>
+                    <hostname op="=" value="%s" />
+                    <key_value key="DISK" op="&gt;" value="9000" />
+                </and>
+            </hostRequires>
+            """ % system.fqdn)
+        self.assertEquals(len(query.all()), 1)
+        # with the bug this count comes out as 3 instead of 1,
+        # which doesn't sound so bad...
+        # but when it's 926127 instead of 278, that's bad
+        self.assertEquals(query.count(), 1)
+
     # https://bugzilla.redhat.com/show_bug.cgi?id=714974
     def test_hypervisor(self):
         baremetal = data_setup.create_system(arch=u'i386', shared=True,

Comment 9 Dan Callaghan 2011-09-08 02:42:53 UTC
Beaker 0.7.1 has been released.