Redshift slow WHERE IN performance

— 2 minute read

Redshift is an amazing service when you need to run analytics on your data. Redshift uses columnar data storage, which means, it is very good when you run aggregation queries for reporting purposes.

Considering anaytical reports are generally genarated on a daily or weekly basis, Redshift uses query compilation to optimize query timing. This could impact the some dynamic queries (that you need to run very often with dynamic set of conditions). Say, you want to run a query with big set of static WHERE IN conditions, at this time, query compilation will slow down the query execution time even though the same query will not be run again.

Example:

SELECT SUM(score) from results WHERE student_id IN (LONG_LIST_OF_IDS)

You can find the compilation state of any query using the following query,

select userid, xid,  pid, query, segment, locus, datediff(ms, starttime, endtime) as duration, compile
from svl_compile
where query = QUERY_ID
order by query, segment;

If you see compile=1, it means this query's some part of execution is being compiled for future use. This internal feature may not be prevented. As an workaround, you can use temporary table and do inner join.

CREATE TEMPORARY TABLE temp_student_ids (id int); INSERT INTO temp_student_ids values ((ID_1),)((ID_2)))

SELECT SUM(SCORE) from results INNER JOIN temp_student_ids ON (student_id=id);