I am attempting to create an index over a desk with probably 100 million rows. Proper now I’ve the next columns:
title="date", kind ="DATE" title="account_id", kind ="INT64" title="group_id", kind ="INT64" title="run_id", kind ="INT64" title="page_id", kind ="INT64" title="page_name", kind ="LONGVARCHAR" title="countable", kind ="INT64" title="viewing_object_id", kind ="INT64" title="viewing_object_title", kind ="LONGVARCHAR"
The place every distinctive
group_id belongs to a single
account_id (every account has a number of teams) and every distinctive
run_id belongs to a single
group_id (every group has a number of runs). Compared,
viewing_object_id will be related to a number of
My most typical question goes to be of the sort:
SELECT MAX_BY(viewing_object_title, date) AS viewing_object_title, MAX_BY(page_name, date) AS page_name, SUM(countable) AS countable FROM my_table WHERE account_id = %d AND date >= DATE(%s) AND date <= DATE(%s) GROUP BY viewing_object_id
However generally I additionally embody
AND page_id = %d. And different occasions I additionally filter on
WHERE account_id = %d AND group_id = %d AND run_id = %d
I am pondering that I’ve a number of choices:
1) Use the next non-clustered indices with no clustered index (preserve heap construction).
[ [account_id, viewing_object_id, date], [account_id, page_id, viewing_object_id, date], [account_id, group_id, run_id, viewing_object_id, date] [account_id, group_id, page_id, viewing_object_id, date] ]
I am pondering that this might take up a ton of area. However absolutely it would not be the identical quantity of area as having 500 million rows proper? These indices match to all the question sorts that I might need to run. And since information is added every day in a pipeline, I am not that involved about write efficiency.
2) Use the next clustered index:
[account_id, date, page_id].
This lets me have actually quick searches after I’m particularly loading all the pieces related to an account over a date vary, together with after I filter by the
page_id. Nevertheless it looks like if I have been to make use of
run_id I would find yourself having to do a full desk scan.
Probably not positive find out how to make this trade-off or if I am pondering of this the precise means. Are one in every of these choices clearly mistaken for a way I would need to index my information?