Cardinality estimation in parallel
Introduction 🧮
First, what is cardinality? It’s counting the unique elements of a field. In SQL, something like SELECT DISTINCT field FROM table
. You can definitely count unique elements this way but the trouble begins when you want to do quick analyses on big data. If you want to explore relationships between two variables, it may involve multiple GROUP BY
s and other operations for every pair of variables that you want to explore. This is especially tedious and expensive if you were to explore every combination of fields. I’m no expert in big O notation estimates, but this sounds like O(n²) to me. And O(n²) is considering pairs, it could grow to O(nᵐ) if you wanted to compare m columns. With probalistic data structures like HyperLogLog and MinHash, we can compute this for every column, so then the cost is only around O(n). See the references listed at the bottom for in-depth discussions on probalistic data structures, the class of data structures that HyperLogLog and MinHash belong to.