How can we rank data in BigQuery ?
BigQuery(BQ) is very useful for data analysis or processing.
It is good at handling huge data. It returns summary result in short time.
When we handle data, sometimes we would like to do numbering.
For example, extract from first to third record in each category.
In order to get ranking in whole data, we should use order by
clause.
But how can we get ranking number ?
And also if 4 records have same score, it is difficult to choose 3 records.
How can we number records without duplication ?
So today I will introduce about "How to rank data considering duplication in BigQuery".
How to rank in BigQuery
There are some methods to rank records in BigQuery.
- RANK
- DENSE_RANK
- ROW_NUMBER
They are different functions.
So introduce each function.
Prepare Data
Anyway, we should prepare data.
Import this CSV file as rank_sample
table.
id,val
001,100
002,120
003,130
004,100
005,110
006,160
007,170
008,100
009,120
010,200
With using this table, check each numbering functions.
RANK
As its name, RANK
is ranking function.
We can use it like RANK() OVER(ORDER BY <column name that you want to rank>)
.
SQL
SELECT
*,
RANK() OVER(ORDER BY val) as rank
FROM test.rank_sample
order by id
Result
Row | id | val | rank |
---|---|---|---|
1 | 1 | 100 | 1 |
2 | 2 | 120 | 5 |
3 | 3 | 130 | 7 |
4 | 4 | 100 | 1 |
5 | 5 | 110 | 4 |
6 | 6 | 160 | 8 |
7 | 7 | 170 | 9 |
8 | 8 | 100 | 1 |
9 | 9 | 120 | 5 |
10 | 10 | 200 | 10 |
According to result, we can see 1st to 10th in rank column.
The val=100
records are all 1st, and next val=110
record is 4th.
It means that rank
function considers same score as same rank and skip next number.
DENSE_RANK
DENSE_RANK
is numbering function that doesn't skip duplicated rank.
We can use it like DENSE_RANK() OVER(ORDER BY <column name>)
.
SQL
SELECT
*,
DENSE_RANK() OVER(ORDER BY val) as dense_rank
FROM test.rank_sample
order by id
Result
Row | id | val | dense_rank |
---|---|---|---|
1 | 1 | 100 | 1 |
2 | 2 | 120 | 3 |
3 | 3 | 130 | 4 |
4 | 4 | 100 | 1 |
5 | 5 | 110 | 2 |
6 | 6 | 160 | 5 |
7 | 7 | 170 | 6 |
8 | 8 | 100 | 1 |
9 | 9 | 120 | 3 |
10 | 10 | 200 | 7 |
We can see that rank is from 1st to 7th.
The difference from RANK()
is that val=110
record is not 4th but 2nd.
So DENSE_RANK
does not skip duplicated rank.
ROW_NUMBER
ROW_NUMBER
is not for ranking, but for setting row number.
But we can use it like ROW_NUMBER() OVER(ORDER BY <column name>)
as same as RANK()
function.
SQL
SELECT
*,
ROW_NUMBER() OVER(ORDER BY val) as dense_rank
FROM test.rank_sample
order by id
Result
Row | id | val | row_number |
---|---|---|---|
1 | 1 | 100 | 1 |
2 | 2 | 120 | 5 |
3 | 3 | 130 | 7 |
4 | 4 | 100 | 2 |
5 | 5 | 110 | 4 |
6 | 6 | 160 | 8 |
7 | 7 | 170 | 9 |
8 | 8 | 100 | 3 |
9 | 9 | 120 | 6 |
10 | 10 | 200 | 10 |
According to the result, we can see from 1st to 10th.
Unique point is that there are 2nd and 3rd records in val=100
data.
ROW_NUMBER
just puts row number. So they have different number even in same score records.
Comparison
We classify RANK
, DENSE_RANK
, ROW_NUMBER
like below.
Function | Description |
---|---|
RANK | Same rank in same score. If there are three 1st records,next is 4th. |
DENSE_RANK | Same rank in same score. If there are three 1st records,next is 2nd. |
ROW_NUMBER | Even in same score, it numbers different number. |
SQL
SELECT
*,
RANK() OVER(ORDER BY val) as rank,
DENSE_RANK() OVER(ORDER BY val) as dense_rank,
ROW_NUMBER() OVER(ORDER BY val) as row_number
FROM test.rank_sample
order by id
Result
Row | id | val | rank | dense_rank | row_number |
---|---|---|---|---|---|
1 | 1 | 100 | 1 | 1 | 1 |
2 | 2 | 120 | 5 | 3 | 5 |
3 | 3 | 130 | 7 | 4 | 7 |
4 | 4 | 100 | 1 | 1 | 2 |
5 | 5 | 110 | 4 | 2 | 4 |
6 | 6 | 160 | 8 | 5 | 8 |
7 | 7 | 170 | 9 | 6 | 9 |
8 | 8 | 100 | 1 | 1 | 3 |
9 | 9 | 120 | 5 | 3 | 6 |
10 | 10 | 200 | 10 | 7 | 10 |
Conclusion
Today I described about "How to rank data considering duplication in BigQuery".
There are some ranking functions below.
- RANK
- DENSE_RANK
- ROW_NUMBER
Each feature is following.
Function | Description |
---|---|
RANK | Same rank in same score. If there are three 1st records,next is 4th. |
DENSE_RANK | Same rank in same score. If there are three 1st records,next is 2nd. |
ROW_NUMBER | Even in same score, it numbers different number. |
It is impressive to use ROW_NUMBER
for avoiding duplication.