How to rank data considering duplication in BigQuery

How can we rank data in BigQuery ?

BigQuery(BQ) is very useful for data analysis or processing.
It is good at handling huge data. It returns summary result in short time.

When we handle data, sometimes we would like to do numbering.
For example, extract from first to third record in each category.
In order to get ranking in whole data, we should use order by clause.

But how can we get ranking number ?

And also if 4 records have same score, it is difficult to choose 3 records.

How can we number records without duplication ?

So today I will introduce about "How to rank data considering duplication in BigQuery".

TOC

How to rank in BigQuery

There are some methods to rank records in BigQuery.

  • RANK
  • DENSE_RANK
  • ROW_NUMBER

They are different functions.

So introduce each function.

Prepare Data

Anyway, we should prepare data.

Import this CSV file as rank_sample table.

id,val
001,100
002,120
003,130
004,100
005,110
006,160
007,170
008,100
009,120
010,200

With using this table, check each numbering functions.

RANK

As its name, RANK is ranking function.

We can use it like RANK() OVER(ORDER BY <column name that you want to rank>).

SQL

SELECT
*,
RANK() OVER(ORDER BY val) as rank
FROM test.rank_sample
order by id

Result

Rowidvalrank
111001
221205
331307
441001
551104
661608
771709
881001
991205
101020010
Rank result

According to result, we can see 1st to 10th in rank column.

The val=100 records are all 1st, and next val=110 record is 4th.

It means that rank function considers same score as same rank and skip next number.

Numbering functions | BigQuery

DENSE_RANK

DENSE_RANK is numbering function that doesn't skip duplicated rank.

We can use it like DENSE_RANK() OVER(ORDER BY <column name>).

SQL

SELECT
*,
DENSE_RANK() OVER(ORDER BY val) as dense_rank
FROM test.rank_sample
order by id

Result

Rowidvaldense_rank
111001
221203
331304
441001
551102
661605
771706
881001
991203
10102007
DENSE_RANK result

We can see that rank is from 1st to 7th.

The difference from RANK() is that val=110 record is not 4th but 2nd.

So DENSE_RANK does not skip duplicated rank.

Numbering functions | DENSE_RANK

ROW_NUMBER

ROW_NUMBER is not for ranking, but for setting row number.

But we can use it like ROW_NUMBER() OVER(ORDER BY <column name>) as same as RANK() function.

SQL

SELECT
*,
ROW_NUMBER() OVER(ORDER BY val) as dense_rank
FROM test.rank_sample
order by id

Result

Rowidvalrow_number
111001
221205
331307
441002
551104
661608
771709
881003
991206
101020010
ROW_NUMBER result

According to the result, we can see from 1st to 10th.

Unique point is that there are 2nd and 3rd records in val=100 data.

ROW_NUMBER just puts row number. So they have different number even in same score records.

Numbering functions | ROW_NUMBER

Comparison

We classify RANKDENSE_RANKROW_NUMBER like below.

FunctionDescription
RANKSame rank in same score. If there are three 1st records,next is 4th.
DENSE_RANKSame rank in same score. If there are three 1st records,next is 2nd.
ROW_NUMBEREven in same score, it numbers different number.

SQL

SELECT
*,
RANK() OVER(ORDER BY val) as rank,
DENSE_RANK() OVER(ORDER BY val) as dense_rank,
ROW_NUMBER() OVER(ORDER BY val) as row_number
FROM test.rank_sample
order by id

Result

Rowidvalrankdense_rankrow_number
11100111
22120535
33130747
44100112
55110424
66160858
77170969
88100113
99120536
101020010710

Conclusion

Today I described about "How to rank data considering duplication in BigQuery".

There are some ranking functions below.

  • RANK
  • DENSE_RANK
  • ROW_NUMBER

Each feature is following.

FunctionDescription
RANKSame rank in same score. If there are three 1st records,next is 4th.
DENSE_RANKSame rank in same score. If there are three 1st records,next is 2nd.
ROW_NUMBEREven in same score, it numbers different number.

It is impressive to use ROW_NUMBER for avoiding duplication.

If you like this article, please
Like or Follow !

If you like this article, please share !

Author

karasanのアバター karasan System engineer

Mid-career engineer (AI, Data science, Salesforce, etc.).
Good at Python and SQL.

TOC