How to Calculate the Uniqueness Ratio of a Column


Along with compliteness, uniqueness is a very important data profiling / quality metric. Let’s see how we can compute uniqueness using python.

Let’s import some packages

# Import
import pandas as pd
import numpy as np

We create a dataset to play with

# Let's manually create some random data, making sure that the ref column is not fully unique.
ref = ['id_1', 'id_2', 'id_3', 'id_4', 'id_3']
val = [1, 2, 5, 6, 9]

df = pd.DataFrame({'id': ref, 'values': val})
id values
0 id_1 1
1 id_2 2
2 id_3 5
3 id_4 6
4 id_3 9

Computing uniqueness

We can easily find the list of unique ids by leveraging the drop duplicates function:

unique_ids = df.iloc[:,[0]].drop_duplicates()
0 id_1
1 id_2
2 id_3
3 id_4

Now calculating uniqueness is as simple as:

uniqueness = len(unique_ids) / len(df['id'])

Et voila!

Share this post: