How to Normalize a DataFrame

python

Imagine you have a form builder tool and you are collecting the data related to how your users are creating their forms for their website. The dataset at your disposal might be denormalized and contains the form id, for every id there could be one or several fields created (name, address …) and the type of the field (integer, string …):

Normalizing a Dataframe

What we want to achieve is have a row per form. In other words we need to group all the field names and field types into one array and obtain a dataset that would look like the following:

Normalizing a Dataframe

This is the job at hand so let’s get to it!

Importing some packages

# Import
import pandas as pd
import numpy as np

We create a dataset to play with

# Create some data
df = pd.DataFrame({"Form Id": ["Form1","Form1","Form2","Form2","Form2"],
                  "Field Name": ["Name","Email","Email","Company Size","Subscribed"],
                  "Field Type": ["String","String","String","Integer","Boolean"]})
df
Form Id Field Name Field Type
0 Form1 Name String
1 Form1 Email String
2 Form2 Email String
3 Form2 Company Size Integer
4 Form2 Subscribed Boolean

Let’s normalize the dataset!

The goal here is not to explain in details how the below code works but merely provide a recipe that can be applied. So in a nutshell here is what we do: we group the DataFrame by the field we want to make unique (Form Id in our case). Then by leveraging agg() and lambda functions we create lists out of the other dimensions. We also show as a bonus how we can surface the count of fields created in every forms.

# Before Pandas 0.25
df_transformed = df.groupby(['Form Id']).agg({
    'Field Type': [lambda x: list(x)],
    'Field Name': {
        'list' : lambda x: list(x),
        'count' : 'count'
    }
}).reset_index()

df_transformed.columns = ['Form Id', 'Field Type List', 'Field Name List', 'Field Name Count']
df_transformed
# After Pandas 0.25
df_transformed = df.groupby(['Form Id']).agg(
    Field_Type = pd.NamedAgg(column = 'Field Type', aggfunc = lambda x: list(x)),
    Field_Name_List = pd.NamedAgg(column = 'Field Name', aggfunc = lambda x: list(x)),
    Field_Name_Count = pd.NamedAgg(column = 'Field Name', aggfunc = 'count')
).reset_index()

df_transformed
Form Id Field_Type Field_Name_List Field_Name_Count
0 Form1 [String, String] [Name, Email] 2
1 Form2 [String, Integer, Boolean] [Email, Company Size, Subscribed] 3

Et voila!

Share this post: