How to join two dataframes with different size

In this post you will learn how to join two dataframes with different size in Pandas.

This post will explain the use of the join function when combining dataframes of different sizes.

The join function offers 4 different methods for joining dataframes: left, right, inner and outer. Each of these methods will connect dataframes that have different dimensions in a different way and, as a result, a different resulting dataframe will be created.

These are the two dataframes I prepared as an example.

import pandas as pd

my_df1 = pd.DataFrame({'Column1': [5, 9, 12],
                      'Column2': [15, 5, 8],
                       'Column3': [1, 4, 3]},
                      index=['2024-01-01', '2024-01-02', '2024-01-03'])

my_df2 = pd.DataFrame({'Column4': [5, 12],
                       'Column5': [7, 6],
                       'Column6': [4, 5]},
                      index=['2024-01-02', '2024-01-04'])

print(f'This is my left dataframe: \n {my_df1}')
print(f'This is my right dataframe: \n {my_df2}')

how to join dataframes with different size

As you can see the dataframes differ in size. Another difference is also the indexes, because both dataframes contain data for different days. Only one day is common.

How to left-join two dataframes with different size

In the first example, we will discuss left join. The left join method is used when you want to keep the entire dataframe on the left and add the part of the dataframe on the right that is common to both dataframes. Data from the right dataframe that has no match in the dataframe on the left will be deleted.

import pandas as pd

my_df1 = pd.DataFrame({'Column1': [5, 9, 12],
                      'Column2': [15, 5, 8],
                       'Column3': [1, 4, 3]},
                      index=['2024-01-01', '2024-01-02', '2024-01-03'])

my_df2 = pd.DataFrame({'Column4': [5, 12],
                       'Column5': [7, 6],
                       'Column6': [4, 5]},
                      index=['2024-01-02', '2024-01-04'])

my_df = my_df1.join(my_df2, how='left')

print(f'This is my left-joined dataframe: \n{my_df}')

how to left join dataframes with different size

As you can see the data on the right dataframe from January 4th has not been included. Also, data from January 1 and January 3 were entered as NaN (not a number), because in the right dataframe there were none at all.

How to right-join two dataframes with different size

The right join method works in the same way as left join. We expect the entire dataframe on the right and a matching part on the left dataframe.

import pandas as pd

my_df1 = pd.DataFrame({'Column1': [5, 9, 12],
                      'Column2': [15, 5, 8],
                       'Column3': [1, 4, 3]},
                      index=['2024-01-01', '2024-01-02', '2024-01-03'])

my_df2 = pd.DataFrame({'Column4': [5, 12],
                       'Column5': [7, 6],
                       'Column6': [4, 5]},
                      index=['2024-01-02', '2024-01-04'])

my_df = my_df1.join(my_df2, how='right')

print(f'This is my right-joined dataframe: \n{my_df}')

how to right join dataframes with different size

Indeed, the dataframe on the right has been fully merged. From the dataframe on the left, it was only inserted on January 2nd. There was no data from January 4, so NaN values were entered.

How to inner-join two dataframes with different size

Inner join is a method that returns only the common part of all joined dataframes.

import pandas as pd

my_df1 = pd.DataFrame({'Column1': [5, 9, 12],
                      'Column2': [15, 5, 8],
                       'Column3': [1, 4, 3]},
                      index=['2024-01-01', '2024-01-02', '2024-01-03'])

my_df2 = pd.DataFrame({'Column4': [5, 12],
                       'Column5': [7, 6],
                       'Column6': [4, 5]},
                      index=['2024-01-02', '2024-01-04'])

my_df = my_df1.join(my_df2, how='inner')

print(f'This is my inner-joined dataframe: \n{my_df}')

how to inner join dataframes with different size

The dataframe, which was created after joining two dataframes using the inner join method, contains data for only one day. By using the inner join method you can expect the resulting dataframe to be smaller than the input dataframes.

How to outer-join two dataframes with different size

The outer join method is, in short, only used when you don’t want to lose any data.

import pandas as pd

my_df1 = pd.DataFrame({'Column1': [5, 9, 12],
                      'Column2': [15, 5, 8],
                       'Column3': [1, 4, 3]},
                      index=['2024-01-01', '2024-01-02', '2024-01-03'])

my_df2 = pd.DataFrame({'Column4': [5, 12],
                       'Column5': [7, 6],
                       'Column6': [4, 5]},
                      index=['2024-01-02', '2024-01-04'])

my_df = my_df1.join(my_df2, how='outer')

print(f'This is my outer-joined dataframe: \n{my_df}')

how to outer join dataframes with different size

The output dataframe contains all the data that is visible in the input dataframes.

The downside of this solution is the multitude of NaN values, because Pandas inserts NaN values in all missing places.

I hope this post was helpful and you now know how to combine dataframes with different sizes.

See also:
Documentation of a Join function

This Post Has 3 Comments

Leave a Reply