Pandas DataFrame Merge
Merging data is a common operation in data analysis, where you combine data from multiple sources into a single DataFrame. Pandas, a powerful data manipulation library in Python, provides various functions to perform merging operations similar to database-style joins. In this article, we will explore different ways to merge DataFrames using Pandas, with comprehensive examples to illustrate each method.
Introduction to DataFrame Merge
In Pandas, the primary function for merging two data sets is merge()
. This function allows for inner, outer, left, and right joins similar to SQL operations. The merge operation in Pandas can be performed on columns or indices, and it handles overlapping column names and missing values gracefully.
Before diving into examples, let’s first understand the key parameters of the pd.merge()
function:
left
: The DataFrame on the left side of the merge.right
: The DataFrame on the right side of the merge.how
: Type of merge to be performed. It can be ‘left’, ‘right’, ‘outer’, or ‘inner’. Default is ‘inner’.on
: Column or index level names to join on. Must be found in both DataFrames.left_on
: Columns from the left DataFrame to use as keys.right_on
: Columns from the right DataFrame to use as keys.left_index
: If True, use the index from the left DataFrame as the join key.right_index
: If True, use the index from the right DataFrame as the join key.
Example 1: Basic Inner Merge
Let’s start with a basic example of an inner merge, where we combine two DataFrames based on a common column.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Age': [25, 30, 35, 40]
})
# Perform an inner merge
result = pd.merge(df1, df2, on='ID')
print(result)
Output:
Example 2: Left Merge
A left merge, or left join, keeps all rows from the left DataFrame and includes matching rows from the right DataFrame. Rows in the left DataFrame that do not have a match in the right DataFrame will have NaN values for the columns from the right DataFrame.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [2, 3, 5, 6],
'Age': [20, 25, 35, 40]
})
# Perform a left merge
result = pd.merge(df1, df2, on='ID', how='left')
print(result)
Output:
Example 3: Right Merge
Right merge, or right join, is similar to left merge but keeps all rows from the right DataFrame while including matching rows from the left DataFrame.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [2, 3, 5, 6],
'Age': [20, 25, 35, 40]
})
# Perform a right merge
result = pd.merge(df1, df2, on='ID', how='right')
print(result)
Output:
Example 4: Outer Merge
An outer merge, or full outer join, returns all rows from both DataFrames. Where there are missing matches, it fills with NaN.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [2, 3, 5, 6],
'Age': [20, 25, 35, 40]
})
# Perform an outer merge
result = pd.merge(df1, df2, on='ID', how='outer')
print(result)
Output:
Example 5: Merge with Different Key Columns
Sometimes, the key columns in each DataFrame have different names. In this case, you can specify left_on
and right_on
to indicate the key columns in each DataFrame.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'EmployeeID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [2, 3, 5, 6],
'Age': [20, 25, 35, 40]
})
# Perform a merge with different key columns
result = pd.merge(df1, df2, left_on='EmployeeID', right_on='ID', how='inner')
print(result)
Output:
Example 6: Merge on Index
In some cases, you might want to merge DataFrames based on their indices rather than columns. You can achieve this by setting left_index
and right_index
to True.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Salary': [50000, 60000, 70000, 80000]
}, index=[1, 2, 3, 4])
df2 = pd.DataFrame({
'Age': [25, 30, 35, 40]
}, index=[3, 4, 5, 6])
# Perform a merge on index
result = pd.merge(df1, df2, left_index=True, right_index=True, how='inner')
print(result)
Output:
Example 7: Merge with Suffixes
When merging DataFrames that have overlapping column names (other than the key columns), Pandas automatically adds suffixes to the overlapping column names to distinguish them. You can customize these suffixes using the suffixes
parameter.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Value': [10, 20, 30, 40]
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Value': [15, 25, 35, 45]
})
# Perform a merge with custom suffixes
result = pd.merge(df1, df2, on='ID', how='inner', suffixes=('_Left', '_Right'))
print(result)
Output:
Example 8: Merge with Multiple Keys
You can merge DataFrames based on multiple columns by passing a list of column names to the on
parameter.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Group': ['A', 'B', 'A', 'B'],
'Data': [100, 200, 300, 400]
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Group': ['A', 'B', 'A', 'B'],
'Value': [10, 20, 30, 40]
})
# Perform a merge with multiple keys
result = pd.merge(df1, df2, on=['ID', 'Group'], how='inner')
print(result)
Output:
Example 9: Conditional Merge
In some scenarios, you might want to merge DataFrames based on a condition other than equality. While Pandas does not directly support conditional joins, you can achieve this by merging on a key that satisfies the condition.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Value': [10, 20, 30, 40]
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Value': [15, 25, 35, 45]
})
# Add a key for merging
df1['Key'] = df1['Value'] > 15
df2['Key'] = df2['Value'] < 40
# Perform a merge based on the condition
result = pd.merge(df1, df2, on='Key', suffixes=('_Left', '_Right'))
print(result)
Output:
Example 10: Merge with Indicator
The indicator
parameter adds a special column _merge
to the output DataFrame, which indicates the source of each row. The column can have three values: ‘left_only’, ‘right_only’, or ‘both’, depending on whether the row comes from the left DataFrame, the right DataFrame, or both.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Age': [25, 30, 35, 40]
})
# Perform a merge with indicator
result = pd.merge(df1, df2, on='ID', how='outer', indicator=True)
print(result)
Output:
Example 11: Merge with Validation
The validate
parameter checks whether the merge is a one-to-one, one-to-many, many-to-one, or many-to-many merge. If the merge violates the specified validation, an exception will be raised.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [2, 2, 5, 6],
'Age': [20, 25, 35, 40]
})
# Perform a merge with validation
try:
result = pd.merge(df1, df2, on='ID', validate='one_to_one')
except Exception as e:
print(e)
Output:
Example 12: Merge with Sort
The sort
parameter sorts the result DataFrame by the join keys in lexicographical order. By default, the result is not sorted.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [4, 3, 2, 1],
'Name': ['David', 'Charlie', 'Bob', 'Alice']
})
df2 = pd.DataFrame({
'ID': [6, 5, 4, 3],
'Age': [40, 35, 30, 25]
})
# Perform a merge with sort
result = pd.merge(df1, df2, on='ID', how='outer', sort=True)
print(result)
Output:
Example 13: Merge with Different Index Levels
You can merge DataFrames on different index levels by specifying the level number or level name in left_on
and right_on
.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David']
}, index=pd.MultiIndex.from_tuples([(1, 'A'), (2, 'B'), (3, 'A'), (4, 'B')]))
df2 = pd.DataFrame({
'Age': [25, 30, 35, 40]
}, index=[1, 2, 3, 4])
# Perform a merge with different index levels
result = pd.merge(df1, df2, left_index=True, right_index=True)
print(result)
Example 14: Merge with Overlapping Columns
When merging DataFrames with overlapping columns (other than the key columns), you can specify a suffix to append to the overlapping column names in the result DataFrame.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Value': [10, 20, 30, 40]
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Value': [15, 25, 35, 45]
})
# Perform a merge with overlapping columns
result = pd.merge(df1, df2, on='ID', suffixes=('_df1', '_df2'))
print(result)
Output:
Example 15: Merge with Duplicated Keys
Pandas handles duplicated keys in a merge operation by performing a cartesian product of the rows. That is, for each duplicated key, every combination of the rows with that key will appear in the result DataFrame.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 2, 3],
'Name': ['Alice', 'Bob', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'ID': [2, 2, 3, 4],
'Age': [20, 25, 30, 35]
})
# Perform a merge with duplicated keys
result = pd.merge(df1, df2, on='ID')
print(result)
Output:
In conclusion, the merge()
function in Pandas provides a powerful and flexible way to combine DataFrames. By understanding its parameters and how they work, you can perform a wide range of merge operations to suit your data analysis needs.