Pandas DataFrame Merge

Pandas DataFrame Merge

Merging data is a common operation in data analysis, where you combine data from multiple sources into a single DataFrame. Pandas, a powerful data manipulation library in Python, provides various functions to perform merging operations similar to database-style joins. In this article, we will explore different ways to merge DataFrames using Pandas, with comprehensive examples to illustrate each method.

Introduction to DataFrame Merge

In Pandas, the primary function for merging two data sets is merge(). This function allows for inner, outer, left, and right joins similar to SQL operations. The merge operation in Pandas can be performed on columns or indices, and it handles overlapping column names and missing values gracefully.

Before diving into examples, let’s first understand the key parameters of the pd.merge() function:

  • left: The DataFrame on the left side of the merge.
  • right: The DataFrame on the right side of the merge.
  • how: Type of merge to be performed. It can be ‘left’, ‘right’, ‘outer’, or ‘inner’. Default is ‘inner’.
  • on: Column or index level names to join on. Must be found in both DataFrames.
  • left_on: Columns from the left DataFrame to use as keys.
  • right_on: Columns from the right DataFrame to use as keys.
  • left_index: If True, use the index from the left DataFrame as the join key.
  • right_index: If True, use the index from the right DataFrame as the join key.

Example 1: Basic Inner Merge

Let’s start with a basic example of an inner merge, where we combine two DataFrames based on a common column.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Age': [25, 30, 35, 40]
})

# Perform an inner merge
result = pd.merge(df1, df2, on='ID')
print(result)

Output:

Pandas DataFrame Merge

Example 2: Left Merge

A left merge, or left join, keeps all rows from the left DataFrame and includes matching rows from the right DataFrame. Rows in the left DataFrame that do not have a match in the right DataFrame will have NaN values for the columns from the right DataFrame.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [2, 3, 5, 6],
    'Age': [20, 25, 35, 40]
})

# Perform a left merge
result = pd.merge(df1, df2, on='ID', how='left')
print(result)

Output:

Pandas DataFrame Merge

Example 3: Right Merge

Right merge, or right join, is similar to left merge but keeps all rows from the right DataFrame while including matching rows from the left DataFrame.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [2, 3, 5, 6],
    'Age': [20, 25, 35, 40]
})

# Perform a right merge
result = pd.merge(df1, df2, on='ID', how='right')
print(result)

Output:

Pandas DataFrame Merge

Example 4: Outer Merge

An outer merge, or full outer join, returns all rows from both DataFrames. Where there are missing matches, it fills with NaN.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [2, 3, 5, 6],
    'Age': [20, 25, 35, 40]
})

# Perform an outer merge
result = pd.merge(df1, df2, on='ID', how='outer')
print(result)

Output:

Pandas DataFrame Merge

Example 5: Merge with Different Key Columns

Sometimes, the key columns in each DataFrame have different names. In this case, you can specify left_on and right_on to indicate the key columns in each DataFrame.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'EmployeeID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [2, 3, 5, 6],
    'Age': [20, 25, 35, 40]
})

# Perform a merge with different key columns
result = pd.merge(df1, df2, left_on='EmployeeID', right_on='ID', how='inner')
print(result)

Output:

Pandas DataFrame Merge

Example 6: Merge on Index

In some cases, you might want to merge DataFrames based on their indices rather than columns. You can achieve this by setting left_index and right_index to True.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Salary': [50000, 60000, 70000, 80000]
}, index=[1, 2, 3, 4])

df2 = pd.DataFrame({
    'Age': [25, 30, 35, 40]
}, index=[3, 4, 5, 6])

# Perform a merge on index
result = pd.merge(df1, df2, left_index=True, right_index=True, how='inner')
print(result)

Output:

Pandas DataFrame Merge

Example 7: Merge with Suffixes

When merging DataFrames that have overlapping column names (other than the key columns), Pandas automatically adds suffixes to the overlapping column names to distinguish them. You can customize these suffixes using the suffixes parameter.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Value': [10, 20, 30, 40]
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Value': [15, 25, 35, 45]
})

# Perform a merge with custom suffixes
result = pd.merge(df1, df2, on='ID', how='inner', suffixes=('_Left', '_Right'))
print(result)

Output:

Pandas DataFrame Merge

Example 8: Merge with Multiple Keys

You can merge DataFrames based on multiple columns by passing a list of column names to the on parameter.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Group': ['A', 'B', 'A', 'B'],
    'Data': [100, 200, 300, 400]
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Group': ['A', 'B', 'A', 'B'],
    'Value': [10, 20, 30, 40]
})

# Perform a merge with multiple keys
result = pd.merge(df1, df2, on=['ID', 'Group'], how='inner')
print(result)

Output:

Pandas DataFrame Merge

Example 9: Conditional Merge

In some scenarios, you might want to merge DataFrames based on a condition other than equality. While Pandas does not directly support conditional joins, you can achieve this by merging on a key that satisfies the condition.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Value': [10, 20, 30, 40]
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Value': [15, 25, 35, 45]
})

# Add a key for merging
df1['Key'] = df1['Value'] > 15
df2['Key'] = df2['Value'] < 40

# Perform a merge based on the condition
result = pd.merge(df1, df2, on='Key', suffixes=('_Left', '_Right'))
print(result)

Output:

Pandas DataFrame Merge

Example 10: Merge with Indicator

The indicator parameter adds a special column _merge to the output DataFrame, which indicates the source of each row. The column can have three values: ‘left_only’, ‘right_only’, or ‘both’, depending on whether the row comes from the left DataFrame, the right DataFrame, or both.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Age': [25, 30, 35, 40]
})

# Perform a merge with indicator
result = pd.merge(df1, df2, on='ID', how='outer', indicator=True)
print(result)

Output:

Pandas DataFrame Merge

Example 11: Merge with Validation

The validate parameter checks whether the merge is a one-to-one, one-to-many, many-to-one, or many-to-many merge. If the merge violates the specified validation, an exception will be raised.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [2, 2, 5, 6],
    'Age': [20, 25, 35, 40]
})

# Perform a merge with validation
try:
    result = pd.merge(df1, df2, on='ID', validate='one_to_one')
except Exception as e:
    print(e)

Output:

Pandas DataFrame Merge

Example 12: Merge with Sort

The sort parameter sorts the result DataFrame by the join keys in lexicographical order. By default, the result is not sorted.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [4, 3, 2, 1],
    'Name': ['David', 'Charlie', 'Bob', 'Alice']
})

df2 = pd.DataFrame({
    'ID': [6, 5, 4, 3],
    'Age': [40, 35, 30, 25]
})

# Perform a merge with sort
result = pd.merge(df1, df2, on='ID', how='outer', sort=True)
print(result)

Output:

Pandas DataFrame Merge

Example 13: Merge with Different Index Levels

You can merge DataFrames on different index levels by specifying the level number or level name in left_on and right_on.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
}, index=pd.MultiIndex.from_tuples([(1, 'A'), (2, 'B'), (3, 'A'), (4, 'B')]))

df2 = pd.DataFrame({
    'Age': [25, 30, 35, 40]
}, index=[1, 2, 3, 4])

# Perform a merge with different index levels
result = pd.merge(df1, df2, left_index=True, right_index=True)
print(result)

Example 14: Merge with Overlapping Columns

When merging DataFrames with overlapping columns (other than the key columns), you can specify a suffix to append to the overlapping column names in the result DataFrame.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Value': [10, 20, 30, 40]
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Value': [15, 25, 35, 45]
})

# Perform a merge with overlapping columns
result = pd.merge(df1, df2, on='ID', suffixes=('_df1', '_df2'))
print(result)

Output:

Pandas DataFrame Merge

Example 15: Merge with Duplicated Keys

Pandas handles duplicated keys in a merge operation by performing a cartesian product of the rows. That is, for each duplicated key, every combination of the rows with that key will appear in the result DataFrame.

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 2, 3],
    'Name': ['Alice', 'Bob', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'ID': [2, 2, 3, 4],
    'Age': [20, 25, 30, 35]
})

# Perform a merge with duplicated keys
result = pd.merge(df1, df2, on='ID')
print(result)

Output:

Pandas DataFrame Merge

In conclusion, the merge() function in Pandas provides a powerful and flexible way to combine DataFrames. By understanding its parameters and how they work, you can perform a wide range of merge operations to suit your data analysis needs.