Pandas Concat Example
Pandas is a powerful data manipulation library in Python, and one of its most useful features is the ability to combine multiple DataFrames using the concat
function. This article will provide a comprehensive guide to using pandas.concat()
, covering various scenarios and use cases with detailed examples.
1. Introduction to pandas.concat()
The pandas.concat()
function is used to concatenate pandas objects along a particular axis. It can be used to combine DataFrames vertically (stacking rows) or horizontally (joining columns). This function is particularly useful when you need to merge multiple datasets or combine data from different sources.
Let’s start with a basic example to illustrate how concat
works:
import pandas as pd
# Create two sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']},
index=['K0', 'K1', 'K2'])
# Concatenate the DataFrames horizontally
result = pd.concat([df1, df2], axis=1)
print("Result of horizontal concatenation:")
print(result)
# Concatenate the DataFrames vertically
result_vertical = pd.concat([df1, df2])
print("\nResult of vertical concatenation:")
print(result_vertical)
Output:
In this example, we create two DataFrames df1
and df2
with different columns but the same index. We then use pd.concat()
to combine them horizontally (axis=1) and vertically (default axis=0). The horizontal concatenation joins the columns side by side, while the vertical concatenation stacks the rows on top of each other.
2. Concatenating DataFrames with Different Indexes
When concatenating DataFrames with different indexes, pandas will include all index values in the result. Let’s see an example:
import pandas as pd
# Create two DataFrames with different indexes
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']},
index=['K1', 'K2', 'K3'])
# Concatenate the DataFrames
result = pd.concat([df1, df2], axis=1)
print("Result of concatenation with different indexes:")
print(result)
# Check for missing values
print("\nMissing values:")
print(result.isnull().sum())
Output:
In this example, df1
and df2
have different index values. When we concatenate them, pandas includes all unique index values from both DataFrames. As a result, some cells in the concatenated DataFrame will contain NaN (Not a Number) values where data is missing for a particular index.
3. Handling Missing Values in Concatenation
When concatenating DataFrames with different indexes or columns, you may want to handle missing values in specific ways. Pandas provides options to fill missing values or drop them entirely.
import pandas as pd
# Create two DataFrames with different indexes and columns
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']},
index=['K1', 'K2', 'K3'])
# Concatenate with inner join
result_inner = pd.concat([df1, df2], axis=1, join='inner')
print("Result of inner join concatenation:")
print(result_inner)
# Concatenate with outer join (default) and fill missing values
result_outer_filled = pd.concat([df1, df2], axis=1, join='outer').fillna('N/A')
print("\nResult of outer join concatenation with filled values:")
print(result_outer_filled)
# Concatenate and drop rows with any missing values
result_dropna = pd.concat([df1, df2], axis=1).dropna()
print("\nResult of concatenation with dropped missing values:")
print(result_dropna)
Output:
This example demonstrates three different approaches to handling missing values:
- Using
join='inner'
to keep only the intersection of indexes. - Using the default
join='outer'
and filling missing values with a specific value (in this case, ‘N/A’). - Dropping rows that contain any missing values after concatenation.
4. Concatenating DataFrames with MultiIndex
Pandas allows you to work with multi-level indexes, also known as hierarchical indexes. When concatenating DataFrames with MultiIndex, you need to be careful about how the levels are aligned. Let’s look at an example:
import pandas as pd
# Create two DataFrames with MultiIndex
idx1 = pd.MultiIndex.from_product([['A', 'B'], [0, 1]], names=['letter', 'number'])
df1 = pd.DataFrame({'X': range(4), 'Y': range(4, 8)}, index=idx1)
idx2 = pd.MultiIndex.from_product([['B', 'C'], [1, 2]], names=['letter', 'number'])
df2 = pd.DataFrame({'Z': range(4)}, index=idx2)
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
# Concatenate the DataFrames
result = pd.concat([df1, df2], axis=1)
print("\nResult of concatenation with MultiIndex:")
print(result)
# Concatenate and sort the index
result_sorted = pd.concat([df1, df2], axis=1).sort_index()
print("\nResult of concatenation with sorted MultiIndex:")
print(result_sorted)
Output:
In this example, we create two DataFrames with MultiIndex and concatenate them. The resulting DataFrame includes all unique combinations of the index levels. We also demonstrate how to sort the MultiIndex after concatenation for better organization.
5. Concatenating DataFrames with Different Column Names
When concatenating DataFrames vertically (along axis=0) with different column names, pandas will include all unique column names in the result. Let’s see an example:
import pandas as pd
# Create two DataFrames with different column names
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'B': ['B3', 'B4', 'B5'],
'C': ['C3', 'C4', 'C5']})
# Concatenate the DataFrames vertically
result = pd.concat([df1, df2], ignore_index=True)
print("Result of vertical concatenation with different column names:")
print(result)
# Check for missing values
print("\nMissing values:")
print(result.isnull().sum())
Output:
In this example, df1
and df2
have different column names. When we concatenate them vertically, pandas includes all unique column names from both DataFrames. The resulting DataFrame will have NaN values where data is missing for a particular column.
6. Using keys in pandas.concat()
The keys
parameter in pandas.concat()
allows you to add a new level to the index, which can be useful for identifying the source of each row in the concatenated DataFrame. Here’s an example:
import pandas as pd
# Create three DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'],
'B': ['B2', 'B3']})
df3 = pd.DataFrame({'A': ['A4', 'A5'],
'B': ['B4', 'B5']})
# Concatenate the DataFrames with keys
result = pd.concat([df1, df2, df3], keys=['X', 'Y', 'Z'])
print("Result of concatenation with keys:")
print(result)
# Access data from a specific key
print("\nData from key 'Y':")
print(result.loc['Y'])
Output:
In this example, we concatenate three DataFrames and use the keys
parameter to assign labels to each DataFrame. The resulting DataFrame has a hierarchical index where the first level corresponds to the keys we provided.
7. Concatenating Series Objects
The concat
function can also be used to combine pandas Series objects. Here’s an example:
import pandas as pd
# Create three Series objects
s1 = pd.Series(['A0', 'A1', 'A2'], name='A')
s2 = pd.Series(['B0', 'B1', 'B2'], name='B')
s3 = pd.Series(['C0', 'C1', 'C2'], name='C')
# Concatenate the Series objects
result = pd.concat([s1, s2, s3], axis=1)
print("Result of concatenating Series objects:")
print(result)
# Concatenate Series objects vertically
result_vertical = pd.concat([s1, s2, s3], axis=0)
print("\nResult of vertical concatenation of Series objects:")
print(result_vertical)
Output:
This example shows how to concatenate Series objects both horizontally and vertically. When concatenating horizontally, the Series names become column names in the resulting DataFrame.
8. Concatenating with Hierarchical Indexing
Hierarchical indexing, also known as MultiIndexing, allows you to have multiple levels of indexes in your DataFrame. Here’s an example of how to use concat
with hierarchical indexing:
import pandas as pd
# Create two DataFrames with hierarchical indexing
idx1 = pd.MultiIndex.from_product([['A', 'B'], [0, 1]], names=['letter', 'number'])
df1 = pd.DataFrame({'X': range(4), 'Y': range(4, 8)}, index=idx1)
idx2 = pd.MultiIndex.from_product([['B', 'C'], [1, 2]], names=['letter', 'number'])
df2 = pd.DataFrame({'Z': range(4)}, index=idx2)
# Concatenate the DataFrames
result = pd.concat([df1, df2], axis=1)
print("Result of concatenation with hierarchical indexing:")
print(result)
# Concatenate and sort the index
result_sorted = pd.concat([df1, df2], axis=1).sort_index()
print("\nResult of concatenation with sorted hierarchical index:")
print(result_sorted)
Output:
In this example, we create two DataFrames with hierarchical indexes and concatenate them. The resulting DataFrame preserves the hierarchical structure of the indexes.
9. Concatenating with Custom Index Labels
You can specify custom index labels when concatenating DataFrames using the keys
parameter along with names
. Here’s an example:
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'],
'B': ['B2', 'B3']})
# Concatenate with custom index labels
result = pd.concat([df1, df2], keys=['First', 'Second'], names=['Source', 'Row'])
print("Result of concatenation with custom index labels:")
print(result)
# Access data from a specific source
print("\nData from 'Second' source:")
print(result.loc['Second'])
Output:
In this example, we use keys
to specify custom labels for each DataFrame and names
to give names to the index levels. This creates a hierarchical index in the resulting DataFrame.
10. Concatenating with Ignore Index
Sometimes you may want to ignore the original index of the DataFrames being concatenated and create a new sequential index. You can do this using the ignore_index
parameter:
import pandas as pd
# Create two DataFrames with different indexes
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=[0, 1, 2])
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']},
index=[3, 4, 5])
# Concatenate with ignore_index=True
result = pd.concat([df1, df2], ignore_index=True)
print("Result of concatenation with ignore_index=True:")
print(result)
Output:
In this example, we concatenate two DataFrames with different index values. By setting ignore_index=True
, we create a new sequential index for the resulting DataFrame.
11. Concatenating DataFrames with Different Data Types
When concatenating DataFrames with columns of different data types, pandas will try to find a common data type that can accommodate all values. Here’s an example:
import pandas as pd
# Create two DataFrames with different data types
df1 = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'b', 'c']})
df2 = pd.DataFrame({'A': ['4', '5', '6'],
'B': [4.0, 5.0, 6.0]})
# Concatenate the DataFrames
result = pd.concat([df1, df2], ignore_index=True)
print("Result of concatenation with different data types:")
print(result)
# Check the data types of the resulting DataFrame
print("\nData types of the resulting DataFrame:")
print(result.dtypes)
Output:
In this example, we concatenate two DataFrames where the data types of columns ‘A’ and ‘B’ differ between the DataFrames. Pandas will automatically convert the data types to accommodate all values, which may result in type upcasting.
12. Concatenating DataFrames with Different Column Orders
When concatenating DataFrames with columns in different orders, pandas will align the columns based on their names. Here’s an example:
import pandas as pd
# Create two DataFrames with columns in different orders
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['C0', 'C1', 'C2']})
df2 = pd.DataFrame({'C': ['C3', 'C4', 'C5'],
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']})
# Concatenate the DataFrames
result = pd.concat([df1, df2], ignore_index=True)
print("Result of concatenation with different column orders:")
print(result)
Output:
In this example, the columns in df1
and df2
are in different orders. When concatenated, pandas aligns the columns based on their names, maintaining a consistent order in the resulting DataFrame.
13. Concatenating DataFrames with Mixed Data Types and Missing Values
When dealing with real-world data, you might encounter situations where you need to concatenate DataFrames with mixed data types and missing values. Here’s an example of how to handle such a scenario:
import pandas as pd
import numpy as np
# Create two DataFrames with mixed data types and missing values
df1 = pd.DataFrame({
'A': [1, 2, np.nan],
'B': ['a', 'b', 'c'],
'C': [1.1, 2.2, 3.3]
})
df2 = pd.DataFrame({
'A': ['4', '5', '6'],
'B': [4.0, np.nan, 6.0],
'C': ['d', 'e', 'f']
})
# Concatenate the DataFrames
result = pd.concat([df1, df2], ignore_index=True)
print("Result of concatenation with mixed data types and missing values:")
print(result)
# Check the data types of the resulting DataFrame
print("\nData types of the resulting DataFrame:")
print(result.dtypes)
# Handle missing values
result_filled = result.fillna('Unknown')
print("\nResult after handling missing values:")
print(result_filled)
Output:
In this example, we concatenate two DataFrames with mixed data types and missing values. The resulting DataFrame will have columns with mixed types, and some cells will contain NaN values. We then demonstrate how to handle missing values by filling them with a specific value.
14. Concatenating DataFrames with Different Column Subsets
When concatenating DataFrames that have different subsets of columns, pandas will include all unique columns in the result and fill missing values with NaN. Here’s an example:
import pandas as pd
# Create three DataFrames with different column subsets
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'B': ['B3', 'B4', 'B5'],
'C': ['C3', 'C4', 'C5']})
df3 = pd.DataFrame({'A': ['A6', 'A7', 'A8'],
'C': ['C6', 'C7', 'C8'],
'D': ['D6', 'D7', 'D8']})
# Concatenate the DataFrames
result = pd.concat([df1, df2, df3], ignore_index=True)
print("Result of concatenation with different column subsets:")
print(result)
# Check for missing values
print("\nMissing values:")
print(result.isnull().sum())
Output:
In this example, we concatenate three DataFrames with different column subsets. The resulting DataFrame includes all unique columns from the input DataFrames, with NaN values where data is missing.
15. Concatenating DataFrames with Date Ranges
When working with time series data, you might need to concatenate DataFrames with date ranges. Here’s an example of how to do this:
import pandas as pd
# Create two DataFrames with date ranges
date_range1 = pd.date_range(start='2023-01-01', end='2023-01-05', freq='D')
df1 = pd.DataFrame({'Date': date_range1, 'Value': range(len(date_range1))})
date_range2 = pd.date_range(start='2023-01-04', end='2023-01-08', freq='D')
df2 = pd.DataFrame({'Date': date_range2, 'Value': range(10, 10 + len(date_range2))})
# Set the 'Date' column as the index
df1.set_index('Date', inplace=True)
df2.set_index('Date', inplace=True)
# Concatenate the DataFrames
result = pd.concat([df1, df2])
print("Result of concatenation with date ranges:")
print(result)
# Sort the index and handle duplicate dates
result_sorted = result.sort_index()
result_sorted = result_sorted[~result_sorted.index.duplicated(keep='first')]
print("\nResult after sorting and handling duplicate dates:")
print(result_sorted)
Output:
In this example, we create two DataFrames with overlapping date ranges, concatenate them, and then sort the result by date. We also demonstrate how to handle duplicate dates by keeping the first occurrence.
16. Concatenating DataFrames with Different Column Data Types
When concatenating DataFrames with columns that have different data types, pandas will attempt to find a common data type that can accommodate all values. Here’s an example:
import pandas as pd
# Create two DataFrames with different column data types
df1 = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'b', 'c'],
'C': [1.1, 2.2, 3.3]})
df2 = pd.DataFrame({'A': ['4', '5', '6'],
'B': [4, 5, 6],
'C': ['d', 'e', 'f']})
# Concatenate the DataFrames
result = pd.concat([df1, df2], ignore_index=True)
print("Result of concatenation with different column data types:")
print(result)
# Check the data types of the resulting DataFrame
print("\nData types of the resulting DataFrame:")
print(result.dtypes)
Output:
In this example, we concatenate two DataFrames where the data types of columns ‘A’, ‘B’, and ‘C’ differ between the DataFrames. Pandas will automatically convert the data types to accommodate all values, which may result in type upcasting.
17. Concatenating DataFrames with MultiIndex Columns
When working with complex data structures, you might encounter DataFrames with MultiIndex columns. Here’s an example of how to concatenate such DataFrames:
import pandas as pd
# Create two DataFrames with MultiIndex columns
columns1 = pd.MultiIndex.from_tuples([('A', 'X'), ('A', 'Y'), ('B', 'X')])
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=columns1)
columns2 = pd.MultiIndex.from_tuples([('B', 'Y'), ('C', 'X'), ('C', 'Y')])
df2 = pd.DataFrame([[7, 8, 9], [10, 11, 12]], columns=columns2)
# Concatenate the DataFrames
result = pd.concat([df1, df2], axis=1)
print("Result of concatenation with MultiIndex columns:")
print(result)
# Access data using MultiIndex
print("\nAccessing data using MultiIndex:")
print(result['A']['X'])
Output:
In this example, we create two DataFrames with MultiIndex columns and concatenate them horizontally. The resulting DataFrame preserves the MultiIndex structure of the columns.
18. Concatenating DataFrames with Different Frequencies
When working with time series data, you might need to concatenate DataFrames with different frequencies. Here’s an example:
import pandas as pd
# Create two DataFrames with different frequencies
date_range1 = pd.date_range(start='2023-01-01', end='2023-01-05', freq='D')
df1 = pd.DataFrame({'Date': date_range1, 'Value': range(len(date_range1))})
date_range2 = pd.date_range(start='2023-01-06', end='2023-01-10', freq='2D')
df2 = pd.DataFrame({'Date': date_range2, 'Value': range(10, 10 + len(date_range2))})
# Set the 'Date' column as the index
df1.set_index('Date', inplace=True)
df2.set_index('Date', inplace=True)
# Concatenate the DataFrames
result = pd.concat([df1, df2])
print("Result of concatenation with different frequencies:")
print(result)
# Resample the result to a common frequency
result_resampled = result.resample('D').asfreq()
print("\nResult after resampling to daily frequency:")
print(result_resampled)
Output:
In this example, we concatenate two DataFrames with different frequencies (daily and every two days). After concatenation, we demonstrate how to resample the result to a common frequency.
19. Concatenating DataFrames with Custom Index Names
When concatenating DataFrames, you might want to preserve or set custom names for the resulting index levels. Here’s an example:
import pandas as pd
# Create two DataFrames with named indexes
df1 = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1']},
index=pd.Index(['I0', 'I1'], name='Index1'))
df2 = pd.DataFrame({'A': ['A2', 'A3'],
'B': ['B2', 'B3']},
index=pd.Index(['I2', 'I3'], name='Index2'))
# Concatenate the DataFrames with keys and custom names
result = pd.concat([df1, df2], keys=['First', 'Second'], names=['Source', 'Original_Index'])
print("Result of concatenation with custom index names:")
print(result)
# Access data using the custom index names
print("\nAccessing data using custom index names:")
print(result.loc['First'])
Output:
In this example, we concatenate two DataFrames with named indexes and add an additional level to the resulting index. We use the names
parameter to set custom names for the index levels.
20. Concatenating DataFrames with Different Column Levels
When working with MultiIndex columns, you might need to concatenate DataFrames with different column levels. Here’s an example:
import pandas as pd
# Create two DataFrames with different column levels
columns1 = pd.MultiIndex.from_tuples([('A', 'X', '1'), ('A', 'Y', '2'), ('B', 'X', '1')])
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=columns1)
columns2 = pd.MultiIndex.from_tuples([('B', 'Y', '2'), ('C', 'X', '1'), ('C', 'Y', '2')])
df2 = pd.DataFrame([[7, 8, 9], [10, 11, 12]], columns=columns2)
# Concatenate the DataFrames
result = pd.concat([df1, df2], axis=1)
print("Result of concatenation with different column levels:")
print(result)
# Flatten the MultiIndex columns
result.columns = ['_'.join(col).strip() for col in result.columns.values]
print("\nResult with flattened column names:")
print(result)
Output:
In this example, we concatenate two DataFrames with different MultiIndex column structures. After concatenation, we demonstrate how to flatten the MultiIndex columns into a single level of column names.
Pandas Concat Example Conclusion
The pandas.concat()
function is a versatile tool for combining DataFrames and Series objects in various ways. Throughout this article, we’ve explored numerous examples and scenarios, demonstrating how to use concat()
effectively in different situations.
Some key points to remember when using pandas.concat()
:
- You can concatenate DataFrames vertically (axis=0) or horizontally (axis=1).
- The
keys
parameter allows you to add a new level to the index, which is useful for identifying the source of each row. - You can handle missing values using different join types (‘inner’ or ‘outer’) and methods like
fillna()
ordropna()
. - When concatenating DataFrames with different column names or data types, pandas will attempt to accommodate all unique columns and find common data types.
- MultiIndex and hierarchical indexing can be preserved and manipulated during concatenation.
- Time series data with different frequencies can be concatenated and resampled as needed.
By mastering the concat()
function and understanding its various parameters and use cases, you’ll be well-equipped to handle complex data manipulation tasks in pandas. Remember to always check the resulting DataFrame after concatenation to ensure it meets your expectations, especially when dealing with mixed data types, missing values, or complex index structures.