Pandas Concat Example

Pandas Concat Example

Pandas is a powerful data manipulation library in Python, and one of its most useful features is the ability to combine multiple DataFrames using the concat function. This article will provide a comprehensive guide to using pandas.concat(), covering various scenarios and use cases with detailed examples.

1. Introduction to pandas.concat()

The pandas.concat() function is used to concatenate pandas objects along a particular axis. It can be used to combine DataFrames vertically (stacking rows) or horizontally (joining columns). This function is particularly useful when you need to merge multiple datasets or combine data from different sources.

Let’s start with a basic example to illustrate how concat works:

import pandas as pd

# Create two sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']},
                   index=['K0', 'K1', 'K2'])

df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
                    'D': ['D0', 'D1', 'D2']},
                   index=['K0', 'K1', 'K2'])

# Concatenate the DataFrames horizontally
result = pd.concat([df1, df2], axis=1)

print("Result of horizontal concatenation:")
print(result)

# Concatenate the DataFrames vertically
result_vertical = pd.concat([df1, df2])

print("\nResult of vertical concatenation:")
print(result_vertical)

Output:

Pandas Concat Example

In this example, we create two DataFrames df1 and df2 with different columns but the same index. We then use pd.concat() to combine them horizontally (axis=1) and vertically (default axis=0). The horizontal concatenation joins the columns side by side, while the vertical concatenation stacks the rows on top of each other.

2. Concatenating DataFrames with Different Indexes

When concatenating DataFrames with different indexes, pandas will include all index values in the result. Let’s see an example:

import pandas as pd

# Create two DataFrames with different indexes
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']},
                   index=['K0', 'K1', 'K2'])

df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
                    'D': ['D0', 'D1', 'D2']},
                   index=['K1', 'K2', 'K3'])

# Concatenate the DataFrames
result = pd.concat([df1, df2], axis=1)

print("Result of concatenation with different indexes:")
print(result)

# Check for missing values
print("\nMissing values:")
print(result.isnull().sum())

Output:

Pandas Concat Example

In this example, df1 and df2 have different index values. When we concatenate them, pandas includes all unique index values from both DataFrames. As a result, some cells in the concatenated DataFrame will contain NaN (Not a Number) values where data is missing for a particular index.

3. Handling Missing Values in Concatenation

When concatenating DataFrames with different indexes or columns, you may want to handle missing values in specific ways. Pandas provides options to fill missing values or drop them entirely.

import pandas as pd

# Create two DataFrames with different indexes and columns
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']},
                   index=['K0', 'K1', 'K2'])

df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
                    'D': ['D0', 'D1', 'D2']},
                   index=['K1', 'K2', 'K3'])

# Concatenate with inner join
result_inner = pd.concat([df1, df2], axis=1, join='inner')

print("Result of inner join concatenation:")
print(result_inner)

# Concatenate with outer join (default) and fill missing values
result_outer_filled = pd.concat([df1, df2], axis=1, join='outer').fillna('N/A')

print("\nResult of outer join concatenation with filled values:")
print(result_outer_filled)

# Concatenate and drop rows with any missing values
result_dropna = pd.concat([df1, df2], axis=1).dropna()

print("\nResult of concatenation with dropped missing values:")
print(result_dropna)

Output:

Pandas Concat Example

This example demonstrates three different approaches to handling missing values:

  1. Using join='inner' to keep only the intersection of indexes.
  2. Using the default join='outer' and filling missing values with a specific value (in this case, ‘N/A’).
  3. Dropping rows that contain any missing values after concatenation.

4. Concatenating DataFrames with MultiIndex

Pandas allows you to work with multi-level indexes, also known as hierarchical indexes. When concatenating DataFrames with MultiIndex, you need to be careful about how the levels are aligned. Let’s look at an example:

import pandas as pd

# Create two DataFrames with MultiIndex
idx1 = pd.MultiIndex.from_product([['A', 'B'], [0, 1]], names=['letter', 'number'])
df1 = pd.DataFrame({'X': range(4), 'Y': range(4, 8)}, index=idx1)

idx2 = pd.MultiIndex.from_product([['B', 'C'], [1, 2]], names=['letter', 'number'])
df2 = pd.DataFrame({'Z': range(4)}, index=idx2)

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Concatenate the DataFrames
result = pd.concat([df1, df2], axis=1)

print("\nResult of concatenation with MultiIndex:")
print(result)

# Concatenate and sort the index
result_sorted = pd.concat([df1, df2], axis=1).sort_index()

print("\nResult of concatenation with sorted MultiIndex:")
print(result_sorted)

Output:

Pandas Concat Example

In this example, we create two DataFrames with MultiIndex and concatenate them. The resulting DataFrame includes all unique combinations of the index levels. We also demonstrate how to sort the MultiIndex after concatenation for better organization.

5. Concatenating DataFrames with Different Column Names

When concatenating DataFrames vertically (along axis=0) with different column names, pandas will include all unique column names in the result. Let’s see an example:

import pandas as pd

# Create two DataFrames with different column names
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'B': ['B3', 'B4', 'B5'],
                    'C': ['C3', 'C4', 'C5']})

# Concatenate the DataFrames vertically
result = pd.concat([df1, df2], ignore_index=True)

print("Result of vertical concatenation with different column names:")
print(result)

# Check for missing values
print("\nMissing values:")
print(result.isnull().sum())

Output:

Pandas Concat Example

In this example, df1 and df2 have different column names. When we concatenate them vertically, pandas includes all unique column names from both DataFrames. The resulting DataFrame will have NaN values where data is missing for a particular column.

6. Using keys in pandas.concat()

The keys parameter in pandas.concat() allows you to add a new level to the index, which can be useful for identifying the source of each row in the concatenated DataFrame. Here’s an example:

import pandas as pd

# Create three DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']})

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']})

df3 = pd.DataFrame({'A': ['A4', 'A5'],
                    'B': ['B4', 'B5']})

# Concatenate the DataFrames with keys
result = pd.concat([df1, df2, df3], keys=['X', 'Y', 'Z'])

print("Result of concatenation with keys:")
print(result)

# Access data from a specific key
print("\nData from key 'Y':")
print(result.loc['Y'])

Output:

Pandas Concat Example

In this example, we concatenate three DataFrames and use the keys parameter to assign labels to each DataFrame. The resulting DataFrame has a hierarchical index where the first level corresponds to the keys we provided.

7. Concatenating Series Objects

The concat function can also be used to combine pandas Series objects. Here’s an example:

import pandas as pd

# Create three Series objects
s1 = pd.Series(['A0', 'A1', 'A2'], name='A')
s2 = pd.Series(['B0', 'B1', 'B2'], name='B')
s3 = pd.Series(['C0', 'C1', 'C2'], name='C')

# Concatenate the Series objects
result = pd.concat([s1, s2, s3], axis=1)

print("Result of concatenating Series objects:")
print(result)

# Concatenate Series objects vertically
result_vertical = pd.concat([s1, s2, s3], axis=0)

print("\nResult of vertical concatenation of Series objects:")
print(result_vertical)

Output:

Pandas Concat Example

This example shows how to concatenate Series objects both horizontally and vertically. When concatenating horizontally, the Series names become column names in the resulting DataFrame.

8. Concatenating with Hierarchical Indexing

Hierarchical indexing, also known as MultiIndexing, allows you to have multiple levels of indexes in your DataFrame. Here’s an example of how to use concat with hierarchical indexing:

import pandas as pd

# Create two DataFrames with hierarchical indexing
idx1 = pd.MultiIndex.from_product([['A', 'B'], [0, 1]], names=['letter', 'number'])
df1 = pd.DataFrame({'X': range(4), 'Y': range(4, 8)}, index=idx1)

idx2 = pd.MultiIndex.from_product([['B', 'C'], [1, 2]], names=['letter', 'number'])
df2 = pd.DataFrame({'Z': range(4)}, index=idx2)

# Concatenate the DataFrames
result = pd.concat([df1, df2], axis=1)

print("Result of concatenation with hierarchical indexing:")
print(result)

# Concatenate and sort the index
result_sorted = pd.concat([df1, df2], axis=1).sort_index()

print("\nResult of concatenation with sorted hierarchical index:")
print(result_sorted)

Output:

Pandas Concat Example

In this example, we create two DataFrames with hierarchical indexes and concatenate them. The resulting DataFrame preserves the hierarchical structure of the indexes.

9. Concatenating with Custom Index Labels

You can specify custom index labels when concatenating DataFrames using the keys parameter along with names. Here’s an example:

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']})

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']})

# Concatenate with custom index labels
result = pd.concat([df1, df2], keys=['First', 'Second'], names=['Source', 'Row'])

print("Result of concatenation with custom index labels:")
print(result)

# Access data from a specific source
print("\nData from 'Second' source:")
print(result.loc['Second'])

Output:

Pandas Concat Example

In this example, we use keys to specify custom labels for each DataFrame and names to give names to the index levels. This creates a hierarchical index in the resulting DataFrame.

10. Concatenating with Ignore Index

Sometimes you may want to ignore the original index of the DataFrames being concatenated and create a new sequential index. You can do this using the ignore_index parameter:

import pandas as pd

# Create two DataFrames with different indexes
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']},
                   index=[0, 1, 2])

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5']},
                   index=[3, 4, 5])

# Concatenate with ignore_index=True
result = pd.concat([df1, df2], ignore_index=True)

print("Result of concatenation with ignore_index=True:")
print(result)

Output:

Pandas Concat Example

In this example, we concatenate two DataFrames with different index values. By setting ignore_index=True, we create a new sequential index for the resulting DataFrame.

11. Concatenating DataFrames with Different Data Types

When concatenating DataFrames with columns of different data types, pandas will try to find a common data type that can accommodate all values. Here’s an example:

import pandas as pd

# Create two DataFrames with different data types
df1 = pd.DataFrame({'A': [1, 2, 3],
                    'B': ['a', 'b', 'c']})

df2 = pd.DataFrame({'A': ['4', '5', '6'],
                    'B': [4.0, 5.0, 6.0]})

# Concatenate the DataFrames
result = pd.concat([df1, df2], ignore_index=True)

print("Result of concatenation with different data types:")
print(result)

# Check the data types of the resulting DataFrame
print("\nData types of the resulting DataFrame:")
print(result.dtypes)

Output:

Pandas Concat Example

In this example, we concatenate two DataFrames where the data types of columns ‘A’ and ‘B’ differ between the DataFrames. Pandas will automatically convert the data types to accommodate all values, which may result in type upcasting.

12. Concatenating DataFrames with Different Column Orders

When concatenating DataFrames with columns in different orders, pandas will align the columns based on their names. Here’s an example:

import pandas as pd

# Create two DataFrames with columns in different orders
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2'],
                    'C': ['C0', 'C1', 'C2']})

df2 = pd.DataFrame({'C': ['C3', 'C4', 'C5'],
                    'A': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5']})

# Concatenate the DataFrames
result = pd.concat([df1, df2], ignore_index=True)

print("Result of concatenation with different column orders:")
print(result)

Output:

Pandas Concat Example

In this example, the columns in df1 and df2 are in different orders. When concatenated, pandas aligns the columns based on their names, maintaining a consistent order in the resulting DataFrame.

13. Concatenating DataFrames with Mixed Data Types and Missing Values

When dealing with real-world data, you might encounter situations where you need to concatenate DataFrames with mixed data types and missing values. Here’s an example of how to handle such a scenario:

import pandas as pd
import numpy as np

# Create two DataFrames with mixed data types and missing values
df1 = pd.DataFrame({
    'A': [1, 2, np.nan],
    'B': ['a', 'b', 'c'],
    'C': [1.1, 2.2, 3.3]
})

df2 = pd.DataFrame({
    'A': ['4', '5', '6'],
    'B': [4.0, np.nan, 6.0],
    'C': ['d', 'e', 'f']
})

# Concatenate the DataFrames
result = pd.concat([df1, df2], ignore_index=True)

print("Result of concatenation with mixed data types and missing values:")
print(result)

# Check the data types of the resulting DataFrame
print("\nData types of the resulting DataFrame:")
print(result.dtypes)

# Handle missing values
result_filled = result.fillna('Unknown')

print("\nResult after handling missing values:")
print(result_filled)

Output:

Pandas Concat Example

In this example, we concatenate two DataFrames with mixed data types and missing values. The resulting DataFrame will have columns with mixed types, and some cells will contain NaN values. We then demonstrate how to handle missing values by filling them with a specific value.

14. Concatenating DataFrames with Different Column Subsets

When concatenating DataFrames that have different subsets of columns, pandas will include all unique columns in the result and fill missing values with NaN. Here’s an example:

import pandas as pd

# Create three DataFrames with different column subsets
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'B': ['B3', 'B4', 'B5'],
                    'C': ['C3', 'C4', 'C5']})

df3 = pd.DataFrame({'A': ['A6', 'A7', 'A8'],
                    'C': ['C6', 'C7', 'C8'],
                    'D': ['D6', 'D7', 'D8']})

# Concatenate the DataFrames
result = pd.concat([df1, df2, df3], ignore_index=True)

print("Result of concatenation with different column subsets:")
print(result)

# Check for missing values
print("\nMissing values:")
print(result.isnull().sum())

Output:

Pandas Concat Example

In this example, we concatenate three DataFrames with different column subsets. The resulting DataFrame includes all unique columns from the input DataFrames, with NaN values where data is missing.

15. Concatenating DataFrames with Date Ranges

When working with time series data, you might need to concatenate DataFrames with date ranges. Here’s an example of how to do this:

import pandas as pd

# Create two DataFrames with date ranges
date_range1 = pd.date_range(start='2023-01-01', end='2023-01-05', freq='D')
df1 = pd.DataFrame({'Date': date_range1, 'Value': range(len(date_range1))})

date_range2 = pd.date_range(start='2023-01-04', end='2023-01-08', freq='D')
df2 = pd.DataFrame({'Date': date_range2, 'Value': range(10, 10 + len(date_range2))})

# Set the 'Date' column as the index
df1.set_index('Date', inplace=True)
df2.set_index('Date', inplace=True)

# Concatenate the DataFrames
result = pd.concat([df1, df2])

print("Result of concatenation with date ranges:")
print(result)

# Sort the index and handle duplicate dates
result_sorted = result.sort_index()
result_sorted = result_sorted[~result_sorted.index.duplicated(keep='first')]

print("\nResult after sorting and handling duplicate dates:")
print(result_sorted)

Output:

Pandas Concat Example

In this example, we create two DataFrames with overlapping date ranges, concatenate them, and then sort the result by date. We also demonstrate how to handle duplicate dates by keeping the first occurrence.

16. Concatenating DataFrames with Different Column Data Types

When concatenating DataFrames with columns that have different data types, pandas will attempt to find a common data type that can accommodate all values. Here’s an example:

import pandas as pd

# Create two DataFrames with different column data types
df1 = pd.DataFrame({'A': [1, 2, 3],
                    'B': ['a', 'b', 'c'],
                    'C': [1.1, 2.2, 3.3]})

df2 = pd.DataFrame({'A': ['4', '5', '6'],
                    'B': [4, 5, 6],
                    'C': ['d', 'e', 'f']})

# Concatenate the DataFrames
result = pd.concat([df1, df2], ignore_index=True)

print("Result of concatenation with different column data types:")
print(result)

# Check the data types of the resulting DataFrame
print("\nData types of the resulting DataFrame:")
print(result.dtypes)

Output:

Pandas Concat Example

In this example, we concatenate two DataFrames where the data types of columns ‘A’, ‘B’, and ‘C’ differ between the DataFrames. Pandas will automatically convert the data types to accommodate all values, which may result in type upcasting.

17. Concatenating DataFrames with MultiIndex Columns

When working with complex data structures, you might encounter DataFrames with MultiIndex columns. Here’s an example of how to concatenate such DataFrames:

import pandas as pd

# Create two DataFrames with MultiIndex columns
columns1 = pd.MultiIndex.from_tuples([('A', 'X'), ('A', 'Y'), ('B', 'X')])
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=columns1)

columns2 = pd.MultiIndex.from_tuples([('B', 'Y'), ('C', 'X'), ('C', 'Y')])
df2 = pd.DataFrame([[7, 8, 9], [10, 11, 12]], columns=columns2)

# Concatenate the DataFrames
result = pd.concat([df1, df2], axis=1)

print("Result of concatenation with MultiIndex columns:")
print(result)

# Access data using MultiIndex
print("\nAccessing data using MultiIndex:")
print(result['A']['X'])

Output:

Pandas Concat Example

In this example, we create two DataFrames with MultiIndex columns and concatenate them horizontally. The resulting DataFrame preserves the MultiIndex structure of the columns.

18. Concatenating DataFrames with Different Frequencies

When working with time series data, you might need to concatenate DataFrames with different frequencies. Here’s an example:

import pandas as pd

# Create two DataFrames with different frequencies
date_range1 = pd.date_range(start='2023-01-01', end='2023-01-05', freq='D')
df1 = pd.DataFrame({'Date': date_range1, 'Value': range(len(date_range1))})

date_range2 = pd.date_range(start='2023-01-06', end='2023-01-10', freq='2D')
df2 = pd.DataFrame({'Date': date_range2, 'Value': range(10, 10 + len(date_range2))})

# Set the 'Date' column as the index
df1.set_index('Date', inplace=True)
df2.set_index('Date', inplace=True)

# Concatenate the DataFrames
result = pd.concat([df1, df2])

print("Result of concatenation with different frequencies:")
print(result)

# Resample the result to a common frequency
result_resampled = result.resample('D').asfreq()

print("\nResult after resampling to daily frequency:")
print(result_resampled)

Output:

Pandas Concat Example

In this example, we concatenate two DataFrames with different frequencies (daily and every two days). After concatenation, we demonstrate how to resample the result to a common frequency.

19. Concatenating DataFrames with Custom Index Names

When concatenating DataFrames, you might want to preserve or set custom names for the resulting index levels. Here’s an example:

import pandas as pd

# Create two DataFrames with named indexes
df1 = pd.DataFrame({'A': ['A0', 'A1'],
                    'B': ['B0', 'B1']},
                   index=pd.Index(['I0', 'I1'], name='Index1'))

df2 = pd.DataFrame({'A': ['A2', 'A3'],
                    'B': ['B2', 'B3']},
                   index=pd.Index(['I2', 'I3'], name='Index2'))

# Concatenate the DataFrames with keys and custom names
result = pd.concat([df1, df2], keys=['First', 'Second'], names=['Source', 'Original_Index'])

print("Result of concatenation with custom index names:")
print(result)

# Access data using the custom index names
print("\nAccessing data using custom index names:")
print(result.loc['First'])

Output:

Pandas Concat Example

In this example, we concatenate two DataFrames with named indexes and add an additional level to the resulting index. We use the names parameter to set custom names for the index levels.

20. Concatenating DataFrames with Different Column Levels

When working with MultiIndex columns, you might need to concatenate DataFrames with different column levels. Here’s an example:

import pandas as pd

# Create two DataFrames with different column levels
columns1 = pd.MultiIndex.from_tuples([('A', 'X', '1'), ('A', 'Y', '2'), ('B', 'X', '1')])
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=columns1)

columns2 = pd.MultiIndex.from_tuples([('B', 'Y', '2'), ('C', 'X', '1'), ('C', 'Y', '2')])
df2 = pd.DataFrame([[7, 8, 9], [10, 11, 12]], columns=columns2)

# Concatenate the DataFrames
result = pd.concat([df1, df2], axis=1)

print("Result of concatenation with different column levels:")
print(result)

# Flatten the MultiIndex columns
result.columns = ['_'.join(col).strip() for col in result.columns.values]

print("\nResult with flattened column names:")
print(result)

Output:

Pandas Concat Example

In this example, we concatenate two DataFrames with different MultiIndex column structures. After concatenation, we demonstrate how to flatten the MultiIndex columns into a single level of column names.

Pandas Concat Example Conclusion

The pandas.concat() function is a versatile tool for combining DataFrames and Series objects in various ways. Throughout this article, we’ve explored numerous examples and scenarios, demonstrating how to use concat() effectively in different situations.

Some key points to remember when using pandas.concat():

  1. You can concatenate DataFrames vertically (axis=0) or horizontally (axis=1).
  2. The keys parameter allows you to add a new level to the index, which is useful for identifying the source of each row.
  3. You can handle missing values using different join types (‘inner’ or ‘outer’) and methods like fillna() or dropna().
  4. When concatenating DataFrames with different column names or data types, pandas will attempt to accommodate all unique columns and find common data types.
  5. MultiIndex and hierarchical indexing can be preserved and manipulated during concatenation.
  6. Time series data with different frequencies can be concatenated and resampled as needed.

By mastering the concat() function and understanding its various parameters and use cases, you’ll be well-equipped to handle complex data manipulation tasks in pandas. Remember to always check the resulting DataFrame after concatenation to ensure it meets your expectations, especially when dealing with mixed data types, missing values, or complex index structures.