Pandas apply return multiple columns
Pandas is a powerful Python library used for data manipulation and analysis. One of its core functionalities is the ability to apply functions to rows or columns of a DataFrame. Often, you might need to apply a function that returns multiple new columns from a single apply operation. This article will explore how to use the apply
function in pandas to return multiple columns, providing detailed examples and explanations.
Introduction to Pandas Apply
The apply
method in pandas can be used on a DataFrame to apply a function along the input axis of the DataFrame. This method is highly versatile and can be used for a variety of data manipulation tasks. When you need to derive multiple new columns from existing columns, apply
can be particularly useful.
Basic Syntax of Apply
The basic syntax of the apply
method is:
DataFrame.apply(func, axis=0, args=(), **kwds)
func
: function to apply to each column or row.axis
: axis along which the function is applied (0 for applying function to each column, 1 for each row).args
: tuple of arguments to pass to function.kwds
: additional keyword arguments to pass to function.
Using Apply to Return Multiple Columns
To return multiple columns using apply
, your function should return a Series with multiple values. Each value in the Series will correspond to a new column in the resulting DataFrame.
Example 1: Splitting Text into Multiple Columns
Suppose you have a DataFrame with a column of concatenated strings, and you want to split these strings into separate columns.
import pandas as pd
# Sample DataFrame
data = {'Info': ['Name:pandasdataframe.com Age:10', 'Name:pandasdataframe.com Age:20']}
df = pd.DataFrame(data)
# Function to split Info into Name and Age
def split_info(row):
name, age = row['Info'].split()
return pd.Series([name.split(':')[1], age.split(':')[1]])
# Applying function
df[['Name', 'Age']] = df.apply(split_info, axis=1)
print(df)
Output:
Example 2: Calculating Multiple Aggregate Metrics
Imagine you need to calculate multiple aggregate metrics from a DataFrame’s numerical columns.
import pandas as pd
import numpy as np
# Sample DataFrame
data = {'Sales': [100, 200, 300], 'Cost': [80, 150, 210]}
df = pd.DataFrame(data)
# Function to calculate profit and profit margin
def financial_metrics(row):
profit = row['Sales'] - row['Cost']
profit_margin = profit / row['Sales']
return pd.Series([profit, profit_margin])
# Applying function
df[['Profit', 'Profit Margin']] = df.apply(financial_metrics, axis=1)
print(df)
Output:
Example 3: Conditional Operations Returning Multiple Columns
Sometimes, you might want to perform operations that depend on the values of the DataFrame’s columns.
import pandas as pd
# Sample DataFrame
data = {'Temperature': [20, 35, 15], 'Humidity': [30, 45, 25]}
df = pd.DataFrame(data)
# Function to check comfort level
def comfort_level(row):
if row['Temperature'] > 30 and row['Humidity'] < 50:
return pd.Series(['Hot', 'Moderate'])
else:
return pd.Series(['Normal', 'High'])
# Applying function
df[['Comfort', 'Humidity Level']] = df.apply(comfort_level, axis=1)
print(df)
Output:
Example 4: Extracting Domain and Suffix from Email
If you have a DataFrame with email addresses, you might want to extract the domain and suffix from each email.
import pandas as pd
# Sample DataFrame
data = {'Email': ['[email protected]', '[email protected]']}
df = pd.DataFrame(data)
# Function to extract domain and suffix
def extract_email_parts(email):
domain = email.split('@')[1].split('.')[0]
suffix = email.split('.')[-1]
return pd.Series([domain, suffix])
# Applying function
df[['Domain', 'Suffix']] = df['Email'].apply(extract_email_parts)
print(df)
Output:
Example 5: Converting Timestamps to Different Time Features
Working with time series data often requires extracting specific time features from timestamps.
import pandas as pd
# Sample DataFrame
data = {'Timestamp': pd.to_datetime(['2021-01-01 12:00', '2021-06-01 15:00'])}
df = pd.DataFrame(data)
# Function to extract year, month, and day
def extract_time_features(timestamp):
year = timestamp.year
month = timestamp.month
day = timestamp.day
return pd.Series([year, month, day])
# Applying function
df[['Year', 'Month', 'Day']] = df['Timestamp'].apply(extract_time_features)
print(df)
Output:
Pandas apply return multiple columns Conclusion
Using the apply
method to return multiple columns in pandas is a powerful technique for data transformation and feature engineering. By writing custom functions that return pandas Series objects, you can efficiently expand the capabilities of your data analysis workflows. The examples provided in this article demonstrate various scenarios where this technique can be applied, from simple text operations to more complex conditional logic and time series manipulation.