Pandas loc Condition
Pandas is a powerful data manipulation library in Python, widely used in data analysis and machine learning tasks. One of its core functionalities is the ability to perform complex data selections using conditions, particularly through the loc
attribute. This article will explore various ways to use the loc
method in Pandas to filter data based on conditions. We will cover a range of examples that demonstrate how to use conditions effectively with loc
to select and manipulate data in a DataFrame.
Introduction to Pandas loc
The loc
attribute is used to access a group of rows and columns by labels or a boolean array. loc
primarily works with label-based indexing, which means that you have to specify the names of the rows and columns that you need to filter. However, it can also work with a boolean array that indicates which rows are included in the output.
Before diving into the examples, let’s first set up a basic Pandas DataFrame that we will use throughout this article:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
}
df = pd.DataFrame(data)
print(df)
Output:
Basic Usage of loc
Selecting Rows by Condition
The simplest form of condition is selecting rows based on the value of a column. Here’s how you can select rows where the age is greater than 30:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
}
df = pd.DataFrame(data)
# Select rows where age is greater than 30
result = df.loc[df['Age'] > 30]
print(result)
Output:
Selecting Specific Columns with Condition
You can also specify the columns you want to retrieve along with the condition:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
}
df = pd.DataFrame(data)
# Select only the Name and Email of persons older than 30
result = df.loc[df['Age'] > 30, ['Name', 'Email']]
print(result)
Output:
Advanced Conditional Selections
Using AND (&) Condition
You can combine multiple conditions using the &
operator:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
}
df = pd.DataFrame(data)
# Select rows where age is greater than 30 and name starts with 'C'
result = df.loc[(df['Age'] > 30) & (df['Name'].str.startswith('C'))]
print(result)
Output:
Using OR (|) Condition
Similarly, use the |
operator to combine conditions with OR logic:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
}
df = pd.DataFrame(data)
# Select rows where age is less than 35 or name starts with 'D'
result = df.loc[(df['Age'] < 35) | (df['Name'].str.startswith('D'))]
print(result)
Output:
Using NOT (~) Condition
To select rows that do not match a condition, use the ~
operator:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
}
df = pd.DataFrame(data)
# Select rows where name does not start with 'A'
result = df.loc[~(df['Name'].str.startswith('A'))]
print(result)
Output:
Complex Conditions
Using isin
Method
The isin
method is useful for filtering data based on a list of values:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
}
df = pd.DataFrame(data)
# Select rows where name is either 'Alice' or 'Bob'
result = df.loc[df['Name'].isin(['Alice', 'Bob'])]
print(result)
Output:
Using between
Method
The between
method is handy for selecting rows where column values fall within a range:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
}
df = pd.DataFrame(data)
# Select rows where age is between 30 and 40
result = df.loc[df['Age'].between(30, 40)]
print(result)
Output:
Combining Conditions Across Different Columns
You can also combine conditions that involve multiple columns:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45],
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
}
df = pd.DataFrame(data)
# Select rows where age is greater than 30 and email includes 'pandasdataframe.com'
result = df.loc[(df['Age'] > 30) & (df['Email'].str.contains('pandasdataframe.com'))]
print(result)
Output:
Pandas loc Condition conclusion
Using the loc
method with conditions in Pandas provides a robust way to filter and manipulate DataFrame rows based on complex logic. This functionality is essential for data preprocessing, analysis, and feature engineering in Python data science projects. The examples provided here should help you get started with using loc
effectively in your own data analysis tasks.
By mastering these techniques, you can efficiently explore and manipulate large datasets, allowing you to extract meaningful insights from your data.