Pandas DataFrame Filter
Pandas is a powerful data manipulation library in Python that provides flexible data structures that make data manipulation and analysis easy. One of the most commonly used data structures in Pandas is the DataFrame. A DataFrame is a two-dimensional labeled data structure with columns of potentially different types.
Filtering is one of the most frequent operations performed on a DataFrame. It allows you to select specific rows or columns from a DataFrame based on some condition. In this article, we will explore different ways to filter data in a Pandas DataFrame.
1. Using Boolean Indexing
Boolean indexing is a type of indexing that allows you to select rows or columns from a DataFrame based on a Boolean condition. Here is an example:
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'Country': ['USA', 'USA', 'UK', 'Canada']}
df = pd.DataFrame(data)
# Filter rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)
Output:
In the above example, df['Age'] > 30
returns a Boolean Series where each element is True if the corresponding age is greater than 30, and False otherwise. The DataFrame df
is then indexed with this Boolean Series, returning only the rows where the condition is True.
2. Using the query
Method
The query
method allows you to filter data using a query string. Here is an example:
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'Country': ['USA', 'USA', 'UK', 'Canada']}
df = pd.DataFrame(data)
# Filter rows where Age is greater than 30
filtered_df = df.query('Age > 30')
print(filtered_df)
Output:
In the above example, the query string ‘Age > 30’ is used to filter the DataFrame. The query
method returns a new DataFrame containing only the rows where the condition is True.
3. Using the loc
and iloc
Methods
The loc
and iloc
methods allow you to filter data based on labels and integer-based location respectively. Here are examples:
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'Country': ['USA', 'USA', 'UK', 'Canada']}
df = pd.DataFrame(data)
# Filter rows where Age is greater than 30 using loc
filtered_df = df.loc[df['Age'] > 30]
# Filter the first three rows using iloc
filtered_df = df.iloc[0:3]
print(filtered_df)
Output:
In the first example, df['Age'] > 30
returns a Boolean Series, which is used to filter the DataFrame using the loc
method. In the second example, df.iloc[0:3]
returns the first three rows of the DataFrame.
4. Using the isin
Method
The isin
method allows you to filter data based on whether each element in the DataFrame is contained in a list of values. Here is an example:
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'Country': ['USA', 'USA', 'UK', 'Canada']}
df = pd.DataFrame(data)
# Filter rows where Country is either USA or UK
filtered_df = df[df['Country'].isin(['USA', 'UK'])]
print(filtered_df)
Output:
In the above example, df['Country'].isin(['USA', 'UK'])
returns a Boolean Series where each element is True if the corresponding country is either ‘USA’ or ‘UK’, and False otherwise. The DataFrame df
is then indexed with this Boolean Series, returning only the rows where the condition is True.
5. Using the filter
Method
The filter
method allows you to filter data based on labels. Here is an example:
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'Country': ['USA', 'USA', 'UK', 'Canada']}
df = pd.DataFrame(data)
# Filter columns that contain the string 'Name'
filtered_df = df.filter(like='Name')
print(filtered_df)
Output:
In the above example, df.filter(like='Name')
returns a new DataFrame containing only the columns whose labels contain the string ‘Name’.
In conclusion, Pandas provides a variety of methods to filter data in a DataFrame. The method you choose depends on your specific needs and the nature of your data.