Pandas Aggregation: Distinct Count

Pandas Aggregation: Distinct Count

In data analysis, understanding the unique values within your dataset is crucial for recognizing patterns, identifying anomalies, and making informed decisions. Pandas, a powerful data manipulation library in Python, provides various methods to perform aggregations, one of which includes counting distinct values. This article explores how to use the agg function in Pandas to perform distinct counts, which is a common requirement in data analysis tasks.

Introduction to Pandas Aggregation

Pandas offers a wide range of functionalities for data manipulation and analysis. Aggregation is a process in which we combine multiple pieces of data into a single result, typically to get a summary statistic like count, sum, average, minimum, or maximum. When dealing with large datasets, these aggregations become essential for summarizing data and extracting meaningful insights.

Basic Setup

Before diving into the examples, ensure you have the Pandas library installed in your Python environment. If not, you can install it using pip:

pip install pandas

Now, import Pandas and create a simple DataFrame to work with:

import pandas as pd

data = {
    'Domain': ['pandasdataframe.com', 'pandasdataframe.com', 'pandasdataframe.com', 'example.com', 'example.com'],
    'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie'],
    'Visits': [1, 2, 1, 3, 1]
}
df = pd.DataFrame(data)
print(df)

Output:

Pandas Aggregation: Distinct Count

Counting Distinct Values Using agg

The agg function in Pandas allows you to apply one or more operations over the specified axis. It is particularly useful when you need to perform multiple aggregations at once or when you want to apply a specific function.

Example 1: Basic Distinct Count

To count distinct values in a single column:

import pandas as pd

data = {
    'Domain': ['pandasdataframe.com', 'pandasdataframe.com', 'pandasdataframe.com', 'example.com', 'example.com'],
    'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie'],
    'Visits': [1, 2, 1, 3, 1]
}
df = pd.DataFrame(data)

result = df['User'].agg(pd.Series.nunique)
print(result)

Output:

Pandas Aggregation: Distinct Count

Example 2: Distinct Count on Grouped Data

Grouping data based on certain criteria and then counting distinct values within each group:

import pandas as pd

data = {
    'Domain': ['pandasdataframe.com', 'pandasdataframe.com', 'pandasdataframe.com', 'example.com', 'example.com'],
    'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie'],
    'Visits': [1, 2, 1, 3, 1]
}
df = pd.DataFrame(data)

result = df.groupby('Domain')['User'].agg(pd.Series.nunique)
print(result)

Output:

Pandas Aggregation: Distinct Count

Example 3: Multiple Aggregations

Applying multiple aggregation functions at once, including distinct count:

import pandas as pd

data = {
    'Domain': ['pandasdataframe.com', 'pandasdataframe.com', 'pandasdataframe.com', 'example.com', 'example.com'],
    'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie'],
    'Visits': [1, 2, 1, 3, 1]
}
df = pd.DataFrame(data)

result = df.agg({'User': pd.Series.nunique, 'Visits': sum})
print(result)

Example 4: Distinct Count with Conditions

Counting distinct values where certain conditions are met:

import pandas as pd

data = {
    'Domain': ['pandasdataframe.com', 'pandasdataframe.com', 'pandasdataframe.com', 'example.com', 'example.com'],
    'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie'],
    'Visits': [1, 2, 1, 3, 1]
}
df = pd.DataFrame(data)

result = df[df['Visits'] > 1]['User'].agg(pd.Series.nunique)
print(result)

Output:

Pandas Aggregation: Distinct Count

Example 5: Distinct Count Across Multiple Columns

Counting distinct combinations of values across multiple columns:

import pandas as pd

data = {
    'Domain': ['pandasdataframe.com', 'pandasdataframe.com', 'pandasdataframe.com', 'example.com', 'example.com'],
    'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie'],
    'Visits': [1, 2, 1, 3, 1]
}
df = pd.DataFrame(data)

result = df.groupby(['Domain', 'User']).size().reset_index(name='Counts')
distinct_count = result['Domain'].agg(pd.Series.nunique)
print(distinct_count)

Output:

Pandas Aggregation: Distinct Count

Example 6: Using nunique Directly

Sometimes, using nunique directly on the DataFrame or Series can be more straightforward:

import pandas as pd

data = {
    'Domain': ['pandasdataframe.com', 'pandasdataframe.com', 'pandasdataframe.com', 'example.com', 'example.com'],
    'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie'],
    'Visits': [1, 2, 1, 3, 1]
}
df = pd.DataFrame(data)

result = df['User'].nunique()
print(result)

Output:

Pandas Aggregation: Distinct Count

Example 7: Distinct Count in a Pivot Table

Creating a pivot table and then applying a distinct count:

import pandas as pd

data = {
    'Domain': ['pandasdataframe.com', 'pandasdataframe.com', 'pandasdataframe.com', 'example.com', 'example.com'],
    'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie'],
    'Visits': [1, 2, 1, 3, 1]
}
df = pd.DataFrame(data)

pivot_table = df.pivot_table(index='Domain', columns='User', values='Visits', aggfunc='sum')
distinct_count = pivot_table.agg(pd.Series.nunique, axis=1)
print(distinct_count)

Output:

Pandas Aggregation: Distinct Count

Example 8: Custom Aggregation Function

Defining a custom aggregation function to count distinct values:

import pandas as pd

data = {
    'Domain': ['pandasdataframe.com', 'pandasdataframe.com', 'pandasdataframe.com', 'example.com', 'example.com'],
    'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie'],
    'Visits': [1, 2, 1, 3, 1]
}
df = pd.DataFrame(data)

def distinct_count(series):
    return series.nunique()

result = df['User'].agg(distinct_count)
print(result)

Output:

Pandas Aggregation: Distinct Count

Example 9: Distinct Count on Filtered Data

Applying a distinct count on a subset of data filtered by another column:

import pandas as pd

data = {
    'Domain': ['pandasdataframe.com', 'pandasdataframe.com', 'pandasdataframe.com', 'example.com', 'example.com'],
    'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie'],
    'Visits': [1, 2, 1, 3, 1]
}
df = pd.DataFrame(data)

filtered_data = df[df['Domain'] == 'pandasdataframe.com']
result = filtered_data['User'].agg(pd.Series.nunique)
print(result)

Output:

Pandas Aggregation: Distinct Count

Example 10: Distinct Count After Concatenation

Concatenating two DataFrames and then counting distinct values:

import pandas as pd

data = {
    'Domain': ['pandasdataframe.com', 'pandasdataframe.com', 'pandasdataframe.com', 'example.com', 'example.com'],
    'User': ['Alice', 'Bob', 'Alice', 'Charlie', 'Charlie'],
    'Visits': [1, 2, 1, 3, 1]
}
df = pd.DataFrame(data)

df2 = pd.DataFrame({
    'Domain': ['pandasdataframe.com', 'newsite.com'],
    'User': ['Alice', 'Zara'],
    'Visits': [1, 1]
})
combined_df = pd.concat([df, df2])
result = combined_df['User'].agg(pd.Series.nunique)
print(result)

Output:

Pandas Aggregation: Distinct Count

Pandas agg distinct count conclusion

Counting distinct values is a fundamental task in data analysis, helping to understand the diversity and frequency of data. Pandas provides versatile tools like agg, groupby, and nunique to make these tasks easier and more intuitive. By mastering these techniques, you can efficiently summarize and analyze large datasets, gaining deeper insights into your data.

In this article, we explored various ways to count distinct values using Pandas, from simple distinct counts to more complex aggregations involving conditions and custom functions. These examples serve as a foundation for performing advanced data analysis tasks in Python using Pandas.