Pandas Create DataFrame

Pandas Create DataFrame

Pandas is a powerful library in Python for data manipulation and analysis. One of the core structures in pandas is the DataFrame, a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Creating a DataFrame is a fundamental skill for any data scientist or analyst working with pandas. In this article, we’ll dive deep into various ways to create DataFrames using pandas, exploring different methods and providing detailed code examples along with explanations.

1. Introduction to DataFrames

A DataFrame is essentially a table where data is aligned in rows and columns. Each column in a DataFrame can be of a different type, such as numeric, string, or boolean. This flexibility makes DataFrames suitable for a wide range of data manipulation tasks.

2. Creating DataFrames from Lists

One of the simplest ways to create a DataFrame is from a list of lists. Each inner list represents a row in the DataFrame.

Example 1: Creating DataFrame from a List of Lists

import pandas as pd

data = [
    ['Alice', 24, 'Engineer'],
    ['Bob', 27, 'Doctor'],
    ['Charlie', 22, 'Artist']
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Profession'])
print(df)

Output:

Pandas Create DataFrame

Explanation

  • We first import the pandas library.
  • We define a list of lists where each inner list contains data for one row.
  • We create the DataFrame using pd.DataFrame(), passing the data and column names.
  • The DataFrame df now contains three rows and three columns.

Example 2: Creating DataFrame from a List of Dictionaries

import pandas as pd

data = [
    {'Name': 'Alice', 'Age': 24, 'Profession': 'Engineer'},
    {'Name': 'Bob', 'Age': 27, 'Profession': 'Doctor'},
    {'Name': 'Charlie', 'Age': 22, 'Profession': 'Artist'}
]
df = pd.DataFrame(data)
print(df)

Output:

Pandas Create DataFrame

Explanation

  • We define a list of dictionaries where each dictionary represents a row.
  • We create the DataFrame directly from the list of dictionaries.
  • Pandas automatically infers the column names from the keys of the dictionaries.

3. Creating DataFrames from Dictionaries

Dictionaries provide a versatile way to create DataFrames, especially when columns have different data types.

Example 3: Creating DataFrame from a Dictionary of Lists

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 27, 22],
    'Profession': ['Engineer', 'Doctor', 'Artist']
}
df = pd.DataFrame(data)
print(df)

Output:

Pandas Create DataFrame

Explanation

  • We define a dictionary where each key represents a column and each value is a list of column values.
  • We create the DataFrame using pd.DataFrame().

Example 4: Creating DataFrame from a Dictionary of Series

import pandas as pd

data = {
    'Name': pd.Series(['Alice', 'Bob', 'Charlie']),
    'Age': pd.Series([24, 27, 22]),
    'Profession': pd.Series(['Engineer', 'Doctor', 'Artist'])
}
df = pd.DataFrame(data)
print(df)

Output:

Pandas Create DataFrame

Explanation

  • We define a dictionary where each key represents a column and each value is a pandas Series.
  • This method allows more control over data types and indices.

4. Creating DataFrames from Numpy Arrays

Numpy arrays can be used to create DataFrames, especially when working with numerical data.

Example 5: Creating DataFrame from a 2D Numpy Array

import pandas as pd
import numpy as np

data = np.array([
    ['Alice', 24, 'Engineer'],
    ['Bob', 27, 'Doctor'],
    ['Charlie', 22, 'Artist']
])
df = pd.DataFrame(data, columns=['Name', 'Age', 'Profession'])
print(df)

Output:

Pandas Create DataFrame

Explanation

  • We import numpy and create a 2D numpy array.
  • We create the DataFrame from this array, specifying the column names.

Example 6: Creating DataFrame from a 1D Numpy Array

import pandas as pd
import numpy as np

data = np.array(['Alice', 24, 'Engineer'])
df = pd.DataFrame([data], columns=['Name', 'Age', 'Profession'])
print(df)

Output:

Pandas Create DataFrame

Explanation

  • We create a 1D numpy array representing a single row of data.
  • We wrap this array in a list and create the DataFrame.

5. Creating DataFrames from Series

Pandas Series can be combined to form a DataFrame. This is useful when dealing with time series data or any data that naturally fits into a single column.

Example 7: Creating DataFrame from Multiple Series

import pandas as pd

name_series = pd.Series(['Alice', 'Bob', 'Charlie'])
age_series = pd.Series([24, 27, 22])
profession_series = pd.Series(['Engineer', 'Doctor', 'Artist'])

df = pd.DataFrame({
    'Name': name_series,
    'Age': age_series,
    'Profession': profession_series
})
print(df)

Output:

Pandas Create DataFrame

Explanation

  • We create three pandas Series, each representing a column.
  • We create the DataFrame by passing a dictionary of these Series.

6. Creating DataFrames from Other DataFrames

Sometimes, it is necessary to create a new DataFrame from an existing one, either by filtering, selecting specific columns, or performing other operations.

Example 8: Creating DataFrame by Selecting Specific Columns

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 27, 22],
    'Profession': ['Engineer', 'Doctor', 'Artist']
}
df = pd.DataFrame(data)
df_new = df[['Name', 'Age']]
print(df_new)

Output:

Pandas Create DataFrame

Explanation

  • We create a DataFrame from a dictionary.
  • We create a new DataFrame df_new by selecting specific columns from the original DataFrame df.

Example 9: Creating DataFrame by Filtering Rows

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 27, 22],
    'Profession': ['Engineer', 'Doctor', 'Artist']
}
df = pd.DataFrame(data)
df_new = df[df['Age'] > 23]
print(df_new)

Output:

Pandas Create DataFrame

Explanation

  • We create a DataFrame from a dictionary.
  • We create a new DataFrame df_new by filtering rows based on a condition.

7. Creating DataFrames from CSV Files

Reading data from CSV files is a common task. Pandas provides a convenient method read_csv to load data from a CSV file into a DataFrame.

Example 10: Creating DataFrame from CSV File

import pandas as pd

df = pd.read_csv('path_to_your_csv_file.csv')
print(df)

Explanation

  • We use the pd.read_csv() method to read data from a CSV file.
  • The file path is specified as an argument, and the method returns a DataFrame.

8. Creating DataFrames from Excel Files

Excel files are widely used in data analysis. Pandas provides the read_excel method to load data from Excel files into a DataFrame.

Example 11: Creating DataFrame from Excel File

import pandas as pd

df = pd.read_excel('path_to_your_excel_file.xlsx')
print(df)

Explanation

  • We use the pd.read_excel() method to read data from an Excel file.
  • The file path is specified as an argument, and the method returns a DataFrame.

9. Creating DataFrames from SQL Databases

Data stored in SQL databases can be loaded into pandas DataFrames using the read_sql method. This requires an active connection to the database.

Example 12: Creating DataFrame from SQL Query

import pandas as pd
import sqlite3

# Establishing a connection to the SQLite database
conn = sqlite3.connect('example.db')

# SQL query to fetch data
query = "SELECT * FROM your_table_name"

# Creating DataFrame
df = pd.read_sql(query, conn)
print(df)

# Closing the connection
conn.close()

Explanation

  • We establish a connection to the SQLite database using sqlite3.connect().
  • We define an SQL query to fetch data from the database.
  • We use pd.read_sql() to execute the query and load the data into a DataFrame.
  • After loading the data, we close the database connection.

10. Creating DataFrames with Custom Index and Columns

Pandas allows creating DataFrames with custom index and column labels, providing flexibility in data representation.

Example 13: Creating DataFrame with Custom Index

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 27, 22],
    'Profession': ['Engineer', 'Doctor', 'Artist']
}
index = ['a', 'b', 'c']
df = pd.DataFrame(data, index=index)
print(df)

Output:

Pandas Create DataFrame

Explanation

  • We define a dictionary with data.
  • We specify a custom index list.
  • We create the DataFrame using pd.DataFrame() and pass the custom index.

Example 14: Creating DataFrame with Custom Columns

import pandas as pd

data = [
    ['Alice', 24, 'Engineer'],
    ['Bob', 27, 'Doctor'],
    ['Charlie', 22, 'Artist']
]
columns = ['Person Name', 'Age in Years', 'Job Title']
df = pd.DataFrame(data, columns=columns)
print(df)

Output:

Pandas Create DataFrame

Explanation

  • We define a list of lists with data.
  • We specify custom column names.
  • We create the DataFrame using pd.DataFrame() and pass the custom columns.

11. Handling Missing Data in DataFrames

Real-world data often contains missing values. Pandas provides several methods to handle missing data.

Example 15: Creating DataFrame with Missing Values

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [24, 27, None, 22],
    'Profession': ['Engineer', None, 'Artist', 'Doctor']
}
df = pd.DataFrame(data)
print(df)

Output:

Pandas Create DataFrame

Explanation

  • We define a dictionary with some missing values (None).
  • We create the DataFrame using pd.DataFrame().

Example 16: Dropping Rows with Missing Values

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [24, 27, None, 22],
    'Profession': ['Engineer', None, 'Artist', 'Doctor']
}
df = pd.DataFrame(data)
df_dropped = df.dropna()
print(df_dropped)

Output:

Pandas Create DataFrame

Explanation

  • We create a DataFrame with missing values.
  • We use the dropna() method to remove rows with any missing values.

Example 17: Filling Missing Values

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [24, 27, None, 22],
    'Profession': ['Engineer', None, 'Artist', 'Doctor']
}
df = pd.DataFrame(data)
df_filled = df.fillna('Unknown')
print(df_filled)

Output:

Pandas Create DataFrame

Explanation

  • We create a DataFrame with missing values.
  • We use the fillna() method to replace missing values with ‘Unknown’.

12. Pandas Create DataFrame Conclusion

Creating DataFrames is a fundamental operation in pandas, providing a foundation for various data manipulation and analysis tasks. This article covered different methods to create DataFrames from lists, dictionaries, numpy arrays, Series, other DataFrames, CSV files, Excel files, and SQL databases. We also explored creating DataFrames with custom indices and handling missing data. The provided examples and explanations should equip you with the necessary knowledge to effectively create and work with DataFrames in pandas.