| Feature | Pandas | Polars | PySpark |
|---|---|---|---|
| Dataset Size | Small to Medium (<10GB, fits in memory) | Medium to Large (GBs to 100GBs, single machine) | Massive (100GB to PBs, distributed) |
| Execution | Eager, single-threaded | Eager/Lazy, multi-threaded | Lazy, distributed |
| Performance | Good for small data, struggles with scale | Very fast on single machine, memory efficient | Scalable and fault-tolerant for big data |
| Complexity | Simple, intuitive API, low learning curve | Pythonic API, moderate learning curve | Complex setup and management |
| Primary Use | EDA, prototyping, ML integration | Performance-critical single-machine tasks | Enterprise ETL, large-scale ML, streaming |
Best for: Exploratory Data Analysis (EDA), quick analysis, and prototyping on small to medium
datasets that fit comfortably in your computer's RAM (typically under 10GB). It integrates
seamlessly with popular machine learning libraries like scikit-learn.
* While powerful, its single-threaded nature and in-memory operations make it a bottleneck for
larger datasets.
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Age': [25, 30, np.nan, 22, 28, 45, 30],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'New York', 'Los Angeles', np.nan],
'Salary': [70000, 90000, 60000, 55000, 80000, 120000, 90000]
}
df = pd.DataFrame(data)
Example (Pandas): Most frequently used pandas functions, complete
with practical examples using a hypothetical DataFrame named df.Polars is a library for data manipulation. Polars is built with an OLAP query engine
implemented in Rust using Apache Arrow Columnar Format as the memory model. Although built using
Rust, there are Python, Node. js, R, and SQL API interfaces to use Polars.
daily_weather.parquet file is used for the example has 27.6 million records.
Performance:
Pandas read_parquet time (avg over 10 runs): 66.4953 seconds
Polars eager read_parquet time (avg over 10 runs): 33.3602 seconds
Polars lazy scan_parquet time (avg over 10 runs): 32.3216 seconds
import pandas as pd
import polars as pl
import numpy as np
import timeit
import os
# --- 1. Create a sample Parquet file ---
file_path = '/content/drive/MyDrive/Colab Notebooks/daily_weather.parquet'
if not os.path.exists(file_path):
# Create a large DataFrame for a meaningful comparison
data = {
'id': np.arange(1000000),
'value': np.random.rand(1000000),
'category': np.random.choice(['A', 'B', 'C'], size=1000000)
}
df_create = pd.DataFrame(data)
df_create.to_parquet(file_path, index=False)
print(f"Created a sample parquet file: {file_path}\n")
# --- 2. Define functions for timeit ---
def read_pandas():
"""Function to read the parquet file using pandas."""
# Ensure pandas uses pyarrow engine for better performance and compatibility
df = pd.read_parquet(file_path, engine='pyarrow')
return df
def read_polars_eager():
"""Function to read the parquet file using Polars (eagerly)."""
df = pl.read_parquet(file_path)
return df
def read_polars_lazy():
"""Function to read the parquet file using Polars (lazily and collect)."""
df = pl.scan_parquet(file_path).collect()
return df
# --- 3. Time the operations ---
# The timeit module runs the function multiple times and provides an average/best time.
# Time Pandas read_parquet
pandas_time = timeit.timeit(read_pandas, number=10) # Run 10 times
print(f"Pandas read_parquet time (avg over 10 runs): {pandas_time:.4f} seconds")
# Time Polars eager read_parquet
polars_eager_time = timeit.timeit(read_polars_eager, number=10) # Run 10 times
print(f"Polars eager read_parquet time (avg over 10 runs): {polars_eager_time:.4f} seconds")
# Time Polars lazy scan_parquet and collect
polars_lazy_time = timeit.timeit(read_polars_lazy, number=10) # Run 10 times
print(f"Polars lazy scan_parquet time (avg over 10 runs): {polars_lazy_time:.4f} seconds")
# --- 4. Clean up (optional) ---
# os.remove(file_path)
# print(f"\nRemoved the sample file: {file_path}")
PySpark is a Python API for Apache Spark, a distributed computing framework for big data processing.
It allows users to write Spark applications using Python, leveraging Spark's capabilities for
parallel processing, fault tolerance, and in-memory computation.
Key Components and Concepts:
import findspark
findspark.init() # Initializes findspark to locate Spark installation
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
# Create a SparkSession
spark = SparkSession.builder \
.appName("PySparkJupyterExample") \
.master("local[*]") \
.getOrCreate()
print("SparkSession created successfully!")
Example (PySpark): Most frequently used PySpark functions, complete
with practical examples using a hypothetical DataFrame named df.