Data Engineering & Analytics

Data Analytics: Using Pandas, Polars, and PySpark

Pandas, Polars, and PySpark are the leading Python libraries for data processing, each excelling in different scenarios based on dataset size - performance needs - computational resources. At best these data tools can be combined in a hybrid approach to leverage their respective strengths.

The challenges of Big Data can be better understood by the 4 V's of big data are Volume, Velocity, Variety, and Veracity. They describe the key characteristics of big data:

Its large size, the speed at which it's generated and processed, the many different types of data it includes, and its trustworthiness or accuracy. Some models add a fifth V, Value, which represents the importance of deriving useful information from the data.

Volume: The enormous amount of data that is generated and collected every second.
Velocity: The speed at which new data is created, gathered, and processed.
Variety: The wide range of data types, which can be structured (like a database), semi-structured, or unstructured (like text or images).
Veracity: The degree of data accuracy and trustworthiness, which can be a challenge to ensure for a large, diverse dataset.
Value: The importance of deriving useful information from the data.

Here is a breakdown of when to use which library:

Feature	Pandas	Polars	PySpark
Dataset Size	Small to Medium (<10GB, fits in memory)	Medium to Large (GBs to 100GBs, single machine)	Massive (100GB to PBs, distributed)
Execution	Eager, single-threaded	Eager/Lazy, multi-threaded	Lazy, distributed
Performance	Good for small data, struggles with scale	Very fast on single machine, memory efficient	Scalable and fault-tolerant for big data
Complexity	Simple, intuitive API, low learning curve	Pythonic API, moderate learning curve	Complex setup and management
Primary Use	EDA, prototyping, ML integration	Performance-critical single-machine tasks	Enterprise ETL, large-scale ML, streaming

1. Pandas:

Best for: Exploratory Data Analysis (EDA), quick analysis, and prototyping on small to medium datasets that fit comfortably in your computer's RAM (typically under 10GB). It integrates seamlessly with popular machine learning libraries like scikit-learn.
* While powerful, its single-threaded nature and in-memory operations make it a bottleneck for larger datasets.

import pandas as pd
import numpy as np

 data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Age': [25, 30, np.nan, 22, 28, 45, 30],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'New York', 'Los Angeles', np.nan],
    'Salary': [70000, 90000, 60000, 55000, 80000, 120000, 90000]
 }
 df = pd.DataFrame(data)

Example (Pandas): Most frequently used pandas functions, complete with practical examples using a hypothetical DataFrame named df.
Assume the following DataFrame for the examples::

The simple example for 16 most used pandas functions can be seen my Colab Pandas Notebook through GitHub Colab Pandas Notebook

2. Polars:

Polars is a library for data manipulation. Polars is built with an OLAP query engine implemented in Rust using Apache Arrow Columnar Format as the memory model. Although built using Rust, there are Python, Node. js, R, and SQL API interfaces to use Polars.
daily_weather.parquet file is used for the example has 27.6 million records.
Performance:
Pandas read_parquet time (avg over 10 runs): 66.4953 seconds
Polars eager read_parquet time (avg over 10 runs): 33.3602 seconds
Polars lazy scan_parquet time (avg over 10 runs): 32.3216 seconds

 import pandas as pd
 import polars as pl
 import numpy as np
 import timeit
 import os

 # --- 1. Create a sample Parquet file ---
 file_path = '/content/drive/MyDrive/Colab Notebooks/daily_weather.parquet'
 if not os.path.exists(file_path):
    # Create a large DataFrame for a meaningful comparison
    data = {
        'id': np.arange(1000000),
        'value': np.random.rand(1000000),
        'category': np.random.choice(['A', 'B', 'C'], size=1000000)
    }
    df_create = pd.DataFrame(data)
    df_create.to_parquet(file_path, index=False)
    print(f"Created a sample parquet file: {file_path}\n")

 # --- 2. Define functions for timeit ---
 def read_pandas():
    """Function to read the parquet file using pandas."""
    # Ensure pandas uses pyarrow engine for better performance and compatibility
    df = pd.read_parquet(file_path, engine='pyarrow')
    return df

 def read_polars_eager():
    """Function to read the parquet file using Polars (eagerly)."""
    df = pl.read_parquet(file_path)
    return df

 def read_polars_lazy():
    """Function to read the parquet file using Polars (lazily and collect)."""
    df = pl.scan_parquet(file_path).collect()
    return df

 # --- 3. Time the operations ---
 # The timeit module runs the function multiple times and provides an average/best time.

 # Time Pandas read_parquet
 pandas_time = timeit.timeit(read_pandas, number=10) # Run 10 times
 print(f"Pandas read_parquet time (avg over 10 runs): {pandas_time:.4f} seconds")

 # Time Polars eager read_parquet
 polars_eager_time = timeit.timeit(read_polars_eager, number=10) # Run 10 times
 print(f"Polars eager read_parquet time (avg over 10 runs): {polars_eager_time:.4f} seconds")

 # Time Polars lazy scan_parquet and collect
 polars_lazy_time = timeit.timeit(read_polars_lazy, number=10) # Run 10 times
 print(f"Polars lazy scan_parquet time (avg over 10 runs): {polars_lazy_time:.4f} seconds")

 # --- 4. Clean up (optional) ---
 # os.remove(file_path)
 # print(f"\nRemoved the sample file: {file_path}")

The simple example for 14 most used Polars functions can be seen my Colab Polars Notebook through GitHub Colab Polars Notebook

3. PySpark

PySpark is a Python API for Apache Spark, a distributed computing framework for big data processing. It allows users to write Spark applications using Python, leveraging Spark's capabilities for parallel processing, fault tolerance, and in-memory computation.

Key Components and Concepts:

SparkSession: The entry point to programming Spark with the Dataset and DataFrame API. It provides a unified entry point for Spark functionality.
DataFrame: A distributed collection of data organized into named columns, similar to a table in a relational database or a DataFrame in Pandas. It provides a higher-level abstraction than RDDs and offers numerous advantages for data processing and analysis.
RDD (Resilient Distributed Dataset): The fundamental data structure in Spark, representing an immutable, fault-tolerant, and distributed collection of objects that can be processed in parallel. While DataFrames are generally preferred for structured data, RDDs are still useful for lower-level control and unstructured data.
Transformations: Operations that create a new DataFrame or RDD from an existing one without immediately computing the result (e.g., select, filter, groupBy).
Actions: Operations that trigger the execution of transformations and return a result to the driver program or write data to an external storage (e.g., show, count, collect, write).

Example (SparkSession):

 import findspark
 findspark.init() # Initializes findspark to locate Spark installation
 
 import pyspark
 from pyspark.sql import SparkSession
 from pyspark.sql.functions import when, col

 # Create a SparkSession
 spark = SparkSession.builder \
    .appName("PySparkJupyterExample") \
    .master("local[*]") \
    .getOrCreate()

 print("SparkSession created successfully!")

Example (PySpark): Most frequently used PySpark functions, complete with practical examples using a hypothetical DataFrame named df.
Assume the following DataFrame for the examples:

The simple example used PySpark functions can be seen my Colab PySpark Notebook through GitHub Colab PySpark Notebook