ETL - Data Loading

Data Loading (L)

If you are looking for Extraction or Transformation of Data, → Data Extraction (E) Data Transformation (T)

Data engineers and scientists primarily interact with structured data that is stored and exchanged using various file formats. Python, particularly using the pandas library, offers robust and efficient ways to read these files into memory for analysis and processing.

Here are the 7 most common data formats used for data loading, with Python examples using pandas where applicable:

1. CSV (Comma-Separated Values)

CSV is arguably the most ubiquitous format due to its simplicity, human-readability, and universal compatibility across platforms. It is the default standard for transferring basic tabular data.

 import pandas as pd

 # Load data from a CSV file
 df_csv = pd.read_csv('sales_data.csv')

 print("--- Loaded CSV Data ---")
 print(df_csv.head())

2. JSON (JavaScript Object Notation)

JSON is a lightweight, semi-structured data interchange format. It is widely used in web APIs and NoSQL databases and is excellent for representing hierarchical or nested data structures.

 import pandas as pd

 # Load data from a JSON file (handling potential nesting)
 # orient='records' often works well for standard JSON lines files
 df_json = pd.read_json('user_events.json', orient='records', lines=True) 

 print("\n--- Loaded JSON Data ---")
 print(df_json.head())

3. Parquet

Parquet is a highly efficient, columnar storage format optimized for large-scale analytical queries (OLAP). It offers superior compression and encoding schemes, making it the preferred format in modern big data ecosystems like Apache Spark. You typically need to install pyarrow (pip install pyarrow).

 pip install pyarrow
 # or
 pip install fastparquet

 import pandas as pd

 # Load data from a Parquet file
 # Requires 'pyarrow' or 'fastparquet' engine to be installed
 df_parquet = pd.read_parquet('optimized_logs.parquet')

 print("\n--- Loaded Parquet Data ---")
 print(df_parquet.head())

4. Excel (.xlsx, .xls)

Despite its limitations in automation compared to open formats, Microsoft Excel files remain a primary source of data input in many business environments, used heavily by non-technical stakeholders.

 pip install openpyxl

 import pandas as pd

 # Load data from an Excel file
 # Requires 'openpyxl' library for .xlsx files (`pip install openpyxl`)
 df_excel = pd.read_excel('quarterly_report.xlsx', sheet_name='Q3_Sales')

 print("\n--- Loaded Excel Data ---")
 print(df_excel.head())

5. Avro

Avro is a row-oriented data serialization format that relies heavily on a schema defined in JSON. It is favored in data streaming platforms like Apache Kafka because writing and reading individual records is highly efficient, which is crucial for high-throughput messaging systems. Pandas does not have a direct read_avro() function, so the avro library is used first.

 pip install avro-python3

 import pandas as pd
 from avro.datafile import DataFileReader
 from avro.io import DatumReader

 # Requires 'avro-python3' (`pip install avro-python3`)
 reader = DataFileReader(open("user_data.avro", "rb"), DatumReader())
 data_list = [record for record in reader]
 reader.close()

 df_avro = pd.DataFrame(data_list)
 print("\n--- Loaded Avro Data ---")
 print(df_avro.head())

6. SQL Databases

While not a "file format," loading data directly from a relational database management system (RDBMS) via a connection is a fundamental data loading pattern. The read_sql() function in pandas requires an active database connection using an engine like SQLAlchemy or database-specific libraries (psycopg2 for Postgres, sqlite3, etc.).

 import pandas as pd
 import sqlite3

 # Create a connection to a database (or a new in-memory DB)
 conn = sqlite3.connect(':memory:') 

 # Load the result of a SQL query into a DataFrame
 sql_query = "SELECT * FROM customers WHERE region = 'West';"
 df_sql = pd.read_sql_query(sql_query, conn)

 print("\n--- Loaded Data from SQL Query ---")
 print(df_sql.head())
 conn.close()

7. Pickle (.pkl)

Pickle is Python's native binary serialization format. It allows you to save any Python object (including an entire pandas DataFrame with all its data types intact) to disk and load it back instantly. It is excellent for temporary storage between Python scripts but should generally be avoided for long-term or cross-language data storage due to security and compatibility issues.

 import pandas as pd

 # Assuming a DataFrame 'df' exists, you can save it:
 # df.to_pickle('my_dataframe.pkl')

 # Load the saved DataFrame back into a variable
 df_pickle = pd.read_pickle('my_dataframe.pkl')

 print("\n--- Loaded Pickle Data ---")
 print(df_pickle.head())

If you are looking for Data Extraction or Loading of Data, → Data Extract (E) Data Transformation (T)