Data engineers and scientists primarily interact with structured data that is stored and exchanged using
various file formats. Python, particularly using the pandas library, offers robust and efficient ways to read
these files into memory for analysis and processing.
Here are the 7 most common data formats used for data loading, with Python examples using pandas where applicable:
CSV is arguably the most ubiquitous format due to its simplicity, human-readability, and universal compatibility across platforms. It is the default standard for transferring basic tabular data.
import pandas as pd
# Load data from a CSV file
df_csv = pd.read_csv('sales_data.csv')
print("--- Loaded CSV Data ---")
print(df_csv.head())
JSON is a lightweight, semi-structured data interchange format. It is widely used in web APIs and NoSQL databases and is excellent for representing hierarchical or nested data structures.
import pandas as pd
# Load data from a JSON file (handling potential nesting)
# orient='records' often works well for standard JSON lines files
df_json = pd.read_json('user_events.json', orient='records', lines=True)
print("\n--- Loaded JSON Data ---")
print(df_json.head())
Parquet is a highly efficient, columnar storage format optimized for large-scale analytical queries (OLAP).
It offers superior compression and encoding schemes, making it the preferred format in modern big data
ecosystems like Apache Spark. You typically need to install pyarrow (pip install pyarrow).
pip install pyarrow
# or
pip install fastparquet
import pandas as pd
# Load data from a Parquet file
# Requires 'pyarrow' or 'fastparquet' engine to be installed
df_parquet = pd.read_parquet('optimized_logs.parquet')
print("\n--- Loaded Parquet Data ---")
print(df_parquet.head())
Despite its limitations in automation compared to open formats, Microsoft Excel files remain a primary source of data input in many business environments, used heavily by non-technical stakeholders.
pip install openpyxl
import pandas as pd
# Load data from an Excel file
# Requires 'openpyxl' library for .xlsx files (`pip install openpyxl`)
df_excel = pd.read_excel('quarterly_report.xlsx', sheet_name='Q3_Sales')
print("\n--- Loaded Excel Data ---")
print(df_excel.head())
Avro is a row-oriented data serialization format that relies heavily on a schema defined in JSON. It is favored in data streaming platforms like Apache Kafka because writing and reading individual records is highly efficient, which is crucial for high-throughput messaging systems. Pandas does not have a direct read_avro() function, so the avro library is used first.
pip install avro-python3
import pandas as pd
from avro.datafile import DataFileReader
from avro.io import DatumReader
# Requires 'avro-python3' (`pip install avro-python3`)
reader = DataFileReader(open("user_data.avro", "rb"), DatumReader())
data_list = [record for record in reader]
reader.close()
df_avro = pd.DataFrame(data_list)
print("\n--- Loaded Avro Data ---")
print(df_avro.head())
While not a "file format," loading data directly from a relational database management system (RDBMS)
via a connection is a fundamental data loading pattern. The read_sql() function in pandas requires
an active database connection using an engine like SQLAlchemy or database-specific libraries
(psycopg2 for Postgres, sqlite3, etc.).
import pandas as pd
import sqlite3
# Create a connection to a database (or a new in-memory DB)
conn = sqlite3.connect(':memory:')
# Load the result of a SQL query into a DataFrame
sql_query = "SELECT * FROM customers WHERE region = 'West';"
df_sql = pd.read_sql_query(sql_query, conn)
print("\n--- Loaded Data from SQL Query ---")
print(df_sql.head())
conn.close()
Pickle is Python's native binary serialization format. It allows you to save any Python object (including an entire pandas DataFrame with all its data types intact) to disk and load it back instantly. It is excellent for temporary storage between Python scripts but should generally be avoided for long-term or cross-language data storage due to security and compatibility issues.
import pandas as pd
# Assuming a DataFrame 'df' exists, you can save it:
# df.to_pickle('my_dataframe.pkl')
# Load the saved DataFrame back into a variable
df_pickle = pd.read_pickle('my_dataframe.pkl')
print("\n--- Loaded Pickle Data ---")
print(df_pickle.head())