How to Read and Write HDF5 Files in Pandas

Pandas offers excellent support for working with HDF5 (Hierarchical Data Format version 5) files, a highly efficient format for storing and retrieving large datasets. HDF5 is particularly useful when dealing with data that exceeds the available RAM, as it allows you to access portions of the data without loading the entire file into memory.

To read data from an HDF5 file, you can use the pd.read_hdf() function. This function takes the file path as its primary argument. Crucially, you also need to specify the key parameter, which identifies the specific dataset within the HDF5 file that you want to read. HDF5 files can contain multiple datasets, each identified by a unique key.

For example:

import pandas as pd

df = pd.read_hdf('my_data.h5', key='data_table')
print(df)

This code snippet reads the dataset named ‘data_table’ from the file ‘my_data.h5’ and loads it into a Pandas DataFrame.

When writing data to an HDF5 file, you use the df.to_hdf() method. Again, you provide the file path and the key parameter. This method allows you to store a DataFrame as a dataset within an HDF5 file.

Here’s an example:

import pandas as pd

data = {'col1': [1, 2, 3], 'col2': ['A', 'B', 'C']}
df = pd.DataFrame(data)

df.to_hdf('my_data.h5', key='data_table')

This code stores the DataFrame df as a dataset named ‘data_table’ in the file ‘my_data.h5’.

HDF5 files offer several advantages. They support compression, which can significantly reduce file size. They also allow you to store metadata alongside the data, providing context and documentation. Moreover, they enable random access, meaning you can read specific portions of the data without reading the entire file.

Leave a Reply