Want faster Pandas code? Check below strategies to optimize performance, memory usage, and runtime when working with large or complex DataFrames in Python.
1. Use Vectorized Operations
Avoid slow Python loops:
# Bad
for i in df.index:
df.loc[i, 'new_col'] = df.loc[i, 'old_col'] * 2
Use vectorization instead:
# Good
df['new_col'] = df['old_col'] * 2
2. Optimize Data Types
Reduce memory usage and speed up processing by converting columns:
df['int_col'] = df['int_col'].astype('Int32')
df['cat_col'] = df['cat_col'].astype('category')
3. Use Efficient Indexes
Set indexes before merging or joining:
df1 = df1.set_index('id')
df2 = df2.set_index('id')
merged = df1.join(df2, how='inner')
4. Try pandas.eval() for Arithmetic
Evaluate expressions faster using eval()
:
df['sum'] = pd.eval('df.a + df.b')
5. Use Numba or Cython for Heavy Loops
If you must loop, try compiling with Numba:
from numba import njit
@njit
def fast_sum(a, b):
return a + b
6. Parallelize with multiprocessing
from multiprocessing import Pool
import numpy as np
def process_chunk(chunk):
return chunk.assign(sum=chunk['a'] + chunk['b'])
chunks = np.array_split(df, 4)
with Pool(4) as pool:
df = pd.concat(pool.map(process_chunk, chunks))
7. Monitor Memory
Use df.info()
and df.memory_usage(deep=True)
to inspect usage. Limit columns on import:
df = pd.read_csv('data.csv', usecols=['id','a','b'], dtype={'id': 'int32'})
8. Benchmark with %timeit
Use Jupyter’s %timeit
to compare speeds:
%timeit df['a'] + df['b']
%timeit pd.eval('df.a + df.b')
Quick Checklist
- Avoid loops – prefer vectorized operations
- Convert columns to efficient dtypes
- Use indexes for joins
- Try
pandas.eval()
and Numba for speed - Use multiprocessing for large DataFrames
- Profile with
%timeit
and monitor memory
Example Workflow
import pandas as pd
df = pd.read_csv('data.csv', usecols=['a', 'b'], dtype={'a': 'float32', 'b': 'float32'})
df['sum'] = pd.eval('df.a + df.b')