The first step in any data science project is to import your data. The most common and frequently used file format by data scientists is the comma-separated values (CSV) file. In this tutorial, you’ll see how to read csv files in Pandas and how to use the read_csv() function to deal with common issues when importing data.
How to read csv in Pandas
import pandas as pd df = pd.read_csv('my_data.csv') print(df.to_string())
How to deal with headers
The most common problem when importing files into a dataframe is the headers.
There is a header parameter.
In case your data does not contain a header, add a header parameter and set it to None.
import pandas as pd df = pd.read_csv('my_data.csv', header=None) print(df.to_string())
Specifying the delimiter
By default, the read_csv function assumes that the delimiter is a comma (,). If your CSV file uses a different delimiter, you can specify it using the delimiter or sep parameter. Here’s an example:
import pandas as pd df = pd.read_csv("my_data.csv", delimiter=";")
This code will create a Pandas DataFrame df from the my_data.csv file using a semicolon (;) as the delimiter.
Specifying the index
By default, the read_csv function assumes that the first column of the CSV file is the index. If you want to use a different column as the index, or if you don’t want to use an index at all, you can specify it using the index_col parameter. Here’s an example:
import pandas as pd df = pd.read_csv("my_data.csv", index_col=0)
This code will create a Pandas DataFrame df from the my_data.csv file with the first column as the index. If you don’t want to use an index, you can set index_col to None:
import pandas as pd df = pd.read_csv("my_data.csv", index_col=None)
Handling missing values
CSV files often contain missing values, which are represented as empty cells. By default, the read_csv function assumes that missing values are represented by NaN (Not a Number). If your CSV file uses a different representation for missing values, you can specify it using the na_values parameter. Here’s an example:
import pandas as pd df = pd.read_csv("my_data.csv", na_values=["N/A", "?"])
This code will create a Pandas DataFrame df from the my_data.csv file, treating N/A and ? as missing values.
Specifying the encoding
By default, the read_csv function assumes that the CSV file is encoded using UTF-8. If your CSV file uses a different encoding, you can specify it using the encoding parameter. Here’s an example:
import pandas as pd df = pd.read_csv("my_data.csv", encoding="ISO-8859-1")
This code will create a Pandas DataFrame df from the my_data.csv file, using the ISO-8859-1 encoding.
See also:
How to save dataframe as text file