乐闻世界logo
搜索文章和话题

How do I read a large csv file with pandas?

1个答案

1

In Pandas, there are several methods to efficiently manage memory usage and ensure processing speed when reading large CSV files. The following are some commonly used strategies and methods:

1. Using read_csv Parameters

Chunked Reading

For very large files, use the chunksize parameter to read the file in chunks. This allows you to process smaller data segments incrementally, avoiding loading the entire file into memory at once.

python
import pandas as pd chunk_size = 10000 # Number of rows per chunk chunks = pd.read_csv('large_file.csv', chunksize=chunk_size) for chunk in chunks: # Process each chunk process(chunk)

Reading Only Specific Columns

If you only need specific columns, using the usecols parameter can significantly reduce memory usage.

python
columns = ['col1', 'col2', 'col3'] # Columns to read df = pd.read_csv('large_file.csv', usecols=columns)

2. Data Type Optimization

Directly specifying more memory-efficient data types during reading can reduce memory consumption. For example, if you know the data range is small, use int32 or float32 instead of the default int64 or float64.

python
dtypes = {'col1': 'int32', 'col2': 'float32', 'col3': 'category'} df = pd.read_csv('large_file.csv', dtype=dtypes)

3. Row-by-Row Reading

Although this method may be slower, it helps manage memory usage, particularly useful for initial data exploration or handling very large files.

python
with open('large_file.csv') as file: for line in file: process(line)

4. Using Dask or Other Libraries

For very large datasets, Pandas might not be the optimal solution. Consider using libraries like Dask, which is designed for parallel computing and can handle large-scale data more efficiently.

python
import dask.dataframe as dd df = dd.read_csv('large_file.csv')

Example Application Scenario

Suppose you work at an e-commerce company and need to process a large CSV file containing millions of orders. Each order has multiple attributes, but you only need OrderID, UserID, and Amount. You can use read_csv with usecols and dtype to optimize the reading process:

python
columns = ['OrderID', 'UserID', 'Amount'] dtypes = {'OrderID': 'int32', 'UserID': 'int32', 'Amount': 'float32'} df = pd.read_csv('orders.csv', usecols=columns, dtype=dtypes)

This approach significantly reduces memory usage and improves processing speed.

2024年7月20日 14:46 回复

你的答案