Chunk Read A Large File in Python
When a file is in the big-boy league, it needs to be handled chunk by chunk
Background
Recently, I needed to write a Python script for some data wrangling task on a well-formatted JSON file. The data was benign and the wrangling not too difficult, but the size of the file was significant: 450 MB. It was not too ridiculous, but surely in the big-boy league. I tried to open it in VS Code, but that absolutely killed my laptop. Since all the data science I had done before was based on toy-sized data, I had no practical experience handling something this big. I knew I had to learn to read the JSON file chunk by chunk.
Solution
One tutorial lists quite a few techniques to read a big file in chunks. Essentially, they belong to two categories: pass a chunk size to the built-in read
function or read the file line by line. The first method dictates how many characters (or bytes, if the read mode is binary) to read at one go, whereas the second method leverages the fact that Python always throws away the previous line when it reads the current one. Both methods achieve chunk read, but in my use case with a well-formatted JSON file, the second method is clearly better.
Then, I found this SO answer, which cleverly uses a generator to handle chunk read. In addition, it also shows the logic of how to determine on which line to deliver the current chunk. Based on that code, I created the following function to support generic chunk read.
This chunk_read
function delivers a chunk when max_sentinel
number of lines that contain the sentinel
pattern have been encountered. In my use case, since the JSON file is a list of dict, I can use the pattern of the first key-value line in the dict as the sentinel for each dict. Then, I just need to decide the number of dicts I feel comfortable ingesting in one chunk. Let’s see some examples of chunk_read
in action.
Example 1: Direct Sentinel Match
Suppose our large JSON file test1.json
looks like this
[
{
"ID": 1,
"Name": "FOO-001",
"Os": "Bar1"
},
{
"ID": 2,
"Name": "FOO-002",
"Os": "Bar2"
},
{
"ID": 3,
"Name": "FOO-003",
"Os": "Bar3"
},
{
"ID": 4,
"Name": "FOO-004",
"Os": "Bar4"
},
{
"ID": 5,
"Name": "FOO-005",
"Os": "Bar5"
}
]
Use the chunk_read
function as shown below
with open('test_1.json', 'r') as f_obj:
for chunk in chunk_read(f_obj, 'ID', 3):
print('new line:', chunk)
we have output
new line: [{"ID": 1,"Name": "FOO-001","Os": "Bar1"},{"ID": 2,"Name": "FOO-002","Os": "Bar2"},{"ID": 3,"Name": "FOO-003","Os": "Bar3"},{ new line: "ID": 4,"Name": "FOO-004","Os": "Bar4"},{"ID": 5,"Name": "FOO-005","Os": "Bar5"}]
As expected, the first chunk contains three dicts, whereas the second contains only two. Surely, we cannot directly convert each chunk read to a JSON object, since it is ill-formatted. Yet, formatting it is not difficult; the only erroneous parts of the string are at the beginning and/or at the end.
Example 2: Regex Match
Suppose another big JSON file test2.json
looks like this
[
{
"IP_Address": "0.0.0.0",
"ID": 1,
"Name": "FOO-001",
"Os": "Bar"
},
{
"IP-Address": "0.0.0.1",
"ID": 2,
"Name": "FOO-002",
"Os": "Bar"
},
{
"ipaddress": "0.0.0.2",
"ID": 3,
"Name": "FOO-001",
"Os": "Bar"
},
{
"ipAddress": "0.0.0.3",
"ID": 4,
"Name": "FOO-002",
"Os": "Bar"
},
{
"Ip_Address": "0.0.0.4",
"ID": 5,
"Name": "FOO-001",
"Os": "Bar"
}
]
Notice that the first key-value lines are different in all dicts, yet they are pattern-matchable via a regex sentinel. We can chunk_read
the file as follows
with open('test_2.json', 'r') as f_obj:
for chunk in chunk_read(f_obj, '(?i)ip.?address', 2):
print('new line:', chunk)
The regex sentinel is (?i)ip.?address
, which matches the first key-value line of all dicts. We have output
new line: [{"IP_Address": "0.0.0.0","ID": 1,"Name": "FOO-001","Os": "Bar"},{"IP-Address": "0.0.0.1","ID": 2,"Name": "FOO-002","Os": "Bar"},{
new line: "ipaddress": "0.0.0.2","ID": 3,"Name": "FOO-001","Os": "Bar"},{"ipAddress": "0.0.0.3","ID": 4,"Name": "FOO-002","Os": "Bar"},{
new line: "Ip_Address": "0.0.0.4","ID": 5,"Name": "FOO-001","Os": "Bar"}]
Since our chunk size is 2 now, each chunk returns two dicts, except for the last one. Again, further string formatting is needed before these chunk reads can be properly parsed.
Closing Thoughts
Chunk read a large data file is a good skill to have for a data engineer. Although there are many ways/tools to do it, knowing the principals behind it always helps. I hope by sharing my approach as a template, others will be able to make adaptations to their use cases and readily create their version of chunk read.