Chunk Read A Large File in Python

4 min readJun 11, 2022

When a file is in the big-boy league, it needs to be handled chunk by chunk

Background

Recently, I needed to write a Python script for some data wrangling task on a well-formatted JSON file. The data was benign and the wrangling not too difficult, but the size of the file was significant: 450 MB. It was not too ridiculous, but surely in the big-boy league. I tried to open it in VS Code, but that absolutely killed my laptop. Since all the data science I had done before was based on toy-sized data, I had no practical experience handling something this big. I knew I had to learn to read the JSON file chunk by chunk.

Solution

One tutorial lists quite a few techniques to read a big file in chunks. Essentially, they belong to two categories: pass a chunk size to the built-in read function or read the file line by line. The first method dictates how many characters (or bytes, if the read mode is binary) to read at one go, whereas the second method leverages the fact that Python always throws away the previous line when it reads the current one. Both methods achieve chunk read, but in my use case with a well-formatted JSON file, the second method is clearly better.

Then, I found this SO answer, which cleverly uses a generator to handle chunk read. In addition, it also shows the logic of how to determine on which line to deliver the current chunk. Based on that code, I created the following function to support generic chunk read.

This chunk_read function delivers a chunk when max_sentinel number of lines that contain the sentinel pattern have been encountered. In my use case, since the JSON file is a list of dict, I can use the pattern of the first key-value line in the dict as the sentinel for each dict. Then, I just need to decide the number of dicts I feel comfortable ingesting in one chunk. Let’s see some examples of chunk_read in action.

Example 1: Direct Sentinel Match

Suppose our large JSON file test1.json looks like this

[
    {
      "ID": 1,
      "Name": "FOO-001",
      "Os": "Bar1"
    },
    {
      "ID": 2,
      "Name": "FOO-002",
      "Os": "Bar2"
    },
    {
      "ID": 3,
      "Name": "FOO-003",
      "Os": "Bar3"
    },
    {
      "ID": 4,
      "Name": "FOO-004",
      "Os": "Bar4"
    },
    {
      "ID": 5,
      "Name": "FOO-005",
      "Os": "Bar5"
    }
]

Use the chunk_read function as shown below

with open('test_1.json', 'r') as f_obj:
    for chunk in chunk_read(f_obj, 'ID', 3):
        print('new line:', chunk)

we have output

new line: [{"ID": 1,"Name": "FOO-001","Os": "Bar1"},{"ID": 2,"Name": "FOO-002","Os": "Bar2"},{"ID": 3,"Name": "FOO-003","Os": "Bar3"},{ new line: "ID": 4,"Name": "FOO-004","Os": "Bar4"},{"ID": 5,"Name": "FOO-005","Os": "Bar5"}]

As expected, the first chunk contains three dicts, whereas the second contains only two. Surely, we cannot directly convert each chunk read to a JSON object, since it is ill-formatted. Yet, formatting it is not difficult; the only erroneous parts of the string are at the beginning and/or at the end.

Example 2: Regex Match

Suppose another big JSON file test2.json looks like this

[
    {
      "IP_Address": "0.0.0.0",
      "ID": 1,
      "Name": "FOO-001",
      "Os": "Bar"
    },
    {
      "IP-Address": "0.0.0.1",
      "ID": 2,
      "Name": "FOO-002",
      "Os": "Bar"
    },
    {
      "ipaddress": "0.0.0.2",
      "ID": 3,
      "Name": "FOO-001",
      "Os": "Bar"
    },
    {
      "ipAddress": "0.0.0.3",
      "ID": 4,
      "Name": "FOO-002",
      "Os": "Bar"
    },
    {
      "Ip_Address": "0.0.0.4",
      "ID": 5,
      "Name": "FOO-001",
      "Os": "Bar"
    }
]

Notice that the first key-value lines are different in all dicts, yet they are pattern-matchable via a regex sentinel. We can chunk_read the file as follows

with open('test_2.json', 'r') as f_obj:
    for chunk in chunk_read(f_obj, '(?i)ip.?address', 2):
        print('new line:', chunk)

The regex sentinel is (?i)ip.?address, which matches the first key-value line of all dicts. We have output

new line: [{"IP_Address": "0.0.0.0","ID": 1,"Name": "FOO-001","Os": "Bar"},{"IP-Address": "0.0.0.1","ID": 2,"Name": "FOO-002","Os": "Bar"},{
new line: "ipaddress": "0.0.0.2","ID": 3,"Name": "FOO-001","Os": "Bar"},{"ipAddress": "0.0.0.3","ID": 4,"Name": "FOO-002","Os": "Bar"},{
new line: "Ip_Address": "0.0.0.4","ID": 5,"Name": "FOO-001","Os": "Bar"}]

Since our chunk size is 2 now, each chunk returns two dicts, except for the last one. Again, further string formatting is needed before these chunk reads can be properly parsed.

Closing Thoughts

Chunk read a large data file is a good skill to have for a data engineer. Although there are many ways/tools to do it, knowing the principals behind it always helps. I hope by sharing my approach as a template, others will be able to make adaptations to their use cases and readily create their version of chunk read.