Using Ruby Enumerators for streaming big gzipped CSV files from FTP

Jan Bajena
2 min readJan 29, 2020

Today I’d like to share with you a solution to a problem that gave me some headache recently, so you can spend your time on something more interesting (have a ☕ or something).

The task

The goal was to write a rake task that’d:
1. Fetch a big (1GB gzipped) CSV file from FTP
2. Ungzip it (18GB ungzipped)
3. Parse each row
4. Insert it into an SQL database

I’ll describe my approach to steps 1–3. Point 4 will is a material for another story 😅

The solution

Of course loading 18 gigabyte file into the memory wasn’t the best idea, so I decided that I’ll try streaming the file in chunks from FTP, ungzip each chunk and then collect the full lines. This should be more merciful to RAM usage and would e.g. allow reading a few records from the file without having to load the whole file first.

Fortunately this approach is possible, because:
1. Net::FTP library supports streaming the file in chunks (getbinaryfile method).
2. Zlib library supports buffered ungzipping

Ok, so how do I glue it all together using the ruby code? It turns out that Enumerator and Enumerator::Lazy come in handy here…

The Loader class is using a set of nested Enumerators:
- stream_file_from_ftp — loads a chunk of file from ftp and yields it to the next enumerator.
- ungzip — gathers loaded FTP chunks into decompressable gzip blocks, decompresses them and yields such decoded chunks to the next enumerator.
- split_lines — collects decompressed gzip blocks, splits them by the newline character and yields full lines to the next enumerator.
- main_enum — preprocesses each loaded CSV line by removing quotation marks and splitting by comma signs.

Note that we called lazy on split_lines enumerator. This is the most important part of this solution — it allows e.g. loading the first row without having to download and ungzip the full file, but only the part necessary for loading one row. Without it our program would load all rows into the memory only to return the first of them.

Summary

I recommend you to remember the presented pattern of combining Enumerators and using the lazy enumeration method. It will come handy anytime when writing code that processes streamable data.

But wait there’s more…

The solution I presented has one drawback — it keeps a TCP socket connection when processing consequent rows. This may sometimes cause a socket time out when processing really huge files.

In this case I’d suggest downloading the file to disk before processing. This can be achieved using Ruby’s OpenURI library which supports FTP protocol.

Feel free to use the code below:

--

--

Jan Bajena

I’m a software developer @ Productboard, mostly interested in building backends using Ruby language. CSS doesn’t make me cry though ;)