Using Ruby Enumerators for streaming big gzipped CSV files from FTP
Today I’d like to share with you a solution to a problem that gave me some headache recently, so you can spend your time on something more interesting (have a ☕ or something).
The goal was to write a rake task that’d:
1. Fetch a big (1GB gzipped) CSV file from FTP
2. Ungzip it (18GB ungzipped)
3. Parse each row
4. Insert it into an SQL database
I’ll describe my approach to steps 1–3. Point 4 will is a material for another story 😅
Of course loading 18 gigabyte file into the memory wasn’t the best idea, so I decided that I’ll try streaming the file in chunks from FTP, ungzip each chunk and then collect the full lines. This should be more merciful to RAM usage and would e.g. allow reading a few records from the file without having to load the whole file first.
Loader class is using a set of nested
stream_file_from_ftp — loads a chunk of file from ftp and yields it to the next enumerator.
ungzip — gathers loaded FTP chunks into decompressable gzip blocks, decompresses them and yields such decoded chunks to the next enumerator.
split_lines — collects decompressed gzip blocks, splits them by the newline character and yields full lines to the next enumerator.
main_enum — preprocesses each loaded CSV line by removing quotation marks and splitting by comma signs.
Note that we called
split_lines enumerator. This is the most important part of this solution — it allows e.g. loading the first row without having to download and ungzip the full file, but only the part necessary for loading one row. Without it our program would load all rows into the memory only to return the first of them.
I recommend you to remember the presented pattern of combining
Enumerators and using the lazy enumeration method. It will come handy anytime when writing code that processes streamable data.
But wait there’s more…
The solution I presented has one drawback — it keeps a TCP socket connection when processing consequent rows. This may sometimes cause a socket time out when processing really huge files.
In this case I’d suggest downloading the file to disk before processing. This can be achieved using Ruby’s OpenURI library which supports FTP protocol.
Feel free to use the code below: