252411: The Horrors of Very Large Datasets

During the last week, I worked on what was the largest project I have built by myself (so far). I received an email from my future CS professor about an opportunity to work on a web app for an upcoming event hosted by the New England Center for Investigative Reporting at the WGBH News studios, and although I was told another developer was already working on it, I began writing my own app just in case. It turned out I made the right choice, as something came up and the original project wasn't ready in time, so mine was chosen. However, even before the almost catastrophe that was the app going live, there were quite a few hiccups along the way.


A Very Large Dataset

I decided that for this particular project, I was going to make use of Google Firebase. Its concept was very enticing. The ability to not have to worry about setting up or maintaining a backend, having user authentication handled automatically, and a NoSQL database already setup, seemed like the perfect way to allow me to finish the app as quickly as possible (as at this point I was trying to finish my version before the event's original developer). And at first, Firebase did really seem like the perfect solution. I was able to create a proof of concept in a few hours that did what I thought was needed, i.e. handle user authentication through Google accounts and allow users to edit information stored on a database. However, I made this proof of concept after emailing the event coordinator, and before he emailed back his response. So I really had no idea what exactly they wanted save some vague details from my CS professor. Most importantly, I had no idea regarding the data to be used.

When I received an email back containing the project specifications, I was pointed to this webpage, and told to use the "All Registered Filer Report Data." After converting it to a CSV so I wouldn't be forced to deal with Microsoft Access, I realized just how big the file was. 252,411 rows. Converting all that to JSON would end up with a massive 500 megabyte file, but there were issues before even reaching that point. The problems truly began when I came across a tool developed by Firebase themselves to import large files, named Firebase Streaming Import.

Not All That Glitters Is Gold

Glancing over the readme for the Streaming Import tool, it seemed to do just what I wanted. Not to mention, the readme for the normal Firebase Import tool directed users to the Streaming Import tool if wanting to upload files greater than 250 megabytes. 500 megabytes shouldn't be an issue then, right? Well, I'd later find out that 500 megabytes was apparently too much for it to handle, but the file's size was not the first problem I encountered during the uploading process.

The first issue I encountered was to convert the CSV to JSON for the Streaming Import to work. Now, conversion is a pretty simple task considering there are tons of libraries out there that already do it. I settled on this one and within a few minutes was ready to test it out. But running it for the first time began throwing errors about unclosed quotes, and it was then I realized that the CSV data I had was quite flawed.

You see, the CSV to JSON converter allowed commas to be a part of a cell's value, provided that the cell's value was wrapped in quotes. For example, a particular cell might contain the phrase "Doe, John". Since the converter is smart enough to know that not all commas will separate values, it notices that the comma is surrounded in quotes, and therefore parses it as a cell value. However, in the CSV I was working with, there were some columns that were designated for notes/extra details regarding that particular row. Therefore, cells in that column had a pretty good chance of being multiline messes, such as:

"We don't know exactly where this contributor came from, maybe remove this report?
Wait for further information, we might figure it out just yet"

While the converter would normally be able to realize that the commas above were part of that cell's value, it would not be able to do so because of the newline that was in the middle. Upon reaching it, the converter would just throw an error and ignore that line, leading to quite a bit of missing data the first time I ran it. After running it a few more times and realizing the fault lay not with the converter but the data itself, I took a closer look at the CSV and realized there were indeed many cases of cells with multiline text. Not good.

Tidying Up

Some quick Googling brought me to this answer on StackOverflow, which did the trick for compressing all those multiline cells in one fell swoop. Yet, upon running the converter again, I received the same error! The script was telling me that there were still newlines in some cells, despite the fact that I could not see them in Excel. Doing some more Googling and some asking in a StackOverflow chatroom, I was advised to try viewing the CSV in Notepad++, which has the ability to display non-printable characters. Opening the data in Notepad++ and jumping to first line that was included in an error message, I saw the culprit. Right in the middle of the row there was CRLF (Carriage Return, Line Feed). I figured just replacing all of those characters with a space would do the trick, but in my haste I forgot that every single line had a CRLF at the end, as they must so a new line could begin. And so, when I saved the file and tried to run the script again without even checking to see what happened to the data, I got even more errors as the entire CSV file had now become one line.

Fortunately, I did have to foresight to have created a backup of the CSV in case my attempts at cleaning it went wrong, so reverting my mistake wasn't an issue. What was, however, was finding a way to get rid of those random CRLFs that were messing with my data. Despite searching the same query over and over, I hoped to find something on Google different than using an Excel macro and the CLEAN() function. Although CLEAN() did exactly what I was looking for, which was to remove all non-printable characters from a cell (including CRLFs), it could only be run on one cell at a time. Meaning if I wanted to run it on the entire spreadsheet, I was going to have to learn some VBA and write a macro, which I just did not want to do at that point.

Eventually however, I gave in, as after an hour or so of fruitlessly looking for an alternative, I found none. Google led me to this StackOverflow answer, and a very helpful user named Ben N from the SuperUser Root Access chat helped me modify it to work on my spreadsheet. It turned out that creating a macro was pretty simple, and probably something I should've looked into immediately when it seemed that there was no other way. Not to mention, the macro was able to clean 252,411 rows in just about five minutes, so it was a pretty efficient solution.

Now that I had a clean CSV with no extraneous newlines to muck up the data, the converter created a JSON file without any complaints. But now there was the problem of getting all that data to Firebase.

Out of the Frying Pan, into yet Another Frying Pan

Firebase Streaming Import, although seemingly my best bet, unfortunately was anything but. The first problem came in the form of this error:

started at 1469471482.77
Traceback (most recent call last):
  File "import.py", line 90, in <module>
    main(argParser.parse_args())
  File "import.py", line 20, in main
    for prefix, event, value in parser:
  File "R:\Python27\lib\site-packages\ijson\common.py", line 65, in parse
    for event, value in basic_events:
  File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 185, in basic_parse
    for value in parse_value(lexer):
  File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 127, in parse_value
    raise UnexpectedSymbol(symbol, pos)
ijson.backends.python.UnexpectedSymbol: Unexpected symbol u'\ufeff' at 0

I discovered that the unexpected symbol in question was a Byte Order Mark, which ijson (the JSON parsing tool that Firebase Streaming Import used) did not like. However, there are a multitude of ways to remove a BOM from a file, from using Vim, to simply saving the file in Notepad and designating ANSI as the encoding.

With the BOM removed, I tried the Import tool once again, and ran into a new error:

Traceback (most recent call last):
  File "import.py", line 90, in <module>
    main(argParser.parse_args())
  File "import.py", line 20, in main
    for prefix, event, value in parser:
  File "R:\Python27\lib\site-packages\ijson\common.py", line 65, in parse
    for event, value in basic_events:
  File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 185, in basic_parse
    for value in parse_value(lexer):
  File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 116, in parse_value
    for event in parse_array(lexer):
  File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 138, in parse_array
    for event in parse_value(lexer, symbol, pos):
  File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 119, in parse_value
    for event in parse_object(lexer):
  File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 170, in parse_object
    pos, symbol = next(lexer)
  File "R:\Python27\lib\site-packages\ijson\backends\python.py", line 51, in Lexer
    buf += data
MemoryError

Now it would seem that ijson simple could not handle a file of such size, despite the tool being described as to handle files greater than 250mb. Perhaps it was something to do with how large one particular row was, but regardless, it was at this point I gave up on trying to use Firebase Streaming Import.

Out of That Frying Pan, Making My Own Frying Pan, and Jumping into It

Throughout this whole ordeal, one of my friends who is far more knowledgeable than me when it comes to programming was suggesting that I simply write my own script to handle uploading the data. Considering that there was a Firebase API for Node.js, I could just convert each row to a JSON object and upload it as it was made. Despite that, I had waited until I exhausted all other possibilities.

When I reached that point, it turned out that writing my own script was far simpler and quicker than all that time wasted on Firebase Streaming Import. However, nothing is ever truly simple. I fired my up my homemade uploader and waited for it to finish, which it surprisingly did very quickly. I could tell because the Firebase API had callbacks for pretty much every method, including uploading, so I made sure to console.log the ID of the report that was uploaded inside the upload method's callback. I assumed that when I saw the ID of the final report, it would be done. However, what I saw was a whole a bunch of IDs for the final report, as if every callback was using the same ID. I ignored it for the moment as a minor mistake on my part and checked Firebase to see if the data had uploaded.

At first, everything seemed fine as there was a ton of reports on Firebase. However, when the full dataset is 252,411 reports big, one ton of reports isn't enough. I soon noticed that nearly half of the data was missing, and so began another stream of problems.

Sometimes All You Need Is Patience

With half the data missing, my first thought was to simply split the CSV into to two and try each separately. Yet, in doing so, I only was able to upload half of each. Splitting them further resulted in the same problem, netting less and less data. That's when I thought that perhaps uploading to Firebase took far longer than was necessary to convert a CSV row to JSON, so some rows may have just been ignored. My more knowledgeable friend advised me to implement an upload queue to ensure that every row was uploaded, and to wait for the queue to empty before exiting the program.

Yet, that still ended up with about half the data missing, leaving us both stumped for quite some time. Eventually, he suggested that I just try leaving the program running for a while, even after it said it had uploaded the final report. So I did just that. I started the process of converting and uploading and went to take a shower. 20 minutes later, I return to my computer and voilà! All the reports had been uploaded successfully!

It would seem that with the massive amount of data I was attempting to upload at once, the connection would have to persist while Firebase was in the process of saving the data it received, lest it drop what it hasn't finished. Although the final report had been converted and uploaded from my computer, it wasn't necessarily true that Firebase had downloaded it (At least, this is what I think).

So now the data had finally been uploaded to Firebase, after 3 days (yes, the events I outlined did take place over 3 days) of wrangling a CSV, scouring the internet, and crying whenever I got an error. Yet, throughout all this, I never considered if Firebase, or rather any NoSQL database in general, was the right tool for this particular project.

Make the Best of What You've Got

For a NoSQL database, the proper way to structure data is as flatly as possible, i.e. one should not recklessly nest data such that one node becomes ridiculously large. However, I did just such a thing as all my reports were objects of the /reports/ node. If I were to have structured the data in a flat manner, I would have had a root level node for each column in the CSV, and split the data of each report into their respective nodes. To make sure data matched up, I would key each piece of the report by the report ID. In order to build a finished report I would, for example, make queries to /committee_name/report_ID/ or /expenditures/report_ID/, and etc. However, in my haste to get a working project ready, I did not do this.

Since all the reports were in one node, and I had already wasted 3 days getting to that point, I decided to create new nodes for just report IDs. I created nodes for reports that hadn't been reviewed, reports that had been reviewed, and reports who have been reviewed by both a user and an admin (double checked, essentially). This way, users would only have to download the report IDs. Yet, the massive amount of unreviewed IDs that would exist at the beginning was large enough that it would freeze the webpage for a few seconds as it was downloaded.

Despite this, I was in too deep, and I had to begin work on the front-end so that I would have something to show soon. (I forgot to mention that I promised a proof of concept to be ready about 3 days prior, before I realized I had to work with so much data) So, I forged onwards and ended up having something ready about 4 days later. It was at that point that I was told that the data I was using so far was just an example, and I was then given the actual data the app would need to use. Said data came in the form of CSVs, with no multiline cells or extraneous CRLFs. And most imortantly, only 3602 rows.

All that time I had been working with such a ridiculously large dataset, cleaning it, getting it to convert properly, uploading every single report to Firebase, wasted. The actual dataset was about 1.43% the size of what I had thought was needed, and took about 10 seconds to successfully upload to Firebase. 10 seconds.

sigh.

The only thing to do at that point was to just make the best of what I had.

Part Two