Getting data out of SQL Server is easy—you don’t need special tools. With just a few lines of code, you can easily send the results of a query to a file. When you’re building a prototype, this is also the easiest way to work; you create quick and dirty tools that get the job done. ETL (extract, transform, load) processes are no exception. During a recent project working with SQL Server and Hive, I assumed that most of the pain would involve Hive. I was sorely mistaken: SQL Server and dirty data were the source of our problems.
Extracting the Errors
Getting data out of SQL Server seemed painless. It only took a few SQL statements and several lines of C# before I had a working data extractor. Data came out of SQL Server and was written to disk in a delimited file format—to make life easy, I decided to use a tab character as the delimiter. The hardest part of the extraction process was checking the data for funny characters.
Hive doesn’t support quoted strings; this made it especially important to make sure that the source data was as clean as possible. Hive’s schema on read makes it possible to support files with a different number of fields in each record, but the downside of this flexibility is that stray delimiter characters will be picked up as a new column.
The initial portion of the data looked correct, so I proceeded with data loads. As we began loading more and more data, problems started appearing in the queries. Data wasn’t where it should be and some of the aggregations were failing—Hive was reporting errors converting strings to numbers. I was perplexed and started looking at the data in the source system. Everything looked correct. Rather than examine the source data, I looked into the extracted data and rapidly got lost in a sea of massive files because Hive’s MapReduce jobs create a detailed log of all activity.
While digging through the chain of MapReduce logs, I found references to the problem data. Ultimately, I found several lines of code where an enterprising user had put a tab character in a string. Not to be outdone, a few other users had managed to sneak newline characters into text as well. After creating additional data cleansing rules, I reloaded all of the source data. Somewhat surprisingly, everything began working at this point. The users and their desire to put random characters in text had been defeated by a simple REPLACE
.
A Transformative Experience
Applying data transformations in Hive was anticlimactic. Really. The process of transforming source data into a queryable format was painless—barring date and time bugs, there’s nothing to report. Hive was able to fly through the source data, pivot several hundred gigabytes of tables, and produce complex aggregations in a matter of minutes.
The only difficulty came from figuring out the right techniques for aggregating data. That, however, is a different story.
The Letdown
There’s always a letdown. After a reasonable run of success building a prototype, something had to break and break it did.
After getting data out of SQL Server, pushing data into Hive, running queries, and pulling results out of Hive, you would assume that the hard part was over. After all, the only thing left to do was load the results into SQL Server. And that’s where the problems started.
There are a lot of techniques to load data into SQL Server and they pose different problems. Although the most efficient techniques involve using bcp
or BULK INSERT
, they require that SQL Server has access to the files. Although that works, troubleshooting a combination of Active Directory and SQL Server security is nobody’s idea of a good time; sorting out the required permissions proved time consuming and difficult. Once security issues were resolved, a new error cropped up: BULK INSERT
may encounter problems loading data into tables created with LOB data types. It was easy to get around this problem by using the correct data types with appropriate precision. This wasn’t difficult to fix, but it did slow down our end to end testing.
In the end…
After a lot of troubleshooting security issues and data formatting issues, we were able to resolve our problems and get an end to end test running. It’s rewarding to watch data move through a processing pipeline, especially after struggling with multiple ETL problems.
ETL remains the most difficult part of database projects. Moving data between different instances of SQL Server can be problematic, depending on the methods used; moving data between SQL Server and different databases can cause problems in ways that you didn’t expect. Building end to end systems is difficult. If you haven’t done it before make sure you pad your schedule; I was able to solve these problems quickly through previous experience and by reaching out to my network, but everyone may not be so lucky. ETL can be problematic even for the most seasoned SQL Server professional.
Curious if you might have a problem suitable for Hadoop but don’t want to run into some of the same problems with your implementation? I’ll at SQLIntersection in April presenting Reporting in Production: A Hadoop Case Study talking about the problems we ran into implementing a Hadoop driven solution and how we solved them. Head on over, look at the sessions, and register to get your learn on in Vegas.