MSSQL JDBC Driver behavior
My latest project involves talking to MS-SQL Server using the JDBC driver and Java. While doing this we setup the database connection and had a simple SQL to get the firstname and lastname for a unique userid from the applicationuser table in the database.
SELECT first_name,last_name FROM application_user WHERE user_id = ?
Given the above SQL, we did not think too much about performance as the user_id was indexed. The java code as below was used to run the SQL.
Connection conn = db.conn(DATABASE_URL); try { PreparedStatement stmt = prepare(conn, "SELECT first_name, last_name " + "FROM application_user " + "WHERE user_id = ?"); stmt.setString(1, username); ResultSet resultSet = stmt.executeQuery(); return extractResults(resultSet); } catch (SQLException e) { e.printStackTrace(); } return null;
When writing integration tests we started noticing that the SQL was taking about 6 seconds to execute. The same SQL would execute inside 100 milliseconds on the MSSQL query analyzer. The friendly DBA’s on our team pointed out that the SQL was doing some data type conversion as the user_id field was of the type VARCHAR but the SQL sent by the the JDBC driver set the data type to NVARCHAR because of this the index was not being used and the SQL took more than 6 seconds to execute. Researching this topic further we decided to cast the variable to VARCHAR as shown below.
Connection conn = db.conn(DATABASE_URL); try { PreparedStatement stmt = prepare(conn, "SELECT first_name, last_name " + "FROM application_user " + "WHERE user_id = cast(? AS VARCHAR"); stmt.setString(1, username); ResultSet resultSet = stmt.executeQuery(); return extractResults(resultSet); } catch (SQLException e) { e.printStackTrace(); } return null;
The above code executed under 100milliseconds and showed us that the data types being used did not match the datatype in the database. We later found out that the MS-SQL JDBC driver does this to properly deal with unicode characters. This behavior can be turned off using the sendStringParametersAsUnicode flag on the database connection. Once this flag is set to false on the connection, then all the SQL we issue do not need the cast
Connection conn = db.conn(DATABASE_URL + ";sendStringParametersAsUnicode=false"); try { PreparedStatement stmt = prepare(conn, "SELECT first_name, last_name " + "FROM application_user " + "WHERE user_id = ? "); stmt.setString(1, username); ResultSet resultSet = stmt.executeQuery(); return extractResults(resultSet); } catch (SQLException e) { e.printStackTrace(); } return null;
Off course this only works if there is no unicode data in your database if there is any unicode data in the database, we will have to revert to casting individual SQL statements.
Back to blogging
There has been a long pause in my blogging activity. I was trying to finish of my latest writing engagement in regards to NoSQL. Working with Martin Fowler on NoSQL Distilled was really fun and will provide a concise text and easy way to understand for everyone the rise of the NoSQL movement and help with what kinds of trade-offs need to be made while working with NoSQL.
The book should soon be in print and e-book formats. Martin has written more about it here
With so much pain, why are stored procedures used so much
- Writing stored procedure code is fraught with danger as there are no modern IDE's that support refactoring, provide code smells like "variable not used", "variable out of scope".
- Finding usages of a given stored procedure or function usually means doing a text search of the whole code base for the name of the function or stored procedure, so refactoring to change name is painful, which means names that do not make any sense are propagated, causing pain and loss of developer productivity
- When coding of stored procedures is done, you need a database to compile the code, this usually means a large database install on your desktop or laptop the other option being to connect to the central database server, again this leads to developers having to carry a lot of dependent systems just to compile their code, this can to solved by database vendors providing a way to compile the code outside of the database.
- Code complexity tools, PMD metrics, Checkstyle etc type of tools are very rare to find for stored procedures, thus making the visualization of metrics around the stored procedure code almost impossible or very hard
- Unit testing stored procedures using *Unit testing frameworks out there like pl/sql unit, ounit, tsql unit is hard, since these frameworks need to be run inside the database and integrating them with Continuous Integration further exasperates the problems
- Order or creation of stored procedures becomes important as you start creating lots of stored procedures and they become interdependent. While creating them in a brand new database, there are false notifications thrown around about missing stored procedures, usually to get around this problem, I have seen a master list of ordered stored procedures for creation maintained by the team or just recompile all stored procedures once they are created "ALTER
RECOMPILE" was built for this. Both of these solutions have their own overhead.
- While running CPU intensive stored procedures, the database engine is the only machine (like JVM) available for the code to run, so if you want to start more processes so that we can handle more requests, its not possible without a database engine. So the only solution left is to get a bigger box (Vertical Scaling)
Replica sets in MongoDB
Replica sets is a feature of MongoDB for Automatic Failover, in this setup there is a primary server and the rest are secondary servers. If the primary server goes down, the rest of the secondary servers choose a new primary via an election process, each server can also be assigned number of votes, so that you can decide the next primary based on data-center location, machine properties etc, you can also start mongo database processes that act only as election tie-breakers these are known as arbiters, these arbiters will never have data, but just act as agents that break the tie.
All operations are directed at the primary server, the primary server writes the operations to its operation log (also known as opslog), the secondary servers get updates from the primary server. The data is written to the primary server and later replicated to the other secondary servers, so when the write happens at the primary and before the write is replicated to the secondary servers, if the primary server goes down you will loose the data that was written to the primary but never replicated to the secondary servers, you can get around this by specifying how many servers should have the data, before the write is considered good
db.runCommand( { getlasterror : 1 , w : 3 } )in the above command, you are saying that the write to the database is considered good, only if the write has been propagated to at least 3 servers, off course doing this for every write is going to be very expensive, so you should batch all your writes for a user action and then issue getlasterror
This is how you start the the mongod servers, in a replica set, the can run on any machine any port, as long as they can talk to each other over the network and all of them have the same "--replSet" parameter, in the example below its "prod"
mongod --replSet prod --port 27017 --dbpath /data/node1 mongod --replSet prod --port 27027 --dbpath /data/node2 mongod --replSet prod --port 27037 --dbpath /data/node3
Once the three servers are up, you have to create a replica configuration as shown below, if you use localhost as a server name, then all the members of the replica set have to be on localhost, if the mongo servers are on different servers, you should use distinct machine names and not localhost for anyone of them, once the replica config is defined, you then initiate the replica using the configuration as shown below
replica_config = {_id: 'prod', members: [ {_id: 0, host: 'localhost:27017'}, {_id: 1, host: 'localhost:27027'}, {_id: 2, host: 'localhost:27037'}]} #Now initiate the replica_config rs.initiate(replica_config);When you are connecting to a replica set, you have to connect to atleast one server which is alive, using the ruby driver you can connect to more than one server using the "multi" method, one part you should be careful about is, lets say you define all the servers in the replica set as your connection string, but one of the members of the replica set is down, you will get connection failures, so the best thing to do is give members of the replica set that are up and the drivers will discover the other servers when they come online or go offline. Here is a sample ruby program to find a doc in a loop.
#!/usr/bin/env ruby require 'mongo' begin @connection = Mongo::Connection.multi([ ['localhost',27017], ['localhost',27027], ['localhost',27037]]) @collection = @connection.db("sales").collection("products") product = { "name" => "Refactoring", "code" => "023XX3", "type" => "book", "in_stock" => 100} @collection.insert(product) 100.times do sleep 0.5 begin product = @collection.find_one "code" => "023XX3" puts "Found Book: "+product["name"] rescue Exception => e puts e.message next end end endWhile the ruby program is running, you can kill the current primary and you will see that the program gets connection exceptions, while the replica set is figuring out the next master, once the next master is picked, the program starts going about its way finding the same data from the newly elected primary, here is a screen cast of the replica sets in action. Replica Sets screencast
Schema less databases and its ramifications.
In the No-SQL land schema-less is a power full feature that is advertised a lot, schema-less basically means you don't have to worry about column names and table names in a traditional sense, if you want to change the column name you just start saving the data using the new column name Lets say you have a document database like mongoDB and you have JSON document as shown below.
{ "_id":"4bc9157e201f254d204226bf", "FIRST_NAME":"JOHN", "MIDDLE_NAME":"D", "LAST_NAME":"DOE", "CREATED":"2010-10-12" }
You have some corresponding code to read the documents from the database and lets say you lots of data in the database in the order of millions of documents. If you want to change the name of some attributes or columns at this point and the new JSON would look like
{ "_id":"4bc9157e201f254d204226bf", "first_name":"JOHN", "middle_name":"D", "last_name":"DOE", "created":"2010-10-12" }
You will have to either change every document in the database to match the new attribute names or you have to make sure you code can handle both types of attribute names like
first_name = doc["first_name"] first_name = doc["FIRST_NAME"] unless !first_name.nil? middle_name = doc["middle_name"] middle_name = doc["MIDDLE_NAME"] unless !middle_name.nil? last_name = doc["last_name"] last_name = doc["LAST_NAME"] unless !last_name.nil?
This attribute name change also affects the indexes created on mongoDB, since the attribute name change is not across all the documents, an Index created on
db.people.ensureIndex({first_name:1})
will not index documents where the attribute name is FIRST_NAME, so you have to create another index for this new attribute name
db.people.ensureIndex({FIRST_NAME:1})
As you can see this gets really complicated if you do multiple refactorings, over a period of time. So when you hear schema less make sure you understand the ramifications of refactoring the attribute names at will and its effect on the code base and the database.
Effective use of data for better customer experience.
For more than seven years I have been getting offers for credit cards from Airlines and Banks. One particular bank has been sending me these solicitations for more than seven years. That is 12 mailings per year, more than 72 mailings so far, remember these are physical paper mailings not the electronic kind. I don't like the junk, it hurts the environment and worst of all I think its not good use of the data they have. How hard is it to design a system around the data they have.
Lets say they have a table of all the targeted customers they want to send a credit card applications to, why not have a attribute on the table for counting how many times the solicitation was sent, or they can even have the date the first solicitation was sent.
Customer Name Address City FirstSolicitationSent
or
Customer Name Address City Solicitations
So they could say if the days between today and the firstSolicitationSent is more than 90 days, then not send another solicitation, or if this number so solicitations is more than three do not send another solicitation.
This allows them to not send solicitations for years and ultimately loose the customer, I understand the argument of the customer needing time to react to the solicitation, but seven years of trying to convert a prospect is pure waste of time and effort. The data available can be used in better ways.
Schema design in a document database
{ "_id" : ObjectId("4bd8ae97c47016442af4a580"), "customerid" : 99999, "name" : "Foo Sushi Inc", "type" : "Good", "since" : "12/12/2001", "addresses" : [{ "address" : "4821 Big Street", "city" : "Stone", "state" : "IL", "country" : "USA" }, { "address" : "1248 Barlow Ln", "city" : "Hedgestone", "country" : "UK" } ], "emails" : [ {"email" : "foousa@sushi.com"}, {"email" : "foouk@sushi.com"} ], "phones" : [ {"phone" : "773-7777-7777"}, {"phone" : "020-6666-6666"} ] }So Instead of 1 Row for customer, 2 rows for address, phone and email each, you get one Customer document. If you want to query for customers in USA. Using RDBMS you would do
SELECT customer.name FROM customer, address WHERE customer.customerid = address.customerid AND address.country="USA"The same query in mongo would look like
db.customers.find({"addresses.country":"USA"},{"name":true})where customers is the collection in which we store our customers.
My experience with MongoDB
./mongod --dbpath=/user/data/dbis all you need, where /user/data/db is the path where you want mongo to create its data files. There are many other options that you can use to customize the mongo instance. Each mongo instance has databases and each database has many collections, mapping back to oracle, mongo database is a oracle schema and mongo collection is a oracle table, The difference is, each collection can hold any type of object, basically every row can be different. An example connection to the database using java looks like this
Mongo mongo = new Mongo("localhost"); db = mongo.getDB("mydatabase");If the "mydatabase" does not exist, it will be created. When you want to put objects in the database, you need to have a collection which holds the objects.
users = db.getCollection("applicationusers");if the "applicationusers" collection does not exist, it will be created, at this point you are ready to put objects into the collection.
BasicDBObject userDocument = new BasicDBObject(); userDocument.put("name", "jack"); userDocument.put("type", "super"); users.insert(userDocument);You create a document by using the BasicDBObject and put attribute names and their values, in the above example "name" is the attribute and "jack" is the value, the users.insert takes the document and inserts it into the collection "users". At this point you have a JSON object put into the database. You can query for the object using the mongo query tool or the rest full api mongo provides using the flag --rest, when you start mongodb, visiting 127.0.0.1:28017/mydatabase/users/ should give you
{ "offset" : 0, "rows": [ { "_id" : { "$oid" : "4bc9157e201f254d204226bf" }, "name" : "jack", "type" : "super" } ], "total_rows" : 1 , "query" : {} , "millis" : 0 }Every object you insert, gets a auto generated id, more about update, delete and complex objects in next blog post.
Workshop at Enterprise Data World 2010
Doing a workshop on Agile Database Development at Enterprise Data World 2010 at SF. See you there.
Testing in conversion projects
When working on projects involving Conversion of data or Migration/Moving of data from a legacy database. The testing effort is enormous and testing takes a lot of time, some test automation can help this effort.
Since data is moved/changed from a source database to destination database, we can write sql which should provide results for the types of tests you want to perform, for example: write a sql to give us number of customers, write a sql to give us account balance for a specific account.
These sqls can be run on your source database as well as your destination database and the results can be compared programmatically, providing us an easy way to compare the state of the database before and after conversion/migration. This testing can be run through a CI engine to make it a regression test suite.
Here is an example implementation using ruby,
We have two databases SOURCE and DESTINATION and two sql files names source.sql and destination.sql. The ruby program picks up sql from these two files and runs them against their database i.e. sql from source.sql is run against the SOURCE database and sql from destination.sql is run against DESTINATION database. The results of both of those sqls is compared and an failure is raised when the results do not match.
results statement = get_sql_statement_to_execute begin source_statement = statement[0] destination_statement = statement[1] source_rows = exec_sql_in_source_return_rows(source_statement) destination_rows = exec_sql_in_destination_return_rows(destination_statement) result = compare_rows(source_rows, destination_rows, destination_statement, source_statement) results << result rescue Log.log("Could not process: "+statement) end if (results.size > 0) Log.log("Results do not match in source and destination") end
The sample ruby code above shows how the solution can be implemented, thus enabling automation of database conversion/migration testing
- Atom feed
My Books
My Tweets
Recently Read
The Power of Now
A New Earth
SQL Anti patterns
The 4-Hour Workweek
Loving Our Kids On Purpose
Tuesdays with Morrie
The Secret
Presentation Zen
Programming Ruby 1.9
Release It!
The Toyota Way
Agile Data Warehousing
Outliers: The Story of Success
India After Gandhi
Why Does Software Cost So Much?:
It Happened in India
Oracle Tuning: The Definitive Reference
Ship it!
Service-Oriented Architecture
My Job Went to India
Let My People Go Surfing
Practices of an Agile Developer
Rails Recipes
Programming Ruby
The Art of SQL
Execution: The Discipline of Getting Things Done
The Monk Who Sold His Ferrari
The Greatness Guide
Who Will Cry When You Die?
Zen and the Art of Motorcycle Maintenance
Freakonomics
Head First Design Patterns (Head First)
Agile Database Techniques
Hibernate in Action (In Action series)
The World Is Flat
Guns, Germs, and Steel
Search
Tag Cloud
- Agile Data
- agile dba
- automation
- BI
- books
- constraints
- Data
- data conversion
- database
- database link
- dba
- design
- development
- documentdb
- EDW
- failover
- java
- mongodb
- nosql
- oracle
Categories
- Agile DBA (7)
- Best Practices (12)
- Improving Design (6)
- BI (2)
- Broadcast (13)
- JDBC (1)
- Learning (11)
- Mistakes I Learnt from (2)
- MongoDB (4)
- NoSQL (2)
- Open Source (2)
- Oracle (6)
- Ruby (2)