ODBMS Industry Watch

"Trends and Information on New Data Management Technologies, Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Apr 16 12

Integrating Enterprise Search with Analytics. Interview with Jonathan Ellis.

by Roberto V. Zicari

“Enterprise Search implies being able to search multiple types of data generated by an enterprise. DSE2 takes this to the next level by integrating this with a real time database AND powerful analytics.” — Jonathan Ellis.

I wanted to learn more about the new version of the commercial version of Cassandra, DataStax Enterprise 2.0. I did interview Jonathan Ellis, CTO and co-founder of DataStax and project chair of Apache Cassandra.

RVZ

Q1. What are the new main features of DataStax Enterprise 2.0 (DSE 2.0)?

Jonathan Ellis: The one I’m most excited about is the integration of Solr support. This is not a side-by-side system like Solandra, where Solr has to maintain a separate copy of the data to be indexed, but full integration with Cassandra : you insert your data once, and access it via Cassandra, Solr, or Hadoop.
Search is an increasingly important ingredient for modern applications, and DSE2 is the first to offer fully integrated, scalable search to developers.

DSE2 also includes Apache Sqoop for easy migration of relational data into Cassandra, and plug-and-play log indexing for your application.

Q2. How does it work technically the integration of Cassandra with Apache Hadoop and Apache Soir?

Jonathan Ellis: Basically, Cassandra offers a pluggable index api, and we created a Solr-backed index implementation with this.
We wrote about the technical details here.

Q3. What is exactly an Enterprise Search, and why choosing Apache Soir?

Jonathan Ellis: Enterprise Search implies being able to search multiple types of data generated by an enterprise. DSE2 takes this to the next level by integrating this with a real time database AND powerful analytics.

Solr is the gold standard for search, much the same way that Hadoop is the gold standard for big data map/reduce and analytics. There’s an ecosystem of tools that build on Solr, so offering true Solr support is much more powerful than implementing a proprietary full-text search engine.

Q4. What are the main business benefits of such integration?

Jonathan Ellis: First, developers and administrators have one database and vendor to concern themselves with instead of multiple databases and many software suppliers. Second, the built-in technical benefits of running both Solr and Hadoop on top of Cassandra yields continuous uptime for critical applications as well as future proofing those same apps where growing data volumes and increased user traffic are concerned.

Finally, customers save anywhere from 80-90% over traditional RDBMS vendors by going with DSE. For example, Constant Contact estimated that a new project they had in the works would take $2.5 million and 9 months on traditional relational technology, but with with Cassandra, they delivered it in 3 months for $250,000.That’s one third the time and one tenth the cost; not bad!

Q5. It looks like you are attempting to compete with Google. Is this correct?

Jonathan Ellis: DSE2 is about providing search as a building block for applications, not necessarily delivering an off-the-shelf search appliance.
Compared to Google’s AppEngine product, it’s fair to say that DSE 2.0 provides a similar, scalable platform to build applications on. DSE 2.0 is actually ahead of the game there: Google has announced but not yet delivered full-text search for AppEngine.

Another useful comparison is to Amazon Web Services: DSE 2.0 gives you the equivalent of Amazon’s DynamoDB, S3, Elastic Map/Reduce, and CloudSearch in a single, integrated stack. So instead of having to insert documents once in S3 and again in CloudSearch, you just add it to DSE (with any of the Solr, Cassandra, or Hadoop APIs) without having to worry about having to write code to keep multiple copies in sync when updates happen.

Q6. How do you manage to run real-time, analytics and search operations in the same database cluster, without performance or resource contention problems?

Jonathan Ellis: DSE offers elastic workload partitioning: your analytics jobs run against their own copies of the data, kept in sync by Cassandra replication, so they don’t interfere with your real time queries. When your workload changes, you can re-provision existing nodes from the real time side to the analytical, or vice versa.

Q7. You do not require ETL software to move data between systems. How does it work instead?

Jonathan Ellis: All DSE nodes are part of a single logical Cassandra cluster. DSE tells Cassandra how many copies to keep for which workload partitions, and Cassandra keeps them in sync with its battle-tested replication.
So your real time nodes will have access to new analytical output almost instantly, and you never have to write ETL code to move real time data into your analytical cluster.

Q8. Could you give us some examples of Big Data applications that are currently powered by DSE 2.0?

Jonathan Ellis:
A recent example is Healthx, which develops and manages online portals and applications for the healthcare market. They handle things such as enrollment, reporting, claims management, and business intelligence.
They have to manage countless health groups, individual members, doctors, diagnoses, and a lot more. Data comes in very fast, from all over, changes constantly, and is accessed all the time.

Healthx especially likes the new search capabilities in DSE 2.0. In addition to being able to handle real-time and analytic work, their users can now easily perform lightening fast searches for things like, ‘find me a podiatrist who is female, speaks German, and has an office close to where I live.‘

Q9. What about Big Data applications which also need to use Relational Data? Is it possible to integrate DSE 2.0 with a Relational Database? If yes, how? How do you handle query of data from various sources?

Jonathan Ellis: Most customers start by migrating their highest-volume, most-frequently-accessed data to DSE (e.g. with the Sqoop tool I mentioned), and leave the rest in a relational system. So RDBMS interoperability is very common at that level.

It’s also possible to perform analytical queries that mix data from DSE and relational sources, or even a legacy HDFS cluster.

Q10. How can developers use DSE 2.0 for storing, indexing and searching web logs?

Jonathan Ellis: We ship a log4j appender with DSE2, so if your log data is coming from Java, it’s trivial to start streaming and indexing that into DSE. For non-Java systems, we’re looking at supporting ingestion through tools like Flume.

Q11. How do you adjust performance and capacity for various workloads depending on the application needs?

Jonathan Ellis: Currently reprovisioning nodes for different workloads is a manual, operator-driven procedure, made easy with our OpsCenter management tool. We’re looking at delivering automatic adaptation to changing workloads in a future release.

Q12. How DSE 2.0 is influenced by DataStax partnerships with Pentaho Corporation (February 28, 2012) with their Pentaho Kettle?

Jonathan Ellis: A question we get frequently is, “I’m sold on Cassandra and DSE, but I need to not only move data from my existing RDBMS’s to you, but transform the data so that it fits into my new Cassandra data model. How can I do that?” With Sqoop, we can extract and load, but nothing else. The free Pentaho solution provides very powerful transformation capabilities to massage the incoming data in nearly every way under the sun before it’s inserted into Cassandra. It does it very fast too,
and with a visual user interface.

Q13. Anything else to add?

Jonathan Ellis: DSE 2.0 is available for download now and is free to use, without any restrictions, for development purposes. Once you move to production, we do require a subscription, but I think you’ll find that the cost associated with DSE is much less than any RDBMS vendor.

_____________

Jonathan Ellis.
Jonathan Ellis is CTO and co-founder of DataStax (formerly Riptano), the commercial leader in products and support for Apache Cassandra. Prior to DataStax, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy. Jonathan is project chair of Apache Cassandra.

Related Posts

- Interview with Jonathan Ellis, project chair of Apache Cassandra (May 16, 2011).

-Analytics at eBay. An interview with Tom Fastner (October 6, 2011).

-On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com (November 2, 2011)

Related Resources

- Big Data and Analytical Data Platforms – Articles.

- NoSQL Data Stores – Articles, Papers, Presentations.

Publishing Master and PhD Thesis on Big Data.

by Roberto V. Zicari

In order to help disseminating the work of young students and researchers in the area of databases, I started publishing Master and PhD thesis in ODBMS.ORG

Published Master and PhD are available for free download (as. pdf) to all visitors of ODBMS.ORG (50,000+ visitors/month).

Copyright of the Master and PhD thesis remain by the authors.

The process of submission is quite simple. Please send (any time) by email to: editor AT odbms.org

1) a .pdf of your work

2) the filled in template below:
___________________________
Title of the work:
Language (English preferable):
Author:
Affiliation:
Short Abstract (max 2-3 sentences of text):
Type of work (PhD, Master):
Area (see classification below):
No of Pages:
Year of completion:
Name of supervisor/affiliation:
________________________________

To qualify for publication in ODBMS.ORG, the thesis should have been completed and accepted by the respective University/Research Center in 2011 or later, and it should be addressing one or more of the following areas:

> Big Data: Analytics, Storage Platforms
> Cloud Data Stores
> Entity Framework (EF)
> Graphs and Data Stores
> In-Memory Databases
> Object Databases
> NewSQL Data Stores
> NoSQL Data Stores
> Object-Relational Technology
> Relational Databases: Benchmarking, Data Modeling

For any questions, please do not hesitate to contact me.

Hope this help.

Best Regards

Roberto V. Zicari
Editor
ODBMS.ORG
ODBMS Industry Watch Blog

In-memory database systems. Interview with Steve Graves, McObject.

by Roberto V. Zicari

“Application types that benefit from an in-memory database system are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes” — Steve Graves.

On the topic of in-memory database systems, I did interview one of our expert, Steve Graves, co-founder and CEO of McObject.

RVZ

Q1. What is an in-memory database system (IMDS)?

Steve Graves: An in-memory database system (IMDS) is a database management system (DBMS) that uses main memory as its primary storage medium.
A “pure” in-memory database system is one that requires no disk or file I/O, whatsoever.
In contrast, a conventional DBMS is designed around the assumption that records will ultimately be written to persistent storage (usually hard disk or flash memory).
Obviously, disk or flash I/O is expensive, in performance terms, and therefore retrieving data from RAM is faster than fetching it from disk or flash, so IMDSs are very fast.
An IMDS also offers a more streamlined design. Because it is not built around the assumption of storage on hard disk or flash memory, the IMDS can eliminate the various DBMS sub-systems required for persistent storage, including cache management, file management and others. For this reason, an in-memory database is also faster than a conventional database that is either fully-cached or stored on a RAM-disk.

In other areas (not related to persistent storage) an IMDS can offer the same features as a traditional DBMS. These include SQL and/or native language (C/C++, Java, C#, etc.) programming interfaces; formal data definition language (DDL) and database schemas; support for relational, object-oriented, network or combination data designs; transaction logging; database indexes; client/server or in-process system architectures; security features, etc. The list could go on and on. In-memory database systems are a sub-category of DBMSs, and should be able to do everything that entails.

Q2. What are significant differences between an in-memory database versus a database that happens to be in memory (e.g. deployed on a RAM-disk).

Steve Graves: We use the comparison to illustrate IMDSs’ contribution to performance beyond the obvious elimination of disk I/O. If IMDSs’ sole benefit stemmed from getting rid of physical I/O, then we could get the same performance by deploying a traditional DBMS entirely in memory – for example, using a RAM-disk in place of a hard drive.

We tested an application performing the same tasks with three storage scenarios: using an on-disk DBMS with a hard drive; the same on-disk DBMS with a RAM-disk; and an IMDS (McObject’s eXtremeDB). Moving the on-disk database to a RAM drive resulted in nearly 4x improvement in database reads, and more than 3x improvement in writes. But the IMDS (using main memory for storage) outperformed the RAM-disk database by 4x for reads and 420x for writes.

Clearly, factors other than eliminating disk I/O contribute to the IMDS’s performance – otherwise, the DBMS-on-RAM-disk would have matched it. The explanation is that even when using a RAM-disk, the traditional DBMS is still performing many persistent storage-related tasks.
For example, it is still managing a database cache – even though the cache is now entirely redundant, because the data is already in RAM. And the DBMS on a RAM-disk is transferring data to and from various locations, such as a file system, the file system cache, the database cache and the client application, compared to an IMDS, which stores data in main memory and transfers it only to the application. These sources of processing overhead are hard-wired into on-disk DBMS design, and persist even when the DBMS uses a RAM-disk.

An in-memory database system also uses the storage space (memory) more efficiently.
A conventional DBMS can use extra storage space in a trade-off to minimize disk I/O (the assumption being that disk I/O is expensive, and storage space is abundant, so it’s a reasonable trade-off). Conversely, an IMDS needs to maximize storage efficiency because memory is not abundant in the way that disk space is. So a 10 gigabyte traditional database might only be 2 gigabytes when stored in an in-memory database.

Q3. What is in your opinion the current status of the in-memory database technology market?

Steve Graves: The best word for the IMDS market right now is “confusing.” “In-memory database” has become a hot buzzword, with seemingly every DBMS vendor now claiming to have one. Often these purported IMDSs are simply the providers’ existing disk-based DBMS products, which have been tweaked to keep all records in memory – and they more closely resemble a 100% cached database (or a DBMS that is using a RAM-disk for storage) than a true IMDS. The underlying design of these products has not changed, and they are still burdened with DBMS overhead such as caching, data transfer, etc. (McObject has published a white paper, Will the Real IMDS Please Stand Up?, about this proliferation of claims to IMDS status.)

Only a handful of vendors offer IMDSs that are built from scratch as in-memory databases. If you consider these to comprise the in-memory database technology market, then the status of the market is mature. The products are stable, have existed for a decade or more and are deployed in a variety of real-time software applications, ranging from embedded systems to real-time enterprise systems.

Q4. What are the application types that benefit the use of an in-memory database system?

Steve Graves: Application types that benefit from an IMDS are those for which eliminating latency is a key design goal, and those that run on systems that simply have no persistent storage, like network routers and low-end set-top boxes. Sometimes these types overlap, as in the case of a network router that needs to be fast, and has no persistent storage. Embedded systems often fall into the latter category, in fields such as telco and networking gear, avionics, industrial control, consumer electronics, and medical technology. What we call the real-time enterprise sector is represented in the first category, encompassing uses such as analytics, capital markets (algorithmic trading, order matching engines, etc.), real-time cache for e-commerce and other Web-based systems, and more.

Software that must run with minimal hardware resources (RAM and CPU) can also benefit.
As discussed above, IMDSs eliminate sub-systems that are part-and-parcel of on-disk DBMS processing. This streamlined design results in a smaller database system code size and reduced demand for CPU cycles. When it comes to hardware, IMDSs can “do more with less.” This means that the manufacturer of, say, a set-top box that requires a database system for its electronic programming guide, may be able to use a less powerful CPU and/or less memory in each box when it opts for an IMDS instead of an on-disk DBMS. These manufacturing cost savings are particularly desirable in embedded systems products targeting the mass market.

Q5. McObject offers an in-memory database system called eXtremeDB, and an open source embedded DBMS, called Perst. What is the difference between the two? Is there any synergy between the two products?

Steve Graves: Perst is an object-oriented embedded database system.
It is open source and available in Java (including Java ME) and C# (.NET) editions. The design goal for Perst is to provide as nearly transparent persistence for Java and C# objects as practically possibly within the normal Java and .NET frameworks. In other words, no special tools, byte codes, or virtual machine are needed. Perst should provide persistence to Java and C# objects while changing the way a programmer uses those objects as little as possible.

eXtremeDB is not an object-oriented database system, though it does have attributes that give it an object-oriented “flavor.” The design goals of eXtremeDB were to provide a full-featured, in-memory DBMS that could be used right across the computing spectrum: from resource-constrained embedded systems to high-end servers used in systems that strive to squeeze out every possible microsecond of latency. McObject’s eXtremeDB in-memory database system product family has features including support for multiple APIs (SQL ODBC/JDBC & native C/C++, Java and C#), varied database indexes (hash, B-tree, R-tree, KD-tree, and Patricia Trie), ACID transactions, multi-user concurrency (via both locking and “optimistic” transaction managers), and more. The core technology is embodied in the eXtremeDB IMDS edition. The product family includes specialized editions, built on this core IMDS, with capabilities including clustering, high availability, transaction logging, hybrid (in-memory and on-disk) storage, 64-bit support, and even kernel mode deployment. eXtremeDB is not open source, although McObject does license the source code.

The two products do not overlap. There is no shared code, and there is no mechanism for them to share or exchange data. Perst for Java is written in Java, Perst for .NET is written in C#, and eXtremeDB is written in C, with optional APIs for Java and .NET. Perst is a candidate for Java and .NET developers that want an object-oriented embedded database system, have no need for the more advanced features of eXtremeDB, do not need to access their database from C/C++ or from multiple programming languages (a Perst database is compatible with Java or C#), and/or prefer the open source model. Perst has been popular for smartphone apps, thanks to its small footprint and smart engineering that enables Perst to run on mobile platforms such as Windows Phone 7 and Java ME.
eXtremeDB will be a candidate when eliminating latency is a key concern (Perst is quite fast, but not positioned for real-time applications), when the target system doesn’t have a JVM (or sufficient resources for one), when the system needs to support multiple programming languages, and/or when any of eXtremeDB’s advanced features are required.

Q6. What are the current main technological developments for in-memory database systems?

Steve Graves: At McObject, we’re excited about the potential of IMDS technology to scale horizontally, across multiple hardware nodes, to deliver greater scalability and fault-tolerance while enabling more cost-effective system expansion through the use of low-cost (i.e. “commodity”) servers. This enthusiasm is embodied in our new eXtremeDB Cluster edition, which manages data stores across distributed nodes. Among eXtremeDB Cluster’s advantages is that it eliminates any performance ceiling from being CPU-bound on a single server.

Scaling across multiple hardware nodes is receiving a lot of attention these days with the emergence of NoSQL solutions. But database system clustering actually has much deeper roots. One of the application areas where it is used most widely is in telecommunications and networking infrastructure, where eXtremeDB has always been a strong player. And many emerging application categories – ranging from software-as-a-service (SaaS) platforms to e-commmerce and social networking applications – can benefit from a technology that marries IMDSs’ performance and “real” DBMS features, with a distributed system model.

Q7. What are the similarities and differences between current various database clustering solutions? In particular, let’s look at dimensions such as scalability, ACID vs. CAP, intended/applicable problem domains, structured vs. unstructured, and complexity of implementation.

Steve Graves: ACID support vs. “eventual consistency” is a good place to start looking at the differences between clustering database solutions (including some cluster-like NoSQL products). ACID-compliant transactions will be Atomic, Consistent, Isolated and Durable; consistency implies the transaction will bring the database from one valid state to another and that every process will have a consistent view of the database. ACID-compliance enables an on-line bookstore to ensure that a purchase transaction updates the Customers, Orders and Inventory tables of its DBMS. All other things being equal, this is desirable: updating Customers and Orders while failing to change Inventory could potentially result in other orders being taken for items that are no longer available.

However, enforcing the ACID properties becomes more of a challenge with distributed solutions, such as database clusters, because the node initiating a transaction has to wait for acknowledgement from the other nodes that the transaction can be successfully committed (i.e. there are no conflicts with concurrent transactions on other nodes). To speed up transactions, some solutions have relaxed their enforcement of these rules in favor of an “eventual consistency” that allows portions of the database (typically on different nodes) to become temporarily out-of-synch (inconsistent).

Systems embracing eventual consistency will be able to scale horizontally better than ACID solutions – it boils down to their asynchronous rather than synchronous nature.

Eventual consistency is, obviously, a weaker consistency model, and implies some process for resolving consistency problems that will arise when multiple asynchronous transactions give rise to conflicts. Resolving such conflicts increases complexity.

Another area where clustering solutions differ is along the lines of shared-nothing vs. shared-everything approaches. In a shared-nothing cluster, each node has its own set of data.
In a shared-everything cluster, each node works on a common copy of database tables and rows, usually stored in a fast storage area network (SAN). Shared-nothing architecture is naturally more complex: if the data in such a system is partitioned (each node has only a subset of the data) and a query requests data that “lives” on another node, there must be code to locate and fetch it. If the data is not partitioned (each node has its own copy) then there must be code to replicate changes to all nodes when any node commits a transaction that modifies data.

NoSQL solutions emerged in the past several years to address challenges that occur when scaling the traditional RDBMS. To achieve scale, these solutions generally embrace eventual consistency (thus validating the CAP Theorem, which holds that a system cannot simultaneously provide Consistency, Availability and Partition tolerance). And this choice defines the intended/applicable problem domains. Specifically, it eliminates systems that must have consistency. However, many systems don’t have this strict consistency requirement – an on-line retailer such as the bookstore mentioned above may accept the occasional order for a non-existent inventory item as a small price to pay for being able to meet its scalability goals. Conversely, transaction processing systems typically demand absolute consistency.

NoSQL is often described as a better choice for so-called unstructured data. Whereas RDBMSs have a data definition language that describes a database schema and becomes recorded in a database dictionary, NoSQL databases are often schema-less, storing opaque “documents” that are keyed by one or more attributes for subsequent retrieval. Proponents argue that schema-less solutions free us from the rigidity imposed by the relational model and make it easier to adapt to real-world changes. Opponents argue that schema-less systems are for lazy programmers, create a maintenance nightmare, and that there is no equivalent to relational calculus or the ANSI standard for SQL. But the entire structured or unstructured discussion is tangential to database cluster solutions.

Q7. Are in-memory database systems an alternative to classical disk-based relational database systems?

Steve Graves: In-memory database systems are an ideal alternative to disk-based DBMSs when performance and efficiency are priorities. However, this explanation is a bit fuzzy, because what programmer would not claim speed and efficiency as goals? To nail down the answer, it’s useful to ask, “When is an IMDS not an alternative to a disk-based database system?”

Volatility is pointed to as a weak point for IMDSs. If someone pulls the plug on a system, all the data in memory can be lost. In some cases, this is not a terrible outcome. For example, if a set-top box programming guide database goes down, it will be re-provisioned from the satellite transponder or cable head-end. In cases where volatility is more of a problem, IMDSs can mitigate the risk. For example, an IMDS can incorporate transaction logging to provide recoverability. In fact, transaction logging is unavoidable with some products, such as Oracle’s TimesTen (it is optional in eXtremeDB). Database clustering and other distributed approaches (such as master/slave replication) contribute to database durability, as does use of non-volatile RAM (NVRAM, or battery-backed RAM) as storage instead of standard DRAM. Hybrid IMDS technology enables the developer to specify persistent storage for selected record types (presumably those for which the “pain” of loss is highest) while all other records are managed in memory.

However, all of these strategies require some effort to plan and implement. The easiest way to reduce volatility is to use a database system that implements persistent storage for all records by default – and that’s a traditional DBMS. So, the IMDS use-case occurs when the need to eliminate latency outweighs the risk of data loss or the cost of the effort to mitigate volatility.

It is also the case that FLASH and, especially, spinning memory are much less expensive than DRAM, which puts an economic lid on very large in-memory databases for all but the richest users. And, riches notwithstanding, it is not yet possible to build a system with 100’s of terabytes, let alone petabytes or exabytes, of memory, whereas spinning memory has no such limitation.

By continuing to use traditional databases for most applications, developers and end-users are signaling that DBMSs’ built-in persistence is worth its cost in latency. But the growing role of IMDSs in real-time technology ranging from financial trading to e-commerce, avionics, telecom/Netcom, analytics, industrial control and more shows that the need for speed and efficiency often outweighs the convenience of a traditional DBMS.

———–
Steve Graves is co-founder and CEO of McObject, a company specializing in embedded Database Management System (DBMS) software. Prior to McObject, Steve was president and chairman of Centura Solutions Corporation and vice president of worldwide consulting for Centura Software Corporation.

Related Posts

- A super-set of MySQL for Big Data. Interview with John Busch, Schooner.

- Re-thinking Relational Database Technology. Interview with Barry Morris, Founder & CEO NuoDB.

- On Data Management: Interview with Kristof Kloeckner, GM IBM Rational Software.

- vFabric SQLFire: Better then RDBMS and NoSQL?

Related Resources

ODBMS.ORG: Free Downloads and Links:
Object Databases
NoSQL Data Stores
Graphs and Data Stores
Cloud Data Stores
Object-Oriented Programming
Entity Framework (EF) Resources
ORM Technology
Object-Relational Impedance Mismatch
Databases in general
Big Data and Analytical Data Platforms

Data Modeling for Analytical Data Warehouses. Interview with Michael Blaha.

by Roberto V. Zicari

“Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits” — Michael Blaha.

This is the third interview with our expert Dr. Michael Blaha on the topic Database Modeling. This time we look at the issue of data design for Analytical Data Warehouses.

In previous interviews we looked at how good is UML for database design , and how good are Use Cases for database modeling.

Hope you`ll find this interview interesting. I encourage the community to post comments.

RVZ

Q1: What is the difference between data warehouses and day-to-day business applications?

Michael Blaha: Operational (day-to-day business) applications serve the routine needs of a business handling orders, scheduling manufacturing runs, servicing patients, and generating financial statements.

Operational applications have many short transactions that must process quickly. The transactions both read and write.
Well-written applications pay attention to data quality, striving to ensure correct data and avoid errors.

In contrast analytical (data warehouse) applications step back from the business routine and analyze data that accumulates over time. The idea is to gain insight into business patterns that are overlooked when responding to routine needs. Data warehouse queries can have a lengthy execution time as they process reams of data, searching for underlying patterns.

End users read from a data warehouse, but they don`t write to it. Rather writing occurs as the operational applications supply new data that is added to the data warehouse.

Q2: How do you approach data modeling for data warehouse problems?

Michael Blaha: For operational applications, I use the UML class model for conceptual data modeling. (I often use Enterprise Architect.) The notation is more succinct than conventional database notations and promotes abstract thinking.
In addition, the UML class model is understandable for business customers as it defers database design details. And, of course, the UML reaches out to the programming side of development.

In contrast, for analytical applications, I go straight to a database notation. (I often use ERwin.) Data warehouses revolve around facts and dimensions. The structure of a data warehouse model is so straightforward (unlike the model of operational application) that a database notation alone suffices.

For a business user, the UML model and the conventional data model look much the same for a data warehouse.
The programmers of a data warehouse (the ETL developers) are accustomed to database notations (unlike the developers in day-to-day applications).

As an aside, I note that in a past book (A Manager`s Guide to Database Technology) I used a UML class model for analytical modeling. In retrospect I now realize that was a forced fit. The class model does not deliver any benefits for data warehouses and it`s an unfamiliar technology for data warehouse developers, so there`s no point in using it there.

Q3: Is there any synergy between non-relational databases (NoSQL, Object Databases) and data warehouses?

Michael Blaha: Not for conventional data warehouses that are set-oriented. Mass quantities of data must be processed in bulk. Set-oriented data processing is a strength of relational databases and the SQL language. Furthermore, tables are a good metaphor for facts and dimensions and the data is intrinsically strongly typed.

NoSQL (Hadoop) is being used for mining Web data. Web data is by its nature unstructured and much different from conventional data warehouses.

<">

gipoco.com is neither affiliated with the authors of this page or responsible
for its contents. This is a safe-cache copy of the original web site.

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.