Lately I’ve been often faced with questions about differences between the so popular big data initiatives and the traditional data warehousing concepts. Few months ago I tried to distinguish the two in context of overall information management paradigm in a blog post that I wrote for the “IBM.Talking about” blog from IBM Croatia. In the rows that follow, I bring the English translation of the text .
For those of you who are unfamiliar with it, the term big data refers to the overall phenomenon of dramatic growth of available amounts of data due to global acceptance of the Internet, incredible amounts of data from social networks and mobile technologies, as well as due to billions of sensors worldwide and massive digitalization of practically whatever we do. Big data also refers to technologies that are able to process such large amounts of data coming from heterogeneous sources, regardless of whether it is structured such as database transactions records, or in form of unstructured records such as pictures or videos. These technologies are based on specific repositories that are able to store different types of files in their native formats and use principles of massive parallel processing to analyze and process such data.
Structure against the free form; pharmacy against antique store
Compared to data warehouses, big data solutions handle data which is “cheap” per terabyte. It is filthy, not standardized, without a dictionary, scattered in different formats. It is “cheap” also due to considerably lower effort to load it into repositories based on technologies such as Apache Hadoop (which itself is an open source project), but also due to the relatively inexpensive processor units and storage capacity that rely on distributed clusters with relatively high error tolerance. The data in our traditional Data Warehouses is on the other side pretty “expensive”. It has to pass substantial control, cleansing and standardization before it even gets the chance to knock the door of a well structured data warehouse. Compared to the big data cluster, a data warehouse seems like a pharmacy in comparison to a grocery store. In fact, even the mention of a grocery store is too pretentious, we are rather talking about relationships between pharmacies and something without a structured and standardized content – more like a flea market or an antique store.
Boring and exploring
After having had to do with pharmacies (here synonym for data warehouse) for so long, in the era of big data we are starting to visit places that will be supplied with a variety of items at a low entry price, without fancy supply chain, without traceability and complicated regulatory requirements. However in such places, a connoisseur could gain surprisingly lucrative outcomes … just like a data scientist would be able to gain by analyzing large amounts of variety of data forms inside a big data repository. From certified sellers (pharmacists), through severely certified products (pharmaceuticals) to certified point of sales (in many countries pharmacies get permissions by population density), the pharmacies are expensive places per unit sold. We enter there with a recipe (or it is already “brought there” through an IT system) and with unambiguous motives (e.g. stopping the pain). On the other hand, such structure disappears in an antique store. We enter there rarely with a particular intention. On top of that, usually inexpensively furnished shops, offer all sorts of things – from art and books over dishes to useful little things that nobody needs, and precious objects from the distant past.
Unlike pharmacies, usually you will not know in advance the nature of outcome of your purchases. You might be keeping something really valuable in your hands. Perhaps, with further research, you can realize that the painting you have just purchased is actually worth a fortune and that you may no longer need to play the lottery. I mean never more! Your visit to the pharmacy will certainly never end with the idea of not playing the Lottery again or terminating the private business you hold. The outcomes from a traditional data warehouse are just as such – boring and predictable. With rare exceptions aside, DWH is generally built with the outcomes known in advance. Users are left to search for relations, understand trends and identify extremes. It will rarely become a journey into the unknown, combining the incompatible and correlating distant phenomenons. That part of the job we leave to the big data.
Comparing the two
Let’s try once again to quickly compare, still generalizing, some aspects of traditional data warehouse and big data solutions. The sole technology implementation, generally is easier with big data. To build the repository it is not necessary to design an extremely detailed data scheme and have ready an exact spot for each byte of data stored based on its type and place in the hierarchy. The logistics of data supply (ETL and data governance) is again much more complicated in traditional DWH. Administration is similar, as well as the learning cycle in adopting the technology.
In case of traditional DWH, for the (already prepared) analytics, there are no experts needed as data is usually packed into predefined syntax and predefined analytical processes which are used by common users – business, scientists, analysts, … With big data this part is much more complicated. The collected massive amount of data needs someone who knows how to filter it in order to reach the value that resides within. Common big data scenarios (e.g. marketing targeting) are often based on “chewing” the data all over again across different dimensions and unstructured attributes. Someone has to distinguish the important from the unimportant, coincidences from rules. He should know filtering techniques and data modeling, be familiar with different tools and algorithms, such as those that are able to connect a person to an image, recognize a script from a picture or understand natural language semantics… Due to its unfiltered nature big data is significantly “more expensive” at this stage .
Who leaves and who stays?
And finally, a little disappointment to all of those who are fed up with continuously optimizing data warehousing models, with immense work when changes or additions occur, concerns about naughty data derivates and ever changing data sources. Big data and DWH are here to stay together side by side, each in its role, just like a flea shop and pharmacy, complementing each other… at least for some time*.
*It is very likely that in the future we will have a single platform for both structured and massive unstructured data. To get to this point some basic technologies such as fast SQL query requests on unstructured repositories should be developed. From the other side the convergence between the two will be further supported by the emerging infrastructure technologies, such as in memory databases, different high performance computing technologies, flash storage and specialized compute architectures.