Drawing I made to support explanation about the change in existing BI paradigm with introduction of Big Data
Big Data and traditional data warehouses. How to compare them? What are the differences and the roles? Who wins? No one actually.
Lately I’ve been often faced with questions about differences between the so popular big data initiatives and the traditional data warehousing concepts. Few months ago I tried to distinguish the two in context of overall information management paradigm in a blog post that I wrote for the “IBM.Talking about” blog from IBM Croatia. In the rows that follow, I bring the English translation of the text .
For those of you who are unfamiliar with it, the term big data refers to the overall phenomenon of dramatic growth of available amounts of data due to global acceptance of the Internet, incredible amounts of data from social networks and mobile technologies, as well as due to billions of sensors worldwide and massive digitalization of practically whatever we do. Big data also refers to technologies that are able to process such large amounts of data coming from heterogeneous sources, regardless of whether it is structured such as database transactions records, or in form of unstructured records such as pictures or videos. These technologies are based on specific repositories that are able to store different types of files in their native formats and use principles of massive parallel processing to analyze and process such data.
Structure against the free form; pharmacy against antique store
Compared to data warehouses, big data solutions handle data which is “cheap” per terabyte. It is filthy, not standardized, without a dictionary, scattered in different formats. It is “cheap” also due to considerably lower effort to load it into repositories based on technologies such as Apache Hadoop (which itself is an open source project), but also due to the relatively inexpensive processor units and storage capacity that rely on distributed clusters with relatively high error tolerance. The data in our traditional Data Warehouses is on the other side pretty “expensive”. It has to pass substantial control, cleansing and standardization before it even gets the chance to knock the door of a well structured data warehouse. Compared to the big data cluster, a data warehouse seems like a pharmacy in comparison to a grocery store. In fact, even the mention of a grocery store is too pretentious, we are rather talking about relationships between pharmacies and something without a structured and standardized content – more like a flea market or an antique store.
Boring and exploring
After having had to do with pharmacies (here synonym for data warehouse) for so long, in the era of big data we are starting to visit places that will be supplied with a variety of items at a low entry price, without fancy supply chain, without traceability and complicated regulatory requirements. However in such places, a connoisseur could gain surprisingly lucrative outcomes … just like a data scientist would be able to gain by analyzing large amounts of variety of data forms inside a big data repository. From certified sellers (pharmacists), through severely certified products (pharmaceuticals) to certified point of sales (in many countries pharmacies get permissions by population density), the pharmacies are expensive places per unit sold. We enter there with a recipe (or it is already “brought there” through an IT system) and with unambiguous motives (e.g. stopping the pain). On the other hand, such structure disappears in an antique store. We enter there rarely with a particular intention. On top of that, usually inexpensively furnished shops, offer all sorts of things – from art and books over dishes to useful little things that nobody needs, and precious objects from the distant past.
Unlike pharmacies, usually you will not know in advance the nature of outcome of your purchases. You might be keeping something really valuable in your hands. Perhaps, with further research, you can realize that the painting you have just purchased is actually worth a fortune and that you may no longer need to play the lottery. I mean never more! Your visit to the pharmacy will certainly never end with the idea of not playing the Lottery again or terminating the private business you hold. The outcomes from a traditional data warehouse are just as such – boring and predictable. With rare exceptions aside, DWH is generally built with the outcomes known in advance. Users are left to search for relations, understand trends and identify extremes. It will rarely become a journey into the unknown, combining the incompatible and correlating distant phenomenons. That part of the job we leave to the big data.
Comparing the two
Let’s try once again to quickly compare, still generalizing, some aspects of traditional data warehouse and big data solutions. The sole technology implementation, generally is easier with big data. To build the repository it is not necessary to design an extremely detailed data scheme and have ready an exact spot for each byte of data stored based on its type and place in the hierarchy. The logistics of data supply (ETL and data governance) is again much more complicated in traditional DWH. Administration is similar, as well as the learning cycle in adopting the technology.
In case of traditional DWH, for the (already prepared) analytics, there are no experts needed as data is usually packed into predefined syntax and predefined analytical processes which are used by common users – business, scientists, analysts, … With big data this part is much more complicated. The collected massive amount of data needs someone who knows how to filter it in order to reach the value that resides within. Common big data scenarios (e.g. marketing targeting) are often based on “chewing” the data all over again across different dimensions and unstructured attributes. Someone has to distinguish the important from the unimportant, coincidences from rules. He should know filtering techniques and data modeling, be familiar with different tools and algorithms, such as those that are able to connect a person to an image, recognize a script from a picture or understand natural language semantics… Due to its unfiltered nature big data is significantly “more expensive” at this stage .
Who leaves and who stays?
And finally, a little disappointment to all of those who are fed up with continuously optimizing data warehousing models, with immense work when changes or additions occur, concerns about naughty data derivates and ever changing data sources. Big data and DWH are here to stay together side by side, each in its role, just like a flea shop and pharmacy, complementing each other… at least for some time*.
*It is very likely that in the future we will have a single platform for both structured and massive unstructured data. To get to this point some basic technologies such as fast SQL query requests on unstructured repositories should be developed. From the other side the convergence between the two will be further supported by the emerging infrastructure technologies, such as in memory databases, different high performance computing technologies, flash storage and specialized compute architectures.
Big Data is a big is a Big buzzword. Although it brings huge opportunities, the “hype talks” might miss lead you to wrong decision that Big Data is cure for all. The truth is pretty simple – Big Data can give those answers that are hidden within the analyzed data set.
Spring 2010, at a small Croatian town there is an unusual meeting going on in a factory that we will call PPP. The meeting room at the first floor of the administrative building, just next to the gray production halls, hosted a group of about 10 different people. Individuals in the room take part of a fiery debate about the presentation that’s been projected on the wall. The discussion of a group of people in white coats, most probably production and development engineers, is obviously driven by the two most active members, often arguing with conflicting views among each other. There is a few people dressed in suites. One would guess that those are consultants and the company’s management team. Part of that group is quiet; they are just listening, and nodding from time to time to show their mental presence in the debate. Two persons from the group in suites ask a lot of questions. Some other participants are very active as well. They draw on the board and answer the questions with lively gestures. Those are mostly members of the academic community that take part of one of the EU “cross-border cooperation” projects, which is actually the reason of this colorful meeting.
Let’s add sensors
In order to improve the efficiency of the production process and product quality, PPP initiated enrichment of certain phases of production by additional sensors, PLC and SCADA elements. By increasing the number of sensors from 12 to 35 per production machine, PPP started one of numerous initiatives around the world that contribute to the enormous global growth of machine generated data, the one we like to call Big Data. At one point, a temperamental professor with a French beard took stage. He passionately explains to the group recent results gathered from mining of the newly established data sets based on the increased number of sensors. No matter how colorful graphs were clear and despite the insight that was much above the previous findings, it was hard not to recognize the indifference on the faces of other participants in the meeting. Something is missing!
Data model or a Swiss cheese?
The whole initiative should provide, if not revolutionary, then at least usable insights. “We need to close the circle!”. All of a sudden, eyes of the participants were turned on the consultant who had been silent so far. “We need to close the information circle. You have all the parameters of the machine, but you really should start from the goals. You have to ensure traceability and link quality of the products with different stages of the production process and their parameters. Otherwise the new parameters won’t have much to say.” It is difficult to add IT tags to the hot metal castings that are being produced by the machines at PPP, so the data that was supposed to link the quality achieved and the level of waste with the 35 newly established parameters was simply missing.
Big Data: new methods, old constrains
Concluding superficially, Big Data might be perceived as a cure for everything: “now that we have so much information available, it is enough to develop mathematical algorithms and we will find all the answers.” But the truth is exactly the opposite. Today we have plenty of mathematical algorithms – from those that recognize your face, the tone of your voice or your fingerprint to those which understand the context of human speech, but the ways in which we traditionally collect data (processes) are not aligned with the technological capabilities of finding data patterns and filtering it through massive parallel processing (technology). More specifically, Big Data technologies will surely find patterns through a large amount of data, but those will not always propose answers to your problem or give you new relevant insights. In the same way, the data mining in PPP provided insight into the machine behavior such as stability patterns of certain parameters during the production cycles, including some insightful deviations. But it offered no answers about how those deviations and patterns affected the only thing that really mattered – the quality of the product. The answers must be included somewhere within the data set that we explore. They have to take part of the meta model of the entity that we analyze, or we must be able to deduct it from attributes of other entities that are similar enough to the one we study (i.e. the data on the quality of the product of hundreds of similar or identical machines worldwide, in case of PPP).
You can read more in the May 2013 issue of the Mreža magazine (Croatian language only), or later during the year translated to English at Alen’s Thing Place.
This work is Copyright of Alen Gojceta. You are not allowed to use the article, or any of its part in commercial or academic work without citing the author and this link.