What's The Definition Of Big Data? Who Cares?
It has been entertaining to see how so many people are arguing over how to define big data. There is always another nuance that can be suggested. There is always another potential exception to any rule that is offered. In the end, I don’t think the energy being put into the discussions is of much tangible value from a business perspective versus really just being an academic exercise. Let’s explore why.
The goal of analytics is to leverage data to make a better business decisions. It is all about business value. Identifying data as “big” or not doesn’t add any business value. What organizations need to worry about is very simple: Is there a data source that isn’t currently being collected that has high potential value? If so, then it needs to be collected and analyzed. That’s all a business person should worry about. They need not care about if it is big, small, or something in between.
Let’s imagine a scenario where a meeting full of business and IT people come together in a large conference room to discuss a new data source. As part of the conversation, they reach an agreement that the new data source should (or should not) be considered big data. What has that done to help them move the ball forward? Nothing. What moves the ball forward is the business team agreeing that the new data is useful and worth analyzing. What moves the ball forward is when the IT team decides how to best make the data available based on the characteristics of the data. Progress is made with a focus on putting the data to work, not on semantics.
With that said, once I’ve decided that a data source is important, the characteristics of that data source can impact how I go about acquiring it and feeding it into my analytic processes. If the data is unusually big and/or unstructured, for example, I may need to leverage some techniques commonly associated with big data. However, that is a technical implementation consideration. The big decision as to whether the data was valuable enough to collect or not has nothing to do with what definitional bucket we might place the data source in.
Another common error is equating big data with the use of certain tools or techniques. However, the tools and techniques often apply more broadly than just for big data. For example, if I want to do sentiment analysis against all the social media commentary for a global organization, I may have quite a lot of data to deal with. I’ll also need some complex text analysis tools and sentiment algorithms. Now let’s assume I want to do a sentiment analysis on 10 comments about me personally. Guess what? I need the exact same text analysis tools and sentiment algorithms. I just don’t need them to scale to the same extent.
What the above point leads to is that much of what is being associated with “big data” is actually a function of “different data”. Text data requires different tools and techniques. Semi-structured data requires different handling than traditional structured data. However, these data types require different handling for both big and small volumes of it.
For those responsible for the technical implementation of big data, the exercise of understanding what makes it different and how it might be defined does have some value. I am not suggesting that all efforts in this area are a waste of time. How can you develop a tool or technique to handle data if you don’t understand what it contains? I am simply suggesting that too much emphasis has been put on the topic for audiences, such as a business user, who really don’t need to worry about it.
The next time somebody asks you how you define big data or if a certain data source should be considered to be big data, consider how you answer. Do you really need to have that discussion? Or do you need to change direction and focus the discussion on what the value of the data might be and how it can be leveraged for analysis? I believe you’ll usually make far more progress by going the latter direction.