It is common to hear about Big Data, this is due to the recognition of the importance of the insights that can be obtained through the data. The term is so widespread that it is already discussed even in non-specialized media such as Magazines and Portals. In 2012, the US Government announced the use of Big Data as a way to strengthen national security and transform teaching and learning.
But even after this popularization, there are still some misconceptions, the most common is Amount of data is the only thing that matters. The following are basic concepts associated with Big Data. I will not present tools, the purpose is to understand the fundamentals so that we can evaluate when to use.
Types of Data
The data types can be classified into:
- Structured Data: Data with a high degree of organization, being possible to be represented by rows and columns that can be easily sorted and processed by simple algorithms.
- Unstructured Data: Data that does not have an internal identifiable structure, such as pdf files, videos, audios, posts in social media, email, etc … Devices such as sensors, tablets, and cell phones are examples of sources of these types of data. The Internet of Things also tends to contribute considerably to the generation of this type of data.
A few years ago we had practically only structured data, today 85% of the data produced is unstructured.
What Defines Big Data
To describe the increase of data generated by society and collected by organizations, the term Big Data was used. This has been used successfully in several areas such as politics, sales, search tools and even in sports, after all, it is argued that Germany’s secret to winning the 2014 World Cup was the use of Big Data tools. The first feature that comes to mind when we talk about Big Data is a large amount of data, however, this is based on three dimensions: Volume, Variety, and Speed, known as the 3 V’s.
The TechAmerica Association defines Big Data as:
It is a large data set that describes large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information. (TechAmerica Foundation’s Federal Big Data Commission, 2012)
- Volume corresponds to the magnitude of the data that may vary depending on the frequency and type of data being recorded. In addition, what is considered Big today may change in the near future due to increased storage capacity.
- Variety is related to the heterogeneity of the data repository.
- Speed refers to the rate at which data is generated and the speed at which it is to be analyzed and presented. Digital devices, such as smartphones and sensors, are responsible for a high growth rate of data creation that requires real-time analysis. Traditional data management systems are not able to handle receiving this huge amount of data instantly. This is where Big Data technologies come in, enabling companies to build real-time intelligence.
In addition to the 3 V’s, other dimensions have been cited:
- Veracity: Corresponds to cases where the reliability of the data is not guaranteed, such as cases of social media messages. This is another Big Data assignment, dealing with uncertain data through tools and analysis for management and mining of this data.
- Variability: The variation in the speed of the data, having moments of high and low rates.
- Value: The data itself does not add anything, the value is obtained by analyzing a large volume of data.
Data Life Cycle
The potential of Big Data is only harnessed when used to drive decision making. Therefore, in addition to the data, an efficient process is necessary to obtain significant insights from a large volume of diverse and dynamic data.
The process of extracting information from Big Data can be divided into five phases:
- Acquisition: Data sources generate a great amount of information, many of them are useless and this is the great challenge of this step: to apply filters that discard useless information, without losing the relevant ones. And these filters should be applied in real-time, as it would be very costly to store all of this data for later deletion.
- Extraction: The data acquired and filtered is normally not ready for reading. As previously mentioned, the data exists in several formats: audio, video, text, among others. This requires an Extraction Strategy so that it integrates data originated from different repositories into a format that can be consumed. Extract-Transform-Load (ETL) is the process that covers any stage of collecting data, adjusting it to the appropriate format and storing it.
- Analysis: Technological advances are making it possible and cost-effective to analyze unstructured data. Distributed computing using easily scalable architecture, frameworks massive processing of non-relational data and parallelism in relational databases are redefining governance and data management.
- Interpretation: The most important aspect in the success of a Big Data system is the presentation of the data in a smart, friendly and reusable format.
At the moment most of the solutions operate only with structured data, so considering the 3 Vs, does not justify the adoption of Big Data. But as the use of social networks is growing a lot and in the future, it will be possible to capture and analyze data from these sources. Even if we do not adopt this solution, we consider it relevant to present the knowledge acquired and clarify that Big Data should not be associated only with the Data volume. When thinking about adopting Big Data, it is important to remember at least 3 Vs: volume, speed, and variety.
How about you? Share some of your experience with Big Data in the comments below!
Contribution is always welcome!