Bilkent University – In the “Big data in Business Life” conference held by TUSIAD Information Society Forum, we have looked big data from a different aspect with stimulating speech of Open Insight President and CEO Dr. Usama FAYYAD.
What is Big Data?
Dr.Fayyad’s explains it as “Big Data is a new generation of Data Technology that promises to make it easier to address the issues rapidly and affordably. Today expectations and reality does not fit together in business”. Big data is actually a mixture of structured, semi-structured and unstructured data. The messiness in the data are fixed by creative solutions by data scientists. He describes as, “a data scientist is a person, who knows a lot more software engineering than a statistician and knows a lot more statistics than a software engineer”. Data scientist is a critical, very highly salaried position Silicon Valley and they are rare.
What are the critical factors of Big Data?
It is called the critical 3V’s of the Big! They are volume, velocity and variety. When we hear the term “Big Data”, the first thing that comes to our mind is the volume. Volume is important but Big Data cannot be defined by only volume. Velocity, the rate of data arrival causes real-time constraints. Variety is another critical factor because actually the data is mostly unstructured or semi-structured and there is no standard formula for interpretation.
The huge amount of data has no value when you cannot formulate and get a meaning out of it.
Data has to be accessible, have a single definition, fit for its defined purpose, traceable back to source. Then, it can have a meaning and generate value.
What is a Data Lake, Hadoop?
A data lake is a method of storing data within a system that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files. Hadoop, Azure Storage and the Amazon S3 platform can be used to build data lake repositories. (https://en.wikipedia.org/wiki/Data_lake )
And who has the nastiest data lake in the world with the variety scale of data? His answer is “Amazon”.
Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.
( https://en.wikipedia.org/wiki/Apache_Hadoop )
Shall we use Hadoop, then?
The answer of Dr. Fayyad is “the cost of premium storage per terabyte is 20$k to 50$k and it reduces to 1$k with Hadoop. The new data landscape shall be preferred as Hadoop.”
According to GE, its data lake system made analysis of 3 million flights – 340 TB worth of data – 2000 faster and 10 times cheaper than analytical processes would have.
So, what are the real issues today?
Dr. Fayyad lists the real issues as:
- data management and governance
- number of data scientists, who are rare but they are critical
- Data and analytics are specialisms that need know-how and not for generalists.
The data is getting bigger each day as the time is getting less. The challenge is to get a value of this data quickly. Today many companies from different sectors, that do not want to fall back in their businesses, tend to focus to transform digitally. Even in traditional and “safe” sectors like utilities we hear about the necessity of the digital transformation.
I believe we will continue talking on those topics a lot.
Author: Irem Sokullu