The market capitalisation of many web era companies, for instance Google and Facebook, exceeds that of traditional companies. The user base of some of these companies exceeds the population of many countries, for instance Facebook has over two billion users. The data that these companies accumulate fuels their growth and exeeds the capabilities of traditional systems like relational databases.
These companies use non traditional systems architectures and methods like NoSQL and Big Data to manage their vast data sets. They need distributed and scaleable methods to manage, visualise and make sense of their data.
The term Big Data describes a data set so large that traditional systems cannot cope. Consider the problems of data movement, storage, query, visualisation, configuration and search to name a few. It is difficult to manage data streams in data sets so large and ensure quality.
New approaches are needed to tackle these problems, the Apache tool set provides at least some of the answers. It provides a distributed and fault tolerant set of systems like Hadoop , Spark and Mesos .
A Big Data system can be described by attributes that describe the data for instance
Velocity describes the rate at which data arrives at a big data system. It can be associated with ETL processes and tools.
Variety describes the range of data types both structured and unstructured that might be imported into a Big Data system.
Volume describes the size of the data set within a big data system and might be expected to start in the high Tb range and increase to the Pb range and beyond.
Veracity describes the quality of the data, as Velocity, Volume and Variety increase so Veracity may descrease due for instance to data quality issues.
Value indicates that the data being stored in a Big Data system should be evaluated to ensure that it has worth. This might be difficult as value might only be apparent at a future date.
The image below examines some of the functions that might be considered when designing a big data system.
For instance ETL ( Extract, Transform and Load ) tools like MuleESB might be needed to move data from a variety of sources to and from the system. Big Data storage and database systems like HDFS and Hive might be needed to store the data. The Kafka system might be needed to offer distribued queueing. Many forms of cluster control might be considered i.e. Yarn , Mesos , Spark or DCOS. See our functions page for more information.