The aim of this page is to examine Big Data system attributes. The five V's were shown on the previous diagram
but they will not be described here as
they have been covered in a previous page.
A big data system needs to be able to scale to meet processing needs, this might imply manual or dynamic scaling both to increase and decrease cluster size. It might involve increasing cluster machines or adding cluster slaves. It might also involve re-organising data across the cluster as it gows.
A big data system needs to be distributed in order to spread both data and functionality across a cluster and so speed processing.
IOT (Internet Of Things) and Real Time needs to be considered when building a Big Data system. For instance when designing data flows are there any timing requirements between data source and destination ? Would a system based on Hadoop / Yarn be fast enough to meet real time needs ?
If you need to process Graphs at a Big Data scale you might consider using Spark GraphX or a Tinkerpop based database like OrientDB or Titan .
You might consider using the Machine Learning functionality of Spark called mllib which contains a series of ml functions to help process your data. If you build Spark yourself you can also extend this functionality.
Cost is a factor to be considered when building Big Data systems. Will you build in the cloud or on physical clusters, either choice should be supported by a cost model. Remember that replication and working data will mean that you will store many times more than the volume of your data set. Remember also if choosing the cloud that there will be end of life costs when moving off of or between cloud providers.
Open source big data system components need to be supported by a large scale Community to foster support and development and to help the product evolve. Big data systems supplied by Apache seem to evolve and be released faster than traditional systems and that I think is due to Apache and their communities.
I mention AI here because artificial intelligence systems require a lot of data to be intelligent. Systems like H2O integrate with Spark to provide AI functionality on a big data platform.
Big Data systems need to be Portable at the end of life the data might be ported to a new platform but it would be useful if the system itself could be ported easily between servers and cloud platforms. This may not be so if cloud services are used. Vendor lock in would not seem to be desireable.
Security and Ownership of systems and data need to be considered. Are there rules or laws which inhibit storage or retention of data. Could any requirement limit access to data given that it may perhaps be sensitive ?
How can Analytics be supported in a big data system ? Zeppelin and Hue provide some answers. Zeppelin offers a collaborative notebook based method of accessing Big Data. It integrates with many systems to do this and uses Spark Scala by default. Hue is a data query tool which integrates with many Big Data tools like Hive, Spark and MapReduce as well as many others to provide multiple query access methods in a single interface.