The aim of this page is to examine some of the Big Data system functions that were not covered by the previous page.
For instance when building a Big Data system what scheduler should you use ? Might it be Yarn in a Hadoop stack ( plus cron ? ) or Marathon and Chronos in a DCOS stack ? Might there be prioritisation and resourcing issues across the job load when using multiple schedulers ?
When considering stream processing the stream module of Spark might be considered or the stream capabilities of Kafka
Huge data volumes are of little use without the ability to visualise the data, this is where systems like Zeppelin might be considered.
Large scale Big Data systems will often have very large scale clusters, cluster monitoring needs to be considered and Datadog is a great tool for the job.
Apache Spark might be considered as a general purpose processing engine ( along with Scala ? ). I use this combination for much of my processing.
How will you handle release management to your big data system ? It seems obvious that you will have mutiple platforms for development, uat and production but systems like Apache Brooklyn offer the possibility to model, release and monitor the components that you wish to use.
Finally how will you test the system that you have created in terms of data validity and over all system function ?