The ability to manually scale, utilize initialization actions, and leverage preemptive virtual machines means Cloud Dataproc clusters can be tailored based on individual needs. Easily create and scale clusters to run native:
With a low price and minute-by-minute billing, customers no longer need to worry about the economics of running a persistant Spark or Hadoop cluster to unlock the benefits of fast, efficient, and reliable data processing.
Integration with Cloud Platform provides immense scalability, ease-of use, and multiple channels for cluster interaction and management. Seamless integration into other Google Cloud products means data is more accessable, operations are easy to monitor, and scaling is a non-issue.
A customer processes 50 gigabytes of (text) log data per day to produce aggregated metrics. They have used a persistent on-premise cluster to store and process the logs with MapReduce.
How Dataproc addresses this need
Google Cloud Storage can act as a landing zone for the log data for low-cost and high-durability storage. A Dataproc cluster can be created in less than 2 minutes to process this data with their existing MapReduce. Once finished, the Dataproc cluster can then be removed immediately.
Instead of running all the time and incurring costs even when not used, Dataproc only runs to process the logs which saves money and reduces complexity.