Enterprises have a new problem; they are recipients of never-ending streams of data. Trying to stay on top of the data deluge is not easy. The exponential rise in information has left even the best data scientists flummoxed. The IoT revolution is exciting for the sales & marketing guys but portends a scary future for data scientists.
The Challenge They Face: Process and analyse terabytes of information streaming every minute to generate meaningful insights.
What Is Cloud Data Flow?
Dataflow is a fully managed, no-ops service from Google that attempts to make data processing and analytics easy and accessible to everyone. Cloud Dataflow complements the rest of the Google cloud platform and works very well with BigQuery.
Timeline Of Development
Google’s first major release in data processing arena was the Google File System(GFS). A series of improvements and addition of new algorithms followed, and its latest offering in data processing arena is the Cloud Dataflow. With DataFlow, Google has addressed issues with their MapReduce service. Dataflow has borrowed and built upon key elements of earlier products like FlumeJava and MillWheel. Here is a timeline of google’s data processing efforts so far :
What Types Of Data It Accepts
- In Batch mode it accepts data from database or file systems
- In Pub/sub mode, it accepts real-time streamed data from google cloud pub/sub middleware feeds.
Languages It Supports
The limited release version supported only Java, but efforts are on to add more languages. The work on supporting Python programming is proceeding quickly. Apart from Java SDK, the limited alpha release also supports Python SDK but only for batch execution (does not support live-stream inputs)
How To Know If You Need Google Cloud Dataflow?
If your processing requirement can be expressed as an SQL query, then Data flow is not for you. You should jump to BigQuery, upload your database and type in the query to get the required result.
How It Accomplishes Data Processing
Google Cloud Dataflow uses abstractions that decouple application code from the implementation of storage engines and runtime environments to make big data analytics easy. The idea behind a fully managed service is to let the developers focus on developing the code and leave the provision and management of computing needs to the Dataflow service. The level of abstraction offered to data scientists is high and allows them to work at a higher functional level.
The Cloud Dataflow has a collection of SDKs that let developers build batch or stream-based data processing pipelines quickly. The Dataflow service generates the appropriate ‘execution graph’ to execute these parallel data processing pipelines to get the desired output.
Applications For Enterprises
1. To generate ‘Real-time Business Insights’
Real-time stream analysis forms the basis of recommendation protocols of several e-tailers (both product and service retailers). Sites like Pinterest use real-time analytics to serve users related pins.
2. Allows better ‘Database Management’
Several companies that capture terabytes of data daily have to convert their unstructured data to structured data, to ready it for further analysis. They do so by running ETL processes using services like dataflow.
3. Use it for pre-processing in Machine Learning
Dataflow is integrated with google’s machine learning platform. You can create machine learning applications that work on your enterprise data using Google’s TensorFlow framework. And, you can retrieve data from Google Cloud Storage and Google BigQuery and subject it to pre-processing in Dataflow for machine learning.
What Type Of Enterprises May Use It
Enterprises with private clouds might stick with Hadoop for now. But, companies with IoT and wearables focus will need to analyse large data streams in real-time and Dataflow might be their first choice due to its data-centric model. The entire Google cloud services platform suite is preferable for enterprises with data-intensive services as Google is unrivalled in data management capabilities.
Competitors in Cloud Data Processing Segment
Google Cloud Platform’s main competitors Amazon and Microsoft also offer stream processing services. But Google’s USP is that the dataflow provides a unified programming model that can cater to both batch and stream processing needs. With the launch of DataFlow, Google is fast closing the gap with rival and market leader amazon’s cloud platform services.
Advantages Of Using Google Cloud Dataflow
1. Easy Development Toolkit: It offers the right level of abstraction for data scientists. The dataflow SDK makes developing data pipelines quick and easy using open-source languages, libraries and tools.
2. Focus on pipeline definition, Not Runtime characteristics: Lets programmers concentrate on writing the code, and handles parallelization and scalability needs
3. Better use of Network Resources: Generates less network traffic by optimising the data processing pipelines
4. Unified Model: Offers unified programming model for batch and stream processing.
5. Integrated with other Analytics Tools: It supports SQL queries via BigQuery, google’s Cloud Analytics service.
6. Runner Agnostic Coding: The data pipelines created using Dataflow SDK can be executed by runners in different environments and on different execution engines.
7. Portability: Supports multiple execution environments in addition to Cloud Dataflow and Local machines (provided they can handle the computing requirements). You can run the data pipelines on other runners.
8. Forget trade-offs: Google cloud dataflow requires no compromises. Here’s how things change with Google Cloud Dataflow:
Future of Cloud Dataflow
In future, Google cloud dataflow is poised to extend its reach further by supporting more programming languages. It will become the data processing service of choice if it continues to open up to other execution engines. The Apache inclusion which came in Feb 2016 is a step forward in Dataflow becoming a fully evolved data processing service.
A fully evolved service is one which allows users to define one data pipeline for their multiple processing needs, without significant tradeoffs, and that has the ability to execute in a number of runtimes, both on-premise, and in the cloud, or locally.
As it continues to extend its capabilities, dataflow will gain more adoption from enterprises that are currently outside the Google cloud platform.