What is Google Cloud DataPrep

This is an introduction to a cycle about Google Cloud and its products for Big Data / Data & Analytics solutions, starting from Cloud Dataprep solution – answering your question, yes I do have a plan to cover other cloud vendors and theirs offering in data & analytics area.

 

Overview

Cloud Dataprep is aim to be an intelligent service to visually explore, clean and prepare structured and unstructured data for analysis. It was introduced first on Google Cloud Next 2017, currently – article written at 7 January 2018 – it is in Public BETA stage.

(Image taken from: Easily prepare data for analysis with Google Cloud – Youtube video)

This product is an browser based offering of an integrated partner service – Trifacta – company that is also delivering services for other ecosystems such as AWS, and is being ranked #1 by Analysts in Gartner Peer Insights .

It comes packed with quite a lot of features, among others:

  • Instant data exploration – point and click interface, that aims to be visually appealing easy to interact for users;
  • Intelligent data cleansing – automated anomalies detection, guided ETL approach based on usage patterns;
  • Serverless service – no need for managing infrastructure;
  • Supports common data sources – structure or unstructured, ranging from CSV’s, JSON’s or relational table formats;
  • Powerful Backend – it is built on top of Cloud Dataflow service, with integrated auto-scalable mechanism;
  • Support for BigQuery and Google Cloud storage;

 

Limitations

As this is still a BETA list of limitations may change in the future, below list is just a handful of key known limitations selected from documentation:

  • Sample sizes are 10 MB
  • Random samples are derived from up to the first 1 GB of the source file
  • Sharing is not supported
  • Integrations with other data stores than BigQuery, Google Cloud Storage, and the local filesystem are not supported
  • Sort transform is not supported.
  • User-defined functions are not supported.
  • Custom dictionaries and custom data types are not supported
  • User access to administrator functions is not supported

 

How Much

Please note that pricing is described at the time of writing this article it may be changed in future releases.

Use of the application is free of charge, however user will be charged based on created preparation rules if it would be executed as a Cloud Dataprep job (based on Google Cloud Dataflow) – it can be exported for free.

When user will decide to run it as part of Cloud Dataprep job execution price will be applied following below equation:

[highlight] 1.16 * (cost of Cloud Dataflow job that executed the Cloud Dataprep flow) [/highlight]

Example costs of typical Cloud Dataflow configuration for Iowa:

Dataflow Worker TypevCPU
(per Hour)
Memory
(per GB per Hour)
Local storage – Persistent Disk
(per GB per Hour)
Local storage – SSD based
(per GB per Hour)
Dataflow Shuffle (per GB per Hour)5
Batch 3$0.056$0.003557$0.000054$0.000298$0.0216
Streaming 4$0.069$0.003557$0.000054$0.000298N/A

 

In addition further charges may apply, when job will consume either Google Cloud Storage or Google BigQuery resources (based on services pricing).

 

 More about the product:

 

Did you had a chance to try it out already, what is your opinion abut this solution?

Thanks,

Krzysiek

Leave a Comment

Twój adres e-mail nie zostanie opublikowany. Wymagane pola są oznaczone *