Friday, September 21, 2012

Promising Big Query platform

Big Data, is growing buzzword that many companies are leveraging to define various business strategies. My interpretation in simple language is - How to manage (Store, access control, read, analyse) massive Information or data generated in order of tera, peta, exa or zeta bytes.

Google, a major technology company, has been dealing with this size of data like other major companies like EMC, IBM, Microsoft and Yahoo. Google recently developed an internal engine called Dremel, a scalable, interactive ad-hoc query system for analysis of read-only nested data set. This engine makes large data sets look very small (by providing meaningful data set faster to application). Big Query is a wrapped up with APIs for developers to use this engine. A good article on this by Wired.com

My team gets excited on any new offerings from Google especially for its enterprise customers. Hence we got our hands dirty to learn what it is all about. Here are our learnings. Reading through Developers site and watching note from I/O event, got us started, followed by a small POC done by me & the team. Here is our understanding & learning.

Architecture
Existing Models
Relational Database works amazing when we have query executed based on primary & foreign keys. This is because this database(s) are internally stored as B-Tree. eg:
Customer Ids. They form basis for profile information, transactional data and similar.

Challenge
A query that seeks information based on non primary or foreign key, entire table has to be scanned for retrieving result set. eg: Identify list of users from a particular geography. Here data-set is indexed on customer-id & not geography and hence entire table has to be scanned for validating geography. If we talk about Big Data (at least TB records), it gets expensive and time-consuming.

Alternatives
  • Avoid Table Scan
  • Fasten up the table scan (Dremel leverages this concept)
Dremel leverages following:
Instead of deriving record oriented storage, it follows column oriented storage. i.e. store each column in separate file. Their are two advantages of column oriented storage.
  1. Read only the columns, based on the information requested. 
  2. Leverage compression algorithm to compress the content. Column property suggest similar nature of data, hence compression works to a great extend. 





In above figure, each level Mixer and Shard are compute machines with computational power, disk and RAM. As soon as a query is requested, parallel requested are fired to next level cores (children) to read data, process and aggregate data. Result is passed on to the parent, where it is reduced as relevant. Above computes talk to each other through RPC over high bandwidth.

Execution
Managing data
Data has to be uploaded in a non-normalized pattern. A CSV table can be uploaded via Big Query tool or through Google's cloud storage (GCS).


Data Schematics
While importing data, data-schema has to be explicitly defined. Hence a text file with field name and schema definitions comes handy. (Schema is pretty simple FieldName:DataType, FieldName2:DataType). Data type are integer, float and string. 

Update: When I started playing with the tool. Not many 3rd party ETLs were available. But now I do see bunch of ETLs available. 

Query
Big query follows SQL dialect with basic set of SQL. As it is developed for analytics purpose, it does not support update or delete requests for data records.

Browser based Big Query tool is very handy to get hands dirty and try running few queries to see how it behaves, and try ad-hoc queries. 

Integration
There are 2 set of Integration points to make a full fledge system. (1) Pull in data from existing data ware house (2) Visually represent the business intelligence derived by running big query on huge data set. 

The great element of Integration story is availability of Big Query through REST Interface with rich set of integration libraries (Java, JavaScript, Python, PHP, .Net, AppScript).

Data Connectors
I do see ways to connect data sets to the big query platform, but have not explored in detail. 

Visualization & business Intelligence
Based on the library used to interact with Big Query, the JSON response set notation varies. Use sample application libraries to get started and learn more using browser debugger. As the data set received is in JSON, it can be plugged into any 3rd party tool for visualization and business representation. For my playground, I used Google Charting APIs to draw charts along with JavaScript library. 

Security
Big query follows standard project hosting on Google projects, which has standard security measures. 

Business Applicability
As claimed, this platform seems to be worth experimented with the given Google's credentials of scalability and speed power of execution. We could set it up and get it running very quickly. With low investment and empowering the cloud platform, we could see results instantaneously. 

For a business entity, insights on business data is vital for its growth. Analytics can be quickly derived using Big Query platform, as they don't need to invest heavily into infrastructure and setup cost. The most important is mining information from large data-set, with incremental knowledge is pretty instantaneous (over waiting to configure the system, pull out data, define model & so on..)

Applicability: This platform is applicable to any vertical industry - say Telco to Retail to online eCommerce platform to an automobile company. 

It's worth exploring benefits of this promising platform. 

Learnings & Challenges
  • After adding OAuth2 client ids, for all new users (through their GMail Ids) that were trying to access the web application (triggering big query request) received error of not accepting Big Query TOS (Not sure why end users have to accept it?) . Yet I've to find a way to smoothly fix the behavior. 
  • From the data set used for POC, TimeStamp data type element is missing and needs to parsed and formatted for better representation. 
  • Documentation for different libraries for Big query and their JSON response notation is not detailed. 
  • For experimental purpose, do refer to the quota policy for your needs
  • Big Query tool restricts uploading of files with larger size. Use GCS for managing larger chunks of data.

1 comment: