Archive

Posts Tagged ‘Data warehousing’

Introduction to R


R is not for “Rishu” as I made it out to be when I heard of this data mining tool. Initially, I assumed R to be yet another tool as Pentaho. But my assumptions fall apart when I clicked on http://www.r-project.org/ which says up front its definition:

“R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment”

So here it is R is a language. R is more of a data mining tool as it seems to me. Well if you ever worked on MATLAB, the format and syntax would look the same. What makes R special is its ability to handle complex mathematical queries and computation simple and easier. Creating graphs and plots are never too easy. For example, let take the below image (Screen shot from code I wrote):

 Image

The code is pretty simple. I have assigned certain values (which are in vector format) into two separate variables – “a” and “‘b”. The value of variable “b” is the square of variable “a”. And as you can see the computation of mathematical function is done by using simple commands. I calculated the “MEAN” and “VARIANCE” of the variable b using two simple commands – mean (b) and var (b). The variable “c_lm” shows the linear regression model of variable b and a.

Well there are loads more. People have gone ahead and created something like “Google Trends”.  Though Google has its own GUI built over R, but nothing is stopping us from creating one either.

Sources: http://www.r-project.org/; Google Trends

Big Data – Hadoop HDFS and MapReduce

September 27, 2012 4 comments

The big data buzz is increasing day by day. So here is a more detailed look at the Hadoop – HDFS and MapReduce.

HDFS or the Hadoop Distributed File System is designed to store a large amount of data in various servers/clusters. The definition of large data needs no explanation (especially when we are talking Big Data).  Data in a Hadoop cluster is broken down in small blocks (default is 64MB) and distributed across the clusters.

The blocks in the cluster are placed based on a block placement algorithm – rack aware. Rack aware algorithm basically determines which block is to be placed in clusters based on the replication factor, which is generally 3x by default.

The basic architecture of HDFS cluster consists of two major nodes namely:

1. Name Node:

This is almost like the Master Node in Greenplum database and the “master” as per the master-slave concept.  The name node manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log.

Now the question arises what if the single name node crashes down (as we have only one primary name node). So, in order to maintain this data, Hadoop provides a secondary name node or Backup Name node. As primary name node is the Single Point of Failure (SPOF), the secondary name node copies the FsImage and EditLog from the Name Node at a particular time.

2. Data Node:

These are the major working blocks of the HDFS. They store and retrieve blocks when they are told to (by the name node), and they report back to the name node periodically with lists of blocks that they are storing. These data nodes are the places where the majority of the data resides.

 

Map Reduce is the second major portion of Hadoop architecture. Map Reduce is the programming logic or the brain as I would like to say. Map Reduce was created by Google which was based on the parallel processing programming logic, written in Java.

The Map Reduce programming model works on two parts – The Mapping part(done by the Mapper) and The Reduction part (done by the Reducer).

The Mapper works on the blocks of data available in the data nodes and tries to get the job done. You can think of Mapper as an individual worker (in the master-slave concept), working to get the data required from the client.

Now the major task remains is to get the aggregate count of the results done by each Mapper. This work is done by the Reducer. The Reducer iterates over the entire result data and sends back a single output value.

Map Reduce programming undergoes through various intermediate stages. Now let’s have a look at the following diagram:

From the diagram above we can see that the user give something as the input. In this case the input is a question and its subsequent answer. These files are stored in the data nodes of the HDFS. The Map-Reduce program looks into given data and breaks the data into an intermediate stage. The intermediate stage consists of a key/value pair, which breaks the file data into many key- value pair data. [If you have studied Compiler Design during your college days, then a look at the key-value stage just reminds me of the lexical analysis, semantic analysis, etc.]. Now after this stage, the sorting or the shuffling of the data takes place. It’s vague to understand from the diagram, but if you look into the second part of the above picture, you will understand the requirement of the sorting phase. The major reason is the availability of various servers or nodes. The Map Reduce makes sure that the shuffling and sorting of the data takes place using the key. Now come the reducer phase, which accepts the data coming from the sorting / shuffling phase and combines the data into a smaller set of values. This data is sent back to the user/client.

The above entire process is controlled by a JobTracker, which coordinates the job run and makes sure everything goes fine. The TaskTracker runs the tasks that the job has been split into.

So this is a brief description of the HDFS and the MapReduce. I didn’t go much deep into the core functionality of Map Reduce as it requires a full scale knowledge of the Java Programming Language. So I guess am able to give a short but detailed explanation on Hadoop. Thanks and take care.

Big Data : Parallelism and Hadoop:Basics


 

Let me start this blog by putting up two scenarios in front you:

Scenario I: You are given a bucket full of mixed fruits. There are 3 different kinds of fruits say apple mango and banana. Now how would you calculate the total number of apple, mango and banana in the bucket?

The simplest answer would be to count the fruits taking one by one and in the end getting the required result.

Scenario II: Now suppose instead of a bucket of fruits, you are given a Truck full of mixed fruits. How would you count the total number of individual fruit this time?

The most feasible approach would be to divide the work (instead of count the entire fruit truck one by one). We would take up one basket each full of fruits [mixed up fruits] and give it to different people[WORKER/SLAVE]. Each people count their own basket (irrespective of any communication between the two) and in the end we [MASTER] sum the results of each basket to get the result. Using this approach we would save time and effort [if you would agree].

Well, if you are still wondering why I started off with this scenario, then I have to say that HADOOP is built on this simple basic principle. The above scenario describes as something in technical terminology called as Parallel processing or distributed system programming. There is concept of Master – Worker in parallel processing system. Master divides the work and the worker does the allotted work. The work done by each worker is sent back to the Master.

Similar is the situation with BIG DATA. There is plenty of data available (just like the truck of fruits) which one cannot handle alone and most importantly the 3-V [volume, variety and velocity] factor of the BIG DATA. So to handle such a situation Apache came up with HADOOP – a high performance distributed data and processing system that can store any kind of data from any source at a very large scale and can do very sophisticated analysis of the BIG DATA.

Hadoop architecture is mainly based on the following two components:

1.       HDFS [Hadoop Distributed File System]:

It is more of a storage area for Hadoop. Whenever a data arrives at the cluster*, the HDFS software breaks it into pieces and distributes to the participating servers in the cluster.

2.       MapReduce:

 As the data is stored as fragments across various servers, MapReduce uses its programming logic to compute the required job on these server data and later return the result back to the Master Server. The computation happens locally and parallel across all servers in the cluster [Master – Worker concept].

The picture above describes the Hadoop Ecosystem, which will be explained in details in my later blogs. I hope I am clear with the parallel distributed concept. This concept will be useful in understanding the architecture of Hadoop.

[A bit of History on Hadoop: Hadoop was created by Doug Cutting, who named it after his son’s elephant toy. Hadoop was derived from Google’s MapReduce and Google File System (GFS) papers. Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses.]

FAQ:

*cluster A computer cluster consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system. The components of a cluster are usually connected to each other through fast local area networks, each node (computer used as a server) running its own instance of an operating system. Computer clusters emerged as a result of convergence of a number of computing trends including the availability of low cost microprocessors, high speed networks, and software for high performance distributed computing

[Source: Wikipedia [Hadoop History] and Google]

 

 

 

 

MicroStrategy Intelligence Server

September 12, 2010 Leave a comment

Before we get into the bits and pieces of MicroStrategy Architecture, we need to know a little bit of Intelligence Server. MicroStrategy Intelligence Server™ is an analytical server that is optimized for enterprise querying and reporting as well as OLAP analysis. It processes report requests from all users of the MicroStrategy Business Intelligence platform through windows, web, and wireless interfaces. These reports range from simple performance indicators such as quarterly sales by product, to sophisticated hypothesis testing using a chi-square test. The results are then returned to the users, who can further interact with the data and run more reports. Folloiwng are the benefits of the Intelligence Server:

Features:

Dynamic SQL Generation: MicroStrategy Intelligence Server stores information about the database tables in metadata. MicroStrategy Intelligence Server uses this metadata to generate optimized SQL for the database. Because the metadata is schema independent, these reports, queries and analyses are generated from your current physical schema without any modifications.

Advanced Caching: MicroStrategy Intelligence Server caches all user requests. Not only are reports cached, but the individual report pages requested by users are also cached. As a result, no redundant processing occurs on the MicroStrategy Intelligence Server or on the database.

Built-in Software-level Clustering and Failover: MicroStrategy Intelligence Server lets you cluster many different individual servers together without any additional software or hardware components. Built-in failover support ensures that if a server experiences a hardware failure, the remaining MicroStrategy Intelligence Servers will pick up failed jobs.

Integrated Aggregations, OLAP, Financial and Statistical Analysis: MicroStrategy Intelligence Server provides simple analysis such as basic performance indicators, as well as more sophisticated analyses such as market basket, churn, retention and deciling analyses. Other analyses include hypothesis testing, regressions, extrapolations and bond calculations.

Business Intelligence Architecture

September 8, 2010 Leave a comment

A business intelligence architecture using MicroStrategy is shown in the following diagram:

The Architecture has the following components:

  • Source System (OLTP):

Source systems are typically databases or mainframes that store transaction processing data. As such, they are an Online Transaction Processing System (OLTP). Transaction Processing involves simple recording of transactions like sales, inventory, withdrawals, deposits and so forth.

A well designed and robust data warehouse lies at the heart of the business intelligence system and enables its users to leverage the competitive advantage that business intelligence provides. A data warehouse is an example of Online Analytical Processing System (OLAP).

Analytical Processing involves manipulating transactional records to calculate sales trends, growth patterns, percent to total contributions, trend reporting, profit analysis etc.

  • ETL Processes:

The extraction, transformation and loading (ETL) process contains information that facilitates the transfer of the data from the source systems to the data warehouse. We have discussed about this in details in my previous post.

The metadata database contains information that facilitates the retrieval of data from the data warehouse when using MicroStrategy applications. It stores MicroStrategy object definitions and information about the data warehouse in proprietary format and maps MicroStrategy objects to the data warehouse structures and content.

  • MicroStrategy Application:

The MicroStrategy applications allow you to interact with the business intelligence system. They allow you to logically organize data hierarchically to quickly and easily create, calculate, and analyze complex data relationships. They also provide the ability to look at the data from different perspective.

A variety of grid and graph formats are available for superior report presentation. You can even build documents, which enable you to combine multiple reports with text and graphics.

An Intro to MicroStrategy

September 5, 2010 Leave a comment

What is basically MicroStrategy and how is it related to Data Warehousing? I guess this post will explain it.

As per we previously discussed, we need ETL tools (e.g. Informatica) to build a Data Warehouse. The ETL (Extract-Transform-Load) extract the data from OLTP and different other data sources, transform the data in the staging area according to the business need and finally load the data to Data Warehouse. Now one common question would be that how will you segregate between a database and a data warehouse. And a compact answer can be like this – “Data warehouse is also a database. When a database stores historical data (data from the same system, taken at different time period), then it becomes a Data Warehouse.”

So we have historical data in the data warehouse. Now what is the use of these data? These data can be used for analysis of business and for that they need to be represented in different format according to the business need. Suppose a business owner wants to have the trend of his last 10 year revenue – represented in Bar graph. Now you can fetch the data from database using SQL for the last 10 year. But can you represent the same data graphically? Here comes the reporting tools and MicroStrategy is a powerful leader of those.

The purpose of reporting tools is to fetch the data from the data warehouse and to represent those data according to business requirements. MicroStrategy has huge number of powerful features to support this. Hopefully we will come to know about those in the upcoming posts. The following snapshot is of MicroStrategy Desktop window.

Fig1: Snapshot of MicroStrategy Desktop

Fig2: Differenct Types of Project in MicroStrategy

As we can see from the above snapshot, we have 2 types of projects in MicroStrategy – 3 Tier and 2 Tier project. We will discuss about this in detail in the next post. As well as I will try to give some information about the architecture of MicroStrategy.

Understanding ETL

November 29, 2009 Leave a comment

The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system extracts the data from source systems, performs transformations and cleansing and delivers the data in a presentation ready format after which the data will be loaded to the ware house. The following figure is the schematic description of the ETL process.

image

The 4 steps of ETL process are explained below:

Extracting: In this phase, data from different types of source systems are fetched into the staging area. The source systems can be mainframe, production sources or any other OLTP sources. Source files can also be different. Data can be stored in relational tables as well as flat files (e.g. Notepad files). The first job of ETL is to fetch data from these different sources.

Cleansing: In most cases level of data quality in an OLTP source system is different from what is required in a data warehouse. To achieve those data quality, cleansing has to be applied. Data cleansing consists of many discrete steps, including checking for valid values, ensuring consistency across values, removing duplicates and checking whether complex business rules and procedures have been applied.

Conforming: Data conformation is required whenever 2 or more data sources are merged in the data warehouse. Separate data sources cannot be queried together unless some or all the textual labels in these sources have been made identical and unless similar numerical measures have been rationalized.

Delivering: The main motto of this step is to make the data ready for querying. It includes physical structuring the data into a set of simple, symmetric schemas known as dimensional models, or star schemas (we will discuss star schema and dimensional model later). These schemas are a necessary basis for building an OLAP system.

Technorati Tags: ,