Archive

Archive for the ‘OpenSource’ Category

Big Data : Parallelism and Hadoop:Basics


 

Let me start this blog by putting up two scenarios in front you:

Scenario I: You are given a bucket full of mixed fruits. There are 3 different kinds of fruits say apple mango and banana. Now how would you calculate the total number of apple, mango and banana in the bucket?

The simplest answer would be to count the fruits taking one by one and in the end getting the required result.

Scenario II: Now suppose instead of a bucket of fruits, you are given a Truck full of mixed fruits. How would you count the total number of individual fruit this time?

The most feasible approach would be to divide the work (instead of count the entire fruit truck one by one). We would take up one basket each full of fruits [mixed up fruits] and give it to different people[WORKER/SLAVE]. Each people count their own basket (irrespective of any communication between the two) and in the end we [MASTER] sum the results of each basket to get the result. Using this approach we would save time and effort [if you would agree].

Well, if you are still wondering why I started off with this scenario, then I have to say that HADOOP is built on this simple basic principle. The above scenario describes as something in technical terminology called as Parallel processing or distributed system programming. There is concept of Master – Worker in parallel processing system. Master divides the work and the worker does the allotted work. The work done by each worker is sent back to the Master.

Similar is the situation with BIG DATA. There is plenty of data available (just like the truck of fruits) which one cannot handle alone and most importantly the 3-V [volume, variety and velocity] factor of the BIG DATA. So to handle such a situation Apache came up with HADOOP – a high performance distributed data and processing system that can store any kind of data from any source at a very large scale and can do very sophisticated analysis of the BIG DATA.

Hadoop architecture is mainly based on the following two components:

1.       HDFS [Hadoop Distributed File System]:

It is more of a storage area for Hadoop. Whenever a data arrives at the cluster*, the HDFS software breaks it into pieces and distributes to the participating servers in the cluster.

2.       MapReduce:

 As the data is stored as fragments across various servers, MapReduce uses its programming logic to compute the required job on these server data and later return the result back to the Master Server. The computation happens locally and parallel across all servers in the cluster [Master – Worker concept].

The picture above describes the Hadoop Ecosystem, which will be explained in details in my later blogs. I hope I am clear with the parallel distributed concept. This concept will be useful in understanding the architecture of Hadoop.

[A bit of History on Hadoop: Hadoop was created by Doug Cutting, who named it after his son’s elephant toy. Hadoop was derived from Google’s MapReduce and Google File System (GFS) papers. Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses.]

FAQ:

*cluster A computer cluster consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system. The components of a cluster are usually connected to each other through fast local area networks, each node (computer used as a server) running its own instance of an operating system. Computer clusters emerged as a result of convergence of a number of computing trends including the availability of low cost microprocessors, high speed networks, and software for high performance distributed computing

[Source: Wikipedia [Hadoop History] and Google]

 

 

 

 

Big Data : An Introduction


Hey guys, I am back to blogging after a pretty long gap. Since my last blog I have been going through data warehousing stuffs. In the midst of my learning data warehousing techniques, I came to know about a bigger issue which is troubling IT companies. It’s called BIG DATA. So I thought to share my knowledge on this advanced business analytic with you guys.

If you are thinking BIG DATA deals with “data which are big in nature”, then I have to say you are perfectly correct. But if your brain is limited to the database tables with 1000 rows to 100K rows; then I fear BIG DATA is something bigger and messier than this. Well, a formal definition on BIG DATA would go as:

Big data is a term applied to data sets both structured and unstructured, whose volume is more than the capacity of commonly used software tools to capture, manage, and process the data with usual database and software techniques within an acceptable time.

 Today, companies face a serious issue. They have access to lots and lots of data and they have no idea what to do with those data. An IBM survey shows that over half of the business leaders today realize that they don’t have access to insights they need to do their jobs. These data normally are generated from the log files, IM Chats, Facebook chats, emails, sensors, etc. These data are raw in nature and is something you won’t find in database table (row-column) format. It’s accumulated from the day to day activity from the work of each and every associate. Companies are trying to access these data store to derive some business intelligence and strategies. BIG DATA is not about relational database but of the data which has got no relations to each other.

 BIG DATA can be classified basically into three different categories based on data characteristics:

1.      VOLUME:

There is huge amount of data that are being stored in the world. In the year 2000, there is around 800,000 petabytes (1 PB = 1015 bytes) of data stored in the world. The volume of data is growing rapidly. Companies have no idea what to do and how to process these data. Twitter alone generates more than 7 petabytes of data everyday and Facebook generates around 10PB of data alone. This value is growing exponentially companies. Some Enterprises generate terabytes of data every hour of every day of the year. It won’t be wrong to think that we are drowning deep in the ocean of data. By 2020, it is expected to reach 35 zettabytes (1 ZB= 1021 bytes).

 2.       VARIETY:

With huge volume of data comes another problem i.e. Variety. With the onset of rapid technology usage, data is not only limited to just relational database, but it has grown to the raw un-structured and semi-structured data mainly coming from web pages, log files, emails, chats, etc. Traditional systems struggle to store and perform required analytics to gain intelligence because most of the information generated doesn’t lend itself to traditional database technologies.

 3.       VELOCITY:

Velocity is one characteristic of BIG DATA that deals with how fast a data is being stored and used for analytics. In BIG DATA terminology, we are looking at a volume and variety aspect also. So, thinking on the rate of arrival of data along with the volume and variety, is something a traditional database technology could hardly handle. As per the survey is concerned, around 2.9 million of emails are sent every second, 20 hrs of video is uploaded every minute in YouTube and around 50 million tweets per day in Twitter. So I think you can imagine the velocity of data come at you.

There is also another characteristic of BIG DATA, which is VALUE. A value aspect of big data is something all companies are looking forward to. Unless you are able to derive some business intelligence and value of these data present, then there is no use of such data. In simple terms, Value deals with what the present unstructured raw data can get a meaningful statistics so that it can be useful in taking proper business decisions.

Companies are trying to extract all the information possible and derive better intelligence out of it and to gain a better understanding of the customers, marketplace and the business. Few technical solutions like HADOOP (which I will explain in my next blog), NoSQL, DKVS databases, etc. are combating BIG DATA problems.

For now all I could conclude is that the right use of BIG DATA will allow analysts to spot trends and give niche insights that help create value and innovation much faster than the conventional methods. It would also help in better meeting consumer demand and facilitating growth.

Cloud Computing : Architecture


Hey guys !!! i hope everyone is clear with the overview on cloud computing ,which i had already discussed in my previous blog. Our entire discussion on cloud computing will not end until and unless we discuss about the architectures and the technical side of this system. So, without wasting much time on “bakwasss” lets begin our discussion on the architecture of cloud computing.

Cloud architecture, the systems architecture of the software systems involved in the delivery of cloud computing, typically involves multiple cloud components communicating with each other over a loose coupling mechanism such as a messaging queue. When talking about a cloud computing system, it’s helpful to divide it into two sections:

1. The Front End or the Intercloud:
The front end includes the client’s computer (or computer network) and the application required to access the cloud computing system. Not all cloud computing systems have the same user interface. Services like Web-based e-mail programs leverage existing Web browsers like Internet Explorer or Firefox. Other systems have unique applications that provide network access to clients.

Cloud Computing Architecture

Cloud Computing Architecture

2. The Back End or The Cloud Engineering :
On the back end of the system are the various computers, servers and data storage systems that create the “cloud” of computing services. In theory, a cloud computing system could include practically any computer program you can imagine, from data processing to video games. Usually, each application will have its own dedicated server.

[N.B: Cloud engineering is the application of engineering disciplines to cloud computing. It brings a systematic approach to the high level concerns of commercialisation, standardisation, and governance in conceiving, developing, operating and maintaining cloud computing systems. It is a multidisciplinary method encompassing contributions from diverse areas such as systems, software, web, performance, information, security, platform, risk, and quality engineering.]

If a cloud computing company has a lot of clients, there’s likely to be a high demand for a lot of storage space. Some companies require hundreds of digital storage devices. Cloud computing systems need at least twice the number of storage devices it requires to keep all its clients’ information stored. That’s because these devices, like all computers, occasionally break down. A cloud computing system must make a copy of all its clients’ information and store it on other devices. The copies enable the central server to access backup machines to retrieve data that otherwise would be unreachable. Making copies of data as a backup is called Redundancy.

The architecture of cloud is evolving rapidly. Hopefully in the upcoming future of computing we can say “we build our home in the cloud”. There are also many issues such as privacy, data maintenance, etc, but still there are loads of advantages too. We will discuss it in the later blogs. Stay tuned for more !!!

Uninstalling GRUB Boot loader:

July 18, 2009 1 comment

  1. Format the Linux partitions to create unallocated space, if you haven’t already. You can find a walkthrough on how to do this under External Links listed below.
  2. Change the BIOS so that your computer boots your CD drive first.
  3. Insert the Windows XP disc and reboot. It may take a few minutes to load.
  4. Select ‘Recovery Console’ by pressing ‘r’.
  5. Select the Windows system to log on to. The default option is ‘1’.
  6. Press enter to bypass the administrative password prompt.
  7. Type fixboot and press enter.
  8. Type fixmbr and press enter.

    Hope these will fix the MBR. J

Categories: Linux/Unix, Windows Tags: , , ,

OpenSSL Upgradation Procedure:

April 23, 2009 Leave a comment

Every linux operating system comes with a OpenSSL version. But if you want to upgrade it to the lates version then follow the following steps:–

[ steps described above is tested on CentOS 5 (Stable) ]

Steps for upgradation of OpenSSL:——

  • Remove the previous versions of OpenSSL using the following command:

#rpm -erase –nodeps openssl

  • Fetch the latest version of openssl from http://openssl.org/source. [Latest version is openssl-0.9.8k]
  • Unzip the tar file to /usr using the following command:

#tar -zxvf -C /usr openssl-0.9.8k.tar.gz

  • Move to the /usr/openssl-0.9.8k directory

#cd /usr/openssl-0.9.8k

  • Install the OpenSSL using the following commands:

<#./config shared

#make

#make test

#make install

  • Link the new files using the following commands:

#cd /lib

#ln -s /usr/openssl-0.9.8k/libssl.so.0.9.8 libssl.so.0.9.8b

#ln -s /usr/openssl-0.9.8k/libssl.so libssl.so.6

#ln -s /usr/openssl-0.9.8k/libcrypto.so.0.9.8 libcrypto.so.0.9.8b

#ln -s /usr/openssl-0.9.8k/libcrypto.so libcrypto.so.6

#cd /usr/lib

#rm /libssl.so

#rm /libcrypto.so

#ln -s /usr/openssl-0.9.8k/libssl.so libssl.so

#ln -s /usr/openssl-0.9.8k/libcrypto.so libcrypto.so

#ln -s /usr/local/ssl/include/ /usr/include/ssl

#cd /usr/include

#rm -rf openssl

#ln -s /usr/local/ssl/include/openssl openssl

  • Rerun ldconfig
  • Perform the following steps:

#cd /etc

#rm ld.so.cache

Open the ld.so.conf file in vi editor and add the following lines:

–         add /usr/local/ssl/lib

–         add /usr/local/lib

Run ldconfig.

  • Change the Environment Path Variable

Open .bash_profile file in vi editor

#vi /root/.bash_profile

Add the following line before export PATH

PATH=$PATH:/usr/openssl-0.9.8k/apps

Save the file end exit from vi editor.

  • Reboot.
  • Done

the symbolic file name may be different depending on the operating system and different version of OpenSSL.

OpenSSL: Introduction

April 9, 2009 Leave a comment

OpenSSL is an open source implementation of the SSL and TLS protocols. The core library (written in the C programming language) implements the basic cryptographic functions and provides various utility functions. Wrappers allowing the use of the OpenSSL library in a variety of computer languages are available.
OpenSSL is based on the excellent SSLeay library developed by Eric A. Young and Tim J. Hudson. The OpenSSL toolkit is licensed under an Apache-style licence, which basically means that you are free to get and use it for commercial and non-commercial purposes subject to some simple license conditions.

Versions are available for most Unix-like operating systems (including Solaris, Linux, Mac OS X and the four open source BSD operating systems), OpenVMS and Microsoft Windows.

FIPS 140-2 Complience:

OpenSSL is one of the few open source programs to be validated under the FIPS 140-2 computer security standard by the National Institute of Standards and Technology‘s Cryptographic Module Validation Program.
[Note: FIPS stands for Federal Information Processing Standard]

Present Version:

openssl-1.0.0-beta1 [ Works under FIPS mode as weel as NON-FIPS Mode (Beta Version) ]

openssl-0.9.8k [Works under FIPS mode as well as NON-FIPS mode (Stable Version)]

openssl-fips-1.2 [ Works Under FIPS Mode ]

The other version of the setup files, documentations and other informations can be obtained from http://www.openssl.org/

Never Upgrade any Software!


Every year we see that various software companies release several software. Some of those software are so useful, that, we forget that, they will also need to update or upgrade.

For example, Nero. After opening the software in your machine you can see the version of it. It should show 6.x.x or 7.x.x or others. Right? Now, just Google the ‘Nero’. You will find, it is running 9.x.x version in the market.

Take another example, Adobe Acrobat Reader. We use this software for reading the PDF files. Now, open it. What version is it showing? 5? 6? 7? Or 8? Now, again Google the ‘Adobe Acrobat Reader’. It is running version 9 man!! What are you doing?

Go online and upgrade all these software. Open Nero’s website. Just a minute….what is it showing? The setup file of 9.0 version of Nero software is 370.5 MB???  Lol! Are they mad? If the setup is near about 400 MB, then what will be the size, if it installs??? Believe it or not, but it is 1.03 GB!!! Yes, from my personal experience, I am telling you. How horrible!!

In their site, they are saying that this time, Nero has become more user friendly and bla..bla..bla… But, in practical, if you see them, you will be more upset. What had they done? Is this our that Nero? It have become more user ‘foe’ly!  Not only Nero, but take any software.. Adobe Acrobat Reader, DivX, iTunes, Power DVD, Adobe Photoshop etc… you name it! These all are our everyday usable software. Not even software, now a days Operating Systems are also becoming monsters. Windows Vista? The os from Microsoft Corporation? Also have tested Windows 7. The same problem.

Now, lets take a small tour over the cons of these upgrading of softwares.

1. More and more Space: They want more AND more space every day. Previously, I have given an example of the Nero. Now, look at this table:

Name of the Software Previous / Most Used Version with Size Most recent Version with Size
Nero V6.x (<100 MB) V9.x (370.5 MB)
Adobe Acrobat reader V5.0 (<10 MB) V9.0 (>26 MB)
DivX V4.0 (0.7 MB) V6.8 (19.8 MB)
iTunes V4.1 (19.1 MB) V8.0 (65.6 MB)
Power DVD V1.5 (2.6 MB) V8.2 (76 MB)

These are the setup file sizes. Now, after installing, they occupy spaces like this: Nero 9.x = 1.03 GB, Adobe Acrobat Reader 9.0 = 230 MB, DivX 6.8 = 50 MB, iTunes 8.0 = 75 MB (without Quick Time, with Quick Time 100 MB). So, what will you do to your hard disk?

2. Unnecessary Functions: 99.9% of the users of Nero have a primary objective: burning a cd or dvd.  But, in the 9.0 version, you will find several tools that you will never use. For example, the Nero Home. I am sure that, you will never use it, if you have used the Windows Media Center once. More over, they give, a ‘Photoshop’-like software which is much harder in order to use the original Adobe Photoshop. You will find more applications like Nero Media Player. The ‘worst’ media player I (probably, you also!) have seen ever. Some times they provide some more ‘visually’ beautiful user interface (Eg. Nero 9), which looks great but works keen. They not only fail to do the job in time, more over pressures on the hardware of the computer which turns to our next point.

3. High Requirements: Windows Vista said that it will require at least 512 – 768 MB of RAM; however, 1GB is the best for the lowest requirement. Windows 7 has raised a step higher. It is demanding minimum 1GB of RAM. However, they said 2GB of RAM will be a good configuration in case of lowest RAM configuration. It was only the demand of RAM. Now, there remains more hardware: Processors, Hard Disk Space, Mother board, Optical Drive etc. Adobe’s latest Photoshop CS4 says of minimum requirement of 512 MB of RAM. But, in practice, I myself found, 1GB works good instead of 512 MB. But, if you guys could remember, the CS2 version had recommended requirement of 512 MB of RAM and, I myself had experienced, 320 MB was more than sufficient for it. My Intel 845GVSR motherboard with P4 2.4 GHz Processor ran the application very fast!

4. Customers are tester: When you buying a software, they will ensure in 1000 ways that, the software has been tested 10,000 times and rated “best value” by some ‘XXX’ magazine! You also look at that and think good of it and become the ‘bakra’. Coming back home, when you start installing it, the lines of problem starts. And until the software company provides a ‘patch’ or ‘update’, the problem persists. I think because of this since 2001 till today the Windows XP is providing ‘Service Pack 3’, ‘security updates’ and other ‘useful updates’. Not only XP, their last OS Vista had received a ‘Service Pack 1’ in 2008 after launching it in mid 2007. So, why will you buy new software, which has not been tested? Why don’t you use the previous versions of those software?? Here comes our last point.

5. Expensive: While Nero’s original burning software initially was used to given away with the CD-Writers at free of cost, the latest Nero 9 will cost you around US$200(probably). If you had purchased Adobe’s Photoshop CS2 or CS3 previously, then also, you have to buy it with near about US$1000(probably) with some US$100 discount. WHY? Why should I buy these software-monsters with 1000 bucks??

So, from the next time, before upgrading any software or OS, think twice. Don’t just follow the crowd of ‘latest versions’ and buy them. Apply your brain, is there really any need to buy a new software version? Can the Nero 9 burn ‘scratched’ discs?? You know the answer: No. Then why do you buy / upgrade it expending both your Income and the Internet Bandwidth??? Rather learn to use alternatives. Open Source Products. For example, In case of Vista os, use Fedora Core / Ubuntu / Cent OS…. These are FREE, OPEN SOURCE and FREELY UPGRADABLE Os which covers the solutions of all the problems stated above. Try to use ‘FoxIT Reader’ v2.3 (current) 2 free of cost with 3MB setup file size having all the features of Adobe Reader 8 and 9. Use VLC media player. This is also another open source media player software that plays nearly all types of files.  Yes, it is true, that, VLC is upgrading every month, but, that does neither require lots of internet bandwidth nor 100s or 1000s bucks as it is free and the latest version 0.9.8a (probably) has the less than 20 MB setup file; which will replace iTunes, DivX and Power DVD. So, apply your brain and think again before any software/os upgradation.