Tuesday, December 20, 2011

Installing R/Rattle

The halo around R continues to grow and grow, more and more organizations are now beginning to explore building capabilities in R programming as it can potentially deliver costs savings. More on the comparison of R and SAS in our earlier blog entry.

In this post we will take you through installation of R and Rattle on a Windows 7 machine. Here is a youtube video showing the capabilities of R on a small credit scoring dataset.
  1. Download R from the website. The link provides for Windows installation, the setup file for both 32 bit and 64 bit systems is the same, so you need not worry.
  2. The setup file is an executable, simply run it and follow the instructions, it should install the basic R software on your system.
  3. There should be an icon created on your desktop, in 64 bit systems two icons get created (one for normal 32 bit, the other for 64 bit). If you have a 64 bit system, double click on the Rx64 2.XX icon, where XX is the version number)
  4. The R software window should open uptype in the following commands one after the other, press enter after each statement   install.packages("RGtk2") & install.packages("rattle"). After the first command, a window will open up asking for a CRAN mirror to be  selected as below, You can select any CRAN mirror to download the packages from (to be safe, select any US or western Europe mirror, try and avoid Iranian mirrors!)
  5. Run the following commands now » library(rattle) followed by rattle()
  6. This is where most errors regarding rattle installation pop up, in a lot of cases R will thrown an error such as GTK not found or error with GTK+ and it will offer to download GTK for you. But even that option after download will not work. Fear not, follow the instructions below to resolve, if your Rattle window launches, congratulations, its working
  7. For those with GTK problems follow the below bullet point steps
  • 32 Bit systems open this link, 64 bit systems open this link.
  • On the page scroll down to GTK+ packages and select GTK+ Version 2.24.8 (32 bit Runtime); GTK+ Version 2.22.1 (64 bit- Binaries)
  • Copy it to the C drive root and extract the ZIP files as they are. For e.g. I create a folder C:\gtk+_2.22.1-1_win64
  • Now Right click on My Computer and then click on Properties (Alternatively you can go via Control Panel >System & Security>System), a new window will open up, on the left hand side click on "Advanced system settings"
  • A new window as below will open up

  • Click on Environment Variables near the bottom, a new window will again pop up, within the system variables selection, scroll down to path and click on edit.
  • An "Edit System Variable" window will open up with variable name "Path", within variable values you will see a number of Folder paths separated by a semi colon.
  • Within the variable values go the beginning and add a path to the GTK folder we had extracted to the Bin folder, for e.g. C:\gtk+_2.22.1-1_win64\bin followed by a semi colon. (Note: make sure your path actually exists in the folder you have extracted into, i.e. the bin folder)
  • Close all and restart the R software
  • Type in library(rattle), press enter followed by rattle()
  • The rattle window should now open up, you are now ready to shake, rattle and roll your data. Install all packages which Rattle prompts you to, it will be done automatically after you press ok. Check out our Rattle demonstration post for a flavor of what Rattle can do.
 

Do let us know if the post was helpful in solving your Rattle installation issues, especially the pesky GTK/RGTK2 error. Feel free to comment even if you still face installation issues, we will try and solve them!

LearnAnalytics Team.

Sunday, December 18, 2011

How to enter the Analytics Industry?

We have been in the field of Analytics training for over 4 years now (current and previous organization) and have trained personally over 1000 students in both SAS programming as well as advanced analytics including both retail and corporate clients.

One of the most repeated queries I field from my students is "How do I get in?" or "How do I convince an Analytics company to hire me ?" or "I have 10 years experience in so-and-so industry, how do I make a switch to Analytics?" . If only I had a rupee for everytime I was asked this question, I guess I could have retired by now! (or maybe 10 Rs/question).

Well, there is no singular answer or approach to enter the industry, off-campus freshers face a tortuous task in breaking in, to get hired you need to have experience in Analytics and to get experience in Analytics, you need to be already hired somewhere. Its an age old challenge.

Back in the day, all technical trades were controlled by guilds (for e.g. carpentry, masonry or blacksmiths) which acted both as facilitators to the chosen and entry barriers to the upstarts. To enter the field, a young man (or person) would have to grovel before an established artisan to get an unpaid apprenticeship in return for lodging and food. This was generally unpaid labor but in the bargain the apprentice gained experience and the artisan free labor. After a few years the apprentice would be granted membership of the guild and be free to setup on his own.

Jump to the 21st century and transplant this to Analytics, How do YOU break in ? The challenge a 1000 years after the establishment of guilds remains the same, to be hired you need experience under your belt and to be experienced you need to get hired.

A few clarifications for people trying to break in

  • First off, there is no formal qualification or degree required to be an Analytics professional. (You dont need fancy Maths/Stats/Engg degrees, I have seen arts graduates become Subject Matter Experts in collections analytics).
  • Secondly, there is no age limit, 40 year olds have made the jump and done extremely well.
  • Its a job seekers market, provided you have reached the magic figure of 12 months experience. Experience in analytics is King today, people who have experience can literally dictate hiring terms.

But how to get that initial experience? Therein lies the heartbreak, though there is one way, just like it was thousands of years ago, apprenticeships (we call them internships now), you have to convince a company, any analytics company to hire you either at a very low salary or even no salary in the beginning. Face it, you need them more than they need you at this stage. Anything to get that valuable CV line about experience in. This is typically a call which freshers just out of college are able to take easily. But for those coming with previous industry experience will find difficult to make the jump. Whether to make this jump or not is a decision you have to take.

I have seen 30 year olds leave stable jobs to start in Analytics at Rs 12,000 /month salaries. A year later they are already at their previous levels. 32 year housewife who took up a SAS programming course was offered a 10,000/month contract for 3 months at a small analytics company, she is now a middle level manager in the analytics arm of a major MNC. The pattern is evident, do whatever it takes to get working once you have acquired a few skills; whether through some training courses or self study.

Analytics companies do not care about your educational qualification, formal background, if you have prior analytics experience, you are a rare commodity and you will be snapped up.

That said, what are the skills that one needs to even get a foot in the door. I have one word for that - "SAS". SAS programming jobs are probably one of the easiest ways to get a foothold in the industry today. SAS certifications (exam costs some USD 200, quite cheap for the benefits it provides) on your CV can act as a substitute for SAS programming experience. The companies will treat you as a known commodity if you have cleared the certification exams and will increase your chances of shortlisting.(I will insert a disclaimer here - since our organization specializes in SAS training for certification, this opinion piece may be considered biased to convince the reader to enroll for SAS training, that is not the objective, I am merely stating an observation)

Secondly, there are SAS jobs and then there are SAS Analytics jobs. One mistake that people can make is to take an initially higher paying pure SAS programming job over a SAS Analytics profile. Candidates need to be very aware of the nature of work they are getting into, a company which offers work in predictive modeling or data mining using SAS should always be preferred over a pure SAS programming job. A mistake here could mean a career of reporting and ad hoc requests versus definitely more glamorous side of Predictive modeling. Beg, borrow, steal, kill or even pay, but get experience of predictive modeling under your belt. A small difference in the beginning but over 30 year career can mean totally divergent paths.

A note here: Typically startups are more likely to hire people based on attitude/aptitude rather than a CV. They provide the best opportunities for people who really want to break in.

In sum :


  • Study a lot, it takes time to master any new technical skill, typically to reach an employable skill level in SAS and basic analytics will require upto 500 hours of training, practice and self study. (Target 4 hours a day over a period of 3-4 months)
  • Be prepared to spend time in the trenches, you have to be mentally ready to take a salary cut, maybe a huge one to get that elusive experience initially. (Target internships and startups here, have a target of 8-12 months experience under your belt before you start looking around after this)
  • Make intelligent choices , in your career, even a 1 week project in predictive modeling using say regression could make all the difference.(Beware of pure SAS programming jobs, they may only be on the reporting side, keep trying to gain experience in data modeling projects)
  • Every big MNC has an analytics arm, if you are already working in such a company, pull all strings to get into an allied project which you can leverage for experience, or an internal transfer. (I know a guy who was in the BPO arm of a major MNC, he bugged his reporting chain for 1.5 years before they finally relented, today he is travelling all over the world as an SCM analytics consultant!)
  • Those who play it like they have nothing to lose are the ones who win big, bring your attitude with you

Do let us know if you found this post helpful, for any queries regarding analytics careers or analytics training drop in a mail at info@learnanalytics.in or check out our website Learn Analytics
We are interested in learning how you made the jump into the analytics industry, drop us a note in the comments section for the other readers.

Introduction to Rattle - A Simple Video on Credit Scoring



Today, we are going to introduce a very powerful data mining tool called Rattle. Interesting feature of Rattle is that it is a GUI which sits on top of R. What it means is that it gives users a point and click interface to build data mining projects, predictive Models etc without writing a single line of R code.

In the featured video we have built various predictive models on a credit scoring dataset and compared their performances against each other using ROC curves. Models built are -->

Decision Trees
Random Forests
Adaptive Boosting
Support Vector Machines
Logistic Regression
Neural Networks
This was done without writing any R code (except to launch rattle). Total video lenght is about 17 minutes, which will take you through data import in rattle, variable exploration, model building and model evaluation using ROC's.

This video is for people from an advanced analytics background as we have not explained much of the methodologies behind the techniques, merely how to do in Rattle. Those who can understand the methodology and are not working in the analytics industry, you should immediately jump ship, greener pastures are awaiting (Seriously, if you understand even 40% of this, you cannot be unemployed!)

For those, who want to understand and learn stuff shown on the video, check out our website www.learnanalytics.in, we specialize in Analytics Training for students worldwide. We provide SAS, R , Advanced Analytics trainings.

For doubts/queries, batch timings, drop in a mail to info@learnanalytics.in .



Click here to download R
Click here to download Rattle
Click here to download the dataset discussed in the video


To install rattle, simply follow the instructions on the website linked above, if you have problems in installing,drop us a mail, we will be glad to help you out. We will be following up on a detailed post on R and rattle installation with troubleshooting.

Drop in comments to give us feedback!!

Learn Analytics Team

Friday, December 16, 2011

R vs SAS (Comparison and Opinion)

1. Background

PC or Mac, Windows or Linux, Intel or AMD, we geeks simply love comparing things. This particular comparison although not known in popular culture is an oft repeated argument in the Analytics industry.

SAS needs no introduction, for those who need one can check out the Wikipedia article as well as LearnAnalytics SAS training section.

R or rather the R Statistical package very simply put is the open source equivalent of SAS, for what it’s worth R can pretty much do everything SAS can do in terms of Statistical analysis and there are some pretty cool things R can do which SAS can’t. Very simply put, say you want to build a predictive model using Logistic regression, well R can do it; ARIMA model, yes; Decision Trees, yes; Association rule mining,yes;etc etc…..

Anything you envisage using SAS STAT for statistical analysis and data mining, R can do it.

What makes R Special?

So what if R can do everything SAS can, there are others also like SPSS, Statistica and so on which can also do pretty much what SAS can do.

Yes, but are the other software’s free?

Therein lies the crux behind the whole argument, R is free, it’s an Open source project initially started in New Zealand and is now considered as one of the best Statistical analysis tools in the world.

What’s the argument, isn't R always better?

It’s not that simple, Linux can do everything Windows can and more, but Windows still dominates. One of the biggest reasons for continued Windows dominance is momentum and an easier user experience. Inspite of all the advantages Linux offers (better security, no viruses, comparable user experience especially in the Ubuntu variants), the common man still prefers Windows, not to say Linux doesn’t have its die hard following and a vibrant support community.

Same goes for R, now I have used both SAS and R extensively and am going to discuss the pro’s and cons of both packages below.

2. Statistical Capability

SAS Stat and other SAS packages pack a powerful punch and cover almost the whole gamut of statistical analysis and techniques. However since R is open source and people can submit their own packages/libraries, the latest cutting edge techniques are invariably released in R first. To date R has got almost 15,000 packages in the CRAN (Comprehensive R Archive Network – The site which maintains the R project) repository.

Some of the latest techniques such as GLMET, RF, ADABoost are available for use in R and not in SAS. Many experimental packages are also available in R. Infact in most Kaggle competitions (which requires a blog post of it’s own), the winners (who are amongst the world’s best data miners) have almost invariably used R to build their models.

In this aspect R is the hands down winner, however a word does need to be put in about SAS, since SAS is a paid software with support, any new innovation, or new statistical technique has to be vetted and accepted. SAS is used in many mission critical assignments where merely experimental techniques cannot be allowed to creep in. While this is necessary for the environment SAS works in, it also means that it will keep playing catchup with R in terms of latest innovations. On the other hand since anybody can upload a package in R, user beware!

Therefore in terms of pure statistical capabilities, I rate R higher.

3. Data Handling

Data handling is the bugbear of R. The single largest drawback of R is the way it allocates and handles memory by trying to load the whole dataset in R. This can cause severe problems when working on a combination of large datasets and small computers (which it always is, your data is always huge and your computer is always puny!).

SAS excels in handling large datasets, infact server editions of SAS can chew through TeraBytes of data without any issues whereas R is very likely to throw Out of memory errors or become unresponsive and die.

Not to say that R cannot handle big data, it can, but say I have a Laptop with 2 gigs of RAM and a dataset running into millions of records, for the same exercise which SAS can do in 30 seconds, R might take upto a few minutes or even die.

However computing power is cheap and getting cheaper by the day, given enough RAM and computing power, R can also crunch through large datasets efficiently, especially on 64 bit machines.

But for now in terms of Data handling, I rate SAS higher.

4. Ease of Use

One of the biggest reasons Linux has never been the runaway success as compared to Windows is that it was so damn difficult to use, install or troubleshoot. Now take that problem and multiply by 10, and you get the idea of R. There is no easy way to put it, but R is not for the faint of heart. It is damn difficult to learn as compared to SAS.

SAS programming syntax can be considered as a high level language which is intuitive and easy to learn, additionally it was designed as a DML (Data Manipulation Language). On the other hand R programming is a monster.

For e.g. consider you have to do a simple data manipulation task such as sorting a few tables and joining them together. It would be a piece of cake to do this on SQL (any SQL package or even PROC SQL) or any of the SAS data steps. Now consider doing this in C++ (makes your blood run cold doesn’t it).

If SAS programming is high level more akin to SQL , then R is a low level language closer to C++. Even simple tasks can mean writing lengthy pieces of obfuscated code.

Learning R is definitely more challenging than SAS, but since R is a true programming language it gives more flexibility and power than SAS to the programmer. But for mere mortals like the rest of us, we would prefer to use the SAS programming language.

Support for R is another issue; obscure errors messages can literally suck the life blood out of somebody who is fairly new to R. There are support groups and forums on the internet, but if you are using a new package and it throws and error, you are on your own.

All in all, for true programmers R is closer to the heart but for the rest of us, who just want to get our work done, SAS is the winner by a mile in terms of ease of use.

5. Recommendation

I have used both R and SAS, and there is no straightforward answer to this. For example even though R is free, technically it should be cheaper to use shouldn’t it? Well the answer is not always.

TCO (Total Cost of Ownership) of using R might actually go higher than SAS. For example an Analytics company decides to use R exclusively figuring since they don’t have to pay for SAS licenses, their cost of project delivery will go down, better profit margins, lower billing to client, better competitiveness in the market. Win –win right?

Except now they have to train their consultants on R, or hire outside talent. R programmers are in short supply, so this drives up your cost of resources for one. Now take into account the learning curve and the deployment cost as well as code migrations of client legacy systems, now to mention the obscure tantrums that R can throw it you, but you can’t call anyone for support now, since there is none, it’s free software. At least if SAS doesn’t work, you can hold them by their throats. (For the kind of licensing fee they demand, it’d better work!)

On the other hand, you have a startup, small team, really smart people. Investing in a SAS license may not make sense at this point, they will simply use what I call the RUM stack (R-Ubuntu-MySql), it’s a pun on the LAMP stack.

i.e. use MySQL for heavy data manipulation, use R only for statistical analysis on machine running on Ubuntu Linux. Everything for free! While this solution may work for a small company and high calibre programmers, it is not scalable for a 25,000 man consulting organization which is run by processes/adherence and not individual brilliance.

My choice -> if you are small and hungry go for R. If you are a big organization where budget is not an issue, close your eyes and buy SAS licenses, everybody will be happy (but install R on your laptop nonetheless).

Disclaimer : The author has a dual boot machine with Win 7 and Ubuntu but he almost never uses Ubuntu, and he has R installed for both but uses Rattle).

For more information on Analytics Training (on SAS, R and Data Mining), check out our website www.learnanalytics.in