Wednesday, November 05, 2014

Getting and Cleaning Data

One of the important steps in getting insights from data is getting the data itself and clearning the raw data to turn it into processed data. Data resides in public websites, APIs, databases, local files and handwritten documents. Fetching this data requires tools and techniques that are not commonly talked about by enterprise data analysis product teams. Enterprise analytics product experts talk about data analysis, which is just one step in the data science process. There are some tools that exclusively focus on cleaning data. But a vast majority of them just focus on just visualization.

I am currently taking a four week course on "Getting and Cleaning Data" taught by professors at the Bloomberg School of Health in John Hopkins University delivered via Coursera. This is the third course in the data science track. The tool I use to get and clean data is 'R'. You might wonder why do we need to use a programming tool such a 'R' when you can use Microsoft Excel. The main difference is the flexibility and the power of 'R' to handle large volumes of data, clean the data and manipulate the data in an efficient and repeatable manner. There is a good post explaining the differences between Excel and 'R'.

Going through these courses and completing the associated exercises and quizzes have given me a conceptual understanding of the tools and practical experience for getting, cleaning and manipulating data. It is as much craft as it is science. I am still working with sample data provided by the professors. But I plan to start fetching my own data in the coming months. For example, you can analyze housing market data that is available from public sources to find good real estate investment opportunities and get insight that may not be readily available to others.
Related Posts Plugin for WordPress, Blogger...