I am currently taking a four week course on "Getting and Cleaning Data" taught by professors at the Bloomberg School of Health in John Hopkins University delivered via Coursera. This is the third course in the data science track. The tool I use to get and clean data is 'R'. You might wonder why do we need to use a programming tool such a 'R' when you can use Microsoft Excel. The main difference is the flexibility and the power of 'R' to handle large volumes of data, clean the data and manipulate the data in an efficient and repeatable manner. There is a good post explaining the differences between Excel and 'R'.
Going through these courses and completing the associated exercises and quizzes have given me a conceptual understanding of the tools and practical experience for getting, cleaning and manipulating data. It is as much craft as it is science. I am still working with sample data provided by the professors. But I plan to start fetching my own data in the coming months. For example, you can analyze housing market data that is available from public sources to find good real estate investment opportunities and get insight that may not be readily available to others.