Resources to Learn Data Science

By Martin Goycoolea, Thu 20 October 2016, in category Resources

data science, education, machine learning, statistics

Data science is by definition (or lack thereof) a domain composed of several disciplines. The subject is so multidisciplinary that I sometimes find myself lost in the abundance of resources. This site is a collection of sites, information, courses, blogs, etc. that have help me improve on different aspects of my data journey. They were helpful to me and I hope they are helpful to you too.

I'm not affiliated with any of these sites, but I want to give credit where credit is due. I wouldn't know as much as I know now if it weren't for the people who took the time to write / explain / teach about the subjects that we care about.

This list is a work in progress and will be constantly updated.

Computer Science

Python

Learning Python the Hard Way: Great textbook resource for beginners. Goes through most of the basics (Control Flow, Data Structures, Functions, Object Oriented, Scripts) very clearly. Exercises are thorough, but sometimes tedious.

Codewars: Other than having a great name, this site allows you to practice problem solving with fun challenges. Useful if you are a beginner, and fun if you want to kill some time while doing something productive.

Pandas Documentation: The official documentation for the pandas library. Dense with information, but would recommend everyone who will be using this package to go through and read it. Having an idea of what tools are available is a great advantage when working on new problems.

Seaborn Documentation: My faviourite plotting library. Nice default plotting styles, throuhou documentation and examples, and an overwhelming variety of easy to access plot styles. From histograms to swarm plots. I use this if possible, especially for exploratory analysis.

R-langauge

I haven't learned much about R right now. It's a popular language and is definitely in my list of things to do. A lot of content, toolboxes, and modules are excusively written for R. Even if you prefer one language over the other, it's still a good idea to be conversational in the other: you might need to work in a team that uses R, or you might want to understand some research someone else wrote.

Scala

Scala seems to be a language people are talking much about. It's functional programming capabilities and dependence on the Java Virtual Machine (JVM) makes Scala a good language for data engineering.

Coursera's Functional Programming in Scala Course: I just enrolled, apparently this course is extremely well taught and has very good reviews. Will update one I finish.

Machine Learning

Andrew Ng's Coursera Course on Machine Learning: Great course, accesible for everyone, yet it goes through most alogirthms. Most importantly it teaches you how to think about Machine Learning: prioritizing work, debuggin, diagnostics etc. This guy is the ancient master of the subject. I can't be thankful enough to him.

Kaggle.com: Great site that hosts Machine Learning competitons. Also hosts many data sets, and has a vibrant community of people interested in data science.

Sci-kit Learn Site: One of the most used Python packages for Machine Learning. The tutorial is very good, and they do a great job at explaining simple algorithms and diagnostics to beginners.

Statistics

Good ol' Stats

MITx Stochastic Systems Classs: Great course on statistics and how to apply them. The language of isntruction is Python so it help to get up to speed with the language as well. Probability, Conditional Probability, Stochastic Processes, Random Variables. I'm only half way through the course, but revisiting these topics with more of purpose has been more rewarding and fun than when I took the class when I was younger.

Experimentation - A/B Testing

A subdomain that requires a focus of its own. Building experiments is a very rigorous procedure, and most should be second nature to the succesful data scientist.

Google's Course on A/B Testing - Udacity: Great course, very accesible and slowly paced. I enjoyed watching the instructors discuss about the caveats of setting up experiments and tell their stories. The course emphasises the importance of creativity in defining metrics, and the difficuly of choosing what experiments are the best to measure a specific results. Overall, sensible guidelines, easy-to-follow exercises, and real-life techniques.

Dealing with writing and text

Sounds like a joke right? Why is text editing even mentioned here? Well it so happens that as a data scientist you will spend a significant amount of time programming (not to the extent a dev would, but still). Thus your working environment is an important aspect to consider to make your life better and more efficient. I also found that through learning about text editing, I learned a lot more about code documentation, plugins, the command line. I also type faster now and am more efficient and more thoughtful on how I structure my code.

I opted to use vim after trying both emacs and vim. Both are good, but I like the ubiquity of vim, and how simple it can be at times. Emacs is definitely more customizable / powerful, but I don't need that now (although sometimes I use Org-Mode on emacs).

Vim

Vim Tutorial: Do the tutorial on vim itself. It's really all you need to get started. Vim Adventures: Ony did the free trial, but it was so much fun to play and the idea is great. If you have disposable cash and time and want to play a game, this might be a a goo investment. : I tried to setup everything on my computer to use vim so I would get use to the commands and the modal paradigm more easily. The first thing I download was VimFx for Firefox (there is a similar extension for Chrome I beleive). Not only did it help improve my vim fluency, but my browsing is much more efficient now, I almost never reach for the mouse and I don't get distracted.

Vim plugin for Jupyter Notebook: Similar to VimFx this allows you to navigate and edit text on your notebook using vims commands. It does conflict with VimFx, so make sure you configure an easy way to disable VimFx while on the notebook.

Word of caution: I found vim frustraiting in the beginning, and addictive after a while. Don't overthink it. You will become better at text editing and coding. Try not to obsess over it, in the end it's just a tool.

Blogging

I recommend starting a blog if you want to share your research. Having a blog motivates me. Also, having your work subject to public scrutiny improves your writing in several ways:

  • Increases your chance to get feedback
  • Makes you more meticulous and detail oriented
  • Displays your ideas to a wider audience and may generate conversation

In the end, blogging emulates what real work is like: you must present clear, coherent, and reproducible results.

DataQuest's Blog Setup: DataQuests does a fabulous job guiding you through setting up your blog. They use Pelican, a Python-based static site generator. I'll let them explain the process, but it's fairly straightforward (fairly, because I struggled a few hours with template details and getting use to the workflow). I followed this method to setup my blog.

List of curated Data Science Blogs: Reading other blogs will help you see and understand how othe people work and present research. Reading content also gives you idea on what to research / build, and who knows, you might even start a dialogue with another researcher. I aspire to one day be on that list.

General use of Medium: Medium has great articles on data science, and many other topics. Sometimes I will post my content there. The coolest aspect of medium is the engaged community.