Side-effects of Python Machine Learning
Super-useful software and packages that I’ve picked up in my Nanodegree
I’m currently studying a Machine Learning Nanodegree with Udacity. One of the great things about it is how much exposure you get to a tonne of awesome software packages. Even as a seasoned Python coder, I was finally using or coming across libraries I’d never heard about.
This works wonders for you guys who love your porky pie Resumes.
When I was studying Chem Eng, we had to learn to use a bunch of mathematical software packages. Maple which is a lesser known package, and of course Matlab which you’ve probably heard of. They’re both powerful software packages, and are useful in their own right.
Other than the very domain-specific use of Simulink there really wasn’t a specific reason we needed to use these software packages. Probably because the University got a good deal on educational licenses, or something to that effect.
But why not Python?
As I say, I was already familiar with Python, but had never used it in a scientific context. I really wish I had. It’s such a wonderful language to get quick scripts up and running and has a really elegant syntax. But when you combine Mathematical packages — you get all that sexy math and can easily integrate it into a larger software package.
Over the course of my Udacity Nanodegree, I’ve been exposed to a number of really powerful packages that I genuinely believe have changed my (massively nerdy) life forever. It’s full of mature, stable tools that work really well.
If you’re just learning Data Science or Machine Learning, it’s great to get an idea of these packages. For me, I got thrown all these software packages at once, and with time I only learned their individual purposes and how it all beautifully pieces together. Hopefully this helps you understand what each component does.
Udacity recommend you use this distribution over the standard Python distribution. And I can see why. For getting all your mathematical python software up and running, it’s a breeze. Out of the box, it includes everything you’ll need to get going (i.e. stuff I list below). Check it out.
This is one of those buzzwords that I never thought I’d understand or learn. I’d feel stupid if someone asked me if I knew IPython. It’s pretty simple really: it’s basically an interactive, visual Python shell. You see the results of your Python code immediately. This is immensely useful if you want to print mathematical results in a REPL style.
They’ve renamed IPython to Jupyter, which makes sense because it supports other languages. But it ships with an absolutely wonderful piece of software, Jupyter Notebook:
Jupyter Notebook is an IDE/Editor for IPython Notebooks — and it is a joy to use. It boasts quality of editing and visuals that’ll make Google Docs sniffle. It runs entirely within your Web Browser, and feels close to running a full-blown Python IDE. You code in code boxes, and execute to display the result. You can also annotate and do ‘report-writing’ a la Maple/Matlab using Markdown.
Even for small scripts to do quick tasks, I still find myself loving to develop them in Jupyter Notebook because it’s so easy to use and visualize your output, rather than the ‘save in text editor and relaunch’ procedure.
Pandas is another one that was completely alien to me (p.s who names these libraries!?). I’ve done so much work with CSVs in Python, but all with the standard CSV library. I will now on forever use Pandas instead. Pandas replicates R’s ‘DataFrame’. That’s a fancy name for a data structure which has powerful indexing and slicing capabilities. If that means nothing to you: think Excel spreadsheet in code. If you do
dataframe['column'] you end up with an object for that specific column and can do operations. More interestingly, do
dataframe[dataframe['column'] > 5] and you end up with a specific column with values greater than 5. It’s super confusing at first, but once you get some practice you realize how awesome it all is.
NumPy really speaks for itself. It does lots of typical analyses that mathematicians need — average, mean. The core is all about array-based data strucutres. For examples, if I have all my values in an array, I can easily do mathematical computations such as mean.
This one’s pretty straightforward too — it’s a super useful package to produce simple and complex plots of your data. Indispensable for any exploratory or evaluation task.
The core of the machine-learning nanodegree. It combines all the DataFrames and NumPy stuff you’ll be working on to help you build your DecisionTrees, Support Vector Machines and all things ML related. The heart of much of the ML you do in the Nanodegree and in industry lies here.
- Keras — An awesome wrapper for packages like TensorFlow
- statsmodels — A package to do typical statistical analyses.
- Seaborn — A library that helps you build matplotlibs, fast.
It all links so well together
All these libraries are just a testament to how well open-source can work. They’re all cross-compatible, and combine to form in my mind the best mathematical package available — with all the typical powers of Python one would expect.
And all for free. My Udacity Nanodegree helped me learn all of these packages and become pretty comfortable with them. In developing nations where the Chemical Engineering students can only dream of funds for Matlab or Maple, thinking about how far Python has come is amazing and I urge greater adoption. Scratch that, I want to see Python as the main math computation language used in universities in the UK too.