# The Data Science Blog

Blog of Gary Short

One of the things that makes data science hard is that it has a foundation in statistics, and one of the things that makes statistics hard is that it can run counter-intuitively. A great illustration of that is the Monty Hall Problem. Month Hall was a US game show host who presented a show called “Let’s Make a Deal” and the Monty Hall Problem is modelled on that show; it goes something like this:

As a contestant, you are presented with three doors. Behind one of the doors is a car and behind the other two there are goats. You are asked to pick a door. Monty will then open one of the other doors, revealing a goat and will then ask you if you want to swap your pick to the remaining closed door. The dilemma is, should you stick or swap, and does it really matter?

When faced with this dilemma, most people get it wrong, mainly because the correct answer runs counter to intuition. Intuition runs something like this:

There are three doors, behind one of the doors there is a car. Each door is equally likely, therefore there is a 1/3 chance of guessing correctly at this stage.

Monty then opens one of the remaining doors, showing you a goat. There are now two doors left, one containing a car, the other a goat, each are equally likely, therefore the chances of guessing correctly at this stage are 50:50, but since the odds are equal there is no benefit, nor harm, in switching and so it doesn’t matter if you switch or not.

That’s the intuitive thinking, and it’s wrong. In fact if you switch, you will win twice as often as you lose. What! I hear you say. (Literally, I can hear you say that). Yeah I know, it’s hard to believe isn’t it? See, I told you it was counter-intuitive. So, let me prove it to you.

Firstly, let’s state some assumptions. Sometimes these assumptions are left implied when the problem is stated, but we’ll make them explicit here for the purposes of clarity. Our assumptions are:

1. 1 door has a car behind it.
2. The other 2 doors have goats behind them.
3. The contestant doesn’t know what is behind each door.
4. Monty knows where the car is.
5. The contestant wants to win the car not the goat. (Seems obvious but, you know…)
6. Monty must reveal a goat.
7. If Monty has a choice of which door to open, he picks with equal likelihood.

Now let’s have a concrete example to work through. Let’s say you are the contestant and you pick door 1, then Monty shows you a goat behind door 2. The question now is should you swap to door 3 or stick with door 1, and I say you’ll win twice as often as you’ll lose if you swap to door 3.

Let’s use a tree diagram to work through this example, as we have a lot of information to process:

There’s a couple of variables we have to condition for, where the car is, and which door Monty shows us, it’s that second condition that intuition ignores, remember Monty *must* show us a goat. So looking at the diagram we can see we choose door 1 and the car can be behind doors 1, 2 or 3 with equal probability; so, 1/3, 1/3, 1/3.

Next, let’s condition for the door Monty shows us. So if we pick 1 and the car is behind 1, he can show us doors 2 or 3, with equal likelihood, so we’ll label them 1/2, 1/2. If we pick 1 and the car is behind door 2, Monty has no choice but to show us door 3, so we’ll label that 1, and finally, if we pick 1 and the car is behind 3 then Monty must show us door 2, so again, we’ll label that 1.

Now, we said in our example that Monty shows us door 2, so we must be on either the top branch or the bottom branch (circled in red). To work out the probabilities we just multiply along the branches, so the probability of the first branch is 1/3 X 1/2 = 1/6 and on the bottom branch it’s 1/3 X 1 = 1/3.  Having done that, we must re-normalise so that the arithmetic adds to 1, so we’ll multiply each by 2, giving us 1/3 and 2/3, making 1 in total.

So now, if we just follow the branches along, we see that if we pick door 1 and Monty shows us door 2, there is a 2/3 probability that the car is behind door 3 and only a 1/3 probability that it is behind door 1 so we should swap and if we do so, we’ll win twice as often as we lose.

The good thing about living in the age of computers is that we now have the number crunching abilities to prove this kind of thing by brute force. Below is some code to run this simulation 1,000,000 times and then state the percentage of winners:

```using System;
using System.Collections.Generic;
using System.Linq;

namespace ConsoleApplication2
{
// Define a door
internal class Door
{
public bool HasCar { get; set; }
public int Number { get; set; }
}

public class Program
{
public static void Main()
{
// Create a tally of winners and losers
List<int> tally = new List<int>();

// We'll need some random values
var rand = new Random(DateTime.Now.Millisecond);

// Run our simulation 1,000,000 times
for (int i = 0; i < 1000000; i++)
{
// First create three numbered doors
List<Door> doors = new List<Door>();
for (int j = 1; j < 4; j++)
{
doors.Add(new Door { Number = j });
}

// Randomly assign one a car
doors[rand.Next(0, 3)].HasCar = true;

// Next the contestant picks one
Door contestantChoice = doors[rand.Next(0, 3)];

// Then Monty shows a goat door
Door montyChoice = doors.Find(x =>
x != contestantChoice && !x.HasCar);

// Then the contestant swaps
contestantChoice = doors.Find(x =>
x != contestantChoice && x != montyChoice);

// Record a 1 for a win and a 0 for a loss
}

// state winners as a percentage
Console.WriteLine(tally.Count(x =>
x == 1) / (float)tally.Count() * 100);
}
}
}
```

When I run this code on my machine (YMMV) I get the following result:

Which is pretty much bang on what we predicted.

Well that’s all for this post, until next time, keep crunching those numbers.

Having won this week’s award for the longest blog title here at Black Marble, I’d better get on with the actual post.

Every data scientist is going to have to work with large data files at some point in their careers, and right now, the de facto standard for doing so is Hadoop. There are lots of ways to gain access to Hadoop, from complete “roll your own” solutions, right up to pre packaged and ready to go solutions from people like Hortonworks.

For the purposes of this post we are going to take the easiest possible route to getting up and running with Hadoop, and that’s to use the “sandbox” VM from Hortonworks. So what is this sandbox? Well according to the Hortonworks site it’s:

A personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. All packaged up in a virtual environment that you can get up and running in 15 minutes!

And that’s pretty accurate to be honest, if you discount the download time of the 1.9Gb VM of course.

The first thing you’re going to want to do is to ensure that you have Hyper-V installed on your Windows 8 machine. To do that, hit the Windows key and search for “Programs and Features”, then select the app of that name from the results:

Over on the left hand side, you’ll see a link entitled “Turn Windows features on or off”, click on that and from the dialog that appears, either check, or ensure checking of, the box beside “Hyper-V”, click on okay, and you’re done.

Having done that, trot across to the Hortonworks website and download the sandbox VM, by clicking on the “Get Sandbox” button…

then selecting the VM of your choice…

We’re going to take the Hyper-V one. Once you have the zip file downloaded, unzip it to a convenient location.

Now, the instructions for installing the VM tell you to create a internal virtual switch within Hyper-V, and that’s okay to get things up and running and for experimentation. However, the sandbox comes with a series of tutorials, which are updated by Hortonworks from time to time, and there’s a button to allow you to download the latest version of them, so we’re going to want to configure our VM to be able to connect to the Internet. To do that, open Hyper-V Manager, and click on the “Virtual Switch Manager” link under “Actions”, on the right hand side of the window:

From there, go ahead and create a new “External Virtual Switch”…

And bridge it with the NIC you have connected to the Internet. As you can see, I’ve got mine connected to my WiFi, and that’s going to cause us a little problem that we’re going to have to fix later on, but I’ll come to that…

Next, from the “Action” menu, select “Import Virtual Machine” and navigate to the location where you extracted the download…

Step through the wizard, and when you have done so, you’ll have the VM installed…

Now we’re going to start up the VM and configure it to access the Internet…

By default, the VM is configured to have a static IP address of 192.168.56.101, which isn’t any use to us. We’ll have to configure our VM for Internet access, so go ahead and follow the instructions on the screen and press <Alt+F5> to log into the machine.

The default credentials for the sandbox VM are UID: root PWD: hadoop.

Having logged in, we want to edit the config file at

/etc/sysconfig/network-scripts/ifcfg-eth0 and set it up for DHCP…

And that’s where things get interesting. If you have a wired Internet connection then you are probably going to be okay; if, like me, you want to use a WiFi connection, then things get tricky. Virtualised WiFi connectivity is problematic and it proved to be so in my case. Now, in theory, all we have to do is to edit the file to have the following settings:

DEVICE=eth0
ONBOOT=yes
BOOTPROT=dhcp
NAME=”System eth0”

then that should be an end to it, and to be fair, when I’m in the office, that works perfectly with the office DHCP server issuing me with a 10.X address. However, when I come home, and I plug my laptop into my home network, the VM can’t seem to get an IP address. This is caused by an issue deep in the WiFi stack that I neither understand nor know how to fix – if it even can be fixed. As you can see from the screen shot above, I simply start my VM and when I’m in the office I comment out the parts required to set a static IP address, and when I’m at home, I comment out the DHCP parts, save the file, reboot the VM and I’m connected to the Internet. Confirm this by pinging some known URL…

Now that we have our sandbox VM connected to the Internet, we can point a browser at the VM’s IP address (either the one given by your DCHP server, and shown in the VM launch screen, or the one you set statically, if you have the same issue as I do at home)…

Hit the “Start” button to run the tutorials…

and click the “Update” button, to ensure you have the “latest and greatest” version of the tutorials…

And that’s it, you’re now ready to work your way through the tutorials and become an Hadoop expert!

Well that’s all for this post, until next time, happy number crunching!

Last time, we spoke about how critical thinking was an aspect of data science that was often over looked. In this post, we’re going to examine some of the fundamental building blocks of critical thinking: claims, explanations and inferences.

Claims
Firstly, claims. Claims are really just assertions:

• The crime rate has fallen this year.
• Unemployment is up.
• Major corporations aren’t paying enough tax.

And they come in four main flavours:

1. Evidence based claims. These are claims that are stated as facts that can be checked by someone. Crime fell by 8% last year.
2. Prediction base claims, which are claims that state something will happen in the future. The UK is condemned to a decade of washed out summers.
3. Recommendation based claims, which are those that make recommendations. We should drink 1.2 litres of water per day.
4. Principle base claims, are those that express an opinion on what ought (or not) to be done. Major corporations should pay more tax.

So what should we do when we come across these claims in our work as data scientists? Well as critical thinkers, we should always question them, asking ourselves questions like, “is this claim reasonable?”, “Is it significant?” and “What else do I need to know to make a judgement regarding this claim?”.

Explanations
Explanations are the things that sit between the claim and the inference. We want to get to the inference because that’s the thing that contains the action point or the conclusion to the argument, but without one or more explanations we can’t get there. Often the explanation sentence will start with “because…” or “due to…”, for example: the UK is condemned to a decade of washed out summers, due to global warming.

They often come in the form of a claim, as in this case. There are implied claims in this explanation, namely: global warming exists, global warming causes weather change and that the UK is affected by this weather change. The same questions asked of claims should be asked of these kinds of explanations too, and you should follow the claim –> explanation “rabbit hole” until it “bottoms out”, or until you satisfy yourself that the explanation is right or wrong.

Explanations can come in the form of single explanations, multiple independent explanations and joint explanations. We’ve covered single explanations; multiple independent explanations are just where more than one explanation can lead from the claim to the inference. For example: I will buy flowers because it is my wife’s birthday and because she likes flowers. Either explanation can be used to explain why flowers will be bought. Joint explanations is where two or more explanations are used, jointly, to explain a claim. However, in this case, the joint explanations are not independent and if one of the explanations are false, then the claim falls. For example: I am going to get wet when leaving work because it is going to rain and I have no umbrella. Here if it doesn’t rain, or someone lends me an umbrella, then the claim falls and I shan’t get wet.

Inference
An inference is the conclusion to an argument and often contains an action point, it follows the logical steps of claim –> explanation –> inference. Often the inference sentence will start with “So…” or “Therefore…”, for example: There is a huge demand for thingummies in the US, because legislation has been passed requiring every citizen to carry a thingummy, therefore we should increase thingummy sales to the US in the coming quarter.

So now you have been furnished with the basic steps that you should run through when you see claims made as the result of your own, or other’s, data science. If the output is in the form of a claim (sales are up in the north west region), look for explanations that can support the claim and an inference that can help the business move forward.

Next time we’ll continue our exploration of critical thinking, until then, crunch those numbers!

Hello there, my name’s Gary Short. If you’re a follower of Black Marble news, you’ll already know that I’ve recently joined the company as Head of Data Science, with the task of creating a flourishing Data Science Practice within the company.

And that is all I plan to say about that.

With the traditional “first post introduction” out of the way, I’d like to spend the rest of this post talking about something far more interesting… data science, and in particular an aspect of data science that I don’t hear a lot of people talking about, and that’s critical thinking. Head over to Wikipedia (no not now, after you’ve finished reading this!) and look up the definition of data science and you see it’s defined as the intersection of a lot of Cool StuffTM

Whilst all of these are valid, and I agree that they are all part of what makes a good data scientist, I do believe that this definition is missing the key aspect of critical thinking. So what is critical thinking? Well, if we pop over to Wikipedia again (not yet! What’s wrong with you people?), we can see that critical thinking is defined thusly:

Different sources define critical thinking variously as:

• "reasonable reflective thinking focused on deciding what to believe or do"[2]
• "the intellectually disciplined process of actively and skillfully conceptualizing, applying, analyzing, synthesizing, or evaluating information gathered from, or generated by, observation, experience, reflection, reasoning, or communication, as a guide to belief and action"[4]
• "purposeful, self-regulatory judgment which results in interpretation, analysis, evaluation, and inference, as well as explanation of the evidential, conceptual, methodological, criteriological, or contextual considerations upon which that judgment is based"[5]
• "includes a commitment to using reason in the formulation of our beliefs"[6]
• Umm, yeah, well that’s clear then, isn’t it? Well no, not really. When I see something defined like this, and by “like this” I mean having several definitions, I immediately decide that the reason that there isn’t one clear definition is that the definition is context sensitive, that is, it very much depends on the point of view of the observer. Since we’re going to be focusing on critical thinking from the point of view of the data scientist, I don’t think it helps us to layer on yet another definition. Instead, it will be much more productive if we just jump in and look at critical thinking by example instead of definition. So, for the remainder of this post, and in a number of future posts, we’ll do just that.

For us data scientists, critical thinking can can best be thought of as the process of examining the significance and meaning of the claims made by the results of our statistical analysis. That’s all well and good, but what exactly is a claim? Well, much like we programmers are used to dealing with claims based authentication, where a “claim” is made about a person or an organisation and a system must check the veracity of such a claim; in data science, a claim may arise from the output of an analysis, and we must ascertain the veracity of such a claim.

Let’s take a recent example from the news, here we’re told that a “Joint study shows extent of Scottish under-employment”, and that..

The extent of under-employment in Scotland has been revealed in a study published by the Scottish government. An analysis jointly prepared with the STUC shows more than 250,000 workers want to work longer hours. That is a rise of 80,000 on 2008, before the downturn got under way. That makes 256,000 people, or more than 10% of the entire workforce.

Critical thinking will teach us to examine each of the claims in this article and to ask questions about them. So let’s do that now, by way of example:

Firstly, the article claims “The extent of under-employment in Scotland has been revealed”. Has it? The article doesn’t say how the figures were obtained, so we can’t draw our own conclusions with regard to sample bias.

The article claims that “250, 000 workers want to work longer hours”. Do these people actually want to work longer hours, or do they want to earn more money for the full time work they do? Do these people want to work longer hours, or do they want to move from part time work into full time work?

After we have examined such explicit claims, we can also examine the implicit claims of the article. An item such as this, on the BBC news site, comes with an implied claim that there is no bias, just a straight reporting of facts. We can apply critical thinking here too, and ask ourselves such questions as, do the Scottish Government and the STUC gain from maximising or minimising the reported number? If they do, is there evidence, in the article or elsewhere, that they have done such a thing?

The article quotes from the research, without linking to it, and from a spokesman from one of the report’s authors, but there is no balancing quote from the opposition, so we should ask ourselves, are these figures undisputed and therefore no balancing quote is required? If not, is the journalist showing bias here by excluding an opposing view or did the supporters of an opposing view decline to comment? If the former, does this journalist have a history of doing so, or is this a one off? If the latter, who holds the opposing view, and were they given enough time to formulate a response? The answers to these questions will go a long way in helping us to contextualise the output of this analysis and to give it the appropriate weight in our decision making process.

As you can see, critical thinking is a key aspect of data science, the correct application of which will allow us to take full advantage of the output of our analysis and will allow us to interpret it properly for our end audience, or for our own benefit when consuming other data scientist’s output. In future blog posts we’ll delve further into this fascinating aspect of data science and I hope you’ll join me for those posts, until then, keep crunching those numbers!