Did You Survive the Titanic?

1365 Views, 6 Favorites, 0 Comments

Did You Survive the Titanic?

15 April 1912: the RMS TITANIC sinks in the North Atlantic Ocean after a collision with an iceberg. Everyone knows about this. Many books, movies and artistic works have been done about this disaster. The picture above is supposedly the last one taken of the ship still on the water.

Of more than 2220 passengers, only 700 survived. But, if you had been there, would you have survived?

Supplies

We're not going to try to recreate the conditions of the disaster, nor to test a model of the Titanic! So, don't go go looking for your duck buoy...

No, we are going to simulate the probabilities of survival, according to some chosen features.

All you need here is:

An ESP32 board
The Arduino IDE...
and some mathematical background if you want to know how this works (otherwise, jump directly to Step 3)

Artificial Intelligence and Bayes Theorem

One simple way of calculating the survival probability is just to divide the number of survivors by the total number of people onboard:

2224 passengers and crew
710 survived (a list is available online)

710 / 2224 = 0.319

So there was statistically a 32% chance. But if you have seen James Cameron's movie (99% chance I think :) you know that it's not that simple. Rose is a healthy rich woman, Jack is a poor young man, who won a ticket in a poker game. She had a first class cabin, he had a third class ticket. She survived, he died (although he was the king of the world).

That's what we called 'features' in the previous step: a feature is a piece of data that characterizes a sample, a measurable property of the object you’re analyzing. It is the essence of Artificial Intelligence algorithms.

If you want to know more about AI on microcontrollers, please have a look at my other Instructables. AI requires data to be able to extract or build patterns and generalize them to unknown data. Here, we want to link our features to 2 categories: 'survived' or 'died'. Categories are usually also called 'classes', I'll use either word later on.

There are many kinds of AI algorithms, some are supervised (meaning they train on already categorized data), some are not. For the problem at hand, the first class (of algorithms, not boat tickets) is best suited, and a dataset is already available online. It can be found here for example.

This dataset is a csv (Comma-separated values) file, which can be opened either using Excel or a simple text editor.

The features are in columns, and some of them are very interesting:

Survived : you already know. It's binary, 1 or 0.
Sex: male or female. An important feature: you have already heard the phrase "Women and children first"
Fare: the price of the ticket... If a passenger paid a higher price, he was treated better by the crew.
Pclass: passenger class. There were 3 classes, distributed in the boat as shown below.
Age.

The other features (such as the passenger's name, her ticket number, etc) are less important, and I made a lighter version of the dataset which is available below, keeping only 4 features. It will be used by the algorithm.

It is important to know that these features do not determine for sure that a passenger will survive or not. But they can provide tendencies: we easily imagine that a young rich woman has more chances than a poor old man. The algorithm will examine the dataset and build the tendencies, to finally estimate a probability of survival.

Downloads

Clean_Dataset.csv

Naive Bayes Method

The keyword here is 'probability'. Bayes' theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. Its formula is written above (in blue neon). Its secret lies in three probabilistic functions that allow us to calculate a fourth one.

It reads as follows: the probability of A knowing B is equal to the probability of B knowing A multiplied by the probability of A divided by the probability of B.

(breathe, and take a moment to think about it...)

Here, A and B are events, the results of an observation. 'Events'? For example, if you throw a dice, 'getting a 3' is an event.

By confronting two events with each other, the formula quantifies the probability for one to induce the other, thus going back from the consequences to the causes. As can be seen on the second image above, we can rewrite the formula as follows

P(A/B) * P(B) = P(B/A) * P(A)

and explain all symbols:

writing P(x) as the probability of the event 'x'
P(A/B) is the Posterior probability : the probability of A being TRUE, knowing that the event B is TRUE
P(B) is called Evidence (sometimes called 'predictor' or 'marginalization '): it is the probability of B being TRUE, independently of any other event. Basically, it is what we know.
P(B/A) is called the Likelihood : the probability of B being TRUE, knowing that the event A is TRUE
P(A) is the Prior probabilty : the probability of A being TRUE

How can we relate this theory to our classification problem?

We want to estimate the probability of a class knowing a set of features. We can then re-write the formula as:

P(class / features) * P(features) = P(features / class) * P(class)

For example, we have 2 classes : 'survived' and 'died'. We have seen above that 32% of the passengers survived, then

P('survived') = 32%
P('died') = 68%

Actually, these numbers are for the whole set of passengers and crew on the ship when the disaster occurred. But, the dataset is not complete and does not provide exhaustive information for everyone. Some features are missing for some passengers. I had to remove the lines with uncomplete information, ending up with 707 data. Among them 289 are in the class 'survived' (remember this for later :). So for our application, the prior probabilities P(class) are:

P('survived') = 289 / 707 = 40.9%
P('died') = 418 / 707 = 59.1%

Now, let's suppose that we want to estimate the probability that a young rich woman survived. Let's say she is 22 years old (the age of Kate Winslet when the movie was released), has a first class ticket that she bought 400 £. The feature set is

{'age:22', 'sex:female', 'Pclass:1', 'fare:400'}

We now need the evidence P(features) and the likelihood P(features / class). But, in the first formula, B is a single event and we have a set of 4 events. How can we deal with it?

The answer is in the name of the method: the NAIVE Bayes method. Here, the term 'naive' means that the events are considered independant. As a consequence, the probability of a set of independant events is just the product of the probabilities of the individual events. The likelihood becomes:

P({'age:22', 'sex:female', 'Pclass:1', 'fare:400'} / 'survived') = P('age:22' / 'survived') * P('sex:female' / 'survived') * P('Pclass:1' / 'survived') * P('fare:400' / 'survived')

This is not always true, as we can imagine that the price of a first class ticket is higher than the price of a third class ticket. But all first price tickets were not sold at the same price (Wikipedia says that they range from 23£ to 870£, it is very likely that higher price cabins were located on upper decks, where people could reach the lifeboats more easily).

For the evidence:

P(features) = sum [ P(features / class) * P(class) ]

where the sum is taken over all the classes. And the conditional probability is decomposed using the product rule:

P(features / class) = product P(individual feature / class)

We get therefore:

P({'age:22', 'sex:female', 'Pclass:1', 'fare:400'}) =

P('age:22' / 'survived') * P('sex:female' / 'survived') * P('Pclass:1' / 'survived') * P('fare:400' / 'survived') * P('survived') +

P('age:22' / 'died') * P('sex:female' / 'died') * P('Pclass:1' / 'died') * P('fare:400' / 'died') * P('died')

Finally, we obtain:

P('survived' / {'age:22', 'sex:female', 'Pclass:1', 'fare:400'}) =

P('age:22' / 'survived') * P('sex:female' / 'survived') * P('Pclass:1' / 'survived') * P('fare:400' / 'survived') * P('survived')

[ P('age:22' / 'survived') * P('sex:female' / 'survived') * P('Pclass:1' / 'survived') * P('fare:400' / 'survived') * P('survived') +

P('age:22' / 'died') * P('sex:female' / 'died') * P('Pclass:1' / 'died') * P('fare:400' / 'died') * P('died') ]

Estimate the probabilities:

We then need to calculate all these probabilities, using the well known formula : the probability of an event is the number of occurrences of the event divided by the total number of samples. So, if among our population of 707, only 183 have a first class ticket, then

P('Pclass:1') = 183 / 707 = 0.2588 = 25.9%

If among them (183), 122 survived, then:

P('Pclass:1' / 'survived') = 122 / 289 = 0.422 = 42.2%
P('Pclass:1' / 'died') = 167 / = 0.578 = 57.8%

Similarly, there are 27 people (in the dataset) who are 22 years old and 11 of them survived ; 261 women of which 197 survived, leading to:

P('age:22') = 27 / 707 = 3.8%
P('age:22' / 'survived') = 11 / 289 = 3.8%
P('age:22' / 'died') = 16 / 418 = 3.8%
P('sex:female') = 261 / 707 = 36.9%
P('sex:female' / 'survived') = 197 / 289 = 68.2%
P('sex:female' / 'died') = 64 / 418 = 15.3%

What about the price? Well, there is no sample with 400£ in our dataset... so we cannot compute the probabilities using the same process. Here, we have 2 solutions:

The simplest solution is to count the samples that have a feature value close to the one we search, and use this number to compute the probability.

In another solution, the naive Bayes method makes a new assumption: the values associated with each class are distributed according to a normal (or Gaussian) distribution. We then need to calculate the mean and standard deviation for this feature, and apply the Gauss formula:

Here xi is the feature and y is the class.

This is what is actually implemented in the code. If any feature is not found in the dataset, then the means and variances of features and classes are calculated and the likelyhood of this particular feature is estimated using a Gaussian distribution.

Running the Program

I assume that you have the ESP32 and that the Arduino IDE is ready for this device. If not, please refer for example to this website.

First you need to install 2 libraries I wrote:

csv_Vector: used to read the dataset
NaiveBayes-for-ESP32

On each of the above GitHub pages, there is a green button reading 'Code'. Click on it to deploy a menu and select 'Download ZIP'. Find the downloaded file on your computer and unzip it into your folder 'Arduino/libraries', and remove the '-main' from the name of the new folder.

You should now have 2 new folders in the 'Arduino/libraries' called 'csvVector' and 'NaiveBayes-for-ESP32'.

Now go back to the Arduino folder and create a new folder called Titanic. In this folder, download the attached file 'Titanic.ino'. Then create a 'data' folder where you put the dataset file from Step 1.

Connect your ESP32 to your computer, restart the Arduino IDE and load the Titanic.ino file. Before to run it, you have to upload the dataset file into the ESP32 file system. This can be done using a plugin called 'ESP32 Sketch Data Upload'. If it doesn't appear in the Tools menu, you need to install it: please refer to this website. Once it is installed, restart the IDE: all you need to do is run the plugin (make sure that the serial monitor is not open) to upload the dataset file to the ESP32.

When it is done, open the serial monitor, select 115200 baud in the menu and upload the sketch...

Enter your age, sex (M or F), class of your ticket and fare. All the numbers should be integer. After some thinking time, you get the result. Good luck!

_______________________________________

So, how about Rose? The sketch says that she had 79.4% chance to survive.

Jack was 23 years old (the age of Leonardo di Caprio when the movie was released). If he was in third class with a ticket fare of 10£... The poor guy had only 6.4% to survive!

And me? With a second class 50£ ticket: I

Downloads

Titanic.ino