In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. Here, I build a supervised learning algorithm to identify fraudulent employees using Enron dataset.

enron_fraud

Understanding the Dataset and Question¶

Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?

Introduction¶

The Enron fraud is the largest case of corporate fraud in American history. Founded in 1985, Enron Corporation went bankrupt by end of 2001 due to widespread corporate fraud and corruption. Before its fall, Fortune magazine had named Enron "America's most innovative company" for six consecutive years. So what happened? Who were the culprits?

In this project, I will play detective and build a classification algorithm to predict a person of interest identifier (POI) based on email and financial features in the combined dataset. A POI is anyone who has been indicted, settled without admitting the guilt and testified in exchange for immunity. We will check our predicted POI against actual POI in the dataset to evaluate our prediction.

In [1]:

import numpy as np
import pandas as pd

import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester_plus import dump_classifier_and_data

/Users/arjanhada/anaconda3/envs/py2/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [2]:

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

In [3]:

enron_data = pd.DataFrame.from_dict(data_dict, orient = 'index')

Data Exploration¶

Lets print few lines from the dataset.

In [4]:

enron_data.head()

Out[4]:

	salary	to_messages	deferral_payments	total_payments	exercised_stock_options	bonus	restricted_stock	shared_receipt_with_poi	restricted_stock_deferred	total_stock_value	...	loan_advances	from_messages	other	from_this_person_to_poi	poi	director_fees	deferred_income	long_term_incentive	email_address	from_poi_to_this_person
ALLEN PHILLIP K	201955	2902	2869717	4484442	1729541	4175000	126027	1407	-126027	1729541	...	NaN	2195	152	65	False	NaN	-3081055	304805	phillip.allen@enron.com	47
BADUM JAMES P	NaN	NaN	178980	182466	257817	NaN	NaN	NaN	NaN	257817	...	NaN	NaN	NaN	NaN	False	NaN	NaN	NaN	NaN	NaN
BANNANTINE JAMES M	477	566	NaN	916197	4046157	NaN	1757552	465	-560222	5243487	...	NaN	29	864523	0	False	NaN	-5104	NaN	james.bannantine@enron.com	39
BAXTER JOHN C	267102	NaN	1295738	5634343	6680544	1200000	3942714	NaN	NaN	10623258	...	NaN	NaN	2660303	NaN	False	NaN	-1386055	1586055	NaN	NaN
BAY FRANKLIN R	239671	NaN	260455	827696	NaN	400000	145796	NaN	-82782	63014	...	NaN	NaN	69	NaN	False	NaN	-201641	NaN	frank.bay@enron.com	NaN

5 rows × 21 columns

In [5]:

print "There are a total of {} people in the dataset." .format(len(enron_data.index)) 
print "Out of which {} are POI and {} Non-POI." .format(enron_data['poi'].value_counts()[True], 
                                                 enron_data['poi'].value_counts()[False])
print "Total number of email plus financial features are {}. 'poi' column is our label." \
.format(len(enron_data.columns)-1)

There are a total of 146 people in the dataset.
Out of which 18 are POI and 128 Non-POI.
Total number of email plus financial features are 20. 'poi' column is our label.

Enron dataset is really messy and has a lot of missing values (NaN). All the features have missing values. Some features have more than 50% of their values missing, as we can see from the frequency of NaN from the table below. NaNs are coerced to 0 for training our algorithm later.

In [6]:

enron_data.describe().transpose()

Out[6]:

	count	unique	top	freq
salary	146	95	NaN	51
to_messages	146	87	NaN	60
deferral_payments	146	40	NaN	107
total_payments	146	126	NaN	21
exercised_stock_options	146	102	NaN	44
bonus	146	42	NaN	64
restricted_stock	146	98	NaN	36
shared_receipt_with_poi	146	84	NaN	60
restricted_stock_deferred	146	19	NaN	128
total_stock_value	146	125	NaN	20
expenses	146	95	NaN	51
loan_advances	146	5	NaN	142
from_messages	146	65	NaN	60
other	146	93	NaN	53
from_this_person_to_poi	146	42	NaN	60
poi	146	2	False	128
director_fees	146	18	NaN	129
deferred_income	146	45	NaN	97
long_term_incentive	146	53	NaN	80
email_address	146	112	NaN	35
from_poi_to_this_person	146	58	NaN	60

In [7]:

enron_data.replace(to_replace='NaN', value=0.0, inplace=True)

Outlier investigation¶

Visualization is one of the most powerful tools for finding outliers. Upon plotting salary against bonus, there is an outlier that pops out immediately - "TOTAL" (move the cursor and examine the points in scatterplot). The spreadsheet added up all the data points for us and we need to take that point out. Upon closer examination, I found one more entry which is not the name of a real person "THE TRAVEL AGENCY IN THE PARK". The entry is dropped from the dataset. The entries which have all the features as 'NaN' are also dropped from the dataset.

In [8]:

# Import plotly and and set your credentials
from plotly import tools
#tools.set_credentials_file(username='*******', api_key='*********')

In [9]:

import plotly.plotly as py
import plotly.graph_objs as go

# Make scatterplot before outlier removal
trace0 = go.Scatter(
    x=enron_data.salary,
    y=enron_data.bonus,
    text = enron_data.index,
    mode = 'markers'
)

# Remove Outlier
enron_data.drop(['TOTAL'], axis = 0, inplace= True)

# Make scatterplot after outlier removal
trace1 = go.Scatter(
    x=enron_data.salary,
    y=enron_data.bonus,
    text = enron_data.index,
    mode = 'markers'
)

# Layout the plots together side by side
fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('Before outlier removal', 'After outlier removal'))

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)

fig['layout']['xaxis1'].update(title='salary')
fig['layout']['xaxis2'].update(title='salary')

fig['layout']['yaxis1'].update(title='bonus')
fig['layout']['yaxis2'].update(title='bonus')

py.iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]

Out[9]:

In [10]:

enron_data.drop(['THE TRAVEL AGENCY IN THE PARK'], axis = 0, inplace= True)

Optimize Feature Selection/Engineering¶

What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.

Feature engineering¶

Feature engineering involves using human intuition to hypothesize what feature might contain pattern that can be exploited using machine learning, coding up the new feature, visualizing and repeating the same process again. Our hypothesis here - "POI's sent email to each other at a rate higher than for Non-POI's." I coded up two new features fraction of emails this person gets from poi (fraction_from_poi) and fraction of emails this person send to poi (fraction_to_poi). number of emails from this person to POI/total number of message from this person

$$fraction\_from\_poi =\frac{number\ of\ emails\ from\ POI\ to\ this\ person}{total\ number\ of\ message\ to\ this\ person} = \frac{from\_poi\_to\_this\_person}{to\_messages}$$$$fraction\_to\_poi = \frac{number\ of\ emails\ from\ this\ person\ to\ POI}{total\ number\ of\ message\ from\ this\ person} = \frac{from\_this\_person\_to\_poi}{from\_messages}$$

When I visualize these new features, we can see that these new features provide discriminating power between POI and Non-POI. There is a good amount of feature space in the lower part of plot (on right) below 0.2 where there are no POIs.

In [11]:

#Create new feature(s)
enron_data["fraction_from_poi"] = enron_data["from_poi_to_this_person"].\
divide(enron_data["to_messages"], fill_value = 0)

enron_data["fraction_to_poi"] = enron_data["from_this_person_to_poi"].\
divide(enron_data["from_messages"], fill_value = 0)

In [12]:

enron_data["fraction_from_poi"] = enron_data["fraction_from_poi"].fillna(0.0)
enron_data["fraction_to_poi"] = enron_data["fraction_to_poi"].fillna(0.0)

In [13]:

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.layouts import row
from bokeh.models import ColumnDataSource

output_notebook()

colormap = {False: 'blue', True: 'red'}
colors = [colormap[x] for x in enron_data['poi']]

labelmap = {False: 'Non-POI', True: 'POI'}
labels = [labelmap[x] for x in enron_data['poi']]

source = ColumnDataSource(dict(
    x1=enron_data["from_poi_to_this_person"],
    y1=enron_data["from_this_person_to_poi"],
    x2=enron_data["fraction_from_poi"],
    y2=enron_data["fraction_to_poi"],
    color=colors,
    label=labels
))

# Before feature engineering
s1 = figure(plot_width=450, plot_height=400)
s1.xaxis.axis_label = 'no. of emails from POI to this person'
s1.yaxis.axis_label = 'no. of emails from this person to POI'

s1.circle('x1', 'y1', size = 10, alpha = 0.5, 
          color='color', legend = 'label', source = source)

# After feature engineering
s2 = figure(plot_width=450, plot_height=400)
s2.xaxis.axis_label = 'fraction of emails this person gets from POI'
s2.yaxis.axis_label = 'fraction of emails this person sends to POI'

s2.circle('x2', 'y2', size = 10, alpha = 0.5, 
          color='color', legend = 'label', source = source)

show(row(s1, s2))

Loading BokehJS ...

Feature Scaling¶

I have used decision tree as my final algorithm. Algorithms like decision tree and linear regression don't require feature scaling, whereas Support Vector Machines (SVM) and k-means clustering does.

SVM and k-means clustering calculate Euclidean distance between points. If one of the features has a large range, the distance will be governed by this particular feature. These classifiers are affine transformation variant.

In case of linear regression, there is a coefficient with each feature. If a feature has large ranges that do not effect the label, regression algorithm will make the corresponding coefficients small. Even in case of tree based algorithms, we don't have to worry about one dimension, when we are doing something with the other. These classifers are affine transformation invariant.

Feature Selection¶

In [14]:

# Store to my_dataset for easy export below.
# create a dictionary from the dataframe
my_dataset = enron_data.to_dict('index')

Features $\neq$ Information. We want to have the minimum number of features than capture trends and patterns in our data. We want to get rid of features that don't give us any information. Machine learning algorithm is just going to be as good as the features we put into it. It is critical that the methodology deployed for feature selection must be scientific and exhaustive without room for intuition.

First I manually removed features which had more than 50% of the values missing (NaN), then I performed SelectKBest on remaining features and selected eight features with scores greater than 2.

In [15]:

from sklearn.feature_selection import SelectKBest, f_classif

features_list = ["poi", "bonus", "exercised_stock_options", "expenses", "other", "restricted_stock", "salary", 
                  "shared_receipt_with_poi", "total_payments", "total_stock_value", "fraction_to_poi",
                 "fraction_from_poi"]

data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(features, labels)

# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)

# Bokeh Barplots
from bkcharts import Bar, show

data = {'scores':scores, 'features':features_list[1:]}
bar = Bar(data, label='features', values='scores', title="Select K Best", 
          legend = None, plot_width = 450, plot_height = 450)

show(bar)

Finally, I used 'feature_importances' attribute of my Decision Tree classifier to select four features (bonus, exercised_stock_options, fraction_to_poi and shared_receipt_with_poi) that maximized my F1 score. Implementing my final algorithm without the newly engineered feature fraction_to_poi, dropped my F1-score by 74% (from 0.47547 to 0.12157). This also shows the effect of this feature on the final algorithm performance.

In [16]:

features_list = ["poi", "bonus", "exercised_stock_options", 
                 "fraction_to_poi", "shared_receipt_with_poi"]
data = featureFormat(my_dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)

In [17]:

# These are feature importance I obtained from running my final classiifer and testing it with tester.py.
# Check Algorithm Performance section below.
feature_importances = [0.13442256, 0.0433088, 0.48651064, 0.335758]
from bkcharts import Bar, show

data = {'importance':feature_importances, 'features':features_list[1:]}
bar = Bar(data, label='features', values='importance', title="Feature importances from fitted Decision Tree", 
          legend = None, plot_width = 400, plot_height = 400)

show(bar)

Pick and Tune an Algorithm¶

What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?

Pick an algorithm¶

Different algorithms were attempted using the four best features:- bonus, exercised_stock_options, fraction_to_poi and shared_receipt_with_poi.

In [18]:

# Not restricting the maximum width in characters of a column
pd.options.display.max_colwidth = 0
data = {"Algorithms":["DecisionTreeClassifier", 
                      "RandomForestClassifier",
                      "AdaBoostClassifier",
                      "GaussianNB"],
        "Parameters":["criterion='entropy', max_depth =2, min_samples_split=2, min_samples_leaf=6", 
                      "n_estimators=150, min_samples_split=5", 
                      "n_estimators=150",
                      "Default"],
        "Accuracy":[0.83962,0.85031,0.81823,0.82769], 
        "Precision":[ 0.47848,0.52246,0.40066,0.41215], 
        "Recall":[0.47250,0.31400,0.36600,0.28150], 
        "F1":[0.47547,0.39225,0.38255,0.33452], 
        "F2":[0.47368,0.34123,0.37244,0.30056]}
algorithms = pd.DataFrame(data, columns = ["Algorithms", "Parameters", "Accuracy", "Precision", "Recall", "F1", "F2"])
algorithms

Out[18]:

	Algorithms	Parameters	Accuracy	Precision	Recall	F1	F2
0	DecisionTreeClassifier	criterion='entropy', max_depth =2, min_samples_split=2, min_samples_leaf=6	0.83962	0.47848	0.4725	0.47547	0.47368
1	RandomForestClassifier	n_estimators=150, min_samples_split=5	0.85031	0.52246	0.3140	0.39225	0.34123
2	AdaBoostClassifier	n_estimators=150	0.81823	0.40066	0.3660	0.38255	0.37244
3	GaussianNB	Default	0.82769	0.41215	0.2815	0.33452	0.30056

What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm?

Hyperparameter optimization¶

Parameter tuning for an algorithm means selecting the good and robust parameter or set of parameters for an algorithm to optimize its performance. Default parameters may not be customized very well for the particular dataset features and might result in poor performance. Scikit learn provides two methods for algorithm parameter tuning/Hyperparameter optimization- GridSearchCV and RandomizedSearchCV.

I used GridSearchCV to do exhaustive search over different parameters and find the best parameters.

I used the 'f1' as my "scoring" parameter to guide the parameter search process to minimize False positives and False negatives.
In the "cv" parameter, I passed a cross validation object (StratifiedShuffleSplit) to validate my search results that best adapt to my dataset characteristics.
For my final DecisionTreeClassiifer I tweaked the parameters as shown below :-

# Specify parameters of the algorithm
clf_params= {'criterion': ['gini', 'entropy'],
             'max_depth': [None, 1, 2, 5, 10],
             'min_samples_split': [2, 3, 4, 5],
             'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8]
            }

# Specify algorithm
dt = DecisionTreeClassifier()

# GridSearchCV
cv = StratifiedShuffleSplit(labels, n_iter = 1000,random_state = 42)
clf = GridSearchCV(dt, param_grid = clf_params,cv = cv, scoring = 'f1')
clf.fit(features,labels)

# pick a winner
best_clf = clf.best_estimator_
print best_clf

max_depth determines when the splitting of decision tree node stops. min_samples_split monitors the amount of observations in an internal node; if a certain threshold is not reached (e.g min 5 people) no further splitting can be done. Very deep trees fit to quirks in data and perform well on training data, but will perform worse on test (unseen) data. We want our model to generalize better.

An example of how my F1-score varied while changing the 'max_depth' parameter of my final DecisionTreeClassifier, while keeping 'min_sample_split' constant at 5. F1-score was obtained using tester.py module.

In [19]:

from bkcharts import Scatter, show

max_depth = [1,2,5,10,15,20]
F1_score = [0.06125, 0.21550, 0.39598, 0.41575, 0.41767, 0.42029]

data = {'max_depth':max_depth, 'F1-score':F1_score}

p = Scatter(data, x='max_depth', y='F1-score', 
            title="DecisionTreeClassifier(max_depth = x, min_samples_split=5)",
            xlabel="max_depth", ylabel="F1-score", 
            plot_width = 500, plot_height = 400)
show(p)

Validate and Evaluate¶

What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?

Validation is the strategy to evaluate the performance of the model on unseen data. A classic mistake is to evaluate the performance of an algorithm on the same dataset it was trained on. It will make our algorithm perform better than it actually does. However, we will have no idea how our algorithm performs on unseen data.

It is essential practice in data mining procedures to keep a subset of data as holdout data- test data. We train our model on training data and examine the generalization performance of the model on the test data. We hide the label for target variable of the test data from the model and let the model predict the values for target variable. Then we compare the values predicted by the model with the hidden true values. We can also use a more sophisticated holdout training and testing procedure called cross-validation.

In our case, I used a variation of k-fold cross-validation called StratifiedShuffleSplit. StratifiedShuffleSplit will make randomly chosen training and test sets multiple times and average the results over all the tests. Data is first shuffled and then split into a pair of training and test sets. Stratification ensures training and test splits have class distribution (POI:Non-POI) that represents the overall data. Stratification is well suited in our case because of class imbalance (18 POI vs 128 Non-POI).

Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.

There are a number of evaluation metrics for classification challenges. In balanced classes, where all the labels are equally represented, we look at the classification accuracy of the model.

$$Accuracy = \frac{Number\ of\ labels\ predicted\ correctly}{Total\ number\ of\ predicitions} = \frac{True\ Positives +True\ Negatives}{Total\ Predictions}$$

However, accuracy is not ideal for skewed classes. In our case, number of POI are small compared to non-POI (18 vs 128). We can have high accuracy by classifying many non-POI correctly and still not have a POI classiifed correctly. For imbalanced classes like we have, Precision and Recall are common measures of model performance.

$$Precision = \frac{True\ Positive}{True\ Positive + False\ Positive}\ \ \ \ \ Recall = \frac{True\ Positive}{True\ Positive + False\ Negative}$$

A good precision means that whenever a POI gets flagged in my test set, I know with a lot of confidence that it's very likely to be a real POI and not a false alarm. A low precision indicates a large number of False Positives, where non-POI gets flagged as POI.

A good recall means nearly I am able to identify a POI everytime it shows up in test cases. A low recall indicates many False Negatives, where POIs don't get flagged correctly.

F1 score conveys a balance between precision and recall. It is the harmonic mean of precision and recall.

$$F1\ score = \frac{2*(precision*recall)}{precision + recall}$$

A good F1 score means both my false positives and false negatives are low, I can identify my POI's reliably and accurately. If my classifier flags a POI then the person is almost certainly a POI, and if the classifier does not flag someone, then they are almost certainly not a POI.

The F2-Score is a weighted average of precision and recall:

$$F2\ score = \frac{5*(precision*recall)}{4*precision + recall}$$

Algorithm Performance¶

In [20]:

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(criterion='entropy', max_depth =2, min_samples_split=2, min_samples_leaf=6)

In [21]:

dump_classifier_and_data(clf, my_dataset, features_list)

In [22]:

# Get confusion matrix (cm)
from tester_plus import test_classifier
cm = test_classifier(clf, my_dataset, features_list)

# Seaborn and matplotlib library
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# View confusion matrix with a heatmap
sns.heatmap(cm, annot=True, fmt = 'd', cmap='Reds', xticklabels=['no', 'yes'], yticklabels=['no', 'yes'])
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion matrix for:\n{}'.format(clf.__class__.__name__));

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=6,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
Feature importances [ 0.13511765  0.04337189  0.48606534  0.33544513]
Accuracy: 0.83915	Precision: 0.47703	Recall: 0.47250	F1: 0.47476	F2: 0.47340
Total predictions: 13000	True positives:  945	False positives: 1036	False negatives: 1055	True negatives: 9964

Reference¶

1) Udacity's Intro to Machine Learning
2) Enron wiki
3) scikit-learn
4) Plotly Scatterplot
5) Bokeh Scatterplot
6) Bokeh Barchart
7) Optional code visibility

Identify Fraud From Enron Data