Project Details
You are a business analyst consultant and your client is a new movie production company looking to make a new movie. The client wants to make sure it’s successful to help make a name for the new company. They are relying on you to help understand movie trends to help inform their decision making. They’ve given you guidance to look into three specific areas:

Question 1: How have movies based on novels performed relative to movies not based on novels?
Question 2: How do the attributes differ between Universal Pictures and Paramount Pictures?
Question 3: How have movie genres changed over time?

In this blog post, I will clean the data and get the dataset ready for Tableau visualization. In the second part of the post, I will attempt to answer these questions using Tableau Dashboard and Stories.

poster_tmdb
Source:- TheMovieDB

Getting Started¶

In [1]:

# Import the libraries needed
import pandas as pd
import numpy as np
import re
import calendar

The original data used for the analysis can be downloaded from the following link. When I tried to open the file directly using pandas read_csv module, I ran into following error- "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 12: invalid continuation byte". So I had to open the file in Sublime Text and save it with utf-8 encoding. This allowed me to open the file using pandas read_csv module.

In [3]:

# Load the data into a pandas dataframe
movies = pd.read_csv("data/movies.csv")
movies.head()

Out[3]:

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	2015-06-09	5562	6.5	2015	1.379999e+08	1.392446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	2015-05-13	6185	7.1	2015	1.379999e+08	3.481613e+08
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	One Choice Can Destroy You	...	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	2015-03-18	2480	6.3	2015	1.012000e+08	2.716190e+08
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Every generation has a story.	...	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	2015-12-15	5292	7.5	2015	1.839999e+08	1.902723e+09
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Vengeance Hits Home	...	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	2015-04-01	2947	7.3	2015	1.747999e+08	1.385749e+09

5 rows × 21 columns

Data clean up¶

If we look at the dataset the information needed to answer the questions isn't readily available. We need to do some preprocessing on relevant columns to dig out the information needed to answer the questions.
We will start by fixing the columns that pertain to the question we are trying to address.

Question 1: How have movies based on novels performed relative to movies not based on novels?
To address this question we need to find movies that are based on novels. This information is primarily available in keywords column and in tagline column. Instead of fixing the genres column for the entire dataset, I select few columns and make a subset of data to work with it.

In [389]:

movies_novels = movies[["id", "original_title", "keywords", "tagline"]].reset_index(drop = True)
movies_novels.head(10)

Out[389]:

	id	original_title	keywords	tagline
0	135397	Jurassic World	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	The park is open.
1	76341	Mad Max: Fury Road	future\|chase\|post-apocalyptic\|dystopia\|australia	What a Lovely Day.
2	262500	Insurgent	based on novel\|revolution\|dystopia\|sequel\|dyst...	One Choice Can Destroy You
3	140607	Star Wars: The Force Awakens	android\|spaceship\|jedi\|space opera\|3d	Every generation has a story.
4	168259	Furious 7	car race\|speed\|revenge\|suspense\|car	Vengeance Hits Home
5	281957	The Revenant	father-son relationship\|rape\|based on novel\|mo...	(n. One who has returned, as if from the dead.)
6	87101	Terminator Genisys	saving the world\|artificial intelligence\|cybor...	Reset the future
7	286217	The Martian	based on novel\|mars\|nasa\|isolation\|botanist	Bring Him Home
8	211672	Minions	assistant\|aftercreditsstinger\|duringcreditssti...	Before Gru, they had a history of bad bosses
9	150540	Inside Out	dream\|cartoon\|imaginary friend\|animation\|kid	Meet the little voices inside your head.

As I went through the keywords column, I noticed that there are multiple phrases that could tell whether a movie is based on novel - 'based on novel', 'based on graphic novel', 'inspired by novel' etc. In order to list all of the possible phrases we could be dealing with, I write a regular expression script to search for the first occurrence of phrase with 'novel' in each row. The function returns either the phrase with 'novel' in it or None.

In [390]:

def find_novel_from_keywords(keywords):
    try:
        match = re.search("(\|{0,1}[\w\s]*novel[\w\s]*\|{0,1})", keywords)
        if match:
            return match.group(0)
        else:
            return None
    except TypeError:
        return None
    
phrases_with_novel = set()
for text in set(movies_novels["keywords"].tolist()):
    pattern = find_novel_from_keywords(text)
    phrases_with_novel.add(pattern)

print(phrases_with_novel)

{'|inspired by novel', '|novelist', '|stolen novel|', '|based on graphic novel', '|based on novel', '|based on novel|', '|based on graphic novel|', '|novelist|', 'based on novel', '|tell all novel|', None, 'based on graphic novel|', 'based on novel|'}

Hmm! there are several phrases with 'novel' in it, but only three phrases tell us if a movie is based on a novel. The three phrases are - 'based on novel', 'based on graphic novel', and 'inspired by novel'. Next, we will create a column based_on_novel_0 that looks for those three phrases from keywords column and returns True if the movie is novel based.

In [391]:

def find_novel_based_movie(keywords):
    try:
        match1 = re.search("(\|{0,1}based[\w\s]*novel\|{0,1})", keywords)
        match2 = re.search("(\|{0,1}inspired[\w\s]*novel\|{0,1})", keywords)
        if match1 or match2:
            return True
        else:
            return False
    except TypeError:
        return None
    
movies_novels["based_on_novel_0"] = movies_novels["keywords"].apply(find_novel_based_movie)  
movies_novels.head()

Out[391]:

	id	original_title	keywords	tagline	based_on_novel_0
0	135397	Jurassic World	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	The park is open.	False
1	76341	Mad Max: Fury Road	future\|chase\|post-apocalyptic\|dystopia\|australia	What a Lovely Day.	False
2	262500	Insurgent	based on novel\|revolution\|dystopia\|sequel\|dyst...	One Choice Can Destroy You	True
3	140607	Star Wars: The Force Awakens	android\|spaceship\|jedi\|space opera\|3d	Every generation has a story.	False
4	168259	Furious 7	car race\|speed\|revenge\|suspense\|car	Vengeance Hits Home	False

It is also quite likely that information for whether a movie is based on novel is in tagline column. Let us search tagline column for any phrase with novel in it.

In [392]:

def find_novel_from_tagline(tagline):
    try:
        match = re.search("(.*novel.*)", tagline)
        if match:
            return match.group(0)
        else:
            return None
    except TypeError:
        return None
    
phrases_with_novel_1 = set()
for text in set(movies_novels["tagline"].tolist()):
    pattern = find_novel_from_tagline(text)
    phrases_with_novel_1.add(pattern)

print(phrases_with_novel_1)

{'Based on the best-selling novel', 'Based on the novel of Chico Xavier', 'Based on the novel by Henry James', 'The #1 novel of the year - now a motion picture!', None}

There were four movies based on novel for which the information was present in tagline column. Next, we will create a column based_on_novel_1 that looks for those phrases from tagline column and returns True if the movie is novel based.

In [393]:

def find_novel_based_movie_1(tagline):
    try:
        match = re.search("(.*novel.*)", tagline)
        if match:
            return True
        else:
            return False
    except TypeError:
        return None
    
movies_novels["based_on_novel_1"] = movies_novels["tagline"].apply(find_novel_based_movie_1)  
movies_novels.head()

Out[393]:

	id	original_title	keywords	tagline	based_on_novel_0	based_on_novel_1
0	135397	Jurassic World	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	The park is open.	False	False
1	76341	Mad Max: Fury Road	future\|chase\|post-apocalyptic\|dystopia\|australia	What a Lovely Day.	False	False
2	262500	Insurgent	based on novel\|revolution\|dystopia\|sequel\|dyst...	One Choice Can Destroy You	True	False
3	140607	Star Wars: The Force Awakens	android\|spaceship\|jedi\|space opera\|3d	Every generation has a story.	False	False
4	168259	Furious 7	car race\|speed\|revenge\|suspense\|car	Vengeance Hits Home	False	False

We are only interested in finding whether a movie is based on novel. We are not concerned whether that information comes from keywords column or tagline column. We will create a single column based_on_novel from based_on_novel_0 and based_on_novel_1. At the same time, we will now drop columns we don't need anymore.

In [394]:

movies_novels["based_on_novel"] = movies_novels["based_on_novel_0"] + movies_novels["based_on_novel_1"]
movies_novels.drop(["keywords", "tagline", "based_on_novel_0", "based_on_novel_1"], axis = 1, inplace = True)
movies_novels.head()

Out[394]:

	id	original_title	based_on_novel
0	135397	Jurassic World	0
1	76341	Mad Max: Fury Road	0
2	262500	Insurgent	1
3	140607	Star Wars: The Force Awakens	0
4	168259	Furious 7	0

We will now convert the based_on_novel column from integers to boolean. Before that, we need to check the different integer values that are there in the column beside 0 and 1. It is quite likely that information for movie based on a novel is present on both the columns- keywords and tagline. As a result, some of the values for based_on_novel might be two. Let's check that out.

In [395]:

print(movies_novels["based_on_novel"].unique())

[0 1 nan 2]

In [396]:

movies_novels[movies_novels['based_on_novel'] == 2]

Out[396]:

	id	original_title	based_on_novel
10660	10671	Airport	2

In [397]:

movies.loc[movies['original_title'] == "Airport", ["original_title", "keywords", "tagline"]]

Out[397]:

	original_title	keywords	tagline
10660	Airport	bomb\|based on novel\|airport\|desperation\|snow s...	The #1 novel of the year - now a motion picture!

In [398]:

int_to_bool = {0: 'False', 1: 'True', 2: 'True'}
movies_novels['based_on_novel'] = movies_novels['based_on_novel'].apply(int_to_bool.get)
movies_novels.head()

Out[398]:

	id	original_title	based_on_novel
0	135397	Jurassic World	False
1	76341	Mad Max: Fury Road	False
2	262500	Insurgent	True
3	140607	Star Wars: The Force Awakens	False
4	168259	Furious 7	False

Now we have our new column based_on_novel in a format that can be used for Tableau visualization. Let's drop the keywords and tagline columns in the original dataset and merge the movies_novels data subsets with the original movie dataset using id and original_title.

In [399]:

movies.drop(["keywords", "tagline"], axis = 1, inplace = True)

# We will use the merge function in pandas to merge the scraped data subsets and original movies dataset. 
# The merge function in pandas is similar to join in SQL.
movies = pd.merge(movies,
                  movies_novels,
                  how = "left",
                  left_on=["id", "original_title"],
                  right_on=["id", "original_title"])

movies.head()

Out[399]:

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj	based_on_novel
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	2015-06-09	5562	6.5	2015	1.379999e+08	1.392446e+09	False
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	2015-05-13	6185	7.1	2015	1.379999e+08	3.481613e+08	False
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	2015-03-18	2480	6.3	2015	1.012000e+08	2.716190e+08	True
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	2015-12-15	5292	7.5	2015	1.839999e+08	1.902723e+09	False
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	2015-04-01	2947	7.3	2015	1.747999e+08	1.385749e+09	False

Question 2: How do the attributes differ between Universal Pictures and Paramount Pictures?
To answer this question we need to tidy up the production_companies column. Each row of production_companies column contains several production companies separated by | and there can be a maximum of five companies. We are interested in finding out whether a movie was produced by Universal Pictures, or Paramount Pictures, or None.

In [400]:

movies_production_companies = movies[["id", "original_title", "production_companies"]].reset_index(drop = True)
movies_production_companies.head()

Out[400]:

	id	original_title	production_companies
0	135397	Jurassic World	Universal Studios\|Amblin Entertainment\|Legenda...
1	76341	Mad Max: Fury Road	Village Roadshow Pictures\|Kennedy Miller Produ...
2	262500	Insurgent	Summit Entertainment\|Mandeville Films\|Red Wago...
3	140607	Star Wars: The Force Awakens	Lucasfilm\|Truenorth Productions\|Bad Robot
4	168259	Furious 7	Universal Pictures\|Original Film\|Media Rights ...

As I skimmed through the production_companies column, I noticed that Universal Studios was not named consistently - Universal Studios, Universal Pictures, Universal etc. In order to find out all the possible names of Universal Studios we could be dealing with, I wrote a script to search for the first occurrence of phrase with 'Universal' in each row. The function returns either the phrase with word Universal in it or None.

In [401]:

def find_Universal(production_company):
    try:
        match = re.search("(\|{0,1}[\w\s]*Universal[\w\s]*\|{0,1})", production_company)
        if match:
            return match.group(0)
        else:
            return None
    except TypeError:
        return None
    
words_with_Universal = set()
for text in set(movies_production_companies["production_companies"].tolist()):
    pattern = find_Universal(text)
    words_with_Universal.add(pattern)

print(words_with_Universal)

{'Universal Pictures', 'Universal Cable Productions|', 'Universal Cartoon Studios|', 'Universal 1440 Entertainment', '|Universal Home Entertainment', '|Universal Studios Home Entertainment|', '|Universal International Pictures ', '|Universal Cartoon Studios|', '|Universal 1440 Entertainment', 'Universal', '|Universal City Studios|', '|Universal CGI|', 'Universal TV', '|Universal Pictures International ', 'Universal Pictures Germany GmbH', 'Universal TV|', '|Universal Pictures', 'Universal Pictures International ', '|Universal Studios Home Entertainment', '|Universal Television', 'Universal Pictures Corporation', '|Universal City Studios', '|Universal Music', 'NBC Universal Television|', '|NBC Universal Global Networks|', 'Universal Home Entertainment', 'Universal Pictures|', '|Universal Family and Home Entertainment', '|Universal Studios Sound Facilities', 'Universal Studios|', '|Universal|', '|Universal Pictures|', 'Universal Studios Home Entertainment Family Productions|', '|Universal Network Television|', 'Universal Studios', 'NBC Universal Television', 'Universal Productions France S', 'Universal Cartoon Studios', 'Universal Studios Home Entertainment', None, '|Universal 1440 Entertainment|', '|Universal Home Video'}

Huh! there are several phrases with Universal in it. I googled to figure out if these were all divisions of the same company. The wikipedia page states - "Universal Pictures (also referred to as Universal Studios or simply Universal) is an American film studio owned by Comcast through the Universal Filmed Entertainment Group division of its wholly owned subsidiary NBCUniversal." An extensive search revealed that these are all part of American multinational media conglomerate NBCUniversal.
I also wanted to go ahead and check if it was similar case for Paramount Pictures. I wrote a similar script to search for the first occurrence of phrase with 'Paramount' in each row.

In [402]:

def find_Paramount(production_company):
    try:
        match = re.search("(\|{0,1}[\w\s]*Paramount[\w\s]*\|{0,1})", production_company)
        if match:
            return match.group(0)
        else:
            return None
    except TypeError:
        return None
    
words_with_Paramount = set()
for text in set(movies_production_companies["production_companies"].tolist()):
    pattern = find_Paramount(text)
    words_with_Paramount.add(pattern)

print(words_with_Paramount)

{'|Paramount Television|', '|Paramount Pictures', 'Paramount Famous Productions', '|Paramount Classics', '|Paramount Home Entertainment', 'Paramount Vantage', 'Paramount Classics', 'Paramount Vantage|', '|Paramount Vantage|', 'Paramount Pictures|', 'Paramount Pictures', '|Paramount Classics|', 'Paramount|', '|Paramount Vantage', '|Paramount Pictures Digital Entertainment|', None, '|Paramount Animation', 'Paramount Home Entertainment', 'Paramount Pictures Digital Entertainment'}

As stated in Wikipedia - "Paramount Pictures Corporation (also known as Paramount Pictures and simply Paramount) is an American film studio based in Hollywood, California, that has been a subsidiary of the American media conglomerate Viacom since 1994." These above-listed names are all subsidiaries of the same company.
Now we can write some code to tag all the companies with Universal in their names as Universal and all the companies with Paramount in their names as Paramount, and neither as None for our purpose. To do this, I used a regular expression that searches for the first occurrence of Universal or Paramount in each row of production_companies column.

In [403]:

def find_Universal_or_Paramount(production_company):
    try:
        universal = re.search("(\|{0,1}[\w\s]*Universal[\w\s]*\|{0,1})", production_company)
        paramount = re.search("(\|{0,1}[\w\s]*Paramount[\w\s]*\|{0,1})", production_company)
        if universal:
            return "Universal"
        elif paramount:
            return "Paramount"
        else:
            return None
    except TypeError:
        return None

# Transform the original 'production_companies' to only keep relevant information
movies["production_companies"] = movies["production_companies"].apply(
    find_Universal_or_Paramount)

# We can see that the production_companies column has been transformed in the original dataset
movies.head()

Out[403]:

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj	based_on_novel
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal	2015-06-09	5562	6.5	2015	1.379999e+08	1.392446e+09	False
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	None	2015-05-13	6185	7.1	2015	1.379999e+08	3.481613e+08	False
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	None	2015-03-18	2480	6.3	2015	1.012000e+08	2.716190e+08	True
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	None	2015-12-15	5292	7.5	2015	1.839999e+08	1.902723e+09	False
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal	2015-04-01	2947	7.3	2015	1.747999e+08	1.385749e+09	False

While we are here, let us also extract day, month, and day of the week information from the date column. We will create these three new columns in the original movies dataset itself. These attributes can be used for exploring the seasonal effect on movie's performance.

In [404]:

movies["release_month"] = pd.DatetimeIndex(movies["release_date"]).month
movies["release_month"] = movies["release_month"].apply(lambda x: calendar.month_abbr[x])
movies["release_day"] = pd.DatetimeIndex(movies["release_date"]).day
movies["release_dayofweek"] = pd.DatetimeIndex(movies["release_date"]).weekday_name
movies.head()

Out[404]:

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	overview	...	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj	based_on_novel	release_month	release_day	release_dayofweek
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	Twenty-two years after the events of Jurassic ...	...	2015-06-09	5562	6.5	2015	1.379999e+08	1.392446e+09	False	Jun	9	Tuesday
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	An apocalyptic story set in the furthest reach...	...	2015-05-13	6185	7.1	2015	1.379999e+08	3.481613e+08	False	May	13	Wednesday
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	Beatrice Prior must confront her inner demons ...	...	2015-03-18	2480	6.3	2015	1.012000e+08	2.716190e+08	True	Mar	18	Wednesday
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Thirty years after defeating the Galactic Empi...	...	2015-12-15	5292	7.5	2015	1.839999e+08	1.902723e+09	False	Dec	15	Tuesday
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Deckard Shaw seeks revenge against Dominic Tor...	...	2015-04-01	2947	7.3	2015	1.747999e+08	1.385749e+09	False	Apr	1	Wednesday

5 rows × 23 columns

We will also calculate two important metrics that we will use later for Tableau visualization.

In [405]:

movies["profit"] = movies["revenue"] - movies["budget"]
movies["profit (%)"] = (movies["revenue"] - movies["budget"])*100/movies["budget"]
movies["profit_adj"] = movies["revenue_adj"] - movies["budget_adj"]

In [406]:

movies.head()

Out[406]:

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	overview	...	release_year	budget_adj	revenue_adj	based_on_novel	release_month	release_day	release_dayofweek	profit	profit (%)	profit_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	Twenty-two years after the events of Jurassic ...	...	2015	1.379999e+08	1.392446e+09	False	Jun	9	Tuesday	1363528810	909.019207	1.254446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	An apocalyptic story set in the furthest reach...	...	2015	1.379999e+08	3.481613e+08	False	May	13	Wednesday	228436354	152.290903	2.101614e+08
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	Beatrice Prior must confront her inner demons ...	...	2015	1.012000e+08	2.716190e+08	True	Mar	18	Wednesday	185238201	168.398365	1.704191e+08
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Thirty years after defeating the Galactic Empi...	...	2015	1.839999e+08	1.902723e+09	False	Dec	15	Tuesday	1868178225	934.089113	1.718723e+09
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Deckard Shaw seeks revenge against Dominic Tor...	...	2015	1.747999e+08	1.385749e+09	False	Apr	1	Wednesday	1316249360	692.762821	1.210949e+09

5 rows × 26 columns

Write the dataframe to csv file for Tableau visualization.

In [407]:

movies.to_csv('data/movies_cleaned.csv',
             index = False)

Question 3: How have movie genres changed over time?
Problem: The genres column consists of multiple genres in the same cell separated by "|". To tidy this data up, we need to have one genre per row associated with their respective id. Instead of fixing the genres column for the entire dataset, I select few columns and make a subset of data to work with it.

In [408]:

movies_genres = movies[["id", "original_title", "genres"]].reset_index(drop = True)
movies_genres.head()

Out[408]:

	id	original_title	genres
0	135397	Jurassic World	Action\|Adventure\|Science Fiction\|Thriller
1	76341	Mad Max: Fury Road	Action\|Adventure\|Science Fiction\|Thriller
2	262500	Insurgent	Adventure\|Science Fiction\|Thriller
3	140607	Star Wars: The Force Awakens	Action\|Adventure\|Science Fiction\|Fantasy
4	168259	Furious 7	Action\|Crime\|Thriller

First, we separate multiple genres associated with each movie into multiple columns. We know from the data specifications that there can be a maximum of five genres for each movie, so we will separate the genres column into five columns.

In [409]:

movies_genres[['genre1', 'genre2', 'genre3', 'genre4', 'genre5']] = movies_genres['genres'].str.split(
    "|", expand = True)
del movies_genres['genres']

The genres variable is now spread out across multiple columns and the data is in wide format. As a final tidying step, we will use pandas melt function to convert the dataset from wide to tall format.

In [410]:

movies_genres = pd.melt(movies_genres, id_vars = ["id", "original_title"],
                       value_name = "genres", var_name = "genre_n")

# Drop rows with no values in genres column
movies_genres.dropna(axis = 0, subset = ['genres'], inplace = True)

movies_genres.head()

Out[410]:

	id	original_title	genre_n	genres
0	135397	Jurassic World	genre1	Action
1	76341	Mad Max: Fury Road	genre1	Action
2	262500	Insurgent	genre1	Adventure
3	140607	Star Wars: The Force Awakens	genre1	Action
4	168259	Furious 7	genre1	Action

In [411]:

movies[movies["original_title"] == "3 Ninjas"]

Out[411]:

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	overview	...	release_year	budget_adj	revenue_adj	based_on_novel	release_month	release_day	release_dayofweek	profit	profit (%)	profit_adj
8342	16314	tt0103596	0.233105	0	0	3 Ninjas	Victor Wong\|Michael Treanor\|Max Elliott Slade\|...	NaN	Jon Turteltaub	Each year, three brothers Samuel, Jeffrey and ...	...	1992	0.0	0.0	False	Aug	7	Friday	0	NaN	0.0

1 rows × 26 columns

In [412]:

# Delete the original genres column in the movies dataset
del movies["genres"]

# Merge the dataframes
movies_cleaned_genres = pd.merge(movies,
                                 movies_genres,
                                 how = "right",
                                 left_on=["id", "original_title"],
                                 right_on=["id", "original_title"])

# Write the dataframe to csv file for Tableau visualization.
movies_cleaned_genres.to_csv('data/movies_cleaned_genres.csv',
                             index = False)