Project Details
You are a business analyst consultant and your client is a new movie production company looking to make a new movie. The client wants to make sure it’s successful to help make a name for the new company. They are relying on you to help understand movie trends to help inform their decision making. They’ve given you guidance to look into three specific areas:

Question 1: How have movies based on novels performed relative to movies not based on novels?
Question 2: How do the attributes differ between Universal Pictures and Paramount Pictures?
Question 3: How have movie genres changed over time?

In this blog post, I will clean the data and get the dataset ready for Tableau visualization. In the second part of the post, I will attempt to answer these questions using Tableau Dashboard and Stories.

poster_tmdb
Source:- TheMovieDB

Getting Started

In [1]:
# Import the libraries needed
import pandas as pd
import numpy as np
import re
import calendar

The original data used for the analysis can be downloaded from the following link. When I tried to open the file directly using pandas read_csv module, I ran into following error- "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 12: invalid continuation byte". So I had to open the file in Sublime Text and save it with utf-8 encoding. This allowed me to open the file using pandas read_csv module.

In [3]:
# Load the data into a pandas dataframe
movies = pd.read_csv("data/movies.csv")
movies.head()
Out[3]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 2015-05-13 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 2015-03-18 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 2015-12-15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 2015-04-01 2947 7.3 2015 1.747999e+08 1.385749e+09

5 rows × 21 columns

Data clean up

If we look at the dataset the information needed to answer the questions isn't readily available. We need to do some preprocessing on relevant columns to dig out the information needed to answer the questions.
We will start by fixing the columns that pertain to the question we are trying to address.

Question 1: How have movies based on novels performed relative to movies not based on novels?
To address this question we need to find movies that are based on novels. This information is primarily available in keywords column and in tagline column. Instead of fixing the genres column for the entire dataset, I select few columns and make a subset of data to work with it.

In [389]:
movies_novels = movies[["id", "original_title", "keywords", "tagline"]].reset_index(drop = True)
movies_novels.head(10)
Out[389]:
id original_title keywords tagline
0 135397 Jurassic World monster|dna|tyrannosaurus rex|velociraptor|island The park is open.
1 76341 Mad Max: Fury Road future|chase|post-apocalyptic|dystopia|australia What a Lovely Day.
2 262500 Insurgent based on novel|revolution|dystopia|sequel|dyst... One Choice Can Destroy You
3 140607 Star Wars: The Force Awakens android|spaceship|jedi|space opera|3d Every generation has a story.
4 168259 Furious 7 car race|speed|revenge|suspense|car Vengeance Hits Home
5 281957 The Revenant father-son relationship|rape|based on novel|mo... (n. One who has returned, as if from the dead.)
6 87101 Terminator Genisys saving the world|artificial intelligence|cybor... Reset the future
7 286217 The Martian based on novel|mars|nasa|isolation|botanist Bring Him Home
8 211672 Minions assistant|aftercreditsstinger|duringcreditssti... Before Gru, they had a history of bad bosses
9 150540 Inside Out dream|cartoon|imaginary friend|animation|kid Meet the little voices inside your head.

As I went through the keywords column, I noticed that there are multiple phrases that could tell whether a movie is based on novel - 'based on novel', 'based on graphic novel', 'inspired by novel' etc. In order to list all of the possible phrases we could be dealing with, I write a regular expression script to search for the first occurrence of phrase with 'novel' in each row. The function returns either the phrase with 'novel' in it or None.

In [390]:
def find_novel_from_keywords(keywords):
    try:
        match = re.search("(\|{0,1}[\w\s]*novel[\w\s]*\|{0,1})", keywords)
        if match:
            return match.group(0)
        else:
            return None
    except TypeError:
        return None
    
phrases_with_novel = set()
for text in set(movies_novels["keywords"].tolist()):
    pattern = find_novel_from_keywords(text)
    phrases_with_novel.add(pattern)

print(phrases_with_novel)
{'|inspired by novel', '|novelist', '|stolen novel|', '|based on graphic novel', '|based on novel', '|based on novel|', '|based on graphic novel|', '|novelist|', 'based on novel', '|tell all novel|', None, 'based on graphic novel|', 'based on novel|'}

Hmm! there are several phrases with 'novel' in it, but only three phrases tell us if a movie is based on a novel. The three phrases are - 'based on novel', 'based on graphic novel', and 'inspired by novel'. Next, we will create a column based_on_novel_0 that looks for those three phrases from keywords column and returns True if the movie is novel based.

In [391]:
def find_novel_based_movie(keywords):
    try:
        match1 = re.search("(\|{0,1}based[\w\s]*novel\|{0,1})", keywords)
        match2 = re.search("(\|{0,1}inspired[\w\s]*novel\|{0,1})", keywords)
        if match1 or match2:
            return True
        else:
            return False
    except TypeError:
        return None
    
movies_novels["based_on_novel_0"] = movies_novels["keywords"].apply(find_novel_based_movie)  
movies_novels.head()
Out[391]:
id original_title keywords tagline based_on_novel_0
0 135397 Jurassic World monster|dna|tyrannosaurus rex|velociraptor|island The park is open. False
1 76341 Mad Max: Fury Road future|chase|post-apocalyptic|dystopia|australia What a Lovely Day. False
2 262500 Insurgent based on novel|revolution|dystopia|sequel|dyst... One Choice Can Destroy You True
3 140607 Star Wars: The Force Awakens android|spaceship|jedi|space opera|3d Every generation has a story. False
4 168259 Furious 7 car race|speed|revenge|suspense|car Vengeance Hits Home False

It is also quite likely that information for whether a movie is based on novel is in tagline column. Let us search tagline column for any phrase with novel in it.

In [392]:
def find_novel_from_tagline(tagline):
    try:
        match = re.search("(.*novel.*)", tagline)
        if match:
            return match.group(0)
        else:
            return None
    except TypeError:
        return None
    
phrases_with_novel_1 = set()
for text in set(movies_novels["tagline"].tolist()):
    pattern = find_novel_from_tagline(text)
    phrases_with_novel_1.add(pattern)

print(phrases_with_novel_1)
{'Based on the best-selling novel', 'Based on the novel of Chico Xavier', 'Based on the novel by Henry James', 'The #1 novel of the year - now a motion picture!', None}

There were four movies based on novel for which the information was present in tagline column. Next, we will create a column based_on_novel_1 that looks for those phrases from tagline column and returns True if the movie is novel based.

In [393]:
def find_novel_based_movie_1(tagline):
    try:
        match = re.search("(.*novel.*)", tagline)
        if match:
            return True
        else:
            return False
    except TypeError:
        return None
    
movies_novels["based_on_novel_1"] = movies_novels["tagline"].apply(find_novel_based_movie_1)  
movies_novels.head()
Out[393]:
id original_title keywords tagline based_on_novel_0 based_on_novel_1
0 135397 Jurassic World monster|dna|tyrannosaurus rex|velociraptor|island The park is open. False False
1 76341 Mad Max: Fury Road future|chase|post-apocalyptic|dystopia|australia What a Lovely Day. False False
2 262500 Insurgent based on novel|revolution|dystopia|sequel|dyst... One Choice Can Destroy You True False
3 140607 Star Wars: The Force Awakens android|spaceship|jedi|space opera|3d Every generation has a story. False False
4 168259 Furious 7 car race|speed|revenge|suspense|car Vengeance Hits Home False False

We are only interested in finding whether a movie is based on novel. We are not concerned whether that information comes from keywords column or tagline column. We will create a single column based_on_novel from based_on_novel_0 and based_on_novel_1. At the same time, we will now drop columns we don't need anymore.

In [394]:
movies_novels["based_on_novel"] = movies_novels["based_on_novel_0"] + movies_novels["based_on_novel_1"]
movies_novels.drop(["keywords", "tagline", "based_on_novel_0", "based_on_novel_1"], axis = 1, inplace = True)
movies_novels.head()
Out[394]:
id original_title based_on_novel
0 135397 Jurassic World 0
1 76341 Mad Max: Fury Road 0
2 262500 Insurgent 1
3 140607 Star Wars: The Force Awakens 0
4 168259 Furious 7 0

We will now convert the based_on_novel column from integers to boolean. Before that, we need to check the different integer values that are there in the column beside 0 and 1. It is quite likely that information for movie based on a novel is present on both the columns- keywords and tagline. As a result, some of the values for based_on_novel might be two. Let's check that out.

In [395]:
print(movies_novels["based_on_novel"].unique())
[0 1 nan 2]
In [396]:
movies_novels[movies_novels['based_on_novel'] == 2]
Out[396]:
id original_title based_on_novel
10660 10671 Airport 2
In [397]:
movies.loc[movies['original_title'] == "Airport", ["original_title", "keywords", "tagline"]]
Out[397]:
original_title keywords tagline
10660 Airport bomb|based on novel|airport|desperation|snow s... The #1 novel of the year - now a motion picture!
In [398]:
int_to_bool = {0: 'False', 1: 'True', 2: 'True'}
movies_novels['based_on_novel'] = movies_novels['based_on_novel'].apply(int_to_bool.get)
movies_novels.head()
Out[398]:
id original_title based_on_novel
0 135397 Jurassic World False
1 76341 Mad Max: Fury Road False
2 262500 Insurgent True
3 140607 Star Wars: The Force Awakens False
4 168259 Furious 7 False

Now we have our new column based_on_novel in a format that can be used for Tableau visualization. Let's drop the keywords and tagline columns in the original dataset and merge the movies_novels data subsets with the original movie dataset using id and original_title.

In [399]:
movies.drop(["keywords", "tagline"], axis = 1, inplace = True)

# We will use the merge function in pandas to merge the scraped data subsets and original movies dataset. 
# The merge function in pandas is similar to join in SQL.
movies = pd.merge(movies,
                  movies_novels,
                  how = "left",
                  left_on=["id", "original_title"],
                  right_on=["id", "original_title"])

movies.head()
Out[399]:
id imdb_id popularity budget revenue original_title cast homepage director overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj based_on_novel
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.5 2015 1.379999e+08 1.392446e+09 False
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 2015-05-13 6185 7.1 2015 1.379999e+08 3.481613e+08 False
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 2015-03-18 2480 6.3 2015 1.012000e+08 2.716190e+08 True
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 2015-12-15 5292 7.5 2015 1.839999e+08 1.902723e+09 False
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 2015-04-01 2947 7.3 2015 1.747999e+08 1.385749e+09 False

Question 2: How do the attributes differ between Universal Pictures and Paramount Pictures?
To answer this question we need to tidy up the production_companies column. Each row of production_companies column contains several production companies separated by | and there can be a maximum of five companies. We are interested in finding out whether a movie was produced by Universal Pictures, or Paramount Pictures, or None.

In [400]:
movies_production_companies = movies[["id", "original_title", "production_companies"]].reset_index(drop = True)
movies_production_companies.head()
Out[400]:
id original_title production_companies
0 135397 Jurassic World Universal Studios|Amblin Entertainment|Legenda...
1 76341 Mad Max: Fury Road Village Roadshow Pictures|Kennedy Miller Produ...
2 262500 Insurgent Summit Entertainment|Mandeville Films|Red Wago...
3 140607 Star Wars: The Force Awakens Lucasfilm|Truenorth Productions|Bad Robot
4 168259 Furious 7 Universal Pictures|Original Film|Media Rights ...

As I skimmed through the production_companies column, I noticed that Universal Studios was not named consistently - Universal Studios, Universal Pictures, Universal etc. In order to find out all the possible names of Universal Studios we could be dealing with, I wrote a script to search for the first occurrence of phrase with 'Universal' in each row. The function returns either the phrase with word Universal in it or None.

In [401]:
def find_Universal(production_company):
    try:
        match = re.search("(\|{0,1}[\w\s]*Universal[\w\s]*\|{0,1})", production_company)
        if match:
            return match.group(0)
        else:
            return None
    except TypeError:
        return None
    
words_with_Universal = set()
for text in set(movies_production_companies["production_companies"].tolist()):
    pattern = find_Universal(text)
    words_with_Universal.add(pattern)

print(words_with_Universal)
{'Universal Pictures', 'Universal Cable Productions|', 'Universal Cartoon Studios|', 'Universal 1440 Entertainment', '|Universal Home Entertainment', '|Universal Studios Home Entertainment|', '|Universal International Pictures ', '|Universal Cartoon Studios|', '|Universal 1440 Entertainment', 'Universal', '|Universal City Studios|', '|Universal CGI|', 'Universal TV', '|Universal Pictures International ', 'Universal Pictures Germany GmbH', 'Universal TV|', '|Universal Pictures', 'Universal Pictures International ', '|Universal Studios Home Entertainment', '|Universal Television', 'Universal Pictures Corporation', '|Universal City Studios', '|Universal Music', 'NBC Universal Television|', '|NBC Universal Global Networks|', 'Universal Home Entertainment', 'Universal Pictures|', '|Universal Family and Home Entertainment', '|Universal Studios Sound Facilities', 'Universal Studios|', '|Universal|', '|Universal Pictures|', 'Universal Studios Home Entertainment Family Productions|', '|Universal Network Television|', 'Universal Studios', 'NBC Universal Television', 'Universal Productions France S', 'Universal Cartoon Studios', 'Universal Studios Home Entertainment', None, '|Universal 1440 Entertainment|', '|Universal Home Video'}

Huh! there are several phrases with Universal in it. I googled to figure out if these were all divisions of the same company. The wikipedia page states - "Universal Pictures (also referred to as Universal Studios or simply Universal) is an American film studio owned by Comcast through the Universal Filmed Entertainment Group division of its wholly owned subsidiary NBCUniversal." An extensive search revealed that these are all part of American multinational media conglomerate NBCUniversal.
I also wanted to go ahead and check if it was similar case for Paramount Pictures. I wrote a similar script to search for the first occurrence of phrase with 'Paramount' in each row.

In [402]:
def find_Paramount(production_company):
    try:
        match = re.search("(\|{0,1}[\w\s]*Paramount[\w\s]*\|{0,1})", production_company)
        if match:
            return match.group(0)
        else:
            return None
    except TypeError:
        return None
    
words_with_Paramount = set()
for text in set(movies_production_companies["production_companies"].tolist()):
    pattern = find_Paramount(text)
    words_with_Paramount.add(pattern)

print(words_with_Paramount)
{'|Paramount Television|', '|Paramount Pictures', 'Paramount Famous Productions', '|Paramount Classics', '|Paramount Home Entertainment', 'Paramount Vantage', 'Paramount Classics', 'Paramount Vantage|', '|Paramount Vantage|', 'Paramount Pictures|', 'Paramount Pictures', '|Paramount Classics|', 'Paramount|', '|Paramount Vantage', '|Paramount Pictures Digital Entertainment|', None, '|Paramount Animation', 'Paramount Home Entertainment', 'Paramount Pictures Digital Entertainment'}

As stated in Wikipedia - "Paramount Pictures Corporation (also known as Paramount Pictures and simply Paramount) is an American film studio based in Hollywood, California, that has been a subsidiary of the American media conglomerate Viacom since 1994." These above-listed names are all subsidiaries of the same company.
Now we can write some code to tag all the companies with Universal in their names as Universal and all the companies with Paramount in their names as Paramount, and neither as None for our purpose. To do this, I used a regular expression that searches for the first occurrence of Universal or Paramount in each row of production_companies column.

In [403]:
def find_Universal_or_Paramount(production_company):
    try:
        universal = re.search("(\|{0,1}[\w\s]*Universal[\w\s]*\|{0,1})", production_company)
        paramount = re.search("(\|{0,1}[\w\s]*Paramount[\w\s]*\|{0,1})", production_company)
        if universal:
            return "Universal"
        elif paramount:
            return "Paramount"
        else:
            return None
    except TypeError:
        return None

# Transform the original 'production_companies' to only keep relevant information
movies["production_companies"] = movies["production_companies"].apply(
    find_Universal_or_Paramount)

# We can see that the production_companies column has been transformed in the original dataset
movies.head()
Out[403]:
id imdb_id popularity budget revenue original_title cast homepage director overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj based_on_novel
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal 2015-06-09 5562 6.5 2015 1.379999e+08 1.392446e+09 False
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller None 2015-05-13 6185 7.1 2015 1.379999e+08 3.481613e+08 False
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller None 2015-03-18 2480 6.3 2015 1.012000e+08 2.716190e+08 True
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy None 2015-12-15 5292 7.5 2015 1.839999e+08 1.902723e+09 False
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal 2015-04-01 2947 7.3 2015 1.747999e+08 1.385749e+09 False

While we are here, let us also extract day, month, and day of the week information from the date column. We will create these three new columns in the original movies dataset itself. These attributes can be used for exploring the seasonal effect on movie's performance.

In [404]:
movies["release_month"] = pd.DatetimeIndex(movies["release_date"]).month
movies["release_month"] = movies["release_month"].apply(lambda x: calendar.month_abbr[x])
movies["release_day"] = pd.DatetimeIndex(movies["release_date"]).day
movies["release_dayofweek"] = pd.DatetimeIndex(movies["release_date"]).weekday_name
movies.head()
Out[404]:
id imdb_id popularity budget revenue original_title cast homepage director overview ... release_date vote_count vote_average release_year budget_adj revenue_adj based_on_novel release_month release_day release_dayofweek
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow Twenty-two years after the events of Jurassic ... ... 2015-06-09 5562 6.5 2015 1.379999e+08 1.392446e+09 False Jun 9 Tuesday
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller An apocalyptic story set in the furthest reach... ... 2015-05-13 6185 7.1 2015 1.379999e+08 3.481613e+08 False May 13 Wednesday
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke Beatrice Prior must confront her inner demons ... ... 2015-03-18 2480 6.3 2015 1.012000e+08 2.716190e+08 True Mar 18 Wednesday
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Thirty years after defeating the Galactic Empi... ... 2015-12-15 5292 7.5 2015 1.839999e+08 1.902723e+09 False Dec 15 Tuesday
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Deckard Shaw seeks revenge against Dominic Tor... ... 2015-04-01 2947 7.3 2015 1.747999e+08 1.385749e+09 False Apr 1 Wednesday

5 rows × 23 columns

We will also calculate two important metrics that we will use later for Tableau visualization.

In [405]:
movies["profit"] = movies["revenue"] - movies["budget"]
movies["profit (%)"] = (movies["revenue"] - movies["budget"])*100/movies["budget"]
movies["profit_adj"] = movies["revenue_adj"] - movies["budget_adj"]
In [406]:
movies.head()
Out[406]:
id imdb_id popularity budget revenue original_title cast homepage director overview ... release_year budget_adj revenue_adj based_on_novel release_month release_day release_dayofweek profit profit (%) profit_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow Twenty-two years after the events of Jurassic ... ... 2015 1.379999e+08 1.392446e+09 False Jun 9 Tuesday 1363528810 909.019207 1.254446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller An apocalyptic story set in the furthest reach... ... 2015 1.379999e+08 3.481613e+08 False May 13 Wednesday 228436354 152.290903 2.101614e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke Beatrice Prior must confront her inner demons ... ... 2015 1.012000e+08 2.716190e+08 True Mar 18 Wednesday 185238201 168.398365 1.704191e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Thirty years after defeating the Galactic Empi... ... 2015 1.839999e+08 1.902723e+09 False Dec 15 Tuesday 1868178225 934.089113 1.718723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Deckard Shaw seeks revenge against Dominic Tor... ... 2015 1.747999e+08 1.385749e+09 False Apr 1 Wednesday 1316249360 692.762821 1.210949e+09

5 rows × 26 columns

Write the dataframe to csv file for Tableau visualization.

In [407]:
movies.to_csv('data/movies_cleaned.csv',
             index = False)

Question 3: How have movie genres changed over time?
Problem: The genres column consists of multiple genres in the same cell separated by "|". To tidy this data up, we need to have one genre per row associated with their respective id. Instead of fixing the genres column for the entire dataset, I select few columns and make a subset of data to work with it.

In [408]:
movies_genres = movies[["id", "original_title", "genres"]].reset_index(drop = True)
movies_genres.head()
Out[408]:
id original_title genres
0 135397 Jurassic World Action|Adventure|Science Fiction|Thriller
1 76341 Mad Max: Fury Road Action|Adventure|Science Fiction|Thriller
2 262500 Insurgent Adventure|Science Fiction|Thriller
3 140607 Star Wars: The Force Awakens Action|Adventure|Science Fiction|Fantasy
4 168259 Furious 7 Action|Crime|Thriller

First, we separate multiple genres associated with each movie into multiple columns. We know from the data specifications that there can be a maximum of five genres for each movie, so we will separate the genres column into five columns.

In [409]:
movies_genres[['genre1', 'genre2', 'genre3', 'genre4', 'genre5']] = movies_genres['genres'].str.split(
    "|", expand = True)
del movies_genres['genres']

The genres variable is now spread out across multiple columns and the data is in wide format. As a final tidying step, we will use pandas melt function to convert the dataset from wide to tall format.

In [410]:
movies_genres = pd.melt(movies_genres, id_vars = ["id", "original_title"],
                       value_name = "genres", var_name = "genre_n")

# Drop rows with no values in genres column
movies_genres.dropna(axis = 0, subset = ['genres'], inplace = True)

movies_genres.head()
Out[410]:
id original_title genre_n genres
0 135397 Jurassic World genre1 Action
1 76341 Mad Max: Fury Road genre1 Action
2 262500 Insurgent genre1 Adventure
3 140607 Star Wars: The Force Awakens genre1 Action
4 168259 Furious 7 genre1 Action
In [411]:
movies[movies["original_title"] == "3 Ninjas"]
Out[411]:
id imdb_id popularity budget revenue original_title cast homepage director overview ... release_year budget_adj revenue_adj based_on_novel release_month release_day release_dayofweek profit profit (%) profit_adj
8342 16314 tt0103596 0.233105 0 0 3 Ninjas Victor Wong|Michael Treanor|Max Elliott Slade|... NaN Jon Turteltaub Each year, three brothers Samuel, Jeffrey and ... ... 1992 0.0 0.0 False Aug 7 Friday 0 NaN 0.0

1 rows × 26 columns

In [412]:
# Delete the original genres column in the movies dataset
del movies["genres"]

# Merge the dataframes
movies_cleaned_genres = pd.merge(movies,
                                 movies_genres,
                                 how = "right",
                                 left_on=["id", "original_title"],
                                 right_on=["id", "original_title"])

# Write the dataframe to csv file for Tableau visualization.
movies_cleaned_genres.to_csv('data/movies_cleaned_genres.csv',
                             index = False)

Tableau Data Visualization

In [1]:
%%HTML
In [2]:
%%HTML
In [3]:
%%HTML
In [4]:
%%HTML

Comments

comments powered by Disqus