Sentiment Analysis with Python (Part 1)

Reviews with 5 or 6 stars were left out.Step 1: Download and Combine Movie ReviewsIf you haven’t yet, go to IMDb Reviews and click on “Large Movie Review Dataset v1.0”..Double clicking this file should be sufficient to unpack it (at least on a Mac), otherwise gunzip -c movie_data.tar.gz | tar xopf — in a terminal will do it.Unpacking and MergingFollow these steps or run the shell script here: Preprocessing ScriptMove the tar file to the directory where you want this data to be stored.Open a terminal window and cd to the directory that you put aclImdb_v1.tar.gz in.gunzip -c aclImdb_v1.tar.gz | tar xopf -cd aclImdb && mkdir movie_datafor split in train test; do for sentiment in pos neg; do for file in $split/$sentiment/*; do cat $file >> movie_data/full_${split}.txt; echo >> movie_data/full_${split}.txt; done; done; done;Step 2: Read into PythonFor most of what we want to do in this walkthrough we’ll only need our reviews to be in a Python list..Make sure to point open to the directory where you put the movie data.Step 3: Clean and PreprocessThe raw text is pretty messy for these reviews so before we can do any analytics we need to clean things up..If you’re unfamiliar with them perhaps start here: Regex TutorialAnd this is what the same review looks like now:"this isnt the comedic robin williams nor is it the quirky insane robin williams of recent thriller fame this is a hybrid of the classic drama without over dramatization mixed with robins new love of the thriller but this isnt a thriller per se this is more a mystery suspense vehicle through which williams attempts to locate a sick boy and his keeper also starring sandra oh and rory culkin this suspense drama plays pretty much like a news report until williams character gets close to achieving his goal i must say that i was highly entertained though this movie fails to teach guide inspect or amuse it felt more like i was watching a guy williams as he was actually performing the actions from a third person perspective in other words it felt real and i was able to subscribe to the premise of the story all in all its worth a watch though its definitely not friday saturday night fare it rates a from the fiend"Note: There are a lot of different and more sophisticated ways to clean text data that would likely produce better results than what I’ve done here..Also, I generally think it’s best to get baseline predictions with the simplest possible solution before spending time doing potentially unnecessary transformations.VectorizationIn order for this data to make sense to our machine learning algorithm we’ll need to convert each review to a numeric representation, which we call vectorization.The simplest form of this is to create one very large matrix with one column for every unique word in your corpus (where the corpus is all 50k reviews in our case).. More details

Leave a Reply