To perform the text analysis, we recommend using tidyverse tools (see packages below) and getting inspiration from the book Text Mining with R: A Tidy Approach (by Silge & Robinson):

library(tidyverse)
library(tidytext)

2) Content

The content of this repo is divided in three directories, each one containing different types of files.

csv-data-file/ contains the text of all Harry Potter books in a single CSV file.
rda-data-files/ contains the seven Harry Potter books stored in R-Data (binary) files—one file per book.
sentiment-lexicons/ contains a handful of sentiment lexicons, also stored in R-Data (binary) files—one file per lexicon.

2.1) Harry Potter CSV file

The data of all the books is available in csv format—in a single file: harry_potter_books.csv.

Assuming that this file is in your working directory, you can import it—via tidyverse’s readr()—as follows:

# requires package tidyverse
hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")

This data set is fairly simple—in terms of its structure—although the text content is far from being tidy. The dataset has 95085 rows and 3 columns:

text: text content
book: title of associated book
chapter: associated chapter number

2.2) Harry Potter R-Data Files

The data of each book is also available in its own R-Data rda file (see rda-data-files/):

"philosophers_stone.rda"
"chamber_of_secrets.rda"
"prisoner_of_azkaban.rda"
"goblet_of_fire.rda"
"order_of_the_phoenix.rda"
"half_blood_prince.rda"
"deathly_hallows.rda"

These files come from the R package "harrypotter" by Bradley Boehmke

https://github.com/bradleyboehmke/harrypotter

To import these files use the load() function. For example, consider the first book “Harry Potter and the Philosopher’s Stone”; here’s how to load() it in R:

# assuming that the rda file is in your working directory
load("philosophers_stone.rda")

Assuming that "philosophers_stone.rda" has been loaded, the text of this book is available in the homonym character vector philosophers_stone

# text is in a character vector
# (with as many elements as chapters in the book)
length(philosophers_stone)
#> [1] 17

The number of elements in philosophers_stone corresponds to the number of chapters in this book: 17 chapters.

You may want to use these files to perform bigram analysis (or other type of n-gram analysis).

2.3) Sentiment Lexicons

In addition to the Harry Potter text, you can also find data for a handful of sentiment lexicons from the R package "textdata" (by Hvitfeldt and Silge):

"bing": Bing Liu’s General purpose English sentiment lexicon that categorizes words in a binary fashion, either positive or negative
"afinn": AFINN is a lexicon of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011.
"nrc": General purpose English sentiment/emotion lexicon. This lexicon labels words with six possible sentiments or emotions: “negative”, “positive”, “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, or “trust”. The annotations were manually done through Amazon’s Mechanical Turk.
"loughran": English sentiment lexicon created for use with financial documents. This lexicon labels words with six possible sentiments important in financial contexts: “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”.

These lexicons come in rda data files (see sentiment-lexicons/):

bing.rda
afinn.rda
bing.rda
loughran.rda

To import them in R, use the load() function. For example, here’s how to import the Bing lexicon:

# assuming that the rda files are in your working directory
load("bing.rda")

Assuming that you’ve loaded the file "bing.rda", the associated lexicon is available in the homonym tibble bing

bing
#> # A tibble: 6,786 × 2
#>    word        sentiment
#>    <chr>       <chr>    
#>  1 2-faces     negative 
#>  2 abnormal    negative 
#>  3 abolish     negative 
#>  4 abominable  negative 
#>  5 abominably  negative 
#>  6 abominate   negative 
#>  7 abomination negative 
#>  8 abort       negative 
#>  9 aborted     negative 
#> 10 aborts      negative 
#> # ℹ 6,776 more rows

README.md Unescape Escape

1) About

2) Content

2.1) Harry Potter CSV file

2.2) Harry Potter R-Data Files

2.3) Sentiment Lexicons

README.md