197 lines
5.2 KiB
Plaintext
197 lines
5.2 KiB
Plaintext
---
|
|
title: "Harry Potter Data"
|
|
output:
|
|
md_document:
|
|
variant: markdown_github
|
|
---
|
|
|
|
<!-- README.md is generated from README.Rmd. Please edit that file -->
|
|
|
|
```{r include = FALSE}
|
|
knitr::opts_chunk$set(
|
|
collapse = TRUE,
|
|
comment = "#>"
|
|
)
|
|
|
|
library(tidyverse)
|
|
library(tidytext)
|
|
```
|
|
|
|
|
|
|
|
## About
|
|
|
|
This repository contains various data files that can be used to perform a
|
|
text analysis of [Harry Potter](https://en.wikipedia.org/wiki/Harry_Potter)
|
|
books, written by Joanne Kathleen Rowling:
|
|
|
|
1) Harry Potter and the Philosopher's Stone
|
|
|
|
2) Harry Potter and the Chamber of Secrets
|
|
|
|
3) Harry Potter and the Prisoner of Azkaban
|
|
|
|
4) Harry Potter and the Goblet of Fire
|
|
|
|
5) Harry Potter and the Order of the Phoenix
|
|
|
|
6) Harry Potter and the Half-Blood Prince
|
|
|
|
7) Harry Potter and the Deathly Hallows
|
|
|
|
|
|
To perform the text analysis, we recommend using _tidyverse_ tools (see
|
|
packages below) and getting inspiration from the book
|
|
[Text Mining with R: A Tidy Approach](https://www.tidytextmining.com/index.html)
|
|
(by Silge & Robinson):
|
|
|
|
```r
|
|
library(tidyverse)
|
|
library(tidytext)
|
|
```
|
|
|
|
|
|
## Content
|
|
|
|
The content of this repo is divided in three directories, each one containing
|
|
different types of files.
|
|
|
|
- [csv-data-file/](csv-data-file) contains the text of all Harry Potter books
|
|
in a single CSV file.
|
|
|
|
- [rda-data-files/](rda-data-files) contains the seven Harry Potter books
|
|
stored in R-Data (binary) files---one filer per book.
|
|
|
|
- [sentiment-lexicons/](sentiment-lexicons) contains a handful of sentiment
|
|
lexicons, also stored in R-Data (binary) files---one file per lexicon.
|
|
|
|
|
|
|
|
## Harry Potter CSV file
|
|
|
|
The data of all the books is available in `csv` format:
|
|
`harry_potter_books.csv`.
|
|
|
|
Assuming that this file is in your working directory, you can import it---via
|
|
tidyverse's `readr()`---as follows:
|
|
|
|
```r
|
|
# requires package tidyverse
|
|
hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
|
|
```
|
|
|
|
```{r echo = FALSE}
|
|
hp_books = read_csv("csv-data-file/harry_potter_books.csv", col_types = "ccc")
|
|
```
|
|
|
|
This data set is fairly simple---in terms of its structure---although the
|
|
text content is far from being tidy. The dataset has `r nrow(hp_books)` rows
|
|
and `r ncol(hp_books)` columns:
|
|
|
|
1) `text`: text content
|
|
|
|
2) `book`: title of associated book
|
|
|
|
3) `chapter`: associated chapter number
|
|
|
|
|
|
|
|
## Harry Potter R-Data Files
|
|
|
|
The data of each book is also available in its own R-Data `rda` file
|
|
(see [rda-data-files/](rda-data-files)):
|
|
|
|
- `"philosophers_stone.rda"`
|
|
- `"chamber_of_secrets.rda"`
|
|
- `"prisoner_of_azkaban.rda"`
|
|
- `"goblet_of_fire.rda"`
|
|
- `"order_of_the_phoenix.rda"`
|
|
- `"half_blood_prince.rda"`
|
|
- `"deathly_hallows.rda"`
|
|
|
|
These files come from the R package `"harrypotter"` by Bradley Boehmke
|
|
|
|
[https://github.com/bradleyboehmke/harrypotter](https://github.com/bradleyboehmke/harrypotter)
|
|
|
|
To import these files use the `load()` function. For example, consider the
|
|
first book "Harry Potter and the Philosopher's Stone"; here's how to `load()`
|
|
it in R:
|
|
|
|
```r
|
|
# assuming that the rda file is in your working directory
|
|
load("philosophers_stone.rda")
|
|
```
|
|
|
|
```{r echo = FALSE}
|
|
load("rda-data-files/philosophers_stone.rda")
|
|
```
|
|
|
|
Assuming that `"philosophers_stone.rda"` has been loaded, the text of this
|
|
book is available in the homonym character vector `philosophers_stone`
|
|
|
|
```{r}
|
|
# text is in a character vector
|
|
# (with as many elements as chapters in the book)
|
|
length(philosophers_stone)
|
|
```
|
|
|
|
The number of elements in `philosophers_stone` corresponds
|
|
to the number of chapters in this book: 17 chapters.
|
|
|
|
You may want to use these files to perform bigram analysis (or other type of
|
|
n-gram analysis).
|
|
|
|
|
|
|
|
## Sentiment Lexicons
|
|
|
|
In addition to the Harry Potter text, you can also find data for a handful of
|
|
sentiment lexicons from the R package `"textdata"` (by Hvitfeldt and Silge):
|
|
|
|
- `"bing"`: Bing Liu's General purpose English sentiment lexicon that
|
|
categorizes words in a binary fashion, either positive or negative
|
|
|
|
- `"afinn"`: AFINN is a lexicon of English words rated for valence with an
|
|
integer between minus five (negative) and plus five (positive). The words have
|
|
been manually labeled by Finn Årup Nielsen in 2009-2011.
|
|
|
|
- `"nrc"`: General purpose English sentiment/emotion lexicon. This lexicon
|
|
labels words with six possible sentiments or emotions: "negative", "positive",
|
|
"anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or
|
|
"trust". The annotations were manually done through Amazon's Mechanical Turk.
|
|
|
|
- `"loughran"`: English sentiment lexicon created for use with financial
|
|
documents. This lexicon labels words with six possible sentiments important in
|
|
financial contexts: "negative", "positive", "litigious", "uncertainty",
|
|
"constraining", or "superfluous".
|
|
|
|
|
|
These lexicons come in `rda` data files (see
|
|
[sentiment-lexicons/](sentiment-lexicons)):
|
|
|
|
- `bing.rda`
|
|
- `afinn.rda`
|
|
- `bing.rda`
|
|
- `loughran.rda`
|
|
|
|
To import them in R, use the `load()` function. For example, here's how to
|
|
import the Bing lexicon:
|
|
|
|
```r
|
|
# assuming that the rda files are in your working directory
|
|
load("bing.rda")
|
|
```
|
|
|
|
```{r echo = FALSE}
|
|
# assuming that the rda file is in your working directory
|
|
load("sentiment-lexicons/bing.rda")
|
|
```
|
|
|
|
Assuming that you've loaded the file `"bing.rda"`, the associated lexicon is
|
|
available in the homonym tibble `bing`
|
|
|
|
```{r}
|
|
bing
|
|
```
|
|
|