1
0
harry-potter-data/README.md
2024-05-15 15:40:24 -07:00

184 lines
5.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!-- README.md is generated from README.Rmd. Please edit that file -->
# 1) About
This repository contains various data files that can be used to perform
a text analysis of [Harry
Potter](https://en.wikipedia.org/wiki/Harry_Potter) books, written by
Joanne Kathleen Rowling:
1. Harry Potter and the Philosophers Stone
2. Harry Potter and the Chamber of Secrets
3. Harry Potter and the Prisoner of Azkaban
4. Harry Potter and the Goblet of Fire
5. Harry Potter and the Order of the Phoenix
6. Harry Potter and the Half-Blood Prince
7. Harry Potter and the Deathly Hallows
To perform the text analysis, we recommend using *tidyverse* tools (see
packages below) and getting inspiration from the book [Text Mining with
R: A Tidy Approach](https://www.tidytextmining.com/index.html) (by Silge
& Robinson):
``` r
library(tidyverse)
library(tidytext)
```
------------------------------------------------------------------------
# 2) Content
The content of this repo is divided in **three directories**, each one
containing different types of files.
- [csv-data-file/](csv-data-file) contains the text of all Harry
Potter books in a single CSV file.
- [rda-data-files/](rda-data-files) contains the seven Harry Potter
books stored in R-Data (binary) files—one file per book.
- [sentiment-lexicons/](sentiment-lexicons) contains a handful of
sentiment lexicons, also stored in R-Data (binary) files—one file
per lexicon.
------------------------------------------------------------------------
## 2.1) Harry Potter CSV file
The data of all the books is available in `csv` format—in a single file:
`harry_potter_books.csv`.
Assuming that this file is in your working directory, you can import
it—via tidyverses `readr()`—as follows:
``` r
# requires package tidyverse
hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
```
This data set is fairly simple—in terms of its structure—although the
text content is far from being tidy. The dataset has 95085 rows and 3
columns:
1. `text`: text content
2. `book`: title of associated book
3. `chapter`: associated chapter number
------------------------------------------------------------------------
## 2.2) Harry Potter R-Data Files
The data of each book is also available in its own R-Data `rda` file
(see [rda-data-files/](rda-data-files)):
- `"philosophers_stone.rda"`
- `"chamber_of_secrets.rda"`
- `"prisoner_of_azkaban.rda"`
- `"goblet_of_fire.rda"`
- `"order_of_the_phoenix.rda"`
- `"half_blood_prince.rda"`
- `"deathly_hallows.rda"`
These files come from the R package `"harrypotter"` by Bradley Boehmke
<https://github.com/bradleyboehmke/harrypotter>
To import these files use the `load()` function. For example, consider
the first book “Harry Potter and the Philosophers Stone”; heres how to
`load()` it in R:
``` r
# assuming that the rda file is in your working directory
load("philosophers_stone.rda")
```
Assuming that `"philosophers_stone.rda"` has been loaded, the text of
this book is available in the homonym character vector
`philosophers_stone`
``` r
# text is in a character vector
# (with as many elements as chapters in the book)
length(philosophers_stone)
#> [1] 17
```
The number of elements in `philosophers_stone` corresponds to the number
of chapters in this book: 17 chapters.
You may want to use these files to perform bigram analysis (or other
type of n-gram analysis).
------------------------------------------------------------------------
## 2.3) Sentiment Lexicons
In addition to the Harry Potter text, you can also find data for a
handful of sentiment lexicons from the R package `"textdata"` (by
Hvitfeldt and Silge):
- `"bing"`: Bing Lius General purpose English sentiment lexicon that
categorizes words in a binary fashion, either positive or negative
- `"afinn"`: AFINN is a lexicon of English words rated for valence
with an integer between minus five (negative) and plus five
(positive). The words have been manually labeled by Finn Årup
Nielsen in 2009-2011.
- `"nrc"`: General purpose English sentiment/emotion lexicon. This
lexicon labels words with six possible sentiments or emotions:
“negative”, “positive”, “anger”, “anticipation”, “disgust”, “fear”,
“joy”, “sadness”, “surprise”, or “trust”. The annotations were
manually done through Amazons Mechanical Turk.
- `"loughran"`: English sentiment lexicon created for use with
financial documents. This lexicon labels words with six possible
sentiments important in financial contexts: “negative”, “positive”,
“litigious”, “uncertainty”, “constraining”, or “superfluous”.
These lexicons come in `rda` data files (see
[sentiment-lexicons/](sentiment-lexicons)):
- `bing.rda`
- `afinn.rda`
- `bing.rda`
- `loughran.rda`
To import them in R, use the `load()` function. For example, heres how
to import the Bing lexicon:
``` r
# assuming that the rda files are in your working directory
load("bing.rda")
```
Assuming that youve loaded the file `"bing.rda"`, the associated
lexicon is available in the homonym tibble `bing`
``` r
bing
#> # A tibble: 6,786 × 2
#> word sentiment
#> <chr> <chr>
#> 1 2-faces negative
#> 2 abnormal negative
#> 3 abolish negative
#> 4 abominable negative
#> 5 abominably negative
#> 6 abominate negative
#> 7 abomination negative
#> 8 abort negative
#> 9 aborted negative
#> 10 aborts negative
#> # 6,776 more rows
```