Add readme
This commit is contained in:
parent
3e704a9d9e
commit
a9ef33f01e
196
README.Rmd
Normal file
196
README.Rmd
Normal file
@ -0,0 +1,196 @@
|
|||||||
|
---
|
||||||
|
title: "Harry Potter Data"
|
||||||
|
output:
|
||||||
|
md_document:
|
||||||
|
variant: markdown_github
|
||||||
|
---
|
||||||
|
|
||||||
|
<!-- README.md is generated from README.Rmd. Please edit that file -->
|
||||||
|
|
||||||
|
```{r include = FALSE}
|
||||||
|
knitr::opts_chunk$set(
|
||||||
|
collapse = TRUE,
|
||||||
|
comment = "#>"
|
||||||
|
)
|
||||||
|
|
||||||
|
library(tidyverse)
|
||||||
|
library(tidytext)
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## About
|
||||||
|
|
||||||
|
This repository contains various data files that can be used to perform a
|
||||||
|
text analysis of [Harry Potter](https://en.wikipedia.org/wiki/Harry_Potter)
|
||||||
|
books, written by Joanne Kathleen Rowling:
|
||||||
|
|
||||||
|
1) Harry Potter and the Philosopher's Stone
|
||||||
|
|
||||||
|
2) Harry Potter and the Chamber of Secrets
|
||||||
|
|
||||||
|
3) Harry Potter and the Prisoner of Azkaban
|
||||||
|
|
||||||
|
4) Harry Potter and the Goblet of Fire
|
||||||
|
|
||||||
|
5) Harry Potter and the Order of the Phoenix
|
||||||
|
|
||||||
|
6) Harry Potter and the Half-Blood Prince
|
||||||
|
|
||||||
|
7) Harry Potter and the Deathly Hallows
|
||||||
|
|
||||||
|
|
||||||
|
To perform the text analysis, we recommend using _tidyverse_ tools (see
|
||||||
|
packages below) and getting inspiration from the book
|
||||||
|
[Text Mining with R: A Tidy Approach](https://www.tidytextmining.com/index.html)
|
||||||
|
(by Silge & Robinson):
|
||||||
|
|
||||||
|
```r
|
||||||
|
library(tidyverse)
|
||||||
|
library(tidytext)
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
### Content
|
||||||
|
|
||||||
|
The content of this repo is divided in three directories, each one containing
|
||||||
|
different types of files.
|
||||||
|
|
||||||
|
- [csv-data-file/](csv-data-file) contains the text of all Harry Potter books
|
||||||
|
in a single CSV file.
|
||||||
|
|
||||||
|
- [rda-data-files/](rda-data-files) contains the seven Harry Potter books
|
||||||
|
stored in R-Data (binary) files---one filer per book.
|
||||||
|
|
||||||
|
- [sentiment-lexicons/](sentiment-lexicons) contains a handful of sentiment
|
||||||
|
lexicons, also stored in R-Data (binary) files---one file per lexicon.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### Harry Potter CSV file
|
||||||
|
|
||||||
|
The data of all the books is available in `csv` format:
|
||||||
|
`harry_potter_books.csv`.
|
||||||
|
|
||||||
|
Assuming that this file is in your working directory, you can import it---via
|
||||||
|
tidyverse's `readr()`---as follows:
|
||||||
|
|
||||||
|
```r
|
||||||
|
# requires package tidyverse
|
||||||
|
hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
|
||||||
|
```
|
||||||
|
|
||||||
|
```{r echo = FALSE}
|
||||||
|
hp_books = read_csv("csv-data-file/harry_potter_books.csv", col_types = "ccc")
|
||||||
|
```
|
||||||
|
|
||||||
|
This data set is fairly simple---in terms of its structure---although the
|
||||||
|
text content is far from being tidy. The dataset has `r nrow(hp_books)` rows
|
||||||
|
and `r ncol(hp_books)` columns:
|
||||||
|
|
||||||
|
1) `text`: text content
|
||||||
|
|
||||||
|
2) `book`: title of associated book
|
||||||
|
|
||||||
|
3) `chapter`: associated chapter number
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### Harry Potter R-Data Files
|
||||||
|
|
||||||
|
The data of each book is also available in its own R-Data `rda` file
|
||||||
|
(see [rda-data-files/](rda-data-files)):
|
||||||
|
|
||||||
|
- `"philosophers_stone.rda"`
|
||||||
|
- `"chamber_of_secrets.rda"`
|
||||||
|
- `"prisoner_of_azkaban.rda"`
|
||||||
|
- `"goblet_of_fire.rda"`
|
||||||
|
- `"order_of_the_phoenix.rda"`
|
||||||
|
- `"half_blood_prince.rda"`
|
||||||
|
- `"deathly_hallows.rda"`
|
||||||
|
|
||||||
|
These files come from the R package `"harrypotter"` by Bradley Boehmke
|
||||||
|
|
||||||
|
[https://github.com/bradleyboehmke/harrypotter](https://github.com/bradleyboehmke/harrypotter)
|
||||||
|
|
||||||
|
To import these files use the `load()` function. For example, consider the
|
||||||
|
first book "Harry Potter and the Philosopher's Stone"; here's how to `load()`
|
||||||
|
it in R:
|
||||||
|
|
||||||
|
```r
|
||||||
|
# assuming that the rda file is in your working directory
|
||||||
|
load("philosophers_stone.rda")
|
||||||
|
```
|
||||||
|
|
||||||
|
```{r echo = FALSE}
|
||||||
|
load("rda-data-files/philosophers_stone.rda")
|
||||||
|
```
|
||||||
|
|
||||||
|
Assuming that `"philosophers_stone.rda"` has been loaded, the text of this
|
||||||
|
book is available in the homonym character vector `philosophers_stone`
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
# text is in a character vector
|
||||||
|
# (with as many elements as chapters in the book)
|
||||||
|
length(philosophers_stone)
|
||||||
|
```
|
||||||
|
|
||||||
|
The number of elements in `philosophers_stone` corresponds
|
||||||
|
to the number of chapters in this book: 17 chapters.
|
||||||
|
|
||||||
|
You may want to use these files to perform bigram analysis (or other type of
|
||||||
|
n-gram analysis).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### Sentiment Lexicons
|
||||||
|
|
||||||
|
In addition to the Harry Potter text, you can also find data for a handful of
|
||||||
|
sentiment lexicons from the R package `"textdata"` (by Hvitfeldt and Silge):
|
||||||
|
|
||||||
|
- `"bing"`: Bing Liu's General purpose English sentiment lexicon that
|
||||||
|
categorizes words in a binary fashion, either positive or negative
|
||||||
|
|
||||||
|
- `"afinn"`: AFINN is a lexicon of English words rated for valence with an
|
||||||
|
integer between minus five (negative) and plus five (positive). The words have
|
||||||
|
been manually labeled by Finn Årup Nielsen in 2009-2011.
|
||||||
|
|
||||||
|
- `"nrc"`: General purpose English sentiment/emotion lexicon. This lexicon
|
||||||
|
labels words with six possible sentiments or emotions: "negative", "positive",
|
||||||
|
"anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or
|
||||||
|
"trust". The annotations were manually done through Amazon's Mechanical Turk.
|
||||||
|
|
||||||
|
- `"loughran"`: English sentiment lexicon created for use with financial
|
||||||
|
documents. This lexicon labels words with six possible sentiments important in
|
||||||
|
financial contexts: "negative", "positive", "litigious", "uncertainty",
|
||||||
|
"constraining", or "superfluous".
|
||||||
|
|
||||||
|
|
||||||
|
These lexicons come in `rda` data files (see
|
||||||
|
[sentiment-lexicons/](sentiment-lexicons)):
|
||||||
|
|
||||||
|
- `bing.rda`
|
||||||
|
- `afinn.rda`
|
||||||
|
- `bing.rda`
|
||||||
|
- `loughran.rda`
|
||||||
|
|
||||||
|
To import them in R, use the `load()` function. For example, here's how to
|
||||||
|
import the Bing lexicon:
|
||||||
|
|
||||||
|
```r
|
||||||
|
# assuming that the rda files are in your working directory
|
||||||
|
load("bing.rda")
|
||||||
|
```
|
||||||
|
|
||||||
|
```{r echo = FALSE}
|
||||||
|
# assuming that the rda file is in your working directory
|
||||||
|
load("sentiment-lexicons/bing.rda")
|
||||||
|
```
|
||||||
|
|
||||||
|
Assuming that you've loaded the file `"bing.rda"`, the associated lexicon is
|
||||||
|
available in the homonym tibble `bing`
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
bing
|
||||||
|
```
|
||||||
|
|
175
README.md
Normal file
175
README.md
Normal file
@ -0,0 +1,175 @@
|
|||||||
|
<!-- README.md is generated from README.Rmd. Please edit that file -->
|
||||||
|
|
||||||
|
## About
|
||||||
|
|
||||||
|
This repository contains various data files that can be used to perform
|
||||||
|
a text analysis of [Harry
|
||||||
|
Potter](https://en.wikipedia.org/wiki/Harry_Potter) books, written by
|
||||||
|
Joanne Kathleen Rowling:
|
||||||
|
|
||||||
|
1. Harry Potter and the Philosopher’s Stone
|
||||||
|
|
||||||
|
2. Harry Potter and the Chamber of Secrets
|
||||||
|
|
||||||
|
3. Harry Potter and the Prisoner of Azkaban
|
||||||
|
|
||||||
|
4. Harry Potter and the Goblet of Fire
|
||||||
|
|
||||||
|
5. Harry Potter and the Order of the Phoenix
|
||||||
|
|
||||||
|
6. Harry Potter and the Half-Blood Prince
|
||||||
|
|
||||||
|
7. Harry Potter and the Deathly Hallows
|
||||||
|
|
||||||
|
To perform the text analysis, we recommend using *tidyverse* tools (see
|
||||||
|
packages below) and getting inspiration from the book [Text Mining with
|
||||||
|
R: A Tidy Approach](https://www.tidytextmining.com/index.html) (by Silge
|
||||||
|
& Robinson):
|
||||||
|
|
||||||
|
``` r
|
||||||
|
library(tidyverse)
|
||||||
|
library(tidytext)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Content
|
||||||
|
|
||||||
|
The content of this repo is divided in three directories, each one
|
||||||
|
containing different types of files.
|
||||||
|
|
||||||
|
- [csv-data-file/](csv-data-file) contains the text of all Harry
|
||||||
|
Potter books in a single CSV file.
|
||||||
|
|
||||||
|
- [rda-data-files/](rda-data-files) contains the seven Harry Potter
|
||||||
|
books stored in R-Data (binary) files—one filer per book.
|
||||||
|
|
||||||
|
- [sentiment-lexicons/](sentiment-lexicons) contains a handful of
|
||||||
|
sentiment lexicons, also stored in R-Data (binary) files—one file
|
||||||
|
per lexicon.
|
||||||
|
|
||||||
|
### Harry Potter CSV file
|
||||||
|
|
||||||
|
The data of all the books is available in `csv` format:
|
||||||
|
`harry_potter_books.csv`.
|
||||||
|
|
||||||
|
Assuming that this file is in your working directory, you can import
|
||||||
|
it—via tidyverse’s `readr()`—as follows:
|
||||||
|
|
||||||
|
``` r
|
||||||
|
# requires package tidyverse
|
||||||
|
hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
|
||||||
|
```
|
||||||
|
|
||||||
|
This data set is fairly simple—in terms of its structure—although the
|
||||||
|
text content is far from being tidy. The dataset has 95085 rows and 3
|
||||||
|
columns:
|
||||||
|
|
||||||
|
1. `text`: text content
|
||||||
|
|
||||||
|
2. `book`: title of associated book
|
||||||
|
|
||||||
|
3. `chapter`: associated chapter number
|
||||||
|
|
||||||
|
### Harry Potter R-Data Files
|
||||||
|
|
||||||
|
The data of each book is also available in its own R-Data `rda` file
|
||||||
|
(see [rda-data-files/](rda-data-files)):
|
||||||
|
|
||||||
|
- `"philosophers_stone.rda"`
|
||||||
|
- `"chamber_of_secrets.rda"`
|
||||||
|
- `"prisoner_of_azkaban.rda"`
|
||||||
|
- `"goblet_of_fire.rda"`
|
||||||
|
- `"order_of_the_phoenix.rda"`
|
||||||
|
- `"half_blood_prince.rda"`
|
||||||
|
- `"deathly_hallows.rda"`
|
||||||
|
|
||||||
|
These files come from the R package `"harrypotter"` by Bradley Boehmke
|
||||||
|
|
||||||
|
<https://github.com/bradleyboehmke/harrypotter>
|
||||||
|
|
||||||
|
To import these files use the `load()` function. For example, consider
|
||||||
|
the first book “Harry Potter and the Philosopher’s Stone”; here’s how to
|
||||||
|
`load()` it in R:
|
||||||
|
|
||||||
|
``` r
|
||||||
|
# assuming that the rda file is in your working directory
|
||||||
|
load("philosophers_stone.rda")
|
||||||
|
```
|
||||||
|
|
||||||
|
Assuming that `"philosophers_stone.rda"` has been loaded, the text of
|
||||||
|
this book is available in the homonym character vector
|
||||||
|
`philosophers_stone`
|
||||||
|
|
||||||
|
``` r
|
||||||
|
# text is in a character vector
|
||||||
|
# (with as many elements as chapters in the book)
|
||||||
|
length(philosophers_stone)
|
||||||
|
#> [1] 17
|
||||||
|
```
|
||||||
|
|
||||||
|
The number of elements in `philosophers_stone` corresponds to the number
|
||||||
|
of chapters in this book: 17 chapters.
|
||||||
|
|
||||||
|
You may want to use these files to perform bigram analysis (or other
|
||||||
|
type of n-gram analysis).
|
||||||
|
|
||||||
|
### Sentiment Lexicons
|
||||||
|
|
||||||
|
In addition to the Harry Potter text, you can also find data for a
|
||||||
|
handful of sentiment lexicons from the R package `"textdata"` (by
|
||||||
|
Hvitfeldt and Silge):
|
||||||
|
|
||||||
|
- `"bing"`: Bing Liu’s General purpose English sentiment lexicon that
|
||||||
|
categorizes words in a binary fashion, either positive or negative
|
||||||
|
|
||||||
|
- `"afinn"`: AFINN is a lexicon of English words rated for valence
|
||||||
|
with an integer between minus five (negative) and plus five
|
||||||
|
(positive). The words have been manually labeled by Finn Årup
|
||||||
|
Nielsen in 2009-2011.
|
||||||
|
|
||||||
|
- `"nrc"`: General purpose English sentiment/emotion lexicon. This
|
||||||
|
lexicon labels words with six possible sentiments or emotions:
|
||||||
|
“negative”, “positive”, “anger”, “anticipation”, “disgust”, “fear”,
|
||||||
|
“joy”, “sadness”, “surprise”, or “trust”. The annotations were
|
||||||
|
manually done through Amazon’s Mechanical Turk.
|
||||||
|
|
||||||
|
- `"loughran"`: English sentiment lexicon created for use with
|
||||||
|
financial documents. This lexicon labels words with six possible
|
||||||
|
sentiments important in financial contexts: “negative”, “positive”,
|
||||||
|
“litigious”, “uncertainty”, “constraining”, or “superfluous”.
|
||||||
|
|
||||||
|
These lexicons come in `rda` data files (see
|
||||||
|
[sentiment-lexicons/](sentiment-lexicons)):
|
||||||
|
|
||||||
|
- `bing.rda`
|
||||||
|
- `afinn.rda`
|
||||||
|
- `bing.rda`
|
||||||
|
- `loughran.rda`
|
||||||
|
|
||||||
|
To import them in R, use the `load()` function. For example, here’s how
|
||||||
|
to import the Bing lexicon:
|
||||||
|
|
||||||
|
``` r
|
||||||
|
# assuming that the rda files are in your working directory
|
||||||
|
load("bing.rda")
|
||||||
|
```
|
||||||
|
|
||||||
|
Assuming that you’ve loaded the file `"bing.rda"`, the associated
|
||||||
|
lexicon is available in the homonym tibble `bing`
|
||||||
|
|
||||||
|
``` r
|
||||||
|
bing
|
||||||
|
#> # A tibble: 6,786 × 2
|
||||||
|
#> word sentiment
|
||||||
|
#> <chr> <chr>
|
||||||
|
#> 1 2-faces negative
|
||||||
|
#> 2 abnormal negative
|
||||||
|
#> 3 abolish negative
|
||||||
|
#> 4 abominable negative
|
||||||
|
#> 5 abominably negative
|
||||||
|
#> 6 abominate negative
|
||||||
|
#> 7 abomination negative
|
||||||
|
#> 8 abort negative
|
||||||
|
#> 9 aborted negative
|
||||||
|
#> 10 aborts negative
|
||||||
|
#> # … with 6,776 more rows
|
||||||
|
```
|
Loading…
x
Reference in New Issue
Block a user