diff --git a/README.Rmd b/README.Rmd new file mode 100644 index 0000000..3124e27 --- /dev/null +++ b/README.Rmd @@ -0,0 +1,196 @@ +--- +title: "Harry Potter Data" +output: + md_document: + variant: markdown_github +--- + + + +```{r include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) + +library(tidyverse) +library(tidytext) +``` + + + +## About + +This repository contains various data files that can be used to perform a +text analysis of [Harry Potter](https://en.wikipedia.org/wiki/Harry_Potter) +books, written by Joanne Kathleen Rowling: + +1) Harry Potter and the Philosopher's Stone + +2) Harry Potter and the Chamber of Secrets + +3) Harry Potter and the Prisoner of Azkaban + +4) Harry Potter and the Goblet of Fire + +5) Harry Potter and the Order of the Phoenix + +6) Harry Potter and the Half-Blood Prince + +7) Harry Potter and the Deathly Hallows + + +To perform the text analysis, we recommend using _tidyverse_ tools (see +packages below) and getting inspiration from the book +[Text Mining with R: A Tidy Approach](https://www.tidytextmining.com/index.html) +(by Silge & Robinson): + +```r +library(tidyverse) +library(tidytext) +``` + + +### Content + +The content of this repo is divided in three directories, each one containing +different types of files. + +- [csv-data-file/](csv-data-file) contains the text of all Harry Potter books +in a single CSV file. + +- [rda-data-files/](rda-data-files) contains the seven Harry Potter books +stored in R-Data (binary) files---one filer per book. + +- [sentiment-lexicons/](sentiment-lexicons) contains a handful of sentiment +lexicons, also stored in R-Data (binary) files---one file per lexicon. + + + +### Harry Potter CSV file + +The data of all the books is available in `csv` format: +`harry_potter_books.csv`. + +Assuming that this file is in your working directory, you can import it---via +tidyverse's `readr()`---as follows: + +```r +# requires package tidyverse +hp_books = read_csv("harry_potter_books.csv", col_types = "ccc") +``` + +```{r echo = FALSE} +hp_books = read_csv("csv-data-file/harry_potter_books.csv", col_types = "ccc") +``` + +This data set is fairly simple---in terms of its structure---although the +text content is far from being tidy. The dataset has `r nrow(hp_books)` rows +and `r ncol(hp_books)` columns: + +1) `text`: text content + +2) `book`: title of associated book + +3) `chapter`: associated chapter number + + + +### Harry Potter R-Data Files + +The data of each book is also available in its own R-Data `rda` file +(see [rda-data-files/](rda-data-files)): + +- `"philosophers_stone.rda"` +- `"chamber_of_secrets.rda"` +- `"prisoner_of_azkaban.rda"` +- `"goblet_of_fire.rda"` +- `"order_of_the_phoenix.rda"` +- `"half_blood_prince.rda"` +- `"deathly_hallows.rda"` + +These files come from the R package `"harrypotter"` by Bradley Boehmke + +[https://github.com/bradleyboehmke/harrypotter](https://github.com/bradleyboehmke/harrypotter) + +To import these files use the `load()` function. For example, consider the +first book "Harry Potter and the Philosopher's Stone"; here's how to `load()` +it in R: + +```r +# assuming that the rda file is in your working directory +load("philosophers_stone.rda") +``` + +```{r echo = FALSE} +load("rda-data-files/philosophers_stone.rda") +``` + +Assuming that `"philosophers_stone.rda"` has been loaded, the text of this +book is available in the homonym character vector `philosophers_stone` + +```{r} +# text is in a character vector +# (with as many elements as chapters in the book) +length(philosophers_stone) +``` + +The number of elements in `philosophers_stone` corresponds +to the number of chapters in this book: 17 chapters. + +You may want to use these files to perform bigram analysis (or other type of +n-gram analysis). + + + +### Sentiment Lexicons + +In addition to the Harry Potter text, you can also find data for a handful of +sentiment lexicons from the R package `"textdata"` (by Hvitfeldt and Silge): + +- `"bing"`: Bing Liu's General purpose English sentiment lexicon that +categorizes words in a binary fashion, either positive or negative + +- `"afinn"`: AFINN is a lexicon of English words rated for valence with an +integer between minus five (negative) and plus five (positive). The words have +been manually labeled by Finn Årup Nielsen in 2009-2011. + +- `"nrc"`: General purpose English sentiment/emotion lexicon. This lexicon +labels words with six possible sentiments or emotions: "negative", "positive", +"anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or +"trust". The annotations were manually done through Amazon's Mechanical Turk. + +- `"loughran"`: English sentiment lexicon created for use with financial +documents. This lexicon labels words with six possible sentiments important in +financial contexts: "negative", "positive", "litigious", "uncertainty", +"constraining", or "superfluous". + + +These lexicons come in `rda` data files (see +[sentiment-lexicons/](sentiment-lexicons)): + +- `bing.rda` +- `afinn.rda` +- `bing.rda` +- `loughran.rda` + +To import them in R, use the `load()` function. For example, here's how to +import the Bing lexicon: + +```r +# assuming that the rda files are in your working directory +load("bing.rda") +``` + +```{r echo = FALSE} +# assuming that the rda file is in your working directory +load("sentiment-lexicons/bing.rda") +``` + +Assuming that you've loaded the file `"bing.rda"`, the associated lexicon is +available in the homonym tibble `bing` + +```{r} +bing +``` + diff --git a/README.md b/README.md new file mode 100644 index 0000000..e1078fb --- /dev/null +++ b/README.md @@ -0,0 +1,175 @@ + + +## About + +This repository contains various data files that can be used to perform +a text analysis of [Harry +Potter](https://en.wikipedia.org/wiki/Harry_Potter) books, written by +Joanne Kathleen Rowling: + +1. Harry Potter and the Philosopher’s Stone + +2. Harry Potter and the Chamber of Secrets + +3. Harry Potter and the Prisoner of Azkaban + +4. Harry Potter and the Goblet of Fire + +5. Harry Potter and the Order of the Phoenix + +6. Harry Potter and the Half-Blood Prince + +7. Harry Potter and the Deathly Hallows + +To perform the text analysis, we recommend using *tidyverse* tools (see +packages below) and getting inspiration from the book [Text Mining with +R: A Tidy Approach](https://www.tidytextmining.com/index.html) (by Silge +& Robinson): + +``` r +library(tidyverse) +library(tidytext) +``` + +### Content + +The content of this repo is divided in three directories, each one +containing different types of files. + +- [csv-data-file/](csv-data-file) contains the text of all Harry + Potter books in a single CSV file. + +- [rda-data-files/](rda-data-files) contains the seven Harry Potter + books stored in R-Data (binary) files—one filer per book. + +- [sentiment-lexicons/](sentiment-lexicons) contains a handful of + sentiment lexicons, also stored in R-Data (binary) files—one file + per lexicon. + +### Harry Potter CSV file + +The data of all the books is available in `csv` format: +`harry_potter_books.csv`. + +Assuming that this file is in your working directory, you can import +it—via tidyverse’s `readr()`—as follows: + +``` r +# requires package tidyverse +hp_books = read_csv("harry_potter_books.csv", col_types = "ccc") +``` + +This data set is fairly simple—in terms of its structure—although the +text content is far from being tidy. The dataset has 95085 rows and 3 +columns: + +1. `text`: text content + +2. `book`: title of associated book + +3. `chapter`: associated chapter number + +### Harry Potter R-Data Files + +The data of each book is also available in its own R-Data `rda` file +(see [rda-data-files/](rda-data-files)): + +- `"philosophers_stone.rda"` +- `"chamber_of_secrets.rda"` +- `"prisoner_of_azkaban.rda"` +- `"goblet_of_fire.rda"` +- `"order_of_the_phoenix.rda"` +- `"half_blood_prince.rda"` +- `"deathly_hallows.rda"` + +These files come from the R package `"harrypotter"` by Bradley Boehmke + + + +To import these files use the `load()` function. For example, consider +the first book “Harry Potter and the Philosopher’s Stone”; here’s how to +`load()` it in R: + +``` r +# assuming that the rda file is in your working directory +load("philosophers_stone.rda") +``` + +Assuming that `"philosophers_stone.rda"` has been loaded, the text of +this book is available in the homonym character vector +`philosophers_stone` + +``` r +# text is in a character vector +# (with as many elements as chapters in the book) +length(philosophers_stone) +#> [1] 17 +``` + +The number of elements in `philosophers_stone` corresponds to the number +of chapters in this book: 17 chapters. + +You may want to use these files to perform bigram analysis (or other +type of n-gram analysis). + +### Sentiment Lexicons + +In addition to the Harry Potter text, you can also find data for a +handful of sentiment lexicons from the R package `"textdata"` (by +Hvitfeldt and Silge): + +- `"bing"`: Bing Liu’s General purpose English sentiment lexicon that + categorizes words in a binary fashion, either positive or negative + +- `"afinn"`: AFINN is a lexicon of English words rated for valence + with an integer between minus five (negative) and plus five + (positive). The words have been manually labeled by Finn Årup + Nielsen in 2009-2011. + +- `"nrc"`: General purpose English sentiment/emotion lexicon. This + lexicon labels words with six possible sentiments or emotions: + “negative”, “positive”, “anger”, “anticipation”, “disgust”, “fear”, + “joy”, “sadness”, “surprise”, or “trust”. The annotations were + manually done through Amazon’s Mechanical Turk. + +- `"loughran"`: English sentiment lexicon created for use with + financial documents. This lexicon labels words with six possible + sentiments important in financial contexts: “negative”, “positive”, + “litigious”, “uncertainty”, “constraining”, or “superfluous”. + +These lexicons come in `rda` data files (see +[sentiment-lexicons/](sentiment-lexicons)): + +- `bing.rda` +- `afinn.rda` +- `bing.rda` +- `loughran.rda` + +To import them in R, use the `load()` function. For example, here’s how +to import the Bing lexicon: + +``` r +# assuming that the rda files are in your working directory +load("bing.rda") +``` + +Assuming that you’ve loaded the file `"bing.rda"`, the associated +lexicon is available in the homonym tibble `bing` + +``` r +bing +#> # A tibble: 6,786 × 2 +#> word sentiment +#> +#> 1 2-faces negative +#> 2 abnormal negative +#> 3 abolish negative +#> 4 abominable negative +#> 5 abominably negative +#> 6 abominate negative +#> 7 abomination negative +#> 8 abort negative +#> 9 aborted negative +#> 10 aborts negative +#> # … with 6,776 more rows +```