Add readme

2023-04-07 07:28:43 -07:00
parent 3e704a9d9e
commit a9ef33f01e
2 changed files with 371 additions and 0 deletions
@@ -0,0 +1,196 @@
+---
+title: "Harry Potter Data"
+output: 
+  md_document:
+    variant: markdown_github
+---
+
+<!-- README.md is generated from README.Rmd. Please edit that file -->
+
+```{r include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+
+library(tidyverse)
+library(tidytext)
+```
+
+
+
+## About
+
+This repository contains various data files that can be used to perform a
+text analysis of [Harry Potter](https://en.wikipedia.org/wiki/Harry_Potter)
+books, written by Joanne Kathleen Rowling:
+
+1) Harry Potter and the Philosopher's Stone
+
+2) Harry Potter and the Chamber of Secrets
+
+3) Harry Potter and the Prisoner of Azkaban
+
+4) Harry Potter and the Goblet of Fire
+
+5) Harry Potter and the Order of the Phoenix
+
+6) Harry Potter and the Half-Blood Prince
+
+7) Harry Potter and the Deathly Hallows
+
+
+To perform the text analysis, we recommend using _tidyverse_ tools (see
+packages below) and getting inspiration from the book 
+[Text Mining with R: A Tidy Approach](https://www.tidytextmining.com/index.html)
+(by Silge & Robinson):
+
+```r
+library(tidyverse)
+library(tidytext)
+```
+
+
+### Content
+
+The content of this repo is divided in three directories, each one containing
+different types of files.
+
+- [csv-data-file/](csv-data-file) contains the text of all Harry Potter books
+in a single CSV file.
+
+- [rda-data-files/](rda-data-files) contains the seven Harry Potter books 
+stored in R-Data (binary) files---one filer per book.
+
+- [sentiment-lexicons/](sentiment-lexicons) contains a handful of sentiment 
+lexicons, also stored in R-Data (binary) files---one file per lexicon.
+
+
+
+### Harry Potter CSV file
+
+The data of all the books is available in `csv` format: 
+`harry_potter_books.csv`.
+
+Assuming that this file is in your working directory, you can import it---via
+tidyverse's `readr()`---as follows:
+
+```r
+# requires package tidyverse
+hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
+```
+
+```{r echo = FALSE}
+hp_books = read_csv("csv-data-file/harry_potter_books.csv", col_types = "ccc")
+```
+
+This data set is fairly simple---in terms of its structure---although the 
+text content is far from being tidy. The dataset has `r nrow(hp_books)` rows 
+and `r ncol(hp_books)` columns: 
+
+1) `text`: text content
+
+2) `book`: title of associated book
+
+3) `chapter`: associated chapter number
+
+
+
+### Harry Potter R-Data Files
+
+The data of each book is also available in its own R-Data `rda` file
+(see [rda-data-files/](rda-data-files)):
+
+- `"philosophers_stone.rda"`
+- `"chamber_of_secrets.rda"`
+- `"prisoner_of_azkaban.rda"`
+- `"goblet_of_fire.rda"`
+- `"order_of_the_phoenix.rda"`
+- `"half_blood_prince.rda"`
+- `"deathly_hallows.rda"`
+
+These files come from the R package `"harrypotter"` by Bradley Boehmke
+
+[https://github.com/bradleyboehmke/harrypotter](https://github.com/bradleyboehmke/harrypotter)
+
+To import these files use the `load()` function. For example, consider the
+first book "Harry Potter and the Philosopher's Stone"; here's how to `load()` 
+it in R:
+
+```r
+# assuming that the rda file is in your working directory
+load("philosophers_stone.rda")
+```
+
+```{r echo = FALSE}
+load("rda-data-files/philosophers_stone.rda")
+```
+
+Assuming that `"philosophers_stone.rda"` has been loaded, the text of this 
+book is available in the homonym character vector `philosophers_stone`
+
+```{r}
+# text is in a character vector
+# (with as many elements as chapters in the book)
+length(philosophers_stone)
+```
+
+The number of elements in `philosophers_stone` corresponds
+to the number of chapters in this book: 17 chapters.
+
+You may want to use these files to perform bigram analysis (or other type of
+n-gram analysis).
+
+
+
+### Sentiment Lexicons
+
+In addition to the Harry Potter text, you can also find data for a handful of 
+sentiment lexicons from the R package `"textdata"` (by Hvitfeldt and Silge):
+
+- `"bing"`: Bing Liu's General purpose English sentiment lexicon that 
+categorizes words in a binary fashion, either positive or negative
+
+- `"afinn"`: AFINN is a lexicon of English words rated for valence with an
+integer between minus five (negative) and plus five (positive). The words have 
+been manually labeled by Finn Årup Nielsen in 2009-2011.
+
+- `"nrc"`: General purpose English sentiment/emotion lexicon. This lexicon 
+labels words with six possible sentiments or emotions: "negative", "positive",
+"anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or
+"trust". The annotations were manually done through Amazon's Mechanical Turk.
+
+- `"loughran"`: English sentiment lexicon created for use with financial
+documents. This lexicon labels words with six possible sentiments important in
+financial contexts: "negative", "positive", "litigious", "uncertainty",
+"constraining", or "superfluous".
+
+
+These lexicons come in `rda` data files (see
+[sentiment-lexicons/](sentiment-lexicons)):
+
+- `bing.rda`
+- `afinn.rda`
+- `bing.rda`
+- `loughran.rda`
+
+To import them in R, use the `load()` function. For example, here's how to
+import the Bing lexicon:
+
+```r
+# assuming that the rda files are in your working directory
+load("bing.rda")
+```
+
+```{r echo = FALSE}
+# assuming that the rda file is in your working directory
+load("sentiment-lexicons/bing.rda")
+```
+
+Assuming that you've loaded the file `"bing.rda"`, the associated lexicon is 
+available in the homonym tibble `bing`
+
+```{r}
+bing
+```
+
@@ -0,0 +1,175 @@
+<!-- README.md is generated from README.Rmd. Please edit that file -->
+
+## About
+
+This repository contains various data files that can be used to perform
+a text analysis of [Harry
+Potter](https://en.wikipedia.org/wiki/Harry_Potter) books, written by
+Joanne Kathleen Rowling:
+
+1.  Harry Potter and the Philosopher’s Stone
+
+2.  Harry Potter and the Chamber of Secrets
+
+3.  Harry Potter and the Prisoner of Azkaban
+
+4.  Harry Potter and the Goblet of Fire
+
+5.  Harry Potter and the Order of the Phoenix
+
+6.  Harry Potter and the Half-Blood Prince
+
+7.  Harry Potter and the Deathly Hallows
+
+To perform the text analysis, we recommend using *tidyverse* tools (see
+packages below) and getting inspiration from the book [Text Mining with
+R: A Tidy Approach](https://www.tidytextmining.com/index.html) (by Silge
+& Robinson):
+
+``` r
+library(tidyverse)
+library(tidytext)
+```
+
+### Content
+
+The content of this repo is divided in three directories, each one
+containing different types of files.
+
+-   [csv-data-file/](csv-data-file) contains the text of all Harry
+    Potter books in a single CSV file.
+
+-   [rda-data-files/](rda-data-files) contains the seven Harry Potter
+    books stored in R-Data (binary) files—one filer per book.
+
+-   [sentiment-lexicons/](sentiment-lexicons) contains a handful of
+    sentiment lexicons, also stored in R-Data (binary) files—one file
+    per lexicon.
+
+### Harry Potter CSV file
+
+The data of all the books is available in `csv` format:
+`harry_potter_books.csv`.
+
+Assuming that this file is in your working directory, you can import
+it—via tidyverse’s `readr()`—as follows:
+
+``` r
+# requires package tidyverse
+hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
+```
+
+This data set is fairly simple—in terms of its structure—although the
+text content is far from being tidy. The dataset has 95085 rows and 3
+columns:
+
+1.  `text`: text content
+
+2.  `book`: title of associated book
+
+3.  `chapter`: associated chapter number
+
+### Harry Potter R-Data Files
+
+The data of each book is also available in its own R-Data `rda` file
+(see [rda-data-files/](rda-data-files)):
+
+-   `"philosophers_stone.rda"`
+-   `"chamber_of_secrets.rda"`
+-   `"prisoner_of_azkaban.rda"`
+-   `"goblet_of_fire.rda"`
+-   `"order_of_the_phoenix.rda"`
+-   `"half_blood_prince.rda"`
+-   `"deathly_hallows.rda"`
+
+These files come from the R package `"harrypotter"` by Bradley Boehmke
+
+<https://github.com/bradleyboehmke/harrypotter>
+
+To import these files use the `load()` function. For example, consider
+the first book “Harry Potter and the Philosopher’s Stone”; here’s how to
+`load()` it in R:
+
+``` r
+# assuming that the rda file is in your working directory
+load("philosophers_stone.rda")
+```
+
+Assuming that `"philosophers_stone.rda"` has been loaded, the text of
+this book is available in the homonym character vector
+`philosophers_stone`
+
+``` r
+# text is in a character vector
+# (with as many elements as chapters in the book)
+length(philosophers_stone)
+#> [1] 17
+```
+
+The number of elements in `philosophers_stone` corresponds to the number
+of chapters in this book: 17 chapters.
+
+You may want to use these files to perform bigram analysis (or other
+type of n-gram analysis).
+
+### Sentiment Lexicons
+
+In addition to the Harry Potter text, you can also find data for a
+handful of sentiment lexicons from the R package `"textdata"` (by
+Hvitfeldt and Silge):
+
+-   `"bing"`: Bing Liu’s General purpose English sentiment lexicon that
+    categorizes words in a binary fashion, either positive or negative
+
+-   `"afinn"`: AFINN is a lexicon of English words rated for valence
+    with an integer between minus five (negative) and plus five
+    (positive). The words have been manually labeled by Finn Årup
+    Nielsen in 2009-2011.
+
+-   `"nrc"`: General purpose English sentiment/emotion lexicon. This
+    lexicon labels words with six possible sentiments or emotions:
+    “negative”, “positive”, “anger”, “anticipation”, “disgust”, “fear”,
+    “joy”, “sadness”, “surprise”, or “trust”. The annotations were
+    manually done through Amazon’s Mechanical Turk.
+
+-   `"loughran"`: English sentiment lexicon created for use with
+    financial documents. This lexicon labels words with six possible
+    sentiments important in financial contexts: “negative”, “positive”,
+    “litigious”, “uncertainty”, “constraining”, or “superfluous”.
+
+These lexicons come in `rda` data files (see
+[sentiment-lexicons/](sentiment-lexicons)):
+
+-   `bing.rda`
+-   `afinn.rda`
+-   `bing.rda`
+-   `loughran.rda`
+
+To import them in R, use the `load()` function. For example, here’s how
+to import the Bing lexicon:
+
+``` r
+# assuming that the rda files are in your working directory
+load("bing.rda")
+```
+
+Assuming that you’ve loaded the file `"bing.rda"`, the associated
+lexicon is available in the homonym tibble `bing`
+
+``` r
+bing
+#> # A tibble: 6,786 × 2
+#>    word        sentiment
+#>    <chr>       <chr>    
+#>  1 2-faces     negative 
+#>  2 abnormal    negative 
+#>  3 abolish     negative 
+#>  4 abominable  negative 
+#>  5 abominably  negative 
+#>  6 abominate   negative 
+#>  7 abomination negative 
+#>  8 abort       negative 
+#>  9 aborted     negative 
+#> 10 aborts      negative 
+#> # … with 6,776 more rows
+```