Add readme

2023-04-07 07:28:43 -07:00 · 2023-04-07 07:28:43 -07:00 · a9ef33f01e
commit a9ef33f01e
parent 3e704a9d9e
2 changed files with 371 additions and 0 deletions
--- a/README.Rmd
+++ b/README.Rmd
@ -0,0 +1,196 @@
 ---
 title: "Harry Potter Data"
 output: 
  md_document:
    variant: markdown_github
 ---
 <!-- README.md is generated from README.Rmd. Please edit that file -->
 ```{r include = FALSE}
 knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
 )
 library(tidyverse)
 library(tidytext)
 ```
 ## About
 This repository contains various data files that can be used to perform a
 text analysis of [Harry Potter](https://en.wikipedia.org/wiki/Harry_Potter)
 books, written by Joanne Kathleen Rowling:
 1) Harry Potter and the Philosopher's Stone
 2) Harry Potter and the Chamber of Secrets
 3) Harry Potter and the Prisoner of Azkaban
 4) Harry Potter and the Goblet of Fire
 5) Harry Potter and the Order of the Phoenix
 6) Harry Potter and the Half-Blood Prince
 7) Harry Potter and the Deathly Hallows
 To perform the text analysis, we recommend using _tidyverse_ tools (see
 packages below) and getting inspiration from the book 
 [Text Mining with R: A Tidy Approach](https://www.tidytextmining.com/index.html)
 (by Silge & Robinson):
 ```r
 library(tidyverse)
 library(tidytext)
 ```
 ### Content
 The content of this repo is divided in three directories, each one containing
 different types of files.
 - [csv-data-file/](csv-data-file) contains the text of all Harry Potter books
 in a single CSV file.
 - [rda-data-files/](rda-data-files) contains the seven Harry Potter books 
 stored in R-Data (binary) files---one filer per book.
 - [sentiment-lexicons/](sentiment-lexicons) contains a handful of sentiment 
 lexicons, also stored in R-Data (binary) files---one file per lexicon.
 ### Harry Potter CSV file
 The data of all the books is available in `csv` format: 
 `harry_potter_books.csv`.
 Assuming that this file is in your working directory, you can import it---via
 tidyverse's `readr()`---as follows:
 ```r
 # requires package tidyverse
 hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
 ```
 ```{r echo = FALSE}
 hp_books = read_csv("csv-data-file/harry_potter_books.csv", col_types = "ccc")
 ```
 This data set is fairly simple---in terms of its structure---although the 
 text content is far from being tidy. The dataset has `r nrow(hp_books)` rows 
 and `r ncol(hp_books)` columns: 
 1) `text`: text content
 2) `book`: title of associated book
 3) `chapter`: associated chapter number
 ### Harry Potter R-Data Files
 The data of each book is also available in its own R-Data `rda` file
 (see [rda-data-files/](rda-data-files)):
 - `"philosophers_stone.rda"`
 - `"chamber_of_secrets.rda"`
 - `"prisoner_of_azkaban.rda"`
 - `"goblet_of_fire.rda"`
 - `"order_of_the_phoenix.rda"`
 - `"half_blood_prince.rda"`
 - `"deathly_hallows.rda"`
 These files come from the R package `"harrypotter"` by Bradley Boehmke
 [https://github.com/bradleyboehmke/harrypotter](https://github.com/bradleyboehmke/harrypotter)
 To import these files use the `load()` function. For example, consider the
 first book "Harry Potter and the Philosopher's Stone"; here's how to `load()` 
 it in R:
 ```r
 # assuming that the rda file is in your working directory
 load("philosophers_stone.rda")
 ```
 ```{r echo = FALSE}
 load("rda-data-files/philosophers_stone.rda")
 ```
 Assuming that `"philosophers_stone.rda"` has been loaded, the text of this 
 book is available in the homonym character vector `philosophers_stone`
 ```{r}
 # text is in a character vector
 # (with as many elements as chapters in the book)
 length(philosophers_stone)
 ```
 The number of elements in `philosophers_stone` corresponds
 to the number of chapters in this book: 17 chapters.
 You may want to use these files to perform bigram analysis (or other type of
 n-gram analysis).
 ### Sentiment Lexicons
 In addition to the Harry Potter text, you can also find data for a handful of 
 sentiment lexicons from the R package `"textdata"` (by Hvitfeldt and Silge):
 - `"bing"`: Bing Liu's General purpose English sentiment lexicon that 
 categorizes words in a binary fashion, either positive or negative
 - `"afinn"`: AFINN is a lexicon of English words rated for valence with an
 integer between minus five (negative) and plus five (positive). The words have 
 been manually labeled by Finn Årup Nielsen in 2009-2011.
 - `"nrc"`: General purpose English sentiment/emotion lexicon. This lexicon 
 labels words with six possible sentiments or emotions: "negative", "positive",
 "anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or
 "trust". The annotations were manually done through Amazon's Mechanical Turk.
 - `"loughran"`: English sentiment lexicon created for use with financial
 documents. This lexicon labels words with six possible sentiments important in
 financial contexts: "negative", "positive", "litigious", "uncertainty",
 "constraining", or "superfluous".
 These lexicons come in `rda` data files (see
 [sentiment-lexicons/](sentiment-lexicons)):
 - `bing.rda`
 - `afinn.rda`
 - `bing.rda`
 - `loughran.rda`
 To import them in R, use the `load()` function. For example, here's how to
 import the Bing lexicon:
 ```r
 # assuming that the rda files are in your working directory
 load("bing.rda")
 ```
 ```{r echo = FALSE}
 # assuming that the rda file is in your working directory
 load("sentiment-lexicons/bing.rda")
 ```
 Assuming that you've loaded the file `"bing.rda"`, the associated lexicon is 
 available in the homonym tibble `bing`
 ```{r}
 bing
 ```
--- a/README.md
+++ b/README.md
@ -0,0 +1,175 @@
 <!-- README.md is generated from README.Rmd. Please edit that file -->
 ## About
 This repository contains various data files that can be used to perform
 a text analysis of [Harry
 Potter](https://en.wikipedia.org/wiki/Harry_Potter) books, written by
 Joanne Kathleen Rowling:
 1.  Harry Potter and the Philosopher’s Stone
 2.  Harry Potter and the Chamber of Secrets
 3.  Harry Potter and the Prisoner of Azkaban
 4.  Harry Potter and the Goblet of Fire
 5.  Harry Potter and the Order of the Phoenix
 6.  Harry Potter and the Half-Blood Prince
 7.  Harry Potter and the Deathly Hallows
 To perform the text analysis, we recommend using *tidyverse* tools (see
 packages below) and getting inspiration from the book [Text Mining with
 R: A Tidy Approach](https://www.tidytextmining.com/index.html) (by Silge
 & Robinson):
 ``` r
 library(tidyverse)
 library(tidytext)
 ```
 ### Content
 The content of this repo is divided in three directories, each one
 containing different types of files.
 -   [csv-data-file/](csv-data-file) contains the text of all Harry
    Potter books in a single CSV file.
 -   [rda-data-files/](rda-data-files) contains the seven Harry Potter
    books stored in R-Data (binary) files—one filer per book.
 -   [sentiment-lexicons/](sentiment-lexicons) contains a handful of
    sentiment lexicons, also stored in R-Data (binary) files—one file
    per lexicon.
 ### Harry Potter CSV file
 The data of all the books is available in `csv` format:
 `harry_potter_books.csv`.
 Assuming that this file is in your working directory, you can import
 it—via tidyverse’s `readr()`—as follows:
 ``` r
 # requires package tidyverse
 hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
 ```
 This data set is fairly simple—in terms of its structure—although the
 text content is far from being tidy. The dataset has 95085 rows and 3
 columns:
 1.  `text`: text content
 2.  `book`: title of associated book
 3.  `chapter`: associated chapter number
 ### Harry Potter R-Data Files
 The data of each book is also available in its own R-Data `rda` file
 (see [rda-data-files/](rda-data-files)):
 -   `"philosophers_stone.rda"`
 -   `"chamber_of_secrets.rda"`
 -   `"prisoner_of_azkaban.rda"`
 -   `"goblet_of_fire.rda"`
 -   `"order_of_the_phoenix.rda"`
 -   `"half_blood_prince.rda"`
 -   `"deathly_hallows.rda"`
 These files come from the R package `"harrypotter"` by Bradley Boehmke
 <https://github.com/bradleyboehmke/harrypotter>
 To import these files use the `load()` function. For example, consider
 the first book “Harry Potter and the Philosopher’s Stone”; here’s how to
 `load()` it in R:
 ``` r
 # assuming that the rda file is in your working directory
 load("philosophers_stone.rda")
 ```
 Assuming that `"philosophers_stone.rda"` has been loaded, the text of
 this book is available in the homonym character vector
 `philosophers_stone`
 ``` r
 # text is in a character vector
 # (with as many elements as chapters in the book)
 length(philosophers_stone)
 #> [1] 17
 ```
 The number of elements in `philosophers_stone` corresponds to the number
 of chapters in this book: 17 chapters.
 You may want to use these files to perform bigram analysis (or other
 type of n-gram analysis).
 ### Sentiment Lexicons
 In addition to the Harry Potter text, you can also find data for a
 handful of sentiment lexicons from the R package `"textdata"` (by
 Hvitfeldt and Silge):
 -   `"bing"`: Bing Liu’s General purpose English sentiment lexicon that
    categorizes words in a binary fashion, either positive or negative
 -   `"afinn"`: AFINN is a lexicon of English words rated for valence
    with an integer between minus five (negative) and plus five
    (positive). The words have been manually labeled by Finn Årup
    Nielsen in 2009-2011.
 -   `"nrc"`: General purpose English sentiment/emotion lexicon. This
    lexicon labels words with six possible sentiments or emotions:
    “negative”, “positive”, “anger”, “anticipation”, “disgust”, “fear”,
    “joy”, “sadness”, “surprise”, or “trust”. The annotations were
    manually done through Amazon’s Mechanical Turk.
 -   `"loughran"`: English sentiment lexicon created for use with
    financial documents. This lexicon labels words with six possible
    sentiments important in financial contexts: “negative”, “positive”,
    “litigious”, “uncertainty”, “constraining”, or “superfluous”.
 These lexicons come in `rda` data files (see
 [sentiment-lexicons/](sentiment-lexicons)):
 -   `bing.rda`
 -   `afinn.rda`
 -   `bing.rda`
 -   `loughran.rda`
 To import them in R, use the `load()` function. For example, here’s how
 to import the Bing lexicon:
 ``` r
 # assuming that the rda files are in your working directory
 load("bing.rda")
 ```
 Assuming that you’ve loaded the file `"bing.rda"`, the associated
 lexicon is available in the homonym tibble `bing`
 ``` r
 bing
 #> # A tibble: 6,786 × 2
 #>    word        sentiment
 #>    <chr>       <chr>    
 #>  1 2-faces     negative 
 #>  2 abnormal    negative 
 #>  3 abolish     negative 
 #>  4 abominable  negative 
 #>  5 abominably  negative 
 #>  6 abominate   negative 
 #>  7 abomination negative 
 #>  8 abort       negative 
 #>  9 aborted     negative 
 #> 10 aborts      negative 
 #> # … with 6,776 more rows
 ```