Add readme
This commit is contained in:
parent
3e704a9d9e
commit
a9ef33f01e
196
README.Rmd
Normal file
196
README.Rmd
Normal file
@ -0,0 +1,196 @@
|
||||
---
|
||||
title: "Harry Potter Data"
|
||||
output:
|
||||
md_document:
|
||||
variant: markdown_github
|
||||
---
|
||||
|
||||
<!-- README.md is generated from README.Rmd. Please edit that file -->
|
||||
|
||||
```{r include = FALSE}
|
||||
knitr::opts_chunk$set(
|
||||
collapse = TRUE,
|
||||
comment = "#>"
|
||||
)
|
||||
|
||||
library(tidyverse)
|
||||
library(tidytext)
|
||||
```
|
||||
|
||||
|
||||
|
||||
## About
|
||||
|
||||
This repository contains various data files that can be used to perform a
|
||||
text analysis of [Harry Potter](https://en.wikipedia.org/wiki/Harry_Potter)
|
||||
books, written by Joanne Kathleen Rowling:
|
||||
|
||||
1) Harry Potter and the Philosopher's Stone
|
||||
|
||||
2) Harry Potter and the Chamber of Secrets
|
||||
|
||||
3) Harry Potter and the Prisoner of Azkaban
|
||||
|
||||
4) Harry Potter and the Goblet of Fire
|
||||
|
||||
5) Harry Potter and the Order of the Phoenix
|
||||
|
||||
6) Harry Potter and the Half-Blood Prince
|
||||
|
||||
7) Harry Potter and the Deathly Hallows
|
||||
|
||||
|
||||
To perform the text analysis, we recommend using _tidyverse_ tools (see
|
||||
packages below) and getting inspiration from the book
|
||||
[Text Mining with R: A Tidy Approach](https://www.tidytextmining.com/index.html)
|
||||
(by Silge & Robinson):
|
||||
|
||||
```r
|
||||
library(tidyverse)
|
||||
library(tidytext)
|
||||
```
|
||||
|
||||
|
||||
### Content
|
||||
|
||||
The content of this repo is divided in three directories, each one containing
|
||||
different types of files.
|
||||
|
||||
- [csv-data-file/](csv-data-file) contains the text of all Harry Potter books
|
||||
in a single CSV file.
|
||||
|
||||
- [rda-data-files/](rda-data-files) contains the seven Harry Potter books
|
||||
stored in R-Data (binary) files---one filer per book.
|
||||
|
||||
- [sentiment-lexicons/](sentiment-lexicons) contains a handful of sentiment
|
||||
lexicons, also stored in R-Data (binary) files---one file per lexicon.
|
||||
|
||||
|
||||
|
||||
### Harry Potter CSV file
|
||||
|
||||
The data of all the books is available in `csv` format:
|
||||
`harry_potter_books.csv`.
|
||||
|
||||
Assuming that this file is in your working directory, you can import it---via
|
||||
tidyverse's `readr()`---as follows:
|
||||
|
||||
```r
|
||||
# requires package tidyverse
|
||||
hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
|
||||
```
|
||||
|
||||
```{r echo = FALSE}
|
||||
hp_books = read_csv("csv-data-file/harry_potter_books.csv", col_types = "ccc")
|
||||
```
|
||||
|
||||
This data set is fairly simple---in terms of its structure---although the
|
||||
text content is far from being tidy. The dataset has `r nrow(hp_books)` rows
|
||||
and `r ncol(hp_books)` columns:
|
||||
|
||||
1) `text`: text content
|
||||
|
||||
2) `book`: title of associated book
|
||||
|
||||
3) `chapter`: associated chapter number
|
||||
|
||||
|
||||
|
||||
### Harry Potter R-Data Files
|
||||
|
||||
The data of each book is also available in its own R-Data `rda` file
|
||||
(see [rda-data-files/](rda-data-files)):
|
||||
|
||||
- `"philosophers_stone.rda"`
|
||||
- `"chamber_of_secrets.rda"`
|
||||
- `"prisoner_of_azkaban.rda"`
|
||||
- `"goblet_of_fire.rda"`
|
||||
- `"order_of_the_phoenix.rda"`
|
||||
- `"half_blood_prince.rda"`
|
||||
- `"deathly_hallows.rda"`
|
||||
|
||||
These files come from the R package `"harrypotter"` by Bradley Boehmke
|
||||
|
||||
[https://github.com/bradleyboehmke/harrypotter](https://github.com/bradleyboehmke/harrypotter)
|
||||
|
||||
To import these files use the `load()` function. For example, consider the
|
||||
first book "Harry Potter and the Philosopher's Stone"; here's how to `load()`
|
||||
it in R:
|
||||
|
||||
```r
|
||||
# assuming that the rda file is in your working directory
|
||||
load("philosophers_stone.rda")
|
||||
```
|
||||
|
||||
```{r echo = FALSE}
|
||||
load("rda-data-files/philosophers_stone.rda")
|
||||
```
|
||||
|
||||
Assuming that `"philosophers_stone.rda"` has been loaded, the text of this
|
||||
book is available in the homonym character vector `philosophers_stone`
|
||||
|
||||
```{r}
|
||||
# text is in a character vector
|
||||
# (with as many elements as chapters in the book)
|
||||
length(philosophers_stone)
|
||||
```
|
||||
|
||||
The number of elements in `philosophers_stone` corresponds
|
||||
to the number of chapters in this book: 17 chapters.
|
||||
|
||||
You may want to use these files to perform bigram analysis (or other type of
|
||||
n-gram analysis).
|
||||
|
||||
|
||||
|
||||
### Sentiment Lexicons
|
||||
|
||||
In addition to the Harry Potter text, you can also find data for a handful of
|
||||
sentiment lexicons from the R package `"textdata"` (by Hvitfeldt and Silge):
|
||||
|
||||
- `"bing"`: Bing Liu's General purpose English sentiment lexicon that
|
||||
categorizes words in a binary fashion, either positive or negative
|
||||
|
||||
- `"afinn"`: AFINN is a lexicon of English words rated for valence with an
|
||||
integer between minus five (negative) and plus five (positive). The words have
|
||||
been manually labeled by Finn Årup Nielsen in 2009-2011.
|
||||
|
||||
- `"nrc"`: General purpose English sentiment/emotion lexicon. This lexicon
|
||||
labels words with six possible sentiments or emotions: "negative", "positive",
|
||||
"anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or
|
||||
"trust". The annotations were manually done through Amazon's Mechanical Turk.
|
||||
|
||||
- `"loughran"`: English sentiment lexicon created for use with financial
|
||||
documents. This lexicon labels words with six possible sentiments important in
|
||||
financial contexts: "negative", "positive", "litigious", "uncertainty",
|
||||
"constraining", or "superfluous".
|
||||
|
||||
|
||||
These lexicons come in `rda` data files (see
|
||||
[sentiment-lexicons/](sentiment-lexicons)):
|
||||
|
||||
- `bing.rda`
|
||||
- `afinn.rda`
|
||||
- `bing.rda`
|
||||
- `loughran.rda`
|
||||
|
||||
To import them in R, use the `load()` function. For example, here's how to
|
||||
import the Bing lexicon:
|
||||
|
||||
```r
|
||||
# assuming that the rda files are in your working directory
|
||||
load("bing.rda")
|
||||
```
|
||||
|
||||
```{r echo = FALSE}
|
||||
# assuming that the rda file is in your working directory
|
||||
load("sentiment-lexicons/bing.rda")
|
||||
```
|
||||
|
||||
Assuming that you've loaded the file `"bing.rda"`, the associated lexicon is
|
||||
available in the homonym tibble `bing`
|
||||
|
||||
```{r}
|
||||
bing
|
||||
```
|
||||
|
175
README.md
Normal file
175
README.md
Normal file
@ -0,0 +1,175 @@
|
||||
<!-- README.md is generated from README.Rmd. Please edit that file -->
|
||||
|
||||
## About
|
||||
|
||||
This repository contains various data files that can be used to perform
|
||||
a text analysis of [Harry
|
||||
Potter](https://en.wikipedia.org/wiki/Harry_Potter) books, written by
|
||||
Joanne Kathleen Rowling:
|
||||
|
||||
1. Harry Potter and the Philosopher’s Stone
|
||||
|
||||
2. Harry Potter and the Chamber of Secrets
|
||||
|
||||
3. Harry Potter and the Prisoner of Azkaban
|
||||
|
||||
4. Harry Potter and the Goblet of Fire
|
||||
|
||||
5. Harry Potter and the Order of the Phoenix
|
||||
|
||||
6. Harry Potter and the Half-Blood Prince
|
||||
|
||||
7. Harry Potter and the Deathly Hallows
|
||||
|
||||
To perform the text analysis, we recommend using *tidyverse* tools (see
|
||||
packages below) and getting inspiration from the book [Text Mining with
|
||||
R: A Tidy Approach](https://www.tidytextmining.com/index.html) (by Silge
|
||||
& Robinson):
|
||||
|
||||
``` r
|
||||
library(tidyverse)
|
||||
library(tidytext)
|
||||
```
|
||||
|
||||
### Content
|
||||
|
||||
The content of this repo is divided in three directories, each one
|
||||
containing different types of files.
|
||||
|
||||
- [csv-data-file/](csv-data-file) contains the text of all Harry
|
||||
Potter books in a single CSV file.
|
||||
|
||||
- [rda-data-files/](rda-data-files) contains the seven Harry Potter
|
||||
books stored in R-Data (binary) files—one filer per book.
|
||||
|
||||
- [sentiment-lexicons/](sentiment-lexicons) contains a handful of
|
||||
sentiment lexicons, also stored in R-Data (binary) files—one file
|
||||
per lexicon.
|
||||
|
||||
### Harry Potter CSV file
|
||||
|
||||
The data of all the books is available in `csv` format:
|
||||
`harry_potter_books.csv`.
|
||||
|
||||
Assuming that this file is in your working directory, you can import
|
||||
it—via tidyverse’s `readr()`—as follows:
|
||||
|
||||
``` r
|
||||
# requires package tidyverse
|
||||
hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
|
||||
```
|
||||
|
||||
This data set is fairly simple—in terms of its structure—although the
|
||||
text content is far from being tidy. The dataset has 95085 rows and 3
|
||||
columns:
|
||||
|
||||
1. `text`: text content
|
||||
|
||||
2. `book`: title of associated book
|
||||
|
||||
3. `chapter`: associated chapter number
|
||||
|
||||
### Harry Potter R-Data Files
|
||||
|
||||
The data of each book is also available in its own R-Data `rda` file
|
||||
(see [rda-data-files/](rda-data-files)):
|
||||
|
||||
- `"philosophers_stone.rda"`
|
||||
- `"chamber_of_secrets.rda"`
|
||||
- `"prisoner_of_azkaban.rda"`
|
||||
- `"goblet_of_fire.rda"`
|
||||
- `"order_of_the_phoenix.rda"`
|
||||
- `"half_blood_prince.rda"`
|
||||
- `"deathly_hallows.rda"`
|
||||
|
||||
These files come from the R package `"harrypotter"` by Bradley Boehmke
|
||||
|
||||
<https://github.com/bradleyboehmke/harrypotter>
|
||||
|
||||
To import these files use the `load()` function. For example, consider
|
||||
the first book “Harry Potter and the Philosopher’s Stone”; here’s how to
|
||||
`load()` it in R:
|
||||
|
||||
``` r
|
||||
# assuming that the rda file is in your working directory
|
||||
load("philosophers_stone.rda")
|
||||
```
|
||||
|
||||
Assuming that `"philosophers_stone.rda"` has been loaded, the text of
|
||||
this book is available in the homonym character vector
|
||||
`philosophers_stone`
|
||||
|
||||
``` r
|
||||
# text is in a character vector
|
||||
# (with as many elements as chapters in the book)
|
||||
length(philosophers_stone)
|
||||
#> [1] 17
|
||||
```
|
||||
|
||||
The number of elements in `philosophers_stone` corresponds to the number
|
||||
of chapters in this book: 17 chapters.
|
||||
|
||||
You may want to use these files to perform bigram analysis (or other
|
||||
type of n-gram analysis).
|
||||
|
||||
### Sentiment Lexicons
|
||||
|
||||
In addition to the Harry Potter text, you can also find data for a
|
||||
handful of sentiment lexicons from the R package `"textdata"` (by
|
||||
Hvitfeldt and Silge):
|
||||
|
||||
- `"bing"`: Bing Liu’s General purpose English sentiment lexicon that
|
||||
categorizes words in a binary fashion, either positive or negative
|
||||
|
||||
- `"afinn"`: AFINN is a lexicon of English words rated for valence
|
||||
with an integer between minus five (negative) and plus five
|
||||
(positive). The words have been manually labeled by Finn Årup
|
||||
Nielsen in 2009-2011.
|
||||
|
||||
- `"nrc"`: General purpose English sentiment/emotion lexicon. This
|
||||
lexicon labels words with six possible sentiments or emotions:
|
||||
“negative”, “positive”, “anger”, “anticipation”, “disgust”, “fear”,
|
||||
“joy”, “sadness”, “surprise”, or “trust”. The annotations were
|
||||
manually done through Amazon’s Mechanical Turk.
|
||||
|
||||
- `"loughran"`: English sentiment lexicon created for use with
|
||||
financial documents. This lexicon labels words with six possible
|
||||
sentiments important in financial contexts: “negative”, “positive”,
|
||||
“litigious”, “uncertainty”, “constraining”, or “superfluous”.
|
||||
|
||||
These lexicons come in `rda` data files (see
|
||||
[sentiment-lexicons/](sentiment-lexicons)):
|
||||
|
||||
- `bing.rda`
|
||||
- `afinn.rda`
|
||||
- `bing.rda`
|
||||
- `loughran.rda`
|
||||
|
||||
To import them in R, use the `load()` function. For example, here’s how
|
||||
to import the Bing lexicon:
|
||||
|
||||
``` r
|
||||
# assuming that the rda files are in your working directory
|
||||
load("bing.rda")
|
||||
```
|
||||
|
||||
Assuming that you’ve loaded the file `"bing.rda"`, the associated
|
||||
lexicon is available in the homonym tibble `bing`
|
||||
|
||||
``` r
|
||||
bing
|
||||
#> # A tibble: 6,786 × 2
|
||||
#> word sentiment
|
||||
#> <chr> <chr>
|
||||
#> 1 2-faces negative
|
||||
#> 2 abnormal negative
|
||||
#> 3 abolish negative
|
||||
#> 4 abominable negative
|
||||
#> 5 abominably negative
|
||||
#> 6 abominate negative
|
||||
#> 7 abomination negative
|
||||
#> 8 abort negative
|
||||
#> 9 aborted negative
|
||||
#> 10 aborts negative
|
||||
#> # … with 6,776 more rows
|
||||
```
|
Loading…
x
Reference in New Issue
Block a user