02-going-beyond-bow.qmd

---
format:
  revealjs:
    logo: "https://cms-cdn.lmu.de/assets/img/Logo_LMU.svg"
    footer: "[Advanced Text Analysis - SICSS-Munich 2023 - Valerie Hase](https://github.com/valeriehase)"
    center-title-slide: false
    theme: ["theme/q-theme.scss"]
    highlight-style: atom-one
    code-fold: show
    code-copy: true
    self-contained: true
    number-sections: false
    smaller: true
    progress: true
execute:
  echo: true
bibliography: "references/references.bib"
csl: references/apa.csl
editor: 
  markdown: 
    wrap: 72
---

<h1>Advanced Text Analysis</h1>

<h2>SICSS-Munich, Day 4</h2>

<hr>

Session 2️⃣: Going beyond bag-of-words: An introduction

Valerie Hase (LMU Munich)

`r fontawesome::fa("github", "black")`
[github.com/valeriehase](https://github.com/valeriehase)

`r fontawesome::fa("globe", "black")`
[valerie-hase.com](https://valerie-hase.com/)

## Agenda

-   Limitations of bow-approaches
-   Identifying meaning through ngrams
    -   Keywords-in-context
    -   Collocations
-   Identifying meaning through syntax
    -   Part-of-speech tagging
    -   Dependency parsing

## The "bag-of-words" (bow) assumption

::: incremental
Likely ❌ wrong assumption that:

-   "treat every word as having a distinct, unique meaning" [@grimmer_text_2022, pp.
    79]
-   We can represent text "as if it were a bag of words, that is, an
    unordered set of words with their position ignored, keeping only
    their frequency in the document." [@jurafsky_speech_2023, pp. 60]
-   In short: Assumption that we can **ignore** the context of words and still
    understand their meaning.
:::

## Repetition: "bag-of-words" (bow)

-   Disassembling texts into tokens is the foundation for the [bag of
    words](https://en.wikipedia.org/wiki/Bag-of-words_model){target="_blank"}
    model (bow)

-   Bow as simplified representation of text where only token
    frequencies are considered

![](figures/bag-of-words.png){fig-align="center" width="75%"}

Note. Figure from [@jurafsky_speech_2023, pp. 60]

## Repetition: Document-Feature Matrix in R `r fontawesome::fa("hand", "black")`

-   This assumption is best illustrated by any analyses based on the
    [Document-Feature-Matrix(DFM)](https://en.wikipedia.org/wiki/Document-term_matrix){target="_blank"}.
-   In DFM-based approaches, context is ignored (unless you explicitly
    include e.g. ngrams as features).

```{r, message = FALSE, warning = FALSE}
library("quanteda")
library("tidyverse")
sentences <- c("I like programming", "I do not like programming")
sentences %>% 
  tokens() %>% 
  dfm()
```

## 

::: {style="font-size: 200%;text-align:center;"}
*Can you come up with examples for when this assumption is violated?* 🤔
:::

## Bag-of-words: A valid assumption?

::: incremental
Likely ❌ violated / not helpful when dealing with...

-   **Polysemy**: "I love this sound." vs. "Sound solution!"
-   **Negation**: "Not bad!"
-   **Named Entities**: "United States", "Olaf Scholz"
-   **Features with similar meanings**: "I like greens." vs. "I like
    vegetables."
:::

## 

::: {style="font-size: 200%;text-align:center;"}
*Have you learned about any methods that relax/do not rely on the bag-of-word
assumption?* 🤔
:::

## Going beyond bag-of-words

-   **Identifying meaning through ngrams** (Session 2️⃣)
-   Identifying meaning through syntax (Session 2️⃣)
-   Identifying meaning through semantic spaces (Session 3️⃣, 4️⃣)

## First dataset for today

-   We'll use data provided by the `quanteda.corpora` package (install
    directly via Github using `devtools`)
-   US State of the Union addresses from 1790 to present
-   Corpus contains *N* = 241 speeches

```{r}
library("devtools")
devtools::install_github("quanteda/quanteda.corpora")
library("quanteda.corpora")
corpus_sotu <- data_corpus_sotu
```

## Identifying meaning through ngrams in R `r fontawesome::fa("hand", "black")`

-   [ngram](https://en.wikipedia.org/wiki/N-gram){target="_blank"}:
    sequence of *n* successive features in a corpus

    -   Bigram: "that is"
    -   Trigram: "that is great"
    -   etc.

-   Let's check out examples from our corpus:

    ```{r}
    tokens(corpus_sotu) %>%
      tokens_ngrams() %>%
      head(1)
    ```

## Repetition: Keywords-in-Context in R `r fontawesome::fa("hand", "black")`

-   [Keywords-in-context](https://en.wikipedia.org/wiki/Key_Word_in_Context){target="_blank"}
    (KWIC) as a way of displaying *concordandes*, i.e., specific
    features and their context, as a type of ngrams.

-   Let's remember how they work:

```{r}
library("quanteda.textstats")
corpus_sotu %>%
  tokens() %>%
  kwic(pattern = c("United"),
       window = 3) %>%
  head(3)
```

## Co-Occurrence Matrix in R `r fontawesome::fa("hand", "black")`

-   Columns & rows denote features
-   Cells indicate how often a feature **co-occurs** together with
    another feature in the same document
-   Upper triangle: lower diagonal sparse (i.e., 0)

```{r}
corpus_sotu %>%
  tokens() %>%
  dfm() %>%
  fcm() %>%
  head(2)
```

## Repetition: Collocations in R `r fontawesome::fa("hand", "black")`

-   [Collocations](https://en.wikipedia.org/wiki/Collocation){target="_blank"}
    as sequences of features which symbolize shared semantic meaning and
    often co-occur, e.g. "United States"
-   Indicated by co-occurrence of these features in similar contexts
    (document, sentence)

```{r}
corpus_sotu %>% 
  textstat_collocations(min_count = 500) %>% 
  arrange(-lambda) %>%
  head(3)
```

## 

::: {style="font-size: 200%;text-align:center;"}
**How could we use these methods for social science questions?** 🤔
:::

## Identifying meaning through ngrams: Overview 📚

-   **Methods**: Keywords-in-context, collocations,
    [ngram-shingling](https://en.wikipedia.org/wiki/W-shingling){target="_blank"}
    (not discussed here)

-   **Use for**: Detecting text similarities, text reuse, stereotypical
    associations

-   **Examplary studies**:

    -   for collocations: @arendt_content_2017
    -   for n-gram shingling: @nicholls_detecting_2019

-   **Tutorials**: @puschmann_automated_2019, @schweinberger2023coll,
    @watanabe_quanteda_2023

-   **Packages**:
    [quanteda](https://cran.r-project.org/web/packages/quanteda), [textreuse](https://cran.r-project.org/web/packages/spacyr) and
    related publication [@mullen_2020]

## Going beyond bag-of-words

-   Identifying meaning through ngrams (Session 2️⃣)
-   **Identifying meaning through syntax** (Session 2️⃣)
-   Identifying meaning through semantic spaces (Session 3️⃣, 4️⃣)

## Identifying meaning through syntax

-   We can also rely on information provided by syntax to better
    identify the meaning of language

-   Here, we will focus on two approaches:

    -   Part-of-speech tagging
    -   Dependency parsing

## Part-of-Speech Tagging (PoS): Introduction

-   [PoS](https://en.wikipedia.org/wiki/Part-of-speech_tagging){target="_blank"}:
    "process of assigning a part-of-speech to each word in a text"
    [@jurafsky_speech_2023, pp. 163]
-   Tags based on feature & context
-   Often used for [Named Entity
    Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)

## Part-of-Speech Tagging (PoS): Introduction

-   [PoS](https://en.wikipedia.org/wiki/Part-of-speech_tagging){target="_blank"}:
    "process of assigning a part-of-speech to each word in a text"
    [@jurafsky_speech_2023, pp. 163]
-   Tags based on feature & context
-   Often used for [Named Entity
    Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)

![](figures/pos_tag.png){fig-align="left" width="20%"
fig-alt="Image of a PoS-tagged sentence"}

Note. Figure from @jurafsky_speech_2023 [p. 164].

For explanation of tags, see @de_marneffe_universal_2021.

## Part-of-Speech Tagging in R `r fontawesome::fa("hand", "black")`

-   In R, usually via the `spacyr` package (but requires Python, installation somewhat [complicated](https://cran.r-project.org/web/packages/spacyr/vignettes/using_spacyr.html))
-   For simplicity, here via `udpipe` package
-   But check out the comparison between both
    [here](https://www.r-bloggers.com/2018/02/a-comparison-between-spacy-and-udpipe-for-natural-language-processing-for-r-users)
    and
    [here](https://github.com/jwijffels/udpipe-spacy-comparison/tree/master)

```{r, eval = FALSE}
library("udpipe")
corpus_sotu %>%
  
  #change format for udpipe package
  as_tibble() %>%
  mutate(doc_id = paste0("text", 1:n())) %>%
  rename(text = value) %>%
  
  #for simplicity, run for fewer documents
  slice_head %>%
  
  #part-of-speech tagging, include only related variables
  udpipe("english") %>% 
  select(doc_id, sentence_id, token_id, token, upos) %>%
  head(5)
```

```{r, eval = TRUE, echo = FALSE}
#Ignore this code, small adaptation for quarto
library("udpipe")
corpus_sotu %>%
  
  #change format for udpipe package
  as_tibble() %>%
  mutate(doc_id = paste0("text", 1:n())) %>%
  rename(text = value) %>%
  
  #for simplicity, run for fewer documents
  slice_head %>%
  
  #part-of-speech tagging, include only related variables
  udpipe(object = "H:/Lehre/SICSS - München - 2023/english-ewt-ud-2.5-191206.udpipe") %>% 
  select(doc_id, sentence_id, token_id, token, upos) %>%
  head(5)
```

## Dependency parsing: Introduction

-   [Dependency
    parsing](https://en.wikipedia.org/wiki/Dependency_grammar){target="_blank"}:
    describing "the syntactic structure of a sentence \[...\] in terms
    of directed binary grammatical relations between the words"
    [@jurafsky_speech_2023, pp. 381]
-   Define syntactic meaning of features by relation to "root"
-   Use as semantic proxy

![](figures/dependency_tree.png){fig-align="left" width="20%"
fig-alt="Image of a Dependency Tree"}

Note. Figure from [@jurafsky_speech_2023, pp. 381].

For explanation of tags, see @de_marneffe_universal_2021.

## Dependency parsing in R `r fontawesome::fa("hand", "black")`

-   In R, usually via the `spacyr` package (but requires Python)
-   For simplicity, here via `udpipe` package

```{r, eval = FALSE}
library("udpipe")
corpus_sotu %>%
  
  #change format for udpipe package
  as_tibble() %>%
  mutate(doc_id = paste0("text", 1:n())) %>%
  rename(text = value) %>%
  
  #for simplicity, run for fewer documents
  slice_head %>%
  
  #dependency parsing, include only related variables
  udpipe("english") %>% 
  select(doc_id, sentence_id, token_id, token, head_token_id, dep_rel) %>%
  head(5)
```

```{r, eval = TRUE, echo = FALSE}
#Ignore this code, small adaptation for quarto
library("udpipe")
corpus_sotu %>%
  
  #change format for udpipe package
  as_tibble() %>%
  mutate(doc_id = paste0("text", 1:n())) %>%
  rename(text = value) %>%
  
  #for simplicity, run for fewer documents
  slice_head %>%
  
  #dependency parsing, include only related variables
  udpipe(object = "H:/Lehre/SICSS - München - 2023/english-ewt-ud-2.5-191206.udpipe") %>% 
  select(doc_id, sentence_id, token_id, token, head_token_id, dep_rel) %>%
  head(5)
```

## Dependency parsing in R `r fontawesome::fa("hand", "black")`

-   Using the `rsyntax` package, we can even plot this to better
    understand these relations!

```{r, eval = FALSE}
library("rsyntax")
sentence <- udpipe("My only goal in life is to understand dependency parsing", "english") %>%
  as_tokenindex() %>%
  plot_tree(., token, lemma, upos)
```

![](figures/dependency_tree_example.png){fig-alt="Image of a Dependency Tree"
fig-align="left" width="3.6cm"}

## 

::: {style="font-size: 200%;text-align:center;"}
**How could we use these methods for social science questions?** 🤔
:::

## Identifying meaning through syntax: Overview 📚

-   **Methods**: Part-of-speech tagging, dependency parsing

-   **Use for**: Detecting entities, entity-specific sentiment, sources, etc.

-   **Examplary studies**:

    -   for PoS: @burggraaff_through_2020
    -   for dependency parsing: @van_atteveldt_clause_2017,
        @fogel-dror_role-based_2019

-   **Tutorials**: @benoit_guide_2020, @schweinberger2023postag

-   **Packages**:
    - [spacyr](https://cran.r-project.org/web/packages/spacyr)
    - [udpipe](https://cran.r-project.org/web/packages/udpipe)
    - [rsyntax](https://cran.r-project.org/web/packages/rsyntax) and
    related publication [@welbers_extracting_2021]

## 

::: {style="font-size: 400%;text-align:center;"}
**Any questions?** 🤔
:::

## References