Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parliament demo update, CANDOR corpus conversion notebook #201

Merged
merged 24 commits into from
Oct 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
e38c973
fix use of mutability in Coordination transformer.
seanzhangkx8 Apr 12, 2023
331c379
run black formatter
seanzhangkx8 Apr 12, 2023
6781ed1
fixed coordination with efficient implementation
seanzhangkx8 Apr 20, 2023
7eac9fc
comments for changes
seanzhangkx8 Apr 20, 2023
b2d41b1
metadata field deepcopy
seanzhangkx8 May 4, 2023
5f371d1
documentation and website update for V3.0
seanzhangkx8 May 25, 2023
d396492
get dataframe mutation fix
seanzhangkx8 May 26, 2023
2d83839
fix get dataframe mutability
seanzhangkx8 Jun 1, 2023
a9d9d38
modify 3.0 documentation
seanzhangkx8 Jun 2, 2023
6cb1074
revert get dataframe fixes
seanzhangkx8 Jun 3, 2023
18567d4
pairer maximize pair mode fix
seanzhangkx8 Jun 8, 2023
406a04c
backendMapper, config documentation
seanzhangkx8 Jun 22, 2023
46feed0
goodbye to python3.7
seanzhangkx8 Jul 9, 2023
4ab48c7
release date update
seanzhangkx8 Jul 11, 2023
dc2c6cd
Merge branch 'master' into convokit-3.0
cristiandnm Jul 11, 2023
1be1132
remove all storage reference
seanzhangkx8 Jul 16, 2023
cc506d7
Merge branch 'CornellNLP:master' into convokit-3.0
seanzhangkx8 Jul 16, 2023
177f1cd
update release date
seanzhangkx8 Jul 17, 2023
78bfca0
Merge branch 'convokit-3.0' of https://github.com/seanzhangkx8/ConvoK…
seanzhangkx8 Jul 17, 2023
1d032b4
Merge branch 'CornellNLP:master' into convokit-3.0
seanzhangkx8 Jul 26, 2023
a300a96
updated setup.py, README
seanzhangkx8 Jul 26, 2023
e7f9071
updated parliament_demo with correct cluster names, added CANDOR corp…
seanzhangkx8 Sep 27, 2023
dfee880
add CANDOR corpus request url
seanzhangkx8 Sep 27, 2023
485ddec
CANDOR Corpus documentation and conversion code
seanzhangkx8 Oct 3, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ Name for download: `spolin-corpus`
In addition to the provided datasets, you may also use ConvoKit with your own custom datasets by loading them into a `convokit.Corpus` object. [This example script](https://github.com/CornellNLP/ConvoKit/blob/master/examples/converting_movie_corpus.ipynb) shows how to construct a Corpus from custom data.

## Installation
This toolkit requires Python >= 3.7.
This toolkit requires Python >= 3.8.

1. Download the toolkit: `pip3 install convokit`
2. Download Spacy's English model: `python3 -m spacy download en`
Expand Down
2,069 changes: 965 additions & 1,104 deletions convokit/expected_context_framework/demos/parliament_demo.ipynb

Large diffs are not rendered by default.

118 changes: 118 additions & 0 deletions docs/source/candor.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
CANDOR Corpus
=============
CANDOR corpus is a dataset of 1650 conversations that strangers had over video chat with rich metadata information obtaind from pre-conversation and post-conversation surveys. The corpus is available by request from the authors (`BetterUp CANDOR Corpus <https://betterup-data-requests.herokuapp.com/>`_) and ConvoKit contains code for converting the transcripts into ConvoKit format, as detailed below.

A full description of the dataset can be found here: `Andrew Reece et al. ,The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation. Sci. Adv.9,eadf3197(2023). <https://www.science.org/doi/10.1126/sciadv.adf3197>`_
Please cite this paper when using CANDOR in your research.

Usage
-----

Request CANDOR Corpus from (transcripts only): `BetterUp CANDOR Corpus <https://betterup-data-requests.herokuapp.com/>`_

Convert the CANDOR Corpus into ConvoKit format using this notebook `Converting CANDOR Corpus to ConvoKit Format <https://github.com/CornellNLP/ConvoKit/blob/master/examples/dataset-examples/CANDOR/candor_to_convokit.ipynb>`_

You will need pick the transcription type when converting CANDOR corpus to ConvoKit that will impact ConvoKit Utterance metadata. See section Utterance-level information below for more detail.

Dataset details
---------------

All ConvoKit metadata attributes preserve the names used in the original corpus, as detailed here `BetterUp CANDOR Corpus Data Dictionary <https://docs.google.com/spreadsheets/d/1ADoaajRsw63WpM3zS2xyGC1YS5WM_IuhFZ94W84DDls/edit#gid=997152539>`_

Speaker-level information
^^^^^^^^^^^^^^^^^^^^^^^^^

There were 1454 unique participants from a broad range of backgrounds. The following information is recorded in the speaker level metadata:

Metadata for each speaker include:
* sex: gender of speaker
* politics: political persuasion the speaker most identify (from very conservative to very liberal)
* race: race/ethnicity of speaker
* edu: highest level of school the speaker have completed or received
* employ: current employment situation of speaker
* age: age of speaker

Utterance-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^

According to the paper, utterances are processed in three different algorithms to parse speaker turns into utterances: Audiophile, Cliffhanger, and Backbiter. Please refer back to the paper for more detailed description on how the three algorithms are implemented.

- Audiophile: A turn is when one speaker starts talking until the other speaker starts speaking
- Cliffhanger: A turns is one full sentence said by one speaker based on terminal punctuation marks (periods, question marks, and exclamation points).
- Backbiter: A turn is what one speaker starts talking until the other speaker speaks a non-backchannel words (example backchannel words: "mhm", "yeah", "exactly", etc.)

You can pick the transcript processing algorithms in the ConvoKit conversion code by changing the TRANSCRIPTION_TYPE variable. Note that, for different algorithms used to process utterances in transcripts, Utterance-level metadata will be different.

For each utterance we provide:

* id: Unique identifier for an utterance.
* conversation_id: Utterance id corresponding to the first utterance of the conversation.
* reply_to: Utterance id of the previous utterance in the conversation.
* speaker: Speaker object corresponding to the author of this utterance.
* text: Textual content of the utterance.

Metadata for each utterance include:

* turn_id: The id of the turn in the current conversation.
* speaker: Speaker id of the speaker of this turn.
* start: The time that the turn starts in the conversation (in seconds).
* stop: The time that the turn ends in the conversation (in seconds).
* backchannel: The text of any backchannels that occur during this conversational turn. (For "backbiter" transcription type only)
* backchannel_count: The number of backchannel instances (as defined in the paper) that occur during this conversational turn. Backchannel instances can be multiple tokens. (Method "backbiter" only)
* backchannel_speaker: The user_id of the person backchanneling. (For "backbiter" transcription type only)
* backchannel_start: The start time of the first backchannel during this turn. (For "backbiter" transcription type only)
* backchannel_stop: The end time of the last backchannel during this turn. (For "backbiter" transcription type only)
* interval: The time between the end of the last turn and the start of this turn in seconds. Can be negative if turns overlap.
* delta: The length of the turn (i.e., stop-start) in seconds.
* questions: The number of question marks that appear in the utterance.
* end_question: Indicates if the utterance ends with a question mark.
* overlap: Indicates if interval is negative.
* n_words: The number of words in the utterance.

Conversation-level information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Conversation metadata contains surveys from each participants organized by survey field names, and the values being speakers' answer organized by speaker ids:

For each conversation we provide:

* id: id of the conversation

Metadata for each conversation correspond to the answer the two speakers gave in the surveys before and after that conversation.
For each conversation, we got 1 survey from each conversation participant, and as this conversation is 2 people video calling, we got 2 surveys per conversation. We decided to organize the metadata in the following way:

convo.meta = {"survey field name" : {speaker_id_x : answer by speaker id speaker_id_x, speaker_id_y : answer by speaker id speaker_id_y} ... }

* i_like_you: How much did you like your conversation partner?
* convo.meta['i_like_you'] = {speaker_id_x : answer by speaker id speaker_id_x, speaker_id_y : answer by speaker id speaker_id_y}
* you_like_me: How much do think your conversation partner liked you?
* i_am_funny: How funny were you in the conversation you just had?
* you_are_funny: How funny was your conversation partner?
* i_am_polite: How polite were you during the conversation?
* you_are_polite: How polite was your conversation partner?
* my_isolation_pre_covid: Prior to the Covid-19 outbreak, how socially isolated did you feel?
* my_isolation_post_covid: SINCE the Covid-19 outbreak, how socially isolated have you felt?
* in_common: How much did you and your partner have in common with one another?
* about 200 other survey fileds detailed in the `BetterUp CANDOR Corpus Data Dictionary <https://docs.google.com/spreadsheets/d/1ADoaajRsw63WpM3zS2xyGC1YS5WM_IuhFZ94W84DDls/edit#gid=997152539/>`_


Statistics about the dataset
------------------------------

* Number of Speakers: 1454
* Number of Utterances: 527869 (if TRANSCRIPTION_TYPE = "cliffhanger")
* Number of Conversations: 1650

Additional note
---------------
Data License
^^^^^^^^^^^^

ConvoKit is not distributing the corpus separately, and thus no additional data license is applicable. The license of the original distribution applies.

Contact
^^^^^^^

Questions about the conversion into ConvoKit format should be directed to Sean Zhang <kz88@cornell.edu>

Questions about the CANDOR corpus should be directed to the corresponding authors <andrew.reece@betterup.com(A.R.);guscooney@gmail.com(G.C.)> of the original paper.
1 change: 1 addition & 0 deletions docs/source/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Datasets
Conversations Gone Awry Dataset (Wikipedia version) <awry.rst>
Conversations Gone Awry Dataset (Reddit CMV version) <awry_cmv.rst>
Cornell Movie-Dialogs Corpus <movie.rst>
CANDOR Corpus <candor.rst>
Parliament Question Time Corpus <parliament.rst>
Wikipedia Talk Pages Corpus <wiki.rst>
Tennis Interviews <tennis.rst>
Expand Down
4 changes: 2 additions & 2 deletions docs/source/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@ The two recommended fixes are to run:

and if that doesn't fix the issue, then run:

>>> open /Applications/Python\ 3.7/Install\ Certificates.command
>>> open /Applications/Python\ 3.8/Install\ Certificates.command

(Substitute 3.7 in the above command with your current Python version (e.g. 3.8 or 3.9) if necessary.)
(Substitute 3.8 in the above command with your current Python version (e.g. 3.9 or 3.10) if necessary.)

Immutability of Metadata Fields
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
Loading