Skip to content
This repository has been archived by the owner on Jun 24, 2019. It is now read-only.

Latest commit

 

History

History
49 lines (36 loc) · 2.15 KB

README.md

File metadata and controls

49 lines (36 loc) · 2.15 KB

Code related to a working paper that was first presented at the AFSP Annual Meeting in Paris, 2013. See Section 1 of this paper and its appendix, or read the HOWTO below for a technical summary.

DATA

The scraper currently collects slightly over 6,300 articles from

  • ecrans.fr (including articles from liberation.fr)
  • lemonde.fr (first lines only for paid content)
  • lesechos.fr (left-censored to December 2011)
  • lefigaro.fr (first lines only for paid content)
  • numerama.com (including old articles from ratiatium.com)
  • zdnet.fr

HOWTO

The entry point is make.r:

  • get_articles will scrape the news sources (adjust page counters to current website search results to update the data)
  • get_corpus will extract all entities and list the most common ones (set minimum frequency with threshold; defaults to 10)
  • get_ranking will export the top 15 central nodes of the co-occurrence network to the tables folder, in Markdown format
  • get_network returns the co-occurrence network, optionally trimmed to its top quantile of weighted edges (set with threshold; defaults to 0)

Tables

  • corpus.terms.csv – a list of all entities, ordered by their raw counts
  • corpus.freqs.csv – a list of entities found in each article
  • corpus.edges.csv – a list of undirected weighted network ties

Notes

  • The weighting scheme is inversely proportional to the number of entity pairs in each article.
  • The weighted degree formula is by Tore Opsahl and uses an alpha parameter of 1.