#############################################################
## This data depository contains all code/data necessary 
## to replicate "The Colour of Finance Words" by Diego 
## García, Xiaowen Hu, and Maximilian Rohrer, forthcoming
## in the Journal of Financial Economics.
##
## Dated: 20221108.
#############################################################

#############################################################
## We are sharing our output in four different files, with
## different purposes.
##
## 1. ML dictionaries (dictionary.zip). We provide the list
## of unigrams and bigrams in our final specifications, using
## both the training and the full samples. These are small
## files, the unigram lists should be plug-and-play, with
## the bigrams you should use our text normalizing routines.
##
## 2. Robust MNIR output (robustMNIR.zip). We provide the
## estimated loadings according to the robust MNIR model for all
## unigrams/bigrams, both for the training sample, and
## the full sample. These are also small files, and researchers
## can choose different cutoffs for the inclusion criteria
## into final dictionaries.
##
## 3. Code (code.zip). We provide the code we use in our paper, 
## starting from the estimation of the robust MNIR model, and the
## construction of the dictionaries, to the details on generating
## the output for the tables we present in the final version of
## the paper.
##
## 4. Data (data.zip). We provide dtms for all the analysis we
## do in the paper (earnings calls, 10Ks and WSJ articles),
## as well as a blueprint of the metadata we use (without variables
## that we cannot share, such as stock prices). We note that
## we provide GVKEYS/PERMNOS so researchers should be able to
## link files easily. We also include the LM dictionaries we
## use in the paper. We note that this file is large (~3Gbs).
#############################################################

#############################################################
## 1. ML dictionaries (dictionary.zip)
#############################################################
## This file contains 8 different dictionaries, which form
## the core of our output. The non-dated files contain the
## dictionaries we produce using the full sample. The ones
## dated *20151231* contain the dictionaries we produce using
## the pre-2016 data.
#############################################################

#############################################################
## 2. Robust MNIR output (robustMNIR.zip)
#############################################################
## This file contains 4 different csv files, containing the
## robust MNIR scores (% positive/negative, freq). The non-dated
## files contain the MNIR scores we produce using the full sample.
## The ones named *20151213* contain the dictionaries we produce
## using the pre-2016 data.
#############################################################

#############################################################
## 3. Code (code.zip)
#############################################################
## This file contains 3 different R scripts.
##
## The first shows how we estimate the robust MNIR model.
##
## The second simply converts the MNIR output to dictionaries.
##
## The third is the code we use to generate the tables in the
## paper.
#############################################################

#############################################################
## 4. Data (daga.zip)
#############################################################
## This file contains 10 different files.
##
## The six files named dtm* contain the document-term-matrices
## for the different corpora we study.
##
## The three meta* files contain the metadata we use in our code.
## We note that we are not sharing fields that come from
## proprietory datasets (e.g. CRSP/Compustat), but we trust
## researchers with access to such data can find it easily.
##
## We note that the meta* and dtm* files match on rows (for
## each of the corpora).
##
## The LM_2021* file contains the Loughran and McDonald (2011)
## dictionaries we use (downloaded in 2021).
#############################################################