Download here (Apache Parquet file; as of 25.01.23) | The data is preliminary and may change as we update the paper/model!

We currently provide the recovered percentiles of missing firm characteristics in file raw_infill.pq. The data has 10 columns:


  • date: in format YYYY-MM-01

  • id: in format crsp_PERMNO

  • char: name of the characteristic, following the convention in Jensen, Kelly, Pedersen (2021)

  • perc: recovered percentile of the missing entry

  • lower: lower raw value of the recovered percentile

  • upper: upper raw value of the recovered percentile

  • mid: mean between lower and upper as an estimate for the raw value of the recovered characteristic

  • mean: mean of the raw observed entries for other firms within the recovered percentile

  • median: median of the raw observed entries for other firms within the recovered percentile

  • missingness [%]: how often the target characteristic is missing per month (date) across all firms

NOTE: We only provide information about the recovered missing entries.


We now also provide the estimated probability distributions across the percentiles of a target characteristic for each missing entry in the raw dataset. Since these files are quite large, we have broken them up by decade:

Link to folder (Apache Parquet files; as of 25.01.23)

CAUTION: the files are large (in total 15GB).