A Journal Devoted To All Areas Of Applied Statistics
 
 
  Annals of Applied Statistics
  Submissions
  Subscriptions
  Editorial Board
  Next Issues
  Published Issues
Supplements
  Instructions for Referees
  Letters to Editor
 
Replication data for: A Correlated Topic Model of Science
Cataloging Information
Documentation, Data and Analysis
User Comments
 
Citation Information
How to Cite
David M. Blei; John D. Lafferty, 2007, "Replication data for: A Correlated Topic Model of Science", hdl:1902.1/10646 Institute for Mathematical Statistics [Distributor]
Study Global Idhdl:1902.1/10646
AuthorsDavid M. Blei (Princeton University); John D. Lafferty (Carnegie Mellon University)
Production Date2007
DistributorInstitute for Mathematical Statistics Logo
Distribution Date2007
Deposit DateOctober 01, 2007
Replication ForDavid M. Blei, and John D. Lafferty. 2007. "A Correlated Topic Model of Science." Ann. Appl. Statist. Volume 1, Number 1 (2007), 17-35. article available here
Provenance
Abstract and Scope
Abstract

Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than X-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [J. Roy. Statist. Soc. Ser. B 44 (1982) 139–177]. We derive a fast variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. We apply the CTM to the articles from Science published from 1990–1999, a data set that comprises 57M words. The CTM gives a better fit of the data than LDA, and we demonstrate its use as an exploratory tool of large document collections.

KeywordsHierarchical models; approximate posterior inference; variational methods; text analysis
Terms of Use
Network Terms of UseIQSS Dataverse Network Terms and Conditions

By downloading these Materials, I agree to the following:

  1. I will not use the Materials to
    1. obtain information that could directly or indirectly identify subjects.
    2. produce links among the Distributor's datasets or among the Distributor's data and other datasets that could identify individuals or organizations.
    3. obtain information about, or further contact with, subjects known to me except where the use and/or release of such identifying information has no potential for constituting an unwarranted invasion of privacy and/or breach of confidentiality.
  2. I agree not to download any Materials where prohibited by applicable law.
  3. I agree not to use the Materials in any way prohibited by applicable law.
  4. I agree that any books, articles, conference papers, theses, dissertations, reports, or other publications that I create which employ data reference the bibliographic citation accompanying this data. These citations include the data authors, data identifier, and other information accord with the Recommended Standard (http://thedata.org/citation/standard) for social science data.
  5. THE DISTRIBUTOR MAKES NO WARRANTIES, EXPRESS OR IMPLIED, BY OPERATION OF LAW OR OTHERWISE, REGARDING OR RELATING TO THE DATASET

BY CLICKING THE "I AGREE" CHECKBOX BELOW, I CONFIRM THAT I HAVE READ AND UNDERSTOOD EACH AND EVERY TERM SET FORTH IN THE TERMS AND CONDITIONS FOR THE USE OF DATA FOUND ABOVE, AND I AGREE TO BE BOUND BY ALL OF SUCH TERMS AND CONDITIONS.

IF I DO NOT UNDERSTAND OR AGREE TO ALL OF THE TERMS AND CONDITIONS, I MUST NOT DOWNLOAD THE MATERIALS.

Other Information