The massive increase in text available in digital formats presents enormous opportunities for social scientists. Yet systematically hand coding a significant share of the available blogs, speeches, emails, web pages, government records, newspapers, or other digitized texts is infeasible. Although computer scientists have developed effective methods for automated content analysis, those methods aim to classify individual documents correctly, whereas social scientists are usually interested in generalizations about the population of documents, such as the proportion in a given category. Unfortunately, even classifiers that categorize individual documents with high accuracy can be hugely biased when estimating category proportions. By directly optimizing for the broader goal of many social scientists, we develop a method that gives approximately unbiased estimates of the category proportions. We illustrate the method with several diverse data sources, including the daily expressed opinions of hundreds of thousands of people about the U.S. presidency. We also make available easy-to-use software that implements our methods and large corpora of text for further analysis. You may also be interested in the ReadMe: Software for Automated Content Analysis.
By downloading these Materials, I agree to the following:
BY CLICKING THE "I AGREE" CHECKBOX BELOW, I CONFIRM THAT I HAVE READ AND UNDERSTOOD EACH AND EVERY TERM SET FORTH IN THE TERMS AND CONDITIONS FOR THE USE OF DATA FOUND ABOVE, AND I AGREE TO BE BOUND BY ALL OF SUCH TERMS AND CONDITIONS.
IF I DO NOT UNDERSTAND OR AGREE TO ALL OF THE TERMS AND CONDITIONS, I MUST NOT DOWNLOAD THE MATERIALS.