Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is an intriguing statistical approach to develop a mathematical representation of a text document. Unlike naive Bayes classifiers, LDA incorporates an awareness of word collocation within documents, making it a more sophisticated representation of document structure beyond simple vocabulary. The LDA representation of a document can be used for clustering, classification and retrieval. During the fall 2007, David Blei, one of the pioneers in LDA participated in the NSF IPAM group on the Mathematics of Knowledge and Search Engines. I am currently exploring the possibility of using his approach to develop this type of representation of the Tang Kristensen corpus.
An important aspect of the current work on LDA is the discovery of topics in the corpus. These topics are related to underlying words in the corpus--for example, ghost (topic) would have a certain probability of generating words haunt, dead, cemetary, minister; while minister (topic) would have a certain probability of generating words church, pig, cow, revenant, conjure. Topic discovery can occur dynamically across the corpus. A dynamic topic model, such as the one discussed in Blei (2006) would further allow one to see how topics change over time--an interesting prospect given the remarkably fast changes in the social, cultural and physical landscape of late 19th century Denmark. Such a model can be applied to most folklore corpora, as the date of collection can often be discovered. In the ETK corpus, these dates of collection are well established.
Probabilistic Latent Semantic Analysis (PLSA) offers a somewhat different approach, offering a generative model on the existing documents in a corpus, but not on new documents, a shortcoming that LDA addresses. PLSA has also been criticized for "overfitting" as the number of parameters grows linearly with the number of the documents in the corpus.