Visualizations and Analytics

A key advance in this project is the application of advanced word study, text categorization, knowledge discovery and visualization tools for the study of disparate digital assets. In the first instance, these analytic techniques will focus on text. In later iterations, and as more images and sound files become available, other techniques will be added.

Mutual Information Models

An interesting possibility for developing robust feature sets for the documents in the ETK corpus are various Mutual Information Models. Most approaches to text view documents as nothing more than a bag of words. Simple word counts are calculated for each document, and the resulting matrix of [documents x words] becomes the feature vector for the underlying document. Similarity between documents are then calculated on these vectors. Mutual Information models hold the promise of allowing for the development far more sophisticated feature sets than simple bag of words, or even lemmatized bag of words.

I am currently exploring several MI models for application to the Tang Kristensen folklore texts. Tony Davis's interesting applications COWS and YAK incorporate calculations of collocation and other syntactical detail as part of an MI model. During the current year, I am experimenting with these approaches, and hope to incorporate advances in MI to develop more sophisticated and robust features on which to base similarity measures for the textual analytics.