Mutual Information Models
An interesting possibility for developing robust feature sets for the documents in the ETK corpus are various Mutual Information Models. Most approaches to text view documents as nothing more than a bag of words. Simple word counts are calculated for each document, and the resulting matrix of [documents x words] becomes the feature vector for the underlying document. Similarity between documents are then calculated on these vectors. Mutual Information models hold the promise of allowing for the development far more sophisticated feature sets than simple bag of words, or even lemmatized bag of words.
I am currently exploring several MI models for application to the Tang Kristensen folklore texts. Tony Davis's interesting applications COWS and YAK incorporate calculations of collocation and other syntactical detail as part of an MI model. During the current year, I am experimenting with these approaches, and hope to incorporate advances in MI to develop more sophisticated and robust features on which to base similarity measures for the textual analytics.