Visualizations and Analytics

A key advance in this project is the application of advanced word study, text categorization, knowledge discovery and visualization tools for the study of disparate digital assets. In the first instance, these analytic techniques will focus on text. In later iterations, and as more images and sound files become available, other techniques will be added.

Text Classification

Text classification systems are generally considered to be instantiations of supervised machine learning. Many classifiers work on binary classification schemes, while some experimental approaches are incorporating multi-class classification. (A brief discussion of this problem can be found in Rennie and Rifkin 2001). The project will incorporate two experimental text classification machines, one based on naïve Bayes classification and another incorporating a linear Support Vector Machine (SVM). Both make use of a training set of 1000 texts from the collection. While it may seem redundant to have two text categorizers, comparisons between linear SVM text classifiers and naïve Bayes classifiers suggest that both tools have their clear uses for different types of problems (Yu and Unsworth 2007; Joachims 2002). Additional work from applied mathematics suggests that linear Support Vector Machines (SVM) return even more accurate results in shorter time (Joachims 2001). Rather than limiting these classifiers to working with a "bag of words" feature set, the tools will be enabled to work with various feature sets for each document, including lemmatized word frequencies, and considerations of word collocation (see mutual information).

Feature selection is an important consideration in all Humanities collections. Scoring of features can lead to significant bias in the classification of documents. Forman(2004) has noted some of these pitfalls, and proposed a possible solution to some of these problems. Mutual Information approaches to the documents in the collection hold significant promise for developing robust feature sets, while LDA may allow us to see how topics change over time or geographic location. Probabilistic Latent Semantic Analysis (PLSA) similarly offers interesting possibilities for understanding these documents.

The tools for this project will be developed using the D2k (T2K) framework, making the modules easily redoployable to other Humanities projects. We are exploring the possibility of user choice, particularly in regards to which subset of the ETK domain to use as the target set. Furthermore, the machines could allow the user to choose the classification scheme, from single binary classification operations to a multiclass classification operation. We will explore developing a 2-D visualization applet, allowing for a comprehensive overview of the multi-class classification of the texts such as that described by di Nunzio (2006).