Visualizations and Analytics

A key advance in this project is the application of advanced word study, text categorization, knowledge discovery and visualization tools for the study of disparate digital assets. In the first instance, these analytic techniques will focus on text. In later iterations, and as more images and sound files become available, other techniques will be added.

Clustering

Unlike supervised machine learning, such as text categorization systems, or mutual information systems, clustering algorithms are generally categorized as unsupervised machine learning. Clustering of documents and the subsequent visualization of the results can allow one to discover patterns not otherwise apparent. Most "clustering" of humanities documents occurs according to single classifiers (eg, "this is a ghost story") and is not based on a multi-dimensional consideration of document. In clustering, one calculates a feature vector for each document, and these vectors are then clustered according to various algorithms. Well known algorithms include nearest neighbor algorithms, normalized cut, and weighted kernel k-means.

Work by Kendall Giles and Mauro Magioni are being incorporated into clustering applets for this project. Each of these approaches has their own advantages, and operate on different clustering algorithms. Both approaches already include visualization applets that allow for drill-down to underlying documents. mages/3dcluster.jpeg
An image from Magioni 2007 showing clusters for topics from 100 Science News articles.