Textual and Visual Analytics
Using visualizations to explore relationships between texts and within texts holds great promise. These new tools, including text categorization tools (naive Bayes and linear SVM), clustering tools (nearest neighbor, weighter kernel k-means), and other analytic tools based on mutual information models or latent Dirichlet allocation of topics. These approaches, relying on advances in both supervised and unsupervised machine learning, may allow us to get answers to previously unanswerable questions, and to pose entirely new questions to our collections of expressive culture. The project seeks to make some of these tools available to a broader humanities audience by making document preprocessing easier, and by making visualizations of the output of these tools easier to grasp. Similarly, the tools developed for this project are intended to give the user considerable freedom in choosing the documents within the corpus to analyze, and which feature sets of the document to include in the analysis.
There are several barriers to wide-spread experimentation among humanities scholars with these emerging approaches to information retrieval and knowledge discovery. Preparing documents to use with these systems is often difficult and, given the experimental nature of many of these tools (i.e. many of these tools are developed by applied mathematicians or computer scientists to address specific problems in the field of machine learning), the code for these tools can often be quirky. Similarly, the output from these tools can be difficult to work with. In the best cases, the output plots to 2-D or 3-D graphs and may allow for drill-down to underlying texts. In many cases, the output is in the form of a numerical matrix. This project aims to address this problem by developing four tools that make use of XML texts coded according to the TEI guidelines. Several excellent word study applications, grouped under the title Vishnu already exist in GDL 3.0. These tools will be expanded to allow for retrieved documents to be mapped onto historical maps as well as mapped into a 2-D research space. An additional indexing application will run on all of the corpus documents, creating a series of extended feature sets for each document, including the "bag of words" (BOW) already included with Vishnu, a lemmatized BOW, probabilistic calculations of collocation, index numbers and keywords. Current experiments with the corpus are helping us identify a series of core feature sets that could be extended to any Humanities corpus.
A series of applets, developed according to the D2K framework will allow for integration of the analytic tools into the Presentation, Research and Visualization interface. These applets will include visualization applets, that allow users to drill-down to underlying documents and to map document clusters onto geographic maps using existing geo-referenced placenames in the underlying texts.