Visualizations

A key advance in this project is the application of advanced word study and visualization tools for the study of disparate digital assets. In the first instance, these visualizations and data-mining techniques will focus on text. In later iterations, and as more images and sound files become available, other visualization and data-mining techniques will be added.

Visualization and Wordstudy

Using visualizations to explore relationships between texts and within texts holds great promise. These new tools allow us to get answers to previously unanswerable questions, and to pose entirely new questions to our collections of expressive culture. The project seeks to take advantage of word-study and visualization tools developed for various digital library platforms, integrating them into the web-enabled presentation platform.

During acquisition and editing, text assets will be placed into the proper databases, and structured properly, to take advantage of these tools. The general architecture of this system is as follows:

images/arch.jpg

These systems include standard word-study tools such as word frequency count and concordances. They also include far more advanced clustering routines that are displayed in a visually rich manner. In later versions, we will incorporate morphological analysis and word lookup tools. Morphological tagging not only makes searches on large text corpora written in an inflected language far more accurate but also makes the visualizations more accurate as well. A word lookup tool that attaches texts to the Dictionary of Jutlandic Dialects holds great promise as well.

During the initial phase, we will take advantage of the visualization tools available through GDL 3.0 and through ArcText.

GDL 3.0 Visualizations

Greenstone Digital Library 3.0 incorporates three very promising visualizations for the study of large textual corpora. These include a Radial view, in which a calculation of keywords allows one to arrange keyword nodes in a circle around document nodes in the center. The keywords act as anchors, pulling the documents toward them with a strength equal to a statistical weight of a keyword in the document. images/radial_active.bmp

A Sammon map produces a two-dimensional map of document clusters. In the center of each circle is the most frequent keyword for the cluster, and clicking on the circle allows one to drill-down to underlying documents.

images/sammon.bmp

A Dendro map produces a two-dimensional image of document cluster, where document nodes appear as leaves on a tree.

images/dendro.bmp

TextArc Visualizations

TextArc is an experimental platform that allows one to visualize relationships within a text. In this project, the idea will be to consider "text" to have different possible components: (a) the folkloric expressions of a tradition group as a whole (b) the folklore repertoire of an individual informant or group of informants--perhaps sorted by class or gender (c) all of the folk expressions in a specific genre and (d) a published collection or series of collections. These visualizations, including the GDL3.0 visualizations and wordstudy tools described above can also be used on the memoirs.

TextArc incorporates a number of intriguing views that also allow for the drill-down so important in data-mining. These views include one that shows the relationship between a keyword and all other words in the text based on a statistical weighting of word forms. In one visualization, words that appear more than once are drawn only once, and then connected to all the places the word appears in the text.

images/alice2.gif

In another view, a curved line connects the words in the order that they appear in the text.

images/alice4.gif

In yet another view, a concordance shows how many times a word appears in a text.

images/alice6.gif

A final view provides a keyword in context (kwic) view into the text.

images/alice5.jpg