I have written before about Martin Wattenberg's Arc Diagrams for visualizing structure within strings. They are an intriguing way of visualizing repetition at varying scales within a linear sequence. When applied to music they produce beautiful images that illustrate the structure. I noted that for most narrative text these diagrams likely wouldn't work very well because of the lack of regular repetition but that it might be fruitful to explore some lower dimensional derived feature of the text.
In my recent exploration of ways to visualize arbitrary text documents I tried out something visually inspired by Wattenberg's Arc Diagrams. Rather than using arcs to connect identical patterns within a document I'm connecting instead segments that contain similar words. Here is the algorithm:
Update:The interactive application is available now for Document Arc Diagrams.
Here are a few sample diagrams:
Despite the arbitrary nature of the segmentation the technique appears to reveal some aspect of the document structure in a visually interesting manner. In Alice in Wonderland, for example, it shows what appears to be four distinct scenes present in the last half of the text. The third is highlighted in orange and has as high frequency words Alice, Mock, Turtle, and Gryphon. The third example is for the lyrics of a song and shows darker lines because the similarity between segments is stronger. There are also regular patterns that repeat multiple times which isn't surprising for song lyrics. It would be interesting to use a line-based or syllable/phoneme-based segmentation for song lyrics rather than the simplistic approach taken here.
I will post an interactive application soon that will let anyone explore a fixed set of documents.