Transcript Analyzer

By: Jeff Clark Date: Fri, 02 Nov 2007

Transcripts or scripts can be very rich data sets if you are comfortable with writing code to analyze text. I have created an interactive Transcript Analyzer for exploring the transcript of the recent Democratic debate in the US. One thing I focussed on was to illustrate 'who said how much and when'. I noted this as a weakness in the NYT tool in my earlier post.

Refer to the image below. The top section shows the distribution of some selected words within the text across a 'timeline' which goes from left to right. Each speech segment is the same width and the height of the small white bars show the number of occurences of that word for that segment. You can add new words with the text box in the top right corner or you can remove existing words by clicking on them.

Right below the word distribution graphs is a similar coloured set showing a spectral decomposition of the text based on who spoke and how much was said. In this case the bar heights give the amount of text for each segment. Click and drag the mouse left to right to move along the timeline and show the actual text for 3 consecutive segments. Mousing into this lower region will cause the blocks to expand and show more text.

image only - click here for interactive version

I think the separated or spectral timeline might be an effective approach to showing this kind of information. From the display in the image above we can glean:

the terms 'war' , 'Iraq', and 'Iran' are highly correlated in the text and dominate the first part of the debate
'tax' is discussed in the middle part of the debate
there was discussion of 'ufo's towards the end with Kucinich and there was laughter associated with the discussion
the first half of the debate had longer more substantive answers than the end
the two moderators/questioners (Williams & Russert) were active throughout the debate and typically alternated segments with the candidates
one exception to this is a direct dialog between Clinton and Dodd about 3/4 of the way through (where you see many short bursts of orange and green)
another exception is that there are a few double blocks by Williams. Examination with the tool shows that these occurred before and after announcement/advertisement breaks

This is far from perfect but I wanted to post it while the subject matter is still current. Of course the basic concept is generally applicable for any script. Some weaknesses of the tool are:

the bars and text are quite small and not adjustable
the text areas don't scroll
the 'add word' entry field doesn't support OR syntax or word stemming
the 'add word' does not support stop words or short words
there should be a simple bar chart showing who says the most overall

NYT Transcript Analyzer

Blog

Race Results Analyzer