Introducing News Spectrum ! It is a visualization of the words used for two topics in the latest results from Google News. One topic is coloured blue, the other red, and the associated words are coloured and positioned based on how highly they are associated with the two topics. Click on any word to see the related Google News results.
This is a generalization of my recent Obama McCain News Spectrum that allows you to enter your own terms of interest. Press the 'Enter' key to generate the spectrum after entering your words. The layout algorithm has also been improved to minimize the number of overlapping words. Give News Spectrum a try ! As always, feedback is welcome.
Thanks to Google News for the data, Processing.org for the tools, and Chris Harrison for the inspiration behind the design.
News Spectrum (static image)
I was thinking about the Word Association Spectrums created by Chris Harrison and thought it might be interesting to create something similar using live data. I've come up with a little application that gets the latest google news results for two terms of interest and generates a word spectrum based on the words found in the results. I removed stop words in order to highlight the words more likely to be of interest. It's an obvious drawback that there are often many hard to decipher overlapping words but it's kind of fun to play with nevertheless. This initial version shows a news spectrum related to the terms 'Obama' and 'McCain'.
Obama McCain News Spectrum (static image)
The New York Times has published an interesting interactive diagram depicting the relationship between various diseases and the genes that are known affect them. The large circle in the image below is zoomed in on one part of the diagram. [via FlowingData]
Chris Harrison has a wonderful collection of visualizations one of which I featured recently in More Color Name Graphics.
Chris recently posted a set of beautiful Word Association Spectrums based on an extremely large dataset from Google containing word bigram distributions. The example shown below is for the words 'war' and 'peace'. The horizontal position of the various words indicate whether they more frequently follow 'war' or 'peace' in the analyzed text. So the word 'memorial' is positioned very close to the left (at the bottom) because the bigram 'war memorial' occurs much more often (normalized by overall counts) than 'peace memorial'. The vertical position is random.
My own Document Contrast Diagrams also stretch out words along a horizontal axis based on the strength of association between two poles. My diagrams try and express a lot more information as well - probably too much. Chris's Word Association Spectrums carry less information. This simplicity allows for a much more elegant design. He has generated spectrums for other interesting word pairs like 'kids:adults' , 'good:evil', and 'american:chinese'. I might like to see versions that don't show the common prepositions so that the nouns, verbs, and adjectives stand out more.
Word Association Spectrum for War and Peace (click to visit Chris Harrison's Post)
I ran the speeches delivered by both Obama and Clinton last night after the May 6th primary results and used them to build a Document Contrast Diagram. See the link for a description of how to interpret the diagram.
May 6th Primary Speech Contrast Diagram (click to see larger version)
I have taken the speeches delivered by both Obama and Clinton last night after the May 6th primary results and used them to build a Document Cloud Comparison. It shows which words were used together by each speaker using linked word clouds. A static image is shown below for references to the word 'change' to give you a flavour but the real fun comes with exploring the interactive application.
If you enter a blank focus string in the application it shows a standard word cloud and colors words that are unique to one speaker or the other. The top words used by Obama and not by Clinton include 'side' , 'down', 'government', 'values', 'yes', 'lead' , 'life', 'kind', 'trust', and 'united' . Those used by Clinton uniquely include 'keep', 'feel', 'journey', 'working', 'invisible', 'west', and 'story'.
'change' Associations and References (static image)
Give it a try yourself. The application is written in Java so you may have to wait a few seconds for it to start up.
As I pointed out in my last post, Directed Sentence Drawings generated from a text make it extremely difficult to see in what order the various topics were discussed and that a simple bar for each sentence in the order they occurred in the text and coloured by topic would be much better in most respects. I've built a graphic to show what I mean. I have also added the most frequent topic words for each set of 10 consecutive sentences.
State of the Union - Sentence Bars with Topic Colours (click to see larger version)
Click on the up arrow below if you found this interesting:
In my post earlier today about Sentence Drawings I mentioned that the overall shape of the graphic doesn't really express anything useful. I have come up with a variation on the idea that tries to address this.
In the sentence drawings produced by Stephanie Posavec or David Sparks each line segment is turned 90 degrees to the right relative to the previous one. This makes the overall shape highly sensitive to minor variations in the text which is why the overall shape doesn't carry much meaning - it's almost random.
I call my diagrams Directed Sentence Drawings because the direction of the line segments are a function of their topic. As before, each sentence is assigned a topic or remains neutral based on the vocabulary it contains. I place a neutral point in the middle of the diagram and four other topic points form a diamond shape around it (see below). For the State of the Union diagrams produced below I used the four topics Government, Domestic, Economy, and Security. The algorithm is as follows:
The diagram immediately below is constructed from the State of the Union Address for the year 2000. It shows there were many sentences about both Domestic and Economic issues, a fair number concerning Government and fewer about Security. The dominant colours give this away but also the overall shape makes it obvious. There is a greater density of lines near the Domestic and Economic topic nodes.
Directed Sentence Drawing for SOTU 2000
This next diagram is for the SOTU of 2001, the first delivered by George W. Bush. It's obvious that it was much shorter, had even less discussion of Security issues than Clinton's in 2000, and also not much sustained discussion about Domestic issues.
Directed Sentence Drawing for SOTU 2001
The SOTU for 2002 was delivered after 9/11 and clearly shows that Security has become the predominant concern.
Directed Sentence Drawing for SOTU 2002
This last diagram is for the SOTU of 2008 and shows that Security is still very topical but that Economic and Governmental issues are starting to recapture attention.
I posted a few weeks back on Stephanie Posavec's interesting graphics constructed from the text of Kerouac’s On the Road. One of her pieces featured Sentence Drawings that were generated using each sentence in sequence with line segments coloured to reflect the topic and sized based on the length of the sentence.
David Sparks has constructed a set of similar sentence drawings for the State of the Union addresses delivered by Bush over his 8 years in office.
David Spark's Sentence Drawing for SOTU 2008 (click to see graphic with all 8 addresses delivered by Bush)
I find these interesting to look at. However, the dominant visual feature is the overall shape of the graphic and I don't think it really expresses anything useful.
Dolores Labs has posted an update on how people have used their color name data in various ways. They linked to my own Color Names Explorer - thank you very much ! Their post is called Color flowers, networks, photos, and even 3D and has several more interesting views of this data. The one that really caught my eye was by Chris Harrison who created a flower-like image by rendering the names in their associated color and varying the position by hue along the radius. I don't think many of these images, including my own, are particularly useful, but they sure are interesting to look at !
Chris Harrison's Color Name Flower (click to see larger version in original article)
Color Name Flower Closeup
There is a new Portfolio link available from all pages on my weblog. It links to a simple index of my most interesting or useful applications and gives a pretty good idea of the kinds of things I like to create.
I'm currently available for data analysis or visualization projects if anybody is interested in working together. I live near Toronto, Canada but I'm open to projects done remotely. I would be happy with creative projects that vary in size from a few days to a few months of work. Send me an email if you are interested.
I have taken the words spoken by both Obama and Clinton during the Pennsylvanian Democratic debate held on April 16th, 2008 and constructed from them a Document Cloud Comparison. Basically, it lets you see which words were used together by each speaker using linked word clouds. A few static images are shown below to give you a flavour but the real fun comes with exploring the interactive application.
If you enter a blank focus string in the application it shows a standard word cloud and colors words that are unique to one speaker or the other. The top words used by Obama and not by Clinton include 'politics' , 'decade', 'election', 'economic', 'somehow', 'generation' , 'mission', 'forward', and 'problem' . Those used by Clinton uniquely include 'york', 'begin', 'world', 'best', 'support', 'administration', 'police', and 'hope'.
'Country' Associations and References (static image)
'jobs' Associations and References (static image)
Give it a try yourself. The application is written in Java so you may have to wait a few seconds for it to start up.
I have taken the words spoken by both Obama and Clinton during the Pennsylvanian Democratic debate held on April 16th, 2008 and constructed from them a Document Contrast Diagram. See the link for a description of how to interpret the diagram.
It shows that they spoke roughly the same number of words but with Obama speaking slightly more. Both were slightly positive in overall emotional tone with some areas of negativity related to guns and security for Clinton and taxes for Obama. There was a great deal of overlap in the words used by the two speakers with the words 'kind', 'Democrats' , 'important', 'country', 'make', 'work', 'president', 'can', 'take' , 'right', and 'guns' being frequently used by both. 'Know' was used a lot by both but more often by Clinton. They both spoke each others names much more than their own but Obama used Clinton's name more often than the reverse.
Key words used frequently and uniquely or much more often by Obama included 'true' , 'statement' , 'economic' , 'issues', 'election', 'confident', 'George' , 'American', 'policy', 'politics', 'income', 'change', 'General', 'ideas', 'Chicago', and 'individuals'. Words used frequently and uniquely or much more often by Clinton included 'decisions', 'stay', 'withdraw', 'Iran', 'failed', 'begin', 'world', 'military', 'best', 'York', 'administration', 'Philadelphia', 'impose' , 'order', 'police', and 'oil'.
Pennsylvanian Debate Contrast Diagram (click to see larger version)
I added the transcript for the Pennsylvanian Democratic debate held on April 16, 2008 to the interactive Transcript Analyzer. The image below is smaller (and more blurry) than from the application but gives a rough idea of what was discussed by which candidate and when. Here are the primary topics covered in order:
Notable by their absence were the words 'immigration' and 'nafta' .
Democrat Debate - Apr 16th, 2008 ( click for interactive application )
One small refinement was made to the application. The counts and bars for the various words will now also include simple plural variations. So references to 'jobs' will also include 'job', and references to 'gun' would also include 'guns'.
Give the Transcript Analyzer a try yourself and, as always, feedback is welcome !
One of the areas I have been exploring here on Neoformix is the notion of constructing graphics in an algorithmic fashion from textual data. The site NOTCOT has just published an article on some interesting work by Stephanie Posavec that explores this same idea. She has constructed a number of different works based on the text of Kerouac’s On the Road. From NOTCOT's article:
The maps visually represent the rhythm and structure of Kerouac’s literary space, creating works that are not only gorgeous from the point of view of graphic design, but also exhibit scientific rigor and precision in their formulation: meticulous scouring the surface of the text, highlighting and noting sentence length, prosody and themes, Posavec’s approach to the text is not unlike that of a surveyor.
Here are a few images that will give you a taste and a rough idea of what they mean. Although definitely more on the artistic side of information visualization, I like these images and the ideas behind them a great deal.




Recently both Clinton and Obama delivered speeches related to the economy. Clinton's was more focussed specifically on the housing crisis. I took the text of Clinton's Halting the Housing Crisis and Obama's Renewing the American Economy and created a Document Contrast Diagram.
It clearly shows that they were about the same length, both slightly positive in overall emotional tone but Clinton's text varied more in tone. The large blue word circles for 'mortgage', 'housing', 'crisis', 'families', 'foreclosure' show the primary topic of interest for Clinton. Obama's mostly unique key terms were 'American', 'financial', 'risk', 'system', 'regulatory', and 'institutions'. The blue segments in the middle of Obama's speech show that he used words in that section more strongly associated with Clinton overall. This is where he discussed the housing crisis.
Obama/Clinton Economic Speech Contrast Diagram (click to see larger version)
Dolores Labs recently did an interesting experiment where they showed many people samples of colors and asked them what they should be called. They posted a graphic that showed the color names that people used for the various colors.
Dolores Labs' Color Name Cloud (click to see larger version in original article)
They also posted the raw data for other people to play with. Martin Wattenberg at IBM Research took the data and created a much more beautiful graphic. Nathan at FlowingData discusses the design differences in the post A Little Bit of Design Goes a Long Way With Infographics.
Wattenberg's Version of the Color Name Cloud (click to see larger version in original article)
I decided to try my hand at building a simple interactive 3D explorer for the data as well. I combined entries with the same name and found the average RGB values. The frequency count was used to highlight the more common names by scaling the size of the text in a manner likely similar to that used by Wattenberg. I then plotted the names in 3D using the red (x), green (y), and blue (z) components of the color value.
Color Name Cloud - initial view

Color Name Cloud - zoomed in view
The initial view is similar to Wattenberg's but not spaced out as nicely. My version also suffers from the fact that the size of the name depends on both frequency of use and how much blue the color happens to contain since the more blue a color has the closer it is drawn to the front of the display.
You can try out the color name explorer below. Can you find the shade somebody called 'baby poop' ?
I'm a proud citizen of Canada and have decided to include a bit more analysis of Canadian-themed data and text in the future.
Yesterday the 2008 Ontario budget speech was delivered which outlines the governments' priorities for the coming year. I have constructed a Document Contrast Diagram from the text of the 2007 Ontario Budget Speech and the 2008 Ontario Budget Speech.
Document Contrast Diagram for 2007/2008 Ontario budget Speeches (click to see larger version)
My first post on Document Contrast Diagrams will give some guidance on how to interpret the image. Here are a few things I noticed that are illustrated by the diagram. You may have to view the larger version to see some of these details.
The image below shows the Document Contrast Diagram from the remarks made by both Clinton and Obama after the Super Tuesday primaries on Feb 5th.
Document Contrast Diagram for Clinton/Obama Super Tuesday Remarks (click to see larger version)
My first post on Document Contrast Diagrams will give some guidance on how to interpret the image. Here are a few things I noticed that are illustrated by the diagram. You may have to view the larger version to see some of these details.
A Document Contrast Diagram is a visual summary of the content of two text documents that illustrates shared words, words that are unique to one document or the other, word frequency, relative size of the two documents, distribution of emotional tone within the documents, related words based on co-occurence, and the most common word in each document segment. Have a look below at the Document Contrast Diagram for the 2007 and 2008 US State of the Union (SOTU) Addresses. If you wish you can click on the image to see a larger version.
I'm hoping that much of the following is reasonably intuitive but here are a number of points regarding interpretation:
I've been consumed lately by the idea of taking two distinct documents and creating a large, visually interesting, static image that compares and contrasts them. I don't have time at the moment to explain how to interpret these but have a look at the images below. The blue text is the State of The Union Address for 2007 and the red is that for 2008.
Click on the images to see larger versions. The idea needs work still but it's starting to look promising.
Last week the New York Times published an interactive graphic called The Ebb and Flow of Movies: Box Office Receipts 1986-2007. It does a pretty good job of showing how the revenue of various movies rose and fell over time as well as more global patterns. The design does make it hard to directly compare movies against each other. It would be neat to pick a bunch of movies and see a set of traditional line graphs starting from the same point. Here is a close up:
And here is 4 years of data with labels showing the summer blockbuster periods. You can also clearly spot the peaks at the end of the years.
Lee Byron has done some other interesting work. One that really caught my eye when I first saw it is this stream-like visualization of music listening habits over time. The data comes from the Last.fm records for a particular user.
In the author's words:
After thinking about how I could show this whole sum in a presentable form, I decided on a sort of layered histogram. Each colored sliver represents a different artist listened to in the last 18 months. The sliver moves through time left to right growing thicker where it was more popular and thinner where it was less. The color indicates the first time the artist was listened to, warmer colors being more recent and cooler being further back. As a new artist is listened to it is put onto the outsides of the graph. The result is a wiggling tour through your listening history past.
Lee describes it as 'a sort of layered histogram' but I think of it as a 'stream graph' - it nicely shows how something varies over time and looks like a stream to me.
Back in 2006 I wrote about Martin Wattenberg's work called The Shape of Song and how it illustrates the repetitive patterns in music using translucent arches that connect identical passages of notes. At the time I mused about doing something similar for text:
Perhaps poetry or lyrics from songs might have an interesting structure but I suspect most text data wouldn't have enough repetition at the token or word level for this idea to be fruitful.I did eventually develop the idea into Document Arc Diagrams that uses similarity of vocabulary.
I just stumbled across Children's Poetry & Limerick Visualizations by Lee Byron which stems from Wattenberg's concept as well. Lee describes the image below with these words
The arcs represent rhyme, alliteration, homophone and repetition. Steps underneath the line represent rhythm. You can see these elements clearly represented in the classic childrens poem: "Hickory Dickory Dock".
Interesting work.
I added the transcript for the Ohio Democratic debate held on February 26, 2008 to the interactive Transcript Analyzer. The image below is smaller (and more blurry) than from the application but gives a rough idea of what was discussed by which candidate and when. Here are the primary topics covered in order:
It's also interesting that 'immigration' was mentioned in passing only once during this debate but was a primary topic in Texas. Also 'education' was not given any real attention.
Democrat Debate - Feb 26th, 2008
I have made another minor refinement to the application. Beside the word lines are shown the number of times each word was used by each candidate. For example for this debate 'Iraq' was used 8 times by Obama and 5 times by Clinton.
Give the Transcript Analyzer a try yourself and, as always, feedback is welcome !
I'm sure almost everybody reading this entry is aware of the tight race in the US democratic primary between Clinton and Obama. There is a huge amount of coverage over this exciting and extremely important contest. A concept much-discussed lately is momentum. I've created a simple graphic to try and visualize the momentum.
The darker blue area shows Clinton's delegate counts over time. The lighter blue shows how much Obama's counts exceed those of Clinton. The small numbers show the actual difference at a point in time. For example, after Feb 5th (Super Tuesday), Obama had 30 more regular pledged delegates than Clinton - not counting super delegates.
Hillary Clinton currently has an advantage in super-delegates (241 to 181 for Obama) and this makes the race closer than depicted above. However, super-delegates support is not fixed - they are free to change who they support up until the time of the convention.
I just added the transcript for the Texas Democratic debate held on February 21, 2008 to the interactive Transcript Analyzer. The image below is smaller (and more blurry) than from the application but should give a rough idea of what was discussed by which candidate and when. Here are the primary topics covered in order:
Democrat Debate - Feb 21st, 2008
I did make a minor refinement to the application. The bars for the words of interest are now coloured to show the speaker. This makes it easy to tell, for example, that Obama used the word 'Iraq' in 7 separate segments but Clinton only used it in one segment.
Give the Transcript Analyzer a try yourself and, as always, feedback is welcome !
Pixish is a new site devoted to connecting visual artists with people interested in exploring and possibly using their work. You can sign up and post 'Assignments' that describe what you are looking for or you can submit designs to fulfill assignments. The site is still in beta mode and had a few hiccups when I played with it yesterday but it's an interesting idea.
Just for fun, I created a couple of designs with a modified version of Word Hearts and entered them into an assignment. They are looking for a T-Shirt design for typography lovers. Here are small versions of my 2 entries:
In a few of my previous interactive applications, namely Digg Explorer and the Race Results Analyzer, I have used small 'data objects' that get smoothly animated between different locations. Sometimes the set of data objects represent a data graphic - a pie chart or histogram for example.
I have just come across a research paper and video by Jeffrey Heer and George Robertson where they investigate the effectiveness of animated transitions in statistical data transitions. Their conclusion was that animated transitions can significantly improve graphical perception. The video is high quality and explains the ideas and results very well.
Note that this research did not use multiple constituent data objects as in my applications but the conclusion is likely valid in this context as well.
I have taken the remarks made by both Obama and McCain after the Potomac Primary results were in and constructed another Document Cloud Comparison. As before, a few static images are shown below to give you a flavour but the real fun comes with exploring the interactive application.
If you enter a blank focus string in the application it shows a standard word cloud and colors words that are unique to one speaker or the other. The top words used by Obama and not by McCain include 'change' , 'tax', 'health', 'college', 'bush', 'lobbyists' , 'jobs', 'rich', and 'iraq' . Those used by McCain uniquely include 'promise', 'serve', 'friends', 'strength', 'faith', 'dreams', and 'challenges'.
'Hope' Associations and References (static image)
'War' Associations and References (static image)
Give it a try yourself. The application is written in Java so you may have to wait a few seconds for it to start up.
I've been playing around with words and shapes again and just posted a little application I call Word Hearts that lets you generate heart shapes filled with words. Here are two sample images:

It's just in time for Valentines Day so have some fun!
I have taken the remarks made by both Clinton and Obama after the Super Tuesday results were in and constructed a Document Cloud Comparison. A few static images are shown below to give you a flavour but the real fun comes with exploring the interactive application.
Most Common Words (static image)
This first image shows part of the list of most common words for both speeches. Clinton mentions 'America' most frequently, Obama the word 'can'. Clinton uses the terms 'god' , 'auto', 'veteran', and 'economy' which aren't mentioned at all by Obama. Interestingly, Obama's top unique words are 'time' and 'change'.
'Hope' Associations and References (static image)
The references to the word 'hope' clearly show Obama's use of repetition and rhythm. This is shown again in his use of the words 'time' and 'change' as shown below.
'Time' & 'Change' Associations and References (static image)
The last reference to 'change' caught my eye - We are the change that we seek. It's the declarative form of a famous quote by Gandhi - You must be the change you want to see in the world.
It's much more interesting to try it out yourself. Click on 'more' to give it a try. The application is written in Java so you will have to wait a few seconds for it to start up.
Word Association Clouds appear to be an interesting way to navigate within a document and get an understanding of the concepts discussed. I've also been playing around with the idea of using two of them linked together in order to explore the similarities and differences between two different documents.
The image below shows an example using the State of the Union addresses for both 2007 and 2008. The two clouds show the words related to the focus word in both documents in the same manner as for the single Word Association Cloud. The only difference is that colour is used to indicate words that are unique to one document or another. The words in blue on the left are unique to the 2007 SOTU and those in red on the right are unique to the 2008 SOTU. As before, you can click on a word to bring it in focus or click on the top edit box to change it. The clouds are linked in this case so that they always show the same word for both documents.
Document Cloud Comparison (static image)
We show here the words associated with 'energy' in both of the transcripts. The word 'supply' is most highly associated with 'energy' in the 2007 version and the blue colour shows that it isn't even used in the 2008 address. You can also easily see that 'wind', 'solar', 'electric' and 'vehicles' were all used in relation to 'energy' in 2007 but were not even mentioned in 2008. In 2008 the word 'security' is the most highly associated term. It does appear in 2007 but is not as prominent in relation to 'energy'.
It's much more interesting to try it out yourself. Click on the image or 'more' to give it a try.
The image below is a Document Arc Diagram generated from the text for the State of the Union Address for 2008. There is some interesting structure evident. There are two very distinctive groupings of arcs. The first is focused on domestic issues and arises from repeated use of the terms America, Americans, Congress, trust, tax, veto, health, housing, technology, and jobs. The second group of arcs is based on repeated use of the terms America, Qaeda, troops, terrorists, iraq, iraqi, afghanistan.
You can enter your own text for analysis with the Document Arc Diagram Application.
State of the Union Adress, 2008
I just added the transcript for the California Democratic debate held on January 31, 2008 to the interactive Transcript Analyzer.
Democrat Debate - Jan 31st, 2008
I have adapted my recent Digg Trends tool so that it can analyze data about weblog posts. A new version exists called Boing Boing Word Trends that loads summaries of the latest 500 posts from Boing Boing and lets you explore which words are used together and how usage has varied over the recent past.
Give Boing Boing Word Trends a try !
The American president recently presented the State of the Union Address for 2008. I noticed this Tag Cloud representation of the text. I'm sure there are several others already on the web as this is a standard analysis these days for any text of interest. It does a pretty good job of summarizing the content by listing the top keywords with a font scaled to their frequency.
In my recent tool Digg Trends I introduced something I call a Word Association Cloud. Visually, a Word Association Cloud looks like a standard Tag Cloud except the topmost word is made distinct in some manner. I've been using a faint block of color behind it. Rather than using font size to represent a simple word frequency the size here illustrates how good the correlation is with the primary word.
Word Association Cloud (static image)
In this example the primary word is 'Afghanistan' and the cloud clearly shows that the major words associated with it are 'iraq', 'america', 'freedom', 'pakistan' etc. The references within the text are also shown. I'm basically counting how often the various words occur near 'Afghanistan' but I'm also weighting this count based on how far apart the words are. You can click the primary word to enter edit mode and change it to whatever you wish. Or you can simply click on one of the associated words to make it the new primary word. This lets you navigate around easily to explore different words. If you change the primary word to a blank then a standard tag cloud is presented.
It's a simple idea but seems to give a useful perspective. I'm guessing somebody somewhere has done this before but I'm not aware of any examples. Please let me know if you find some. Give the Word Association Cloud for the State of the Union Address a try !
The design of the Digg Election Story Analyzer has been improved and generalized so that it can be used for all the topics and subtopics available on Digg. I'm calling the result Digg Trends. The tool loads the latest 500 popular stories for the desired topic and analyzes the text found in the story titles and descriptions. The image below shows the current results for the 'Technology' topic.
Static image - click it to launch the interactive application
The Digg Trends analysis focuses on four words at any given time. A different color is used for each. The graph at the top shows how the number of references to each of the four words varies over time. You can turn off the 'Stacked' checkbox to show a line graph which does a better job of showing which word is referenced the most at any given time.

For much of this past month of January, 2008 Apple has had much greater attention within the Digg community than Microsoft, Google, or Digg itself. There was a large spike in Apple references around Jan 16th which corresponds to the announcement of the MacBook Air at MacWorld. Attention to Digg was higher than Apple around January 23rd.
Recently I completed a small freelance project for the site ButterBeeHappy.com . Basically, the site lets you easily keep a journal of those things that make you happy or that you are grateful for. There is research to suggest that doing so has psychological benefit. The site is free to use and I've been enjoying using it the last couple of months.
My small piece of the puzzle is called the Honeycomb Navigator. It lets you see the words used most often in your entries as well as which other words are associated with them. You can also explore the things that made you, or other people, happy by hovering over and clicking on particular cells.
HoneyComb Navigator (static image)
In the example image the central hex on the left shows a particular user, in this case it's me - jclark. The outer hexes show the words most commonly used in my recent entries: julia, soccer, today, leanne etc. The middle ring of hexes on the left are the other users that most often used these same words in their entries. If you mouse over a user hex the right-hand area shows a random entry from that user. If you mouse over a word hex the right-hand area shows a random entry containing that word. You can click on a word or user to make it central.
You can try out the navigator by itself below or visit ButterBeeHappy.com to sign up yourself !
I have added the transcript for the South Carolina Democratic debate held on January 21, 2008 to the interactive Transcript Analyzer.
Democrat Debate - Jan 21st, 2008

Here are a few simple patterns that I noticed:
I have just posted another tool to my projects section. This one is called the Digg Election Story Analyzer and shows the trends in word usage over time and word associations for stories that reached popular status in the Digg US Elections 2008 topic. The tool loads the latest 500 popular stories and analyzes the text found in the story titles and descriptions. An 'attention timeline' and tag clouds of related words are then displayed.
Here are a couple of images to give you a taste. Of course, it's always more fun to just give the Digg Election Story Analyzer a try!


I have updated the Transcript Analyzer so that you can view different transcripts. Both the Democrat & Republican debates in New Hampshire on January 5th are available. There are two other debates as well.
Democrat Debate - Jan 5th, 2008

Republican Debate - Jan 5th, 2008

These images are a little compressed compared to the actual application but a few things still immediately jump out at me:
Here are the top ten posts on Neoformix that were visited the most often by people during 2007. All but two of them (6 and 9) are interactive applications written in Java/Processing and allow you to explore some data or create an interesting image.
Thanks to everyone who visited the site over the year, especially those who sent me feedback or linked to my content. Best wishes to everyone and may you have a happy, productive, and interesting 2008 !
Like lots of people this time of year I've been thinking about snow. Actually more than thinking - I've been shovelling it, walking in it, driving in it, and playing in it. My latest Text Toy stays with the snow theme. It allows you to generate snowflake-like graphics from a few words or phrases.
Check out the interactive application for the Text Snowflake Creator. You can enter your own text to generate images like:

Lot's of people have been having fun with the Big Small application I posted a couple of weeks ago. In fact, I've had a couple of days with more than 25,000 pageviews. Not too bad for such a simple application !
The information provided by the Digg API is quite rich and very relevant to the community of Digg users. I've created a second visualization using the API, this one focussed on the relationships between the latest popular stories. The Digg Story Graph is an interactive visualization that shows the relationships between recent popular stories on Digg through the use of node and link diagrams. Stories can be visually connected through shared vocabulary, common topics, domain, submitter, or date submitted.



There is also a large version of the Digg Story Graph available. It requires 900x800 pixels for proper display and a decent CPU for good responsiveness. The smaller version shows the 100 latest popular Digg stories. The larger version will show 200 and support more word nodes.
Give it a try !
My Digg Explorer has attracted some attention of late culminating in it reaching the front page of Digg late last night. In the span of a couple of hours it received about 5,600 views, and reached a total of more than 7,000 views for the day. To put this in perspective, it's more than my site usually gets in a month. The application and my server handled the load with no trouble and remained very responsive throughout.
I did have a little bit of excitement when Digg decided to add new features to their site immediately before my app went popular. I was a little concerned they might break my application but the only impact was that two new top level categories had no predefined colour and appeared white. Within a few minutes I added colours for the new categories and had it posted to my server.

Thanks very much to everyone who dugg my little application, especially Reg 'Zaibatsu', Muhammad Saleem, and Andrew Sorcini who really got things rolling yesterday. I would also like to thank Stan Schroeder for the write-up in mashable - Beautiful Digg Tool Provides Wealth Of Interesting Data. Thanks also to Daniel Burka, the creative director at Digg, and Tom Carden of Stamen Design for triggering the attention I got within their organizations. Stamen partnered with Digg to produce the very popular Digg Labs visualizations of Digg data.

I have posted the interactive application called Big Small in my projects section. Now you can enter your own text to generate images like:

Something big made from something small. This is a simple static image from my latest Processing experiment.

I have added a new graph of the top Users to the Digg Explorer. I also fixed a bug in the domain parsing logic.
This is a tool for exploration of the 500 most popular recent stories from Digg. Visually it is very similar to my recent Race Results Analyzer in that it uses small circles to represent items of interest which are fluidly positioned in various ways to emphasize patterns of interest.


Have a look at the Digg Explorer.
I ran another race this past weekend, this time a 10K. The race results are online in a simple tabular format provided by SportsStats. The data set is fairly rich and contains an athlete's name, city , age bracket, gender, and time. I have created a little tool to help explore this type of data. Some sample graphs are shown below.


The little circles represent athletes: red for women, blue for men, and green for selected. You can click on any circle to select it (or de-select if already selected). You can also click and drag the mouse to change the selection status of everything within the selection rectangle. The little circles smoothly transition from their old to new locations when a new graph type is chosen.
Give it a try !
Transcripts or scripts can be very rich data sets if you are comfortable with writing code to analyze text. I have created an interactive Transcript Analyzer for exploring the transcript of the recent Democratic debate in the US. One thing I focussed on was to illustrate 'who said how much and when'. I noted this as a weakness in the NYT tool in my earlier post.
Refer to the image below. The top section shows the distribution of some selected words within the text across a 'timeline' which goes from left to right. Each speech segment is the same width and the height of the small white bars show the number of occurences of that word for that segment. You can add new words with the text box in the top right corner or you can remove existing words by clicking on them.
Right below the word distribution graphs is a similar coloured set showing a spectral decomposition of the text based on who spoke and how much was said. In this case the bar heights give the amount of text for each segment. Click and drag the mouse left to right to move along the timeline and show the actual text for 3 consecutive segments. Mousing into this lower region will cause the blocks to expand and show more text.
I think the separated or spectral timeline might be an effective approach to showing this kind of information. From the display in the image above we can glean:
The New York Times has produced an interactive transcript visualizer that allows exploration of the transcript for the recent Democratic debate in the US. It shows word count by speaker with a simple bar graph and illustrates the size of the various speech segments with a multi-strip rectangular region. It also supports highlighting a search term within the transcript. The tool doesn't do a great job of showing who spoke when. I would also like to see the capability of highlighting multiple search strings in the display. On the whole, I think it is quite well done.
[Link discovered via the open house project]

Patrick Dinnen over at Hogtown Consulting has produced an interesting visualization of election results. It's an interactive application built using Processing, my favourite toy of the moment. Currently the data used is for the Ontario 2003 election but the idea could, of course, be applied more generally.
I did have trouble running the application using IE 7 on Win XP - it shows a tiny window rather than the desired size but it works fine for me using Firefox.

Here are a couple of simple bar charts showing which countries consume the most electricity and oil per person. I only included countries with more than 1 million people in the analysis. The data comes from the CIA World Factbook.
This first one shows the top countries for electricity consumption per capita. The top countries are mostly rich and cold with the exception of some oil-rich nations.

This second graph shows the countries having the highest per capita consumption of oil. Some heavy oil producers show up (Kuwait, UAE, US, Canada ) as well as some smaller highly developed countries (Singapore, Hong Kong, Taiwan).

One of the reasons I haven't been posting much lately is that I was training for a half-marathon. Running about 40 miles a week does tend to cut into your free time for other activities. As a solitary sport it was possible to squeeze it in here and there without too much impact on the time I value so much with my family.
I ran the Toronto Half Marathon this past Sunday. It went pretty well - no rain, nice and cool, no blisters, no serious pain, and I ran it a bit faster than planned. My final time was 1:48:33 and I finished 813th out of 3494 competitors. Not bad for my first try. I'm feeling pretty good after the race although walking down stairs is a bit of a problem. On the whole it has been a very rewarding experience.
Here is a picture taken by my daughter with perfect timing at the finish. The second graphic has some information provided by runpix. I like the little finish line visualization - in the live version you can hover over the dots and get details about who the people were that finished around you. The runners all wear timing chips so they have all these details available.

Another reason I haven't been posting much lately is that for the last month I have been a member of a jury for a criminal trial. We arrived at our verdict last week and found the defendant guilty of 5 separate charges, the most serious of which was impaired driving causing death.
It was strange and difficult having responsibility over a decision that would have such a large impact on a person as well as his family and friends. It was also difficult hearing and seeing, day after day, all the detailed words and images related to such a tragedy. The person killed in the accident was a good friend of the accused so our verdict found the driver criminally responsible for the death of his friend. I certainly felt some sympathy for the man but our decision had to be based on evidence only - 'without sympathy or prejudice'. I'm thankful that our duty did not include any sentencing.
It was certainly an emotionally powerful experience - one that I will never forget. I take away a great many positive things including a stronger appreciation for my own personal freedom and newfound respect for our law enforcement and judicial systems in Ontario, Canada. However, I think the most positive aspect of the whole experience for me were the other people that served with me as members of the jury. The fact that 12 randomly selected people, of various ages, from all walks of life, would turn out to be so intelligent, friendly, funny, and supportive has made a deep impression on me.
Anil Dash has a short post pointing out an interesting graphic that illustrates the relationships between Indo-European languages. It's from the American Heritage Dictionary of the English Language and is a variation on the Radial Treemap idea I described last year.

A couple of weeks ago The Economist published a report giving a 'Democracy Index' for the various countries of the world. It's an interesting set of data but the various references to it that I saw only included short lists of the top and bottom ranked countries. I have created a few graphs that might prove interesting based on this information plus some other data from the CIA World Factbook.
This first graph shows the number of countries having a democracy index in the given range. I counted how many were in each .5 sized bucket. For example, the first bar shows that there was only 1 country (North Korea) with index in the range 1.0-1.5 .

The second graph shows the number of people living with a democracy index in the given range. The large spike at 2.5-3.0 is due to China and the one at 7.5-8.0 to India.

The fact that the top ranked countries are all relatively small (Sweden, Iceland, Netherlands, Norway, Denmark...) is suggestive that perhaps there is a relationship between size of country and the level of democracy. This third graph is a scatterplot of the democracy index vs the population for each country. The population is on a log scale because of the huge variation in country size.
I don't believe that I've mentioned yet the excellent resource Many Eyes. It was created by IBM's Visual Communication Lab. In their words:
We believe that visualizations gain power when multiple people use them to communicate, and that communication gains power when multiple people can visualize and explore information together. We want to democratize visualization, enabling anyone on the internet to publish powerful interactive visualizations and start their own data conversations. Many Eyes is designed to bring that power to you.The Visual Communication Lab was created by the brilliant Martin Wattenberg and includes the amazing Fernanda Viegas and this product of the lab shows the quality of the people behind it.
The latest style of visualization unveiled at Many Eyes is called a Word Tree which is a method to visualize the different contexts that a word or phrase presents within a body of text. The example below shows an analysis of the phrase 'young king' in 'The Compleat Grimm Fairy Tales'. Click on the image to see the live visualization which lets you easily navigate to other words or phrases.

Introducing Home Planet Defense ! It's a strategy game where you build and upgrade bases to protect your home planet from alien ships. I hope you have as much fun playing it as I did creating it !

I have fixed a problem in the Shared Word Diagram application. If the relative frequency of the words in one column was much less than that in another then you would see a large number of words overlapping that were impossible to read. It's changed now so that there is a minimum spacing in such situations so that the words are readable. This makes the tool much more useful for comparing differrent versions of the same document.

Thanks to Stewart McKie for pointing out the problem. Check out his site scriptcloud which lets you create content clouds from a screenplay.
I have just posted another application for exploring the structure of text documents. This one lets you compare and contrast two documents by showing both the unique and shared vocabulary and the distribution across the documents. Here is an example static image:

The two columns of squares represent the two documents. The longest document will be shown with 50 segments. In this case, the rightmost blue column is the larger of the two and represents the American State of the Union Address for 2007. The trivial (stop) words were discarded before analysis. For this example the topmost square segment covers the first 51 words of a document, the second segment the next 51 etc.
The leftmost column of word circles show the high frequency words that are present in document 1 (State of the Union 2002) but are not present at all in the second document. The rightmost column of words show those unique to the second document and the central column has the words common to both. The bigger the circle the more frequent the word. The circles are ordered in each column by average position of the word in the documents where they appear which roughly minimizes the number of connection crossings.
Hovering over a word (in the interactive application, not this static image), in this case 'terrorists', will show which segments of the documents contain the word. Darker connecting lines indicate more occurences in that segment. It will also highlight with colour the other words occurring in the same segments. So, for this example, we can easily see that:
The interactive application is available here for Shared Word Diagrams. This version lets you enter your own text for analysis - see the form at the bottom of the application. Have fun and let me know if you discover any especially interesting examples.
There is a new version of the Document Arc Diagram tool that allows anyone to enter their own text and generate diagrams. Visit the project page, fill in the form in the bottom of the page, and press the button. Have fun !
I have posted the interactive application for Document Arc Diagrams. There are 10 documents available for analysis at the moment. I hope to allow processing of arbitrary user text within the week.
I have written before about Martin Wattenberg's Arc Diagrams for visualizing structure within strings. They are an intriguing way of visualizing repetition at varying scales within a linear sequence. When applied to music they produce beautiful images that illustrate the structure. I noted that for most narrative text these diagrams likely wouldn't work very well because of the lack of regular repetition but that it might be fruitful to explore some lower dimensional derived feature of the text.
In my recent exploration of ways to visualize arbitrary text documents I tried out something visually inspired by Wattenberg's Arc Diagrams. Rather than using arcs to connect identical patterns within a document I'm connecting instead segments that contain similar words. Here is the algorithm:
Update:The interactive application is available now for Document Arc Diagrams.
Here are a few sample diagrams:


Despite the arbitrary nature of the segmentation the technique appears to reveal some aspect of the document structure in a visually interesting manner. In Alice in Wonderland, for example, it shows what appears to be four distinct scenes present in the last half of the text. The third is highlighted in orange and has as high frequency words Alice, Mock, Turtle, and Gryphon. The third example is for the lyrics of a song and shows darker lines because the similarity between segments is stronger. There are also regular patterns that repeat multiple times which isn't surprising for song lyrics. It would be interesting to use a line-based or syllable/phoneme-based segmentation for song lyrics rather than the simplistic approach taken here.
I will post an interactive application soon that will let anyone explore a fixed set of documents.
I recently came across an interesting searchable database of information graphics built by the Parsons Institute for Information Mapping. The database contains over 1200 examples of information visualization images. Their stated goal is to build the most comprehensive, manually annotated (and taxonomically classified) information graphics database in the world.
Here a few sample images taken from the results of the search for graph:
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
The idea of an interactive tool to explore the structure of a text document has always intrigued me. Visually highlighting key terms from a document and the relationships between them might be an effective way to gain new insights. I have been playing around for a while creating such a tool and have decided it's interesting enough to show here. There are quite a few things I don't like about it but I'm going to set it aside for a bit.
I don't like to embed java applications directly in my feed so the real application can be found farther down this post - the part that you have to read directly on my site. Here is just an image:

The top left set of connected circles represents a partial view of a graph showing inter-relationships between words. There is a central ring of the primary words of interest and a secondary outer ring of some other words related to the central set. Click on an inner word to remove it from the central ring. Click on an outer word to add it to the central ring. In either case the words on the secondary ring are dynamically adjusted to show the 'most important words' related to the central set. The strength of the connections between the inner words and all the others are shown with simple lines. You can also hold down the number '1' key while clicking to make that word the only central word.
The top right shows a collection of bar graphs giving the distribution of the primary words across the entire document. Underneath it is a small map showing the distribution of the words across the entire document. The bottom right gives a list of other interesting words that aren't already in the circle diagram. By 'interesting' I mean high frequency but modified so that capitolized words are boosted. These words can be clicked on to add them to the central diagram. The bottom left gives excerpts for the word last hovered over. There are 5 or 6 files you can explore by clicking on the upper left '?' icon.
Give it a try !
I have borrowed some aspects of the visual design of Elastic Tag Maps for a new interactive version of a word frequency graph. Here is a simple image of the results in case you have trouble running the java application. It shows a word graph for the 2007 State of the Union Address that I used as an example in Word Frequency Graphs. This time, however, I've done away with the ellipses and only draw the connections when you hover over a node.

I mentioned in Optimal Representation of Text Documents that a tag cloud can be used to illustrate high frequency terms in a document but doesn't show any real structure of that document. One way to improve this is to position the tags in a cloud so that tags which are used together in the document appear close together in the cloud. Tag clouds usually show their terms in alphabetical order or are sometime sorted in increasing order of frequency.
Moritz Stefaner at Well-formed data.net has developed something he calls Elastic Tag Maps which have the property of related tags being positioned near each other. In his words:
Tag clouds are ordered the wrong way: Tags denote concepts. As such, they have meaningful relations to each other. Tag clouds are ordered alphabetically or by size - it would be much more effective, if tags that belong together could also be presented together. Some of these relations can be deduced automatically, by observing how tags are used: Some tags might always appear together, others sometimes and others never. If tags co-occur frequently or have many common 'neighbors', you can be sure the concepts denoted will be related in some manner.
Here is an example of his that gives the idea. It's interactive so be sure to play with it to get the full effect.

In my last post I used the text of this years State of the Union address as an example. Brad Borevitz has created an interesting visualization of the entire corpus of the State of the Union addresses from 1790 to 2007.

You can see which specific terms are more prominent in a given address relative to the entire corpus. The horizontal position of the words on the graph give the average position within the specific address being viewed. You can compare any two documents or see a great many statistical details about specific terms. It's certainly an intriguing application.
Here is a rough attempt at illustrating the meaning of some text with an automatically generated diagram.

Even without an understanding of how the document was constructed much can be understood from the words that are present. A quick glance at some of the words in the bigger ovals suggests a rough idea of the topic: america, iraq, help, health, congress. The connections between some of the words give more hints: federal-government, health-insurance, fight-enemy, american-forces, united-states, ask-congress, qaeda-terrorists, iraqi-security. Notice how my brain ordered them in the way that makes the most sense, united-states rather than states-united, even though no direction is evident in the diagram connections.
The text this diagram was based on obviously includes information related to the american government and the security situation in Iraq. The fact that 'health-insurance' is prominent together with the presence of other terms like 'children' and 'congress' suggest the document wasn't focussed exclusively on the situation in Iraq. In fact, this diagram was constructed from a transcript of the 2007 State of the Union address.
Given a text document, what is the 'best' way to concisely represent the content within say - a 600x600 pixel region ? One procedure that would probably give good output is this:
This would be a time-consuming and expensive option. What is the best automated way to solve the same problem ? Perhaps software that reads the text and automatically produces a summary ? I don't have any experience with the state-of-the-art in auto-summarization but I suspect it often doesn't work very well.
How about tag clouds of the most frequent non-trivial words ? They would highlight high-frequency words but don't show any real structure within the document. I'm sure we can do better.
I suspect software that detects named entities (people, places, organizations, products etc) might be a useful component of a solution. Perhaps something that creates a diagram illustrating the key entities and relationships between them would be useful.
Any ideas ?
My recent post on Boing Boing featured an example Multi-level Pie Chart. Michael Janssen has written an interesting post entitled Learned Bad Ideas that was prompted by his reaction to the graphic. As you can likely guess from the title of his post he didn't like it very much.
Michael starts with some discussion of bar charts and the fact that they are great for comparing the relative size of different quantities. No argument there. He then discusses Pie Charts which includes -
"Pie charts are the bad seed of the graph world. They aren't very useful, hang out a lot, and don't help you much. The worst thing about pie charts is that they aren't even good at the thing they're supposed to be the best at: comparing relative sizes."
He isn't alone here. For many reasons they are often rejected outright by people with education in information design. The