Spot

By: Jeff Clark    Date: Thu, 12 Jan 2012

Spot is an interactive real-time Twitter visualization that uses a particle metaphor to represent tweets. The tweet particles are called spots and get organized in various configurations to illustrate information about the topic of interest.

Spot has an entry field at the lower-left corner where you can type any valid Twitter search query. The latest 200 tweets will be gathered and used for the visualization. Note that Twitter search results only go back about a week so a search for a rare topic may only return a few. When you enter a query the URL is changed so you can easily bookmark it or send it to someone. The query brainpicker gives you a display something like this:

At the top left, next to the logo, are five icons to access the different views. The first is called Group mode and is shown above. Basically, tweets that share a lot of the same words are grouped together inside larger circles. Tweets are often grouped because they are retweets of the same original content but this doesn't have to be the case. They may be tweets from different people that don't even know each other but happen to be discussing the same thing. The intent is to show quickly the most popular things people are saying about a particular topic. Tweets that are more unique are placed in the phyllotaxy spiral to the right.

All the tweet spots show an image of the sender and at any time can be clicked on to see the tweet details. Clicking on the text of an open tweet will show the original in another browser window. Click on the background or an open tweet spot to close it or you can directly click on another spot.

The Different Views

Here is a complete list of the views and what they show:

  1. Group View (speech bubble icon) places tweets that share common words inside large circles

  2. Timeline View (watch icon) places tweets along a timeline based on when they were sent

  3. User View (person icon) shows a bar chart with the people sending the most tweets in the set

  4. Word View (Word Circle icon) directly shows word bubbles with tweets attracted to the words they contain

  5. Source View (Megaphone icon) a bar chart showing the tool used to send the tweets (or sometimes the news source)

The Word View, again for the query brainpicker:


User and Twitter List Queries

The string 'brainpicker' matches the wonderful twitter account by Maria Popova and the results shown above are mainly retweets of or discussions about the tweets she has sent. You can also do a search for @brainpicker including the @ sign to see the latest tweets sent from that account. This uses the standard Twitter API to get the data and so can go back farther in time. The Word View for this query clearly shows the Brainpicker focus on books, reading, writing, art, and maps.

You can also retrieve the latest tweets from a twitter list. Here is an example for a list I created by analyzing who was on various lists created about data visualization. In the search field enter @Top100in/datavis and you should get something like this for the User View:


Technology and Credits

I was inspired to create this when playing with the wonderful Twitter visualization called Revisit by Moritz Stefaner. Another influence was the Stamen work on Digg swarm which is no longer active but there is a video. My academic background in physics makes it natural for me to think in terms of interacting particles.

This application was created with the wonderful Processing.js which is the javascript-based extension of the Processing tool I have used in the past. Thanks to Ben Fry, Casey Reas, John Resig, David Humphrey and the other people in the Centre for Development of Open Technology at Seneca College. Thanks also to Jim Bumgardner for the excellent tutorial on phyllotaxy spirals and to The Noun Project for four of the icons. Thanks also of course to Twitter and all the people who fill it with great content!

Performance is pretty good with the Chrome browser, and decent in Firefox and Safari. It will not work in Internet Explorer (except perhaps the new IE 9). It seems to work reasonably well on the newer iPads although the search field is broken currently in that environment. The application will go out and get new tweets periodically. For popular queries the analysis and display of those tweets will often cause lagging to occur.

Obama Mosaic Portrait

By: Jeff Clark    Date: Wed, 30 Nov 2011

Here is a Multiscale Mosaic of Obama created from hundreds of pictures taken during his time in office.

The Van Gogh Portrait Mosaics were fun but I wanted to try an example that uses photographs as opposed to paintings. I settled on a portrait of Obama because of the widespread availability of photographs of him that are free of copyright restrictions. The subimages for this design are taken from the White House's Flickr photostream and seem to have been primarily taken by Pete Souza. I downloaded the 1000 most 'interesting' photos from the stream and used those as input to my process. I also manually selected and hand-centered about 10 interesting regions from these images to augment the set.

Here is a close-up showing the detail near the eye and nose.

Van Gogh Mosaic Portraits

By: Jeff Clark    Date: Wed, 23 Nov 2011

Here are four mosaic portraits of Vincent Van Gogh. The primary images and all the various component tiles are regions of paintings by Van Gogh.







A few more details on the multiscale mosaic process can be found in the post Multiscale Mosaics. The portrait images are all from WikiMedia Commons. The other Van Gogh paintings came from here. I created these by writing custom code in Processing.

Multiscale Mosaics

By: Jeff Clark    Date: Tue, 22 Nov 2011

I have been further refining my multiscale mosaic technique in search of the overriding goal of reconstructing an image from sub-images in such a way that balances the clarity of the large target image and the sub-images. I have tried out lots of ideas and the ones that seem to have the most potential for creating interesting multiscale mosaics are:

  • Allow use of lightened and darkened versions of the sub-images
  • Allow manual adjustment of the detail level (size of sub-images used) in different regions of the image
  • When matching sub-images to regions consider how often each sub-image has already been used in order to increase the number of different sub-images used in the final product
  • Do some limited blending of the target image with the sub-images

I have used a cropped region of Vincent Van Gogh's painting Self-Portrait With Grey Felt Hat as my target image while developing these ideas. The sub-images are sections of Van Gogh paintings. They are either the central squares or a few are manually selected square regions that focus on some interesting detail.

These techniques do seem capable of producing interesting mosaic images that can carry meaning at multiple visual scales.

Phyllotaxy Spiral Mosaics

By: Jeff Clark    Date: Tue, 15 Nov 2011

The post Mona Mosaics showed a number of ways to segment a flat surface and build mosaics by filling regions with the average colour for that region in some underlying image. Here is another example of the same technique but this time using a Phyllotaxy spiral, sometimes called a Fibonacci spiral. It's an arrangement commonly found in plant growth - for example in the Sunflower.

Jim Bumgardner has an excellent tutorial where he develops the idea and gives code for producing the pattern and several variations. I'm using something based on his Example 10 code to produce the mosaic below from a simple radial gradient. I love the swirling spirals in opposite directions found in the pattern.

And of course we must apply it to the Mona Lisa image as well.

Mona Pizza

By: Jeff Clark    Date: Thu, 10 Nov 2011

In the previous posts Mona Mosaics, Recursive Mona, and Blended Mona I played around with some ideas for reconstructing the famous Mona Lisa image in different ways. One of the things I did was to build up the image from smaller versions of itself. I was using simple image tinting and blending to get reasonable results.

This time I'm going to select sub-images from a set of pictures and use those to build the large image. This has been done for many years now and there are various tools to support it but I thought it would be interesting to try it myself. For this test rendering I'm using a small set of 23 images related to pizza. For simplicity they are all square images so they map well to the square regions determined by my algorithm. The algorithm selects the best-matching sub-image for each region and if the match isn't very good then it sub-divides the region and tries again at a smaller scale. This version uses blending to try and balance clarity of both the sub-images and the global picture.

Mona Pizza

For purposes of comparison here is the same image with no blending applied. You can see the sub-images more clearly but the overall image is only vaguely defined. This could be improved by using smaller sub-image pixels or a larger collection of sub-images to choose from.

Blended Mona

By: Jeff Clark    Date: Wed, 09 Nov 2011

The previous post, Recursive Mona, showed an image of the Mona Lisa constructed from smaller versions of itself. One of the things I don't like about that image, and most other 'photographic mosaic' type images, is that the grid structure controlling the sub-images is so visually prominent. Using multiple scales as I did helps to some degree but the regularity detracts from the overall image.

I've tried to improve this by breaking down the squares that require a more detailed rendering into subsquares in a more varied fashion. There are now 5 or 6 different splitting algorithms used to get the sub-components. This reduces the number of places where you see large numbers of consecutive tiles with the same geometry.

Another technique I've tried out is to blend the sub-images into the overall image at their edges. This tends to smooth out the edges between adjacent sub-images so it looks more natural and also has the impact of strengthening the overal global image. Here is Mona again with both of these techniques applied.

Recursive Mona

By: Jeff Clark    Date: Mon, 07 Nov 2011

One of the ideas presented in Mona Mosaics was to break down an image into square areas at different scales where the colour doesn't vary much. A natural extension of this is to redraw a tinted version of the original image inside each square. Repeat a few times and you get a version of the starting image built recursively from smaller and smaller versions of itself. Here is an example of the concept applied again to the Mona Lisa.

Iconic Faces

By: Jeff Clark    Date: Mon, 07 Nov 2011

Here are a few iconic faces that I have reconstructed with triangles. Source images came from 100+ Portraits of Iconic People of All Time. The faces are Che Guevara, Salvador Dali, and Audrey Hepburn.






Mona Mosaics

By: Jeff Clark    Date: Sat, 05 Nov 2011

A couple of years ago I explored reconstructing images based on Delaunay triangulization and Voronoi decomposition. Inspired by the work of Jonathan Puckey and Andy Gilmore I've revisited the idea of rebuilding images using some geometric-based simplification.

The source image for all these example is the Mona Lisa. The first rendering is a simple square grid where the colour of each square is the average colour in that region of the underlying image. By using a smaller grid size one can obviously get more detail than is shown here.

The image beside it is much more interesting. I start by looking at large square regions to see how much the colour varies. If it is fairly consistent then that implies there is less detail in that region and I can draw it as a simple large square. If the colour variation is higher than some threshold I look at the smaller subsquares and repeat the process recursively until some lower size is reached. This gives us a version of the image that has smaller more detailed squares where the image varies a lot and larger blocks of colour elsewhere.

Images 3 and 4 are similar but use triangular regions rather than squares. Another wrinkle which I added to the recursive process is to define a location on the base image that shows the 'center of attention'. I then vary the colour consistency threshold based on distance from that point. This allows for manually defining, to a limited degree, where the regenerated image will be more detailed. For these examples I used a point in the middle of the Mona Lisa's face.

The next 2 versions use circular regions which don't filll all the space so a background colour shows through.

These 2 fill the background of each circle with the average colour of that region and this gives a much more pleasing result.

This last image uses a recursive triangle decomposition as well but the sub-triangles are defined in a more varied fashion.

Sparklines for MLS Season

By: Jeff Clark    Date: Tue, 25 Oct 2011

Edward Tufte defines Sparklines as intense, simple, word-sized graphics, that should also be high-resolution. They are a very useful technique, especially when combined with the idea of small multiples.

I generated the example below based on the results of the 2011 Major League Soccer regular season. In this case, a whisker-style sparkline was generated for each team to show the complete Win-Loss-Tie sequence for the season. A small upward blue bar shows a win, a grey bar in the middle a tie, and a downward red bar is, of course, a loss.

The graphic succinctly illustrates how each team did over the season. A few interesting tidbits:

  • Los Angeles was consistently strong over the entire season
  • Real Salt Lake ended the season poorly with 4 losses then 2 ties
  • Sporting KC had a horrible start going 1-6-1 but then recovered well
  • DC United had no wins in their last 6 games
  • Vancouver had many more ties in the first half of the season than the second half

Radial Scans

By: Jeff Clark    Date: Fri, 21 Oct 2011

Here are a couple of portraits done with a simple radial scan technique. Arc segments are drawn that are coloured by sampling an image source.

Top Ten Cars in the UK

By: Jeff Clark    Date: Mon, 30 May 2011

I created some print graphics for Live Magazine back in February. I enjoyed the project a great deal and would be very happy to tackle more print projects. Send me an email at web1@neoformix.com if you are interested.

The graphic shows a streamgraph illustrating the top selling automobiles in the UK from 1973 until 2010. The various series were sorted to group the same brands together as much as possible and to add the newer brands to the outside of the graph.

Click on the image to see a larger version

I used custom code created with Processing to create vector output in PDF format and then fine-tuned the graphics with Adobe Illustrator.

Minor Site Changes

By: Jeff Clark    Date: Tue, 17 May 2011

I made a couple of minor changes to the Neoformix.com website. The first was that I removed the google Ads. They made virtually no money and cluttered the display up unnecessarily. The second change was that I added a 'Tweet' button at the bottom of every article page to make it easier to share my content on Twitter.

War and Peace

By: Jeff Clark    Date: Fri, 13 May 2011

I've created two new Word Portraits titled War and Peace. Both the template images of Hitler and Gandhi are from the wonderful Wikimedia Commons.

I experimented a bit with adding a more 3D impression to the image by using a tool to bring forward the brighter parts of the image. This was done more for the Gandhi image since the highlighted parts of Hitler didn't correspond very well to depth. The tool I used was DeepImage by Daniel Hawkes.

Explore Twitter Lists

By: Jeff Clark    Date: Sat, 30 Apr 2011

It has been very gratifying to see the interest in my recently launched Tweet Topic Explorer. In the week since it was made available there have been posts about it on Infosthetics, FlowingData, Cool Infographics, and many other places. It has also had over 1,200 tweets sent about it. Thank you everyone for trying it out and telling your friends!

Much of the initial attention came from people in Europe looking at non-English accounts. The tool was enhanced a few days after launch to ignore stop words in German, Italian, Spanish, French, and Dutch. It's not a perfect implementation and of course misses many common languages but it does make the tool more useful for many more people.

Another request for improvement that I was able to deliver was the capacity to analyze the tweets from Twitter Lists. You can now enter a list name in the field to see a Word Cluster Diagram for the latest tweets from the people on the list. The volume of tweets on a list is usually pretty high so the last 800 tweets (which is how many are used by the tool) will not go very far back in time. When using the Tweet Topic Explorer with a list the tweets on the right are enhanced to include the account and icon for the author of each tweet.

Here is the result for the Twitter List @Top100In/DataVis:

And here are a few others without the tweet list shown. @mashable/marketing:

And @Scobleizer/iphone-and-ipad:

Tweet Topic Explorer

By: Jeff Clark    Date: Tue, 19 Apr 2011

One problem I face on a daily basis is to decide for a given Twitter account whether I want to follow it or not. I consider many factors when making the decision such as language of their tweets, frequency, whether they interact on twitter with other people I admire, or if I have some personal or geographic connection with them. But the most critical factor for me is whether they tweet about things that match my interests. Sometimes you can get a hint about this by looking at their short one line twitter bio but the best way is usually to scan their latest tweets.

I have created a new tool to help see which topics a person tweets about most often. It also shows the other twitter users that are mentioned most frequently in their tweets. I call it the Tweet Topic Explorer. I'm using the recently described Word Cluster Diagrams to show the most frequently used words in their tweets and how they are grouped together. This example below is for my own account, @JeffClark, and shows one word cluster containing twitter,data,visualization,list,venn, and streamgraph. Another group has word,cloud,shaped,post etc. It's a bit hard to see in this small image but there is a cluster about Toronto where I live and mentions of run, marathon, soccer. Also, there are bubbles for some of the people on Twitter I mention the most often: @flowingdata, @eagereyes, @blprnt, @moritz_stefaner, @dougpete.

For all these images below you can click on them to go to a live version of the tool.

Here is another example showing the full tool. This one is for one of my favourite accounts to follow, @brainpicker, by Maria Popova. In this case the word 'book' has been highlighted with a click and the list to the right shows the tweets that contain the word. The words in the tweet list are coloured if they appear in the word cluster diagram. Clicking a different word bubble will select that word instead. You can click on any twitter @ID in the tweet list to load the data for that account. The tool is currently configured to load the last 800 tweets. For my account this goes back a couple of years in time but for more prolific tweeters it may only span a few weeks. The entry field at the lower left lets you explore the tweets for any twitter user.

Here are a few more examples of the word cluster diagrams generated from some twitter accounts. @acarvin is doing an extraordinary job of covering the events in the Middle East.


(More...)

Word Cluster Diagram

By: Jeff Clark    Date: Mon, 18 Apr 2011

A few years back I introduced the idea of Clustered Word Clouds which use word size to indicate frequency but also use positioning and word colour to group words together that were highly correlated in the text. It works reasonably well I think. See the example below:

I've come up with a new variation on this idea that tries to improve a couple of things. In many word clouds, including those generated by Wordle and my clustered clouds, the font size of the words are proportional to the word frequency. This has the effect that words with many letters (for example 'indisposed') cover a much greater area than a word with fewer letters (say 'ill') if they have the same word count. Some word clouds are constructed so that the area of the word is proportional to the word count rather than font height. This often has the opposite effect of unnaturally emphasizing words with fewer letters. My new design uses solid circles of colour whose area is proportional to the count. I think they may do a slightly better job of giving the proper visual emphasis to the words.

By using larger blocks of colour I think it's also easier to visually distinguish the groups in a clustered cloud. I'm calling this new variation a 'Word Cluster Diagram'. The one below is for the same text as the older style above but the clustering algorithm, and stop word list are a bit different so they aren't directly comparable. I think it has some promise although it's not as space efficient as using the words on their own.

Five Years

By: Jeff Clark    Date: Fri, 08 Apr 2011

Five years ago today, I published my first entry on Neoformix.com. I wasn't really sure if anyone would pay attention. You have, and for that I thank you all. Thanks especially to everyone who has written about my work or passed it along to your friends.

Except for the first few months, virtually all the images, interactive applications, and analysis presented on this blog were created using code I wrote with Processing. Thanks very much to Casey Reas, Ben Fry, and the community around that wonderful tool. Thanks to all the amazing researchers, coders, artists, and designers that have most directly influenced my work, especially: Ben Shneiderman, Martin Wattenberg, Fernanda Viégas, Ben Fry, Casey Reas, Chris Harrison, Nathan Yau, Lee Byron, Moritz Stefaner, Jonathan Feinberg, Gui Borchet, Jer Thorp, Robert Kosara, Andrew Vande Moere, Manuel Lima, Frederik Vanhoutte, Mario Klingemann, Robert Hodgin, and Tom Carden.

I've selected images from a few representative posts from the past five years. Click on the image to visit the respective post. Thanks again everyone and I'm looking forward to what the next five years will bring!











Love and Hate on Twitter

By: Jeff Clark    Date: Mon, 14 Feb 2011

I have been collecting tweets containing the words 'love' and 'hate' for a couple of years now and decided to analyze them to see what could be discovered. It was a fun project that I finished just in time for Valentine's Day. I hope you love it!

Click to enlarge

For the data I chose to use every tenth tweet containing the word 'love' and every tenth tweet containing the word 'hate' from all of 2010. This yielded 658,391 love tweets and 503,489 hate tweets. Incidentally, this means there were roughly 6.5 million tweets last year containing 'love' and about 5 million containing 'hate'.

The first set of diagrams in the graphic show the love/hate ratio for various sets of related words. Basically, I counted the number of times a word appeared together with 'love' and together with 'hate'. A simple percentage of 'love' associations out of the total gives a basic measure of sentiment - let's call it the Love Quotient ;) A value near 100% means the word is used almost exclusively with 'love' and never with 'hate' and the graph will show hearts all the way to the right side. Each full heart represents 5% over the 50% neutral point so, for example, 'amazon' has six and a bit hearts showing so its' Love Quotient is about 82%.

Using simple word association is a pretty crude measure of sentiment. It obviously would be fooled by a sarcastic tweet like: Ugg - liver and onions again. Don't you just love the food in the cafeteria? Even so, by looking at large quantities of data it seems to give reasonable results in many cases. The data definitely settles the age-old question: pie > cake!

The diagram with all the photos is actually a Treemap. Surprisingly, this is the first treemap to appear on Neoformix since my second post back in April of 2006 about The Map of the Market. This one shows the people who were mentioned most frequently with the word 'love'. It's dominated by celebrities, mostly singers who appeal to young teenagers.

The StreamGraph shows how the word 'love' was used together with various sports over the course of 2010. The term 'football' combines references to both american football and international football (soccer). You can see the peak in June for the World Cup and peaks for both hockey and skating during the winter olympics in February.

Text analysis and creation of the various graphics was done with custom code created in Processing. The Treemap diagram used the Treemap library created by Benjamin B. Bederson and Martin Wattenberg. Thanks!

State of the Union 2011

By: Jeff Clark    Date: Wed, 26 Jan 2011

President Obama delivered the State of the Union speech last night for 2011. I've created a few diagrams that compare it with the speech from last year to try and understand how it differs.

First we have two Sentence Bar Diagrams for the speeches from 2010 and 2011. Sentence Bar diagrams use color coding to show the topic of the various sentences in the text and bar length to show how long the sentences are. In these diagrams I did combine adjacent pairs of sentences so it wouldn't be too long. These two texts are almost the same length, have a very similar breakdown over the four topics, and both have a segment towards the end about security issues. The 2011 speech has slightly more emphasis on the domestic issues of education and less on economic matters.

This next diagram shows the words that were used much more frequently in 2010 vs 2011. For example, the word 'families' - the third down the list, was used 17 times in 2010 but only 2 times in 2011. Other prominent words from last year compared to this year: bill, businesses, security, national, recovery, act, banks, energy, and insurance.

This one below shows the words used much more often this year than last year: new, world, race, future, high, technology, research, education, progress, and innovation.

Finally, we have a Document Contrast Diagram comparing the two speeches.

Click to enlarge

Simple Visual Designs

By: Jeff Clark    Date: Fri, 19 Nov 2010

I've been exploring algorithmic generation of images from combinations of simple shapes. I'm using alpha-blending with grayscale sub-components and then taking the various shades of gray created through overlap and recoloring based on a palette. Here are a couple that I think turned out pretty well.



Designs by Juan Osborne

By: Jeff Clark    Date: Wed, 03 Nov 2010

If you enjoy my Word Portraits you should definitely take a look at the work of Juan Osborne. He has some wonderful designs. Here are a couple of samples:

Six Ways to Find Value in Twitter's Noise

By: Jeff Clark    Date: Mon, 07 Jun 2010

The June 2010 Issue of the Harvard Business Review contains a small data visualization piece by myself and Scott Berinato. It's called Six Ways to Find Value in Twitter's Noise and has a StreamGraph showing tweets about the iPad during the launch weekend. I collected and analyzed the data and created the StreamGraph. Scott did a great job picking out some interesting features and explaining what it all means. It was a fun project and it's great to see my work in such a prestigious print magazine. Thanks for the opportunity Scott!

StreamGraph for Makers

By: Jeff Clark    Date: Sat, 15 May 2010

A few weeks ago I had the pleasure of reading Makers, a novel by Cory Doctorow. It's an interesting story, well told, and filled with stimulating ideas related to technology, creative culture, and intellectual property.

Cory makes his work available for free download so I was able to create a Document StreamGraph based on the text of the book. The document is split up into 24 equal sized segments and the word counts are done within each segment. These segments are used in place of time along the horizontal axis of the StreamGraph. I chose to show capitolized words and the resulting image does a reasonable job of illustrating the ebb and flow of the various characters within the narrative.

Click for larger version

Some Thoughts on Flash

By: Jeff Clark    Date: Thu, 29 Apr 2010

Steve Jobs published some thoughts today about why Apple isn't supporting Flash on their mobile platforms. The Shaped Word Cloud below was created from the text.

Just for fun I made a Clustered Word Cloud as well.

The Art of Tatiana Plakhova

By: Jeff Clark    Date: Wed, 21 Apr 2010

I really like the work of Tatiana Plakhova and have been following her Flickr stream since last year. Some of her images make me think of alien life forms or cities of the distant future. The one on the top right here below reminds me of Cerenkov Radiation.

Using her image from the top left above as inspiration I created a simple animation that tries to recreate her style. This video isn't great quality but seems to get the idea across.

Blue Flow 1 from Jeff Clark on Vimeo.

NHL Points Over Career

By: Jeff Clark    Date: Mon, 19 Apr 2010

One charting technique that I really like is to take time series for related data that occured over different time periods and align them to a common starting point so they can more easily be compared. One good example is this graph comparing this recession to the last five in terms of employment decline. Another one, this time interactive, is from the NYT and depicts Paths to the Top of the Home Run Charts.

I have created a couple of simple line charts showing cumulative point production (goals + assists) for selected NHL players over their careers. I'm actually using Adjusted Points which try to control for the fact that teams played fewer games in the past and rule changes and other factors impact the ease of scoring goals over time. Data is from Hockey-Reference.com.

This first chart shows many of the top players from the past. I only showed data up until age 43. Gordie Howe did get points in the NHL at age 51 but they were negligible in the overall results other than to illustrate his amazing longevity as a player. The graph clearly shows why Wayne Gretzky is called the 'Great One'. You can also see the various plateaus due to injury for Lemieux, and early career end for Bobby Orr (who is also the only defenceman shown here).

The second graph keeps Gretzky and Richard for comparison but highlights many of today's top stars. Crosby appears to have a legitimate chance to match Gretzky but has a long way to go...

Tweets Containing Love

By: Jeff Clark    Date: Fri, 09 Apr 2010

I have been collecting tweets containing the word 'love' for more than a year now and just analyzed a sample to see what other words are being used in conjunction with 'love'. I naively assumed I'd see lots of company or product names as the top non-generic terms. There were a few near the top - iphone, ipod, and starbucks for example. The most commonly used non-generic terms were actually almost all Twitter accounts for singers. The person with the most references was @justinbieber. Note that I analyzed 1 out of every 50 tweets so the counts shown here are ~50 times smaller than the real totals for the year.

During the last few months the total for @justinbieber exceeded the next top 14 combined. The streamgraph also shows a strong decrease for @mileycyrus and @ddlovato. References to @jonasbrothers seem to have split into separate streams for both @nickjonas and @joejonas.

Here is a PDF version of the streamgraph.

Inline Images for Twitter Clients

By: Jeff Clark    Date: Wed, 03 Mar 2010

Wouldn't it be cool if your twitter client could directly show tweets with small embedded images? Things like stock charts, graphical weather reports, server status, traffic reports, graphical emoticons expressing emotional state of your friends, mini-graphical movie ratings with thumbs up/down or stars, sports record summaries, or a million others that I haven't though of? Perhaps something like this?

This shouldn't be very hard. In fact, I think all that's required is the following:

  1. Somebody create a new URL shortener that by convention is only for links to images of dimensions 234x60 pixels or smaller. It should verify at the time of link creation that images fulfill the size constraint. I'll call it inpic for now but any short name would work.

  2. Twitter clients that want to support inline images in tweets are modified to recognize tweets with links to http://inpic.com/ABCD and display the image inline rather than the text link. Twitter clients that don't support inlining would show the text link and people could see the image with a click as they do now.

Step 1 is easy. There are hundreds of URL shorteners already in existence. We just need to adopt one that indicates by its' name that it points to a small embeddable image. An alternative that would avoid having to get different companies to adopt the same convention would be to use a special hashcode to indicate the same thing. Have all tweets with any link and the tag #inlinedimage handled by showing the image inline. If the link is invalid or doesn't point to a small image then the twitter client should revert to showing the text form.

Step 2 is easy as well since Twitter clients already show images in tweets - the user avatar images. I chose the size constraint by measuring the space used by TweetDeck to show the text of a tweet - I got about 237x62 pixels. This is just slightly bigger than the standard half banner size of 234x60 used for online advertising so I chose that instead.

Here are a few more things that could be added to make this even more useful:

  1. The URL shortener service (inpic or whatever it gets called) would host images in a manner similar to twitpic.com

  2. Twitter clients would support letting people easily embed graphical emoticons.

  3. If a second link in an inline image tweet is provided it would act as a browser target link if the inline image is clicked on. So an inline image in a tweet would give summary information and when clicked on the user would see more details inside a browser window.

  4. Twitter clients that support this might have an option to turn it off for anyone who prefers to always see text.

I think many people would find this valuable and it seems quite simple to accomplish. Come on TweetDeck, Twhirl, and other Twitter Client companies - get to work!

Where this idea came from

This morning I came across the interesting post Visualizing time series data embedded in tweets by Chris McDowall. The basic idea he discusses is to send time series data in tweets and have twitter clients recognize the format and present it as a small graph ( or Sparkline ) embedded in the tweet stream rather than just text. Chris seemes to have been inspired by the Twitter Data proposal.

It's an intriguing idea and Chris created a proof of concept twitter client called the Twitter Sparkline Visualizer.

One problem I see is that a twitter client that doesn't recognize the special data format would show the cryptic form which would probably be undesirable in most cases. Also, the 140 character limit of a tweet would put a fairly tight boundary on how much could be encoded. In a comment on the post, Tom Carden suggested looking at the Google Charts API as a "good example of a concise vocabulary for passing chart data around using URLs".

Tom's suggestion triggered an idea for me: Use any RESTful api like Google Charts to encode small charts in a URL, then use a URL shortener to construct a tweetable link representing the chart. Furthermore, we can use a specially named URL shortener that indicates to a twitter client that all of its' links point to small inline charts. This lets a twitter client determine efficiently that a given link can be rendered inline.

It makes sense to generalize the idea further to support use of any small image rather than charts in particular.

Profile in Harvard Business Review

By: Jeff Clark    Date: Sat, 27 Feb 2010

About ten days ago I was contacted by Scott Berinato, an editor at the Harvard Business Review, who was interested in writing up some of my visualization work for the HBR Research blog. We had a nice chat and he subsequently published Four Ways of Looking at Twitter which profiled my four twitter visualization tools.

He did a wonderful job and the article got lots of attention on Twitter. I've been tracking many of the tweets about the article and there have been at least 1500 tweets sent by various people telling their friends to read it. All the extra attention has made this the busiest week on Neoformix over the past year. Thank you to Scott for creating the article and thanks also to everybody who passed it along to all their friends!

Apple Logo from Products

By: Jeff Clark    Date: Tue, 02 Feb 2010

I was looking for pictures of the new Apple iPad and stumbled across this image of Apple Form Factor Evolution. It's got lots of images of Apple products on a nice simple white background and was perfect fodder to use with the Image Foam Technique so I made this version of the Apple logo from the product sub-images.

SOTU 2010 Word Cloud Map

By: Jeff Clark    Date: Thu, 28 Jan 2010

Last night President Obama delivered the State of the Union Address. The Shaped Word Cloud below was created from the text.

More Visualization Links on Twitter

By: Jeff Clark    Date: Sat, 23 Jan 2010

In a recent post I showed the Top 20 Individual Data Visualizations Mentioned on Twitter and remarked that many of the most frequently mentioned twitter links were to collections of visualizations. Shown below is a meta list of the top collection-type data visualization or infographic links.

Top Collections of Data Visualization Links

  1. 50 Great Examples of Data Visualization - Webdesigner Depot

  2. Data Visualization and Infographics Resources - Smashing Magazine

  3. 15 Stunning Examples of Data Visualization - Web Design Ledger

  4. 20 Essential Infographics & Data Visualization Blogs - Inspired Magazine

  5. Is Information Visualization the Next Frontier for Design? - Fast Company

  6. 28 Rich Data Visualization Tools - InsideRIA

  7. The Beauty of Infographics and Data Visualization - Abduzeedo

  8. 50 Great Examples of Data Visualization - Sun Yat-Sen University

  9. 20 Inspiring Uses of Data Visualization - SingleFunction

  10. 5 Best Data Visualization Projects of the Year – 2009 - FlowingData

  11. Data Visualization: Stories for the Information Age - BusinessWeek

  12. Data Visualization: Modern Approaches - Smashing Magazine

  13. The 21 Heroes of Data Visualization: - BusinessWeek

  14. 20+ CSS Data Visualization Techniques - tripwire magazine

  15. MEDIA ARTS MONDAYS:Data Visualization Tools - PSFK

  16. 37 Data-ish Blogs You Should Know About - FlowingData

  17. 5 Best Data Visualization Projects of the Year - FlowingData

  18. 30 new outstanding examples of data visualization - FrancescoMugnai.com

  19. Infosthetics: the beauty of data visualization - PingMag

  20. 5 Beautiful Social Media Videos - Mashable

Here are the top product type links in the field according to Twitter data between March 24 and Dec 31, 2009.

Top Data Visualization Product Links Mentioned on Twitter

  1. Axiis : Data Visualization Framework

  2. The JavaScript InfoVis Toolkit

  3. Microsoft - What is Pivot?

  4. Many Eyes

  5. Roambi - Your Data, iPhone-Style

  6. Flare - Data Visualization for the Web

  7. Gapminder.org - For a fact based world view.

  8. SpatialKey - Location Intelligence for Decision Makers

  9. Tableau Software - Data Visualization and Business Intelligence

  10. SIMILE Widgets

and finally:

Top Data Visualization Websites Mentioned on Twitter

  1. Information Is Beautiful | Ideas, issues, concepts, subjects - visualized!

  2. FlowingData | Data Visualization and Statistics

  3. Information Aesthetics | Information Visualization & Visual Communication

  4. visualcomplexity.com | A visual exploration on mapping complex networks

  5. DataViz on Tumblr

Charting the Beatles

By: Jeff Clark    Date: Mon, 18 Jan 2010

Michael Deal has published an interesting collection of graphics in his Charting the Beatles project. This first snippet below shows the beginnings of a graph illustrating authorship and collaboration in songwriting throughout their song collection. The full graphic clearly shows the trend towards less collaboration over time in songwriting, the increasing contribution from George, and increasing contribution by outside contributors.

This second image is from a chart showing references in Beatles songs to earlier songs. There are full images and several other interesting graphics on his site.

Top 20 Data Visualizations Mentioned on Twitter

By: Jeff Clark    Date: Mon, 18 Jan 2010

For many people Twitter has become the best place for discovering the latest and most interesting work in a variety of fields. In my twitter client I keep a search column open that gets constantly updated with the latest tweets pertaining to data visualization or infographics and I see lots of beautiful content flow by. I've been collecting these tweets for quite a while and thought it would be interesting to analyze them and see which visualizations were shared through twitter the most often.

Many of the top links in the domain were articles containing collections of visualizations chosen to be the 'Top NNN' by some panel of experts. For example, the top most shared link was 50 Great Examples of Data Visualization by Web Designer Depot. I will have another post in the near future that lists the most popular of these types of links as well as separate lists for products/frameworks and news/analysis. For this list I chose to focus instead on references to individual data visualizations or infographics.

Here are the top 20 ordered by popularity. Click on either the link or image to go to the original article.

1. Historical Browser Statistics - Axiis



2. Stunning data visualization in the AlloSphere - Video on TED.com



3. Worldwide Real-Time Firefox Downloads



4. The Geography of Jobs - TIP Strategies



5. Realtime Downloads from the App Store - Michael Lebowitz



6. Manhattan's Population By Day vs Manhattan's Population By Night - Manhattan population - Gizmodo



7. Take a new look at health - GE



8. The Billion Dollar Gram - Information Is Beautiful



9. Death and Taxes 2009 - WallStats



10. Turning a Corner? - NYTimes.com

Note that the link made popular on Twitter for #9 Death and Taxes was actually a link to an image on imageshack and I have used instead a link to the original source of the material.

The tweets for this entire analysis were collected from March 24, 2009 until December 31, 2009. Only the first link to a specific item from each Twitter ID was counted so that one person did not unfairly impact the results by tweeting frequently about the same thing.

Items 11-20 are listed below.


(More...)

Twitter Word Map for Android

By: Jeff Clark    Date: Sat, 16 Jan 2010

Here is a Shaped Word Cloud for tweets containing 'android' from 2009. I removed the tokens 'android' and '#android' from the analysis. You can click on the words to jump to Twitter Search and see the matching tweets. It's pretty clear that android is a 'google' 'phone' and is related to 'iphone' and 'htc'.

Obama 2009 Tweets and #tcot

By: Jeff Clark    Date: Mon, 11 Jan 2010

I've taken another look at the set of tweets from 2009 that contain 'Obama'. This time I started by focusing on the most popular hashtags that were used. This graph shows the top 10 hashtags, their distribution over the course of 2009, and the total references to them. The top hashtag by far was #tcot which stands for 'Top Conservatives on Twitter'.

How do tweets that contain #tcot differ from those that don't have it? What words seem especially associated with the tag? What topics do people using the tag seem to be focusing on?

I've done an analysis on the word frequency inside tweets containing the tag versus tweets without it. This chart below shows the words that are used much more frequently in the #tcot tweets compared to the baseline. Words on the left like 'CARE' and 'BUSH' are used at a rate of around 100-120% of the baseline rate. Words on the right like 'BHO' (shorthand for Barack Hussein Obama) and 'RASMUSSEN' are used around 500% of the baseline rate - or, in other words, they occur around five times as often in #tcot tweets as they do in non-#tcot tweets.

The chart is an interesting collection of terms and is an attempt at distilling what the people who use the tag #tcot are saying in relation to Obama. Some notable words in the set are 'DANGEROUS', 'SOCIALIZED', 'EXPOSE', 'RADICALS', 'ARROGANT', 'MARXIST', 'COMMUNIST', 'CLIMATEGATE'.

Tweets About Obama in 2009

By: Jeff Clark    Date: Thu, 07 Jan 2010

I collected all the public tweets containing 'Obama' during 2009. There were over 5 million recorded during the course of the year. I've done some analysis on a sample containing every 20th tweet. This first graph simply shows the distribution over the course of the year of the number of times the name 'Obama' was used. The curve has a big peak during the inauguration, a few smaller ones in February and March and is then remarkably level for the rest of the year.

This set of graphs shows other words that were used frequently in the tweets about Obama and that had distributions with a high concentration near specific dates during the year. When ordered by the peak date for each graph they give an interesting graphical narrative of Obama-related events during 2009.







Snow Doves

By: Jeff Clark    Date: Tue, 05 Jan 2010

It's been snowing where I live for the last month or so and I've been playing around with generating a dove image from snowflake constituents. This first image is constructed from smaller snowflakes built using the Text Snowflake Creator based on the words PEACE, LOVE, and TRUTH. The dove image is from Wikimedia Commons.

This second version uses the three unicode snowflake characters in the font Arial Unicode MS. I've also applied a small variation in color.

Neoformix Review 2009

By: Jeff Clark    Date: Mon, 04 Jan 2010

Thank you everybody for your interest in Neoformix over the past year. I wish you all a Wonderful and Happy 2010!

These are the 20 most popular posts published on Neoformix during 2009 ordered by their popularity. There are a large number of popular posts based on the Shaped Word Cloud concept and a few more on the related Image Foam Technique.

1. Iran Election Word Cloud



2. September 11 Pager Data Visualization



3. Butterfly Plane



4. Oscar Chatter on Twitter



5. Hudson River Landing



6. Fish Tank



7. Butterfly Falcon



8. Shaped Word Clouds



9. TED Shaped Word Cloud



10. The Raven



11. Apple Twitter Word Map



12. Obama Twitter Word Map



13. Earth Day Twitter Map



14. Peace Dove



15. World News Clustered Word Cloud



16. Word Portrait: Michael Jackson



17. Obama Inauguration Speech



18. Twitter List Profile Clouds



19. Toronto Twitter Community



20. Temporal Correlation for Words in Tweets



Note that many of the most popular parts of Neoformix visited during the past year were for projects published prior to 2009 and include Twitter StreamGraphs, Twitter Venn, Big Small, and Word Hearts.

Twitter Venn Birthday

By: Jeff Clark    Date: Thu, 17 Dec 2009

One year ago today I launched Twitter Venn. Those of you who have not used it before or have forgotten about it might want to check it out. The image below is an example of what it produces.

Launch Twitter Venn

ACM Crossroads Cover

By: Jeff Clark    Date: Tue, 15 Dec 2009

I'm very pleased to announce that an image from my Twitter StreamGraphs tool was chosen as the cover for the current issue of ACM Crossroads - the Student Journal of the Association for Computing Machinery. There is also a small writeup inside about the image. It depicts the streamgraph for the phrase 'data visualization' and suits the issue well since it is dedicated to the Social Web. The entire issue is available online.

Thanks to Chris Harrison, the editor-in-chief, for inviting me to contribute the image and to Senior Editor Jill Duffy for sending me some copies of the issue.

Climate Change Clouds

By: Jeff Clark    Date: Mon, 07 Dec 2009

Fifty-six papers in forty-five countries published a front page article today calling for action at the climate summit in Copenhagen. I've taken the text of the article and created a couple of images. The first is a Clustered Word Cloud which shows the more prominent words from the article grouped into clusters based on whether they were used together.

This second image takes the word clusters and arranges them in a starburst type pattern. The visual form was influenced by the Word Associations work by Chris Harrison. It's a little more interesting to look at and makes the groupings more obvious but has the drawback that the words are smaller than in the first format.

Animated Word Clouds

By: Jeff Clark    Date: Wed, 02 Dec 2009

Last night Obama outlined the new policy in Afghanistan in a speech at West Point entitled The Way Forward in Afghanistan and Pakistan. Like many people, I have mixed feelings towards a larger military effort in the region. I have tried to represent that ambivalence with an animated word cloud based on the speech that transitions from one symbol to another.

This was created with custom code written in Processing. The two images came from here and here.

If you like this work you might want to    Follow JeffClark on Twitter

9/11 Pager Data Visualization

By: Jeff Clark    Date: Sat, 28 Nov 2009

The organization Wikileaks recently published a data set of pager intercepts from the 9/11 tragedy. As described on their website:

Text pagers are usually carried by persons operating in an official capacity. Messages in the archive range from Pentagon, FBI, FEMA and New York Police Department exchanges, to computers reporting faults at investment banks inside the World Trade Center

The archive is a completely objective record of the defining moment of our time. We hope that its entrance into the historical record will lead to a nuanced understanding of how this event led to death, opportunism and war.

I have taken this data and done an analysis for 100 phrases selected to summarize the events of that horrible day. I have focused on the time period from 8am until 8pm, September 11th, 2001.

This video below shows a Phrase Burst Visualization of the data. The larger the text the more frequently it was used during the 12 hour period. Text appears bright during the times of high usage and fades away otherwise. The color hues are cosmetic. This phrase burst visualization is basically a word cloud where the brightness of the words varies according to how prominent the words were during specific periods of time. You can drag the playhead for the video around to examine specific times.

Pager Data from 9/11 - Phrase Cloud Visualization from Jeff Clark on Vimeo.

Perhaps a more useful view of the data is provided by this set of timeline graphs. They are ordered by the time of the highest peak for the phrase and in this arrangement provide a narrative of the events.





Video, graphing, and analysis done with custom code created with Processing.

If you like this work you might want to    Follow JeffClark on Twitter

Swine Flu Deaths - Altered

By: Jeff Clark    Date: Tue, 24 Nov 2009

I believe that the recent Swine Flu pandemic has been dramatically overplayed in the media. This morning I came across the image below on dataviz.tumblr.com that shows the number of deaths in the last 300 days from various causes including Swine Flu. There are a lot of things done really well here - the most important of which is that the deaths due to swine flu are put in a proper context.

Unfortunately the choice of using a solid red bar for emphasis beside the bar graph for Swine Flu deaths confuses the message because at first glance the bar can be interpreted as an extension of the bar graph itself. The first impression (and for some viewers the only impression) is that the deaths due to swine are exceptionally high - the very myth that the graphic is trying to dispel.

Click to see larger version

I have made a small intervention to the graphic that I believe makes the message less likely to be confused. The bar has been replaced with a text label and three arrows that can't be confused with an extension of the graph itself but still draw attention to the relatively small number of deaths for Swine Flu.

Click to see larger version

Unfortunately there is no reference on dataviz.tumblr.com to either the source of the original graphic or the data depicted. If anyone knows then send me a note and I'll add proper attribution here.

Creating Topical Twitter Lists

By: Jeff Clark    Date: Sat, 21 Nov 2009

In a recent post I defined the idea of Twitter ListMates as IDs that are frequently grouped together on the same twitter lists. The listmates for some starting ID give an interesting perspective on how that ID is perceived by others and are in some sense similar to it.

If the starting 'seed' ID is highly characteristic of some particular domain then the highest ranking listmates will also be characteristic of that domain. As a concrete example, let's start from infosthetics, the twitter account for one of the central websites in the area of data visualization. The top ranking listmates are: flowingdata, datavis, and infobeautiful which are all very important voices in the domain.

If we start with all four of these IDs, find the lists they are on, and see who else appears on the same lists the most often we can get an excellent quality list of twitter IDs for the field of data visualization. By starting with a small set of IDs rather than just one we introduce less bias into the result. Another technique that can be used to improve quality is to only use twitter lists whose name matches the domain as well - for example include the members of a list called 'datavis' but not of one called 'friends' when determining the listmates.

I have used this technique to define a number of twitter lists for various domains and saved them under the twitter ID Top100in. The lists defined so far are:

These meta-lists seem to be filled with interesting accounts for the various topics although the datavis one does have a few IDs that are more focused on digital art and design rather than visualization in particular. Feel free to follow them!

Twitter StreamGraph Supports Lists

By: Jeff Clark    Date: Mon, 16 Nov 2009

I have updated Twitter StreamGraphs to support the new twitter lists. You just enter a list in the standard format in the text box to see the graph for the latest 1000 tweets from all members of the list. The standard format looks like this: @scobleizer/web-innovators.

The Twitter StreamGraph for the list @scobleizer/web-innovators (click to launch application)

More Twitter ListMates

By: Jeff Clark    Date: Mon, 16 Nov 2009

In Twitter ListMates I introduced a name for the idea of people who are often grouped together on Twitter lists. The idea has value because listmates have been grouped together by multiple people who independently decided that those accounts are similar in some sense. Doing this type of analysis starting from my account, JeffClark, helped me find new people to follow.

I have repeated the process for four other accounts to try and confirm that this technique is indeed useful. The results are shown below.

For Robert Scoble (scobleizer) we get:
  1. guykawasaki
  2. mashable
  3. techcrunch
  4. kevinrose
  5. leolaporte
  6. jason
  7. chrisbrogan
  8. google
  9. veronica
  10. timoreilly
  11. chrispirillo
  12. garyvee
  13. ev
  14. jowyang
  15. davewiner
  16. wired
  17. arrington
  18. tweetdeck
  19. problogger
  20. briansolis
  21. therealdvorak
  22. rww
  23. joelcomm
  24. engadget
  25. patricknorton
For Shaquille O'Neal (THE_REAL_SHAQ) we get:
  1. aplusk
  2. lancearmstrong
  3. oprah
  4. dwighthoward
  5. taylorswift13
  6. jimmyfallon
  7. ogochocinco
  8. iamdiddy
  9. theellenshow
  10. terrellowens
  11. ryanseacrest
  12. johncmayer
  13. reallamarodom
  14. mrskutcher
  15. reggie_bush
  16. paulpierce34
  17. britneyspears
  18. the_real_nash
  19. serenajwilliams
  20. chrisbosh
  21. mariahcarey
  22. barackobama
  23. nba
  24. qbkilla
  25. tonyhawk
For John Mayer (johncmayer) we get:
  1. taylorswift13
  2. katyperry
  3. aplusk
  4. ladygaga
  5. britneyspears
  6. jtimberlake
  7. oprah
  8. mrskutcher
  9. theellenshow
  10. pink
  11. jason_mraz
  12. mariahcarey
  13. coldplay
  14. perezhilton
  15. nicolerichie
  16. ryanseacrest
  17. ashleytisdale
  18. therealjordin
  19. johnlegend
  20. markhoppus
  21. jessicasimpson
  22. iamdiddy
  23. jimmyfallon
  24. kimkardashian
  25. ashsimpsonwentz
And for Alex Payne (al3x), an engineer at Twitter:
  1. ev
  2. jack
  3. dhh
  4. rsarver
  5. jeresig
  6. scobleizer
  7. codinghorror
  8. biz
  9. thomasfuchs
  10. ginatrapani
  11. loic
  12. rasmus
  13. blaine
  14. dalmaer
  15. mashable
  16. veronica
  17. timoreilly
  18. dougw
  19. ijustine
  20. kevinrose
  21. photomatt
  22. leahculver
  23. kevinmarks
  24. shanselman
  25. jasonfried

Again, it seems to give good results: Scoble is grouped with other influential people in the field of technology; Shaq with a mixture of athletes and other celebrities; John Mayer with musicians and celebrities; And Alex with a mixture of developers, other twitter employees, and people influential in technology.

Twitter ListMates

By: Jeff Clark    Date: Thu, 12 Nov 2009

In the recent post called Twitter List Profile Clouds I explored how the Twitter list names to which a person has been added can reveal how they are perceived across the twittersphere. Another interesting idea is that when somebody adds an account to a list they are implicitly defining a relation between that account and every other account on the same list. They are essentially making a declaration that all the members of the list share some characteristic. The name of the list usually offers a clue about how all the list members are related.

So, for example, the fact that datavis and flowingdata both appear on a list together means that somebody thinks they are similar in some sense. And if the list name is called 'datavisualization' then that reveals how the list creator thinks they are similar.

I think of two accounts that appear on a list together as 'listmates'. It seems a reasonable name for the concept and follows the pattern of schoolmates, roommates, teammates etc. If you take all the Twitter Lists that an account is listed on and find all the members of those lists you can define a set of users related to the starting account. Keep track of how many times they appear in total and you also get a numeric score for how similar they are.

I tried out the idea using my own account, JeffClark, as a starting point. Here are my top 25 Twitter Listmates:

  1. datavis
  2. flowingdata
  3. ben_fry
  4. infosthetics
  5. moritz_stefaner
  6. stamen
  7. colorfuldata
  8. infobeautiful
  9. pitchinteractiv
  10. reas
  11. visup
  12. krees
  13. blprnt
  14. mslima
  15. eagereyes
  16. nbrgraphs
  17. jcukier
  18. vizworld
  19. mcristia
  20. infojocks
  21. infochimps
  22. datamasher
  23. teamswivel
  24. sunlightlabs
  25. densitydesign

The list is a who's who of people I respect and admire in the field of data visualization and I'm very pleased that others have grouped us together. I believe this technique has promise for finding interesting new accounts to follow.

Two Sides of the Same Story

By: Jeff Clark    Date: Mon, 09 Nov 2009

Jer Thorp has been doing some amazing work over the last couple of years. He just wrote an excellent post called Two Sides of the Same Story: Laskas & Gladwell on CTE & the NFL where he introduces a small visualization tool to look at the similarities and differences between two articles published in October about head injuries and the NFL. The articles are Game Brain, by Jeanne Marie Laskas and Offensive Play, by Malcolm Gladwell. The image below shows an example of what his tool can do.

I have previously explored the idea of comparing and contrasting document pairs with my Document Contrast Diagrams. The diagram below was created from the same two articles that Jer used in his analysis. There are obviously a lot of differences between the two visualizations both in appearance and in the technical means of constructing the diagrams but the underlying organizational metaphor is the same:

  1. Size of words reflect frequency of use
  2. Horizontal position reflects which document uses the word the most
  3. Vertical position reflects where the words are used in the documents the most

Jer's tool seems designed more to be for interactive exploration whereas mine is focused more on creating static diagrams that try and show more information all at once. Mine also tries to illustrate emotional tone (with the little coloured triangles), the overall document size difference, and the fraction of unique or shared vocabulary.

Click to see larger version

Just to be clear, I'm NOT suggesting Jer used my work as a starting point for his own - although I'd be flattered if he did! It's just a case of two people tackling the same problem and independently coming up with a fairly obvious approach to represent the information. Those of you who like my work should check out his blog blprnt.com. Jer has recently published the source code for a number of his projects and has plans to set free the code for this tool as well.

Twitter List Profile Clouds

By: Jeff Clark    Date: Sun, 08 Nov 2009

Twitter recently introduced the Twitter List feature which lets people define sets of user accounts that are related in some manner. The lists are given a name and can be followed by other people who are interested in seeing all the tweets from the accounts in the list. Popular twitter users such as Robert Scoble appear on thousands of lists - 3963 for Robert at this time. My twitter Id JeffClark, appears on a more modest 40 lists for comparison.

The act of assigning someone to a list is a type of tagging operation and the name of the list gives a clue regarding how that person is regarded by others. I've used the new Twitter List API to get the names of all the lists that Scoble currently appears on. Some simple counting (using code of course) gives us a table showing the most common names for lists that he appears on. The first few entries are:

  • tech: 567
  • social-media: 127
  • technology: 116
  • socialmedia: 87
  • bloggers: 51
  • geeks: 43

I have used these names and frequency counts to generate a Shaped Word Cloud that illustrates the various list names that list creators associate with Scoble.

Here is another Twitter List Profile Cloud below - this one is for Guy Kawasaki. It has many similarities to the one for Scoble but there some names much more prominent for Guy: marketing, business, and entrepeneurs for example.

And here is a third. Can you guess who it's for?

The icons used for the word clouds are from here: man ,woman .

More Abstract Images

By: Jeff Clark    Date: Sat, 07 Nov 2009

Here is another Delaunay Image, this one created from a well known photograph by Steve McCurry for the National Geographic. The subject was Sharbat Gula and a retrospective on her life done by National Geographic can be found here.

Here are a couple of more Voronoi designs based on the same image.

I created these images with custom software written in Processing that relies heavily on the Mesh library by Lee Byron. I also used the Mesh demo created by Marius Watz as a starting point for my code. Thanks!

Delaunay and Voronoi Mona Lisa

By: Jeff Clark    Date: Sat, 31 Oct 2009

One reason the images I referenced in my previous post caught my eye was that I've been playing around with a similar technique for a couple of months now. I dusted off the code and improved it to support Delaunay images as well as to do shading of the triangles or polygons.

Image 1 below shows a Delaunay image constructed from the Mona Lisa. The triangles in the first image are coloured evenly and the shade is the average colour of the three vertices. Image 2 is the same except I'm colouring the triangle pixels based on a function of how far they are from the various vertices and the colours at those vertices. It gives a much more realistic image.

I've removed the triangle edges in image 3 and image 4 is the original for reference. I like this technique because you can easily control where the resulting image is more detailed by just using more control points in that region or by shading the triangles differently.

There is a related type of diagram that is based on Voronoi cells. This next image is the Voronoi diagram using the same control points as above. The regions are polygons of arbitrary number of sides rather than triangles. The last image uses more control points to get more details from the underlying base image.

I created these images with custom software written in Processing that relies heavily on the Mesh library by Lee Byron. I also used the Mesh demo created by Marius Watz as a starting point for my code. Thanks!

Delaunay Images

By: Jeff Clark    Date: Fri, 30 Oct 2009

I really like these Delaunay Images created by Jonathan Puckey. The expressiveness derived from a few well chosen triangles is quite impressive. The link above shows a few more as well as a video showing one being created.

Random Tiles

By: Jeff Clark    Date: Sun, 25 Oct 2009

I stumbled across this image by Hugo Dechesne and liked the sense of depth suggested by the stacked tiles. Click on his image to see a higher resolution version.

Monks mosaic

I've tried to recreate the technique and applied it to a more famous image. The second version below just uses smaller tiles. I'm pretty happy with how it came out for such a simple technique but I still prefer the shading in Hugo's images. I think he's using a more diffuse and rounded shadow.


 

 

Alphabeasties

By: Jeff Clark    Date: Sun, 25 Oct 2009

I love typographic designs. When I was doing my first work with Word Portraits a year ago it occurred to me that I could probably make a really cool childrens ABC book where the representative images were constructed with words or letterforms. I thought it might be visually interesting and that potentially there might even be an educational benefit for the kids learning to read if the images helped them remember the beginning letter for the word. I haven't pursued the idea yet but I just stumbled across a beautiful example of the same idea. It's called alphabeasties: and other Amazing Types and was created by Werner Design Werks.

Here are a couple of images from the book:

Alphabeasties Cover
 
Monkey and Newt
 
E is for Elephant
 

I encountered this via grain edit.

Cameron/Brown Contrast Diagram

By: Jeff Clark    Date: Tue, 13 Oct 2009

Last week I produced several Document Contrast Diagrams comparing speeches by various political leaders in the UK. The diagrams were used in an article for The Times called How the party leaders' speeches compare. See the article for all three diagrams and a description of how to interpret the diagram. The one for David Cameron and Gordon Brown is shown below.

Thanks to Jonathan Richards and The Times for the opportunity to get some exposure for the technique.

Cameron/Brown Speech Contrast Diagram (click to see larger version)

Tundra Trek

By: Jeff Clark    Date: Thu, 08 Oct 2009

A couple of months ago I attended the grand opening of a new exhibit at the Toronto Zoo called Tundra Trek. While I was there I noticed they were promoting it with a cool composite design made from symbols of local landmarks. I couldn't find it online at the time but just looked again and found it at adsoftheworld.com . Design by Lowe Roche, Canada.

Thanks to Joe Sapiano a long-time zoo volunteer (and my father-in-law) for the invitation to the event.

Composite Panda Design

By: Jeff Clark    Date: Thu, 08 Oct 2009

This is a composite design based on the famous logo for the World Wildlife Foundation. Internal shapes are from the Animals font by Alan Carr.

Peace Dove

By: Jeff Clark    Date: Fri, 02 Oct 2009

Here is a typographical piece about peace. It's called 'Peace Dove' and uses the word 'peace' translated into 21 different languages - English, Hindi, Chinese, French, Russian, Dutch, Hebrew, German, Greek, Czech, Filipino, Arabic, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Portuguese, and Swahili. How many can you find ? The dove image is from Wikimedia Commons and I used Google Translate to get the word in the different languages.

My original image had the characters shown in reverse order for both arabic and hebrew. The image shown below has been corrected. Thanks to Ori Folger for pointing out the problem.

Apple Fruit References in Tweets

By: Jeff Clark    Date: Fri, 02 Oct 2009

This is the fourth part in a series analyzing a years worth of tweets containing the word 'apple'. The first three sections are:

  1. Apple Brand References in Tweets
  2. Company References in Tweets
  3. Simplistic Sentiment Mining from Tweets
This section looks at use of the word 'apple' as a fruit rather than as the company Apple.

The graphs below showing distribution over time have been normalized to remove the trend of increasing number of tweets over time. This helps show the underlying patterns related to the specific term of interest. Note that the month labels are positioned at the beginning of the month.

Here are a few observations:

  • 'pie' has a series of small peaks in the fall with a huge peak at the end of Nov for American Thanksgiving and strong peaks as well for Christmas and July 4th
  • 'cider' is used primarily in the fall with a huge drop-off right after Christmas/New Years
  • both 'juice' and 'sauce' are fairly consistent during the year although 'juice' seems to have an increase in references during the warmer months of Mar-Aug
  • 'crumble' has roughly 4 times as many references as 'cobbler'
  • 'picking' shows a very strong regular pattern of peaks localized to September and October. They appear to correspond to fall weekends where people talk about going apple picking.
  • There is a strong peak for apple + 'cookies' at the end of April beginning of May that falls off fairly slowly. I'm not sure why. Google Trends shows a similar pattern although not as strong.

Additions to Portfolio

By: Jeff Clark    Date: Fri, 02 Oct 2009

I added five more links in my Portfolio section a couple of days ago. The link is (currently) found near the top right of all the pages on Neoformix. If you are looking for a post of mine based on the memory of an image it might prove to be a useful starting point since all the links have small thumbnail images associated with them.

In case you weren't aware the Archive link brings you to a page showing all the posts on Neoformix. It does take a while to load and I will likely reorganize them by year before 2010 begins.

Obama UN Speech StreamGraph

By: Jeff Clark    Date: Thu, 24 Sep 2009

Here is a StreamGraph prepared from the text of Obama's speech to the UN. I've tried to show more words than in some of my other text-based StreamGraphs but I'm not sure it is successful. More words means more slices and less of a chance that you can follow an individual slice through the speech to see the rise and fall of it's frequency.

Click image to see a high-resolution PDF version

Simplistic Sentiment Mining from Tweets

By: Jeff Clark    Date: Tue, 22 Sep 2009

This is the third part in a series analyzing aspects of a years worth of tweets containing the word 'apple'. The first part of the series discussed Apple Brand References in Tweets and showed which Apple brands were referenced the most and their distribution over time. It also included word clouds showing the terms most often associated with each of the primary brands. One of these is shown below for 'ipod'.

It's interesting and gives some indication of the other topical words related to 'ipod' and their relative frequency. One thing it doesn't do is show what people feel about ipods. Do they Love them? Hate them? Can we figure it out from all this data?

One simple method of approaching this problem is to see which emotion-laden adjectives or declarations occur together with the various brands in tweets. This is a crude form of sentiment mining that makes no attempt at detecting sarcasm or the even more common inversion due to modifiers like 'not'. The size limitations of tweets mean that they seldom express ideas in a subtle or linguistically complex fashion so it might be appropriate to use such a simplistic approach - especially when we are dealing with large volumes of tweets like we are here (570,464).

I have repeated the word association analysis done in Apple Brand References in Tweets but have restricted the words of interest to a small set of terms that are often used to express feelings. Have a look:


There appears to be considerable variation in the spectrum for the different brands. People seem to find 'iphone', 'ipod', 'nano', and 'shuffle' to be cool and interesting. They love the 'mac' and are much more negative towards 'itunes'. I suspect this technique might indeed be valuable.

Company References in Tweets

By: Jeff Clark    Date: Mon, 21 Sep 2009

This is a second installment in a series analyzing aspects of a years worth of tweets containing the word 'apple'. The previous post showed which Apple brands were referenced the most and their distribution over time. This one focuses on the other companies mentioned in tweets containing 'apple'. The data is from Sep 1st, 2008 until Aug 31, 2009 and I collected a total of 2,852,320 tweets in this time frame and analyzed every fifth tweet emitted (570,464 of them) to get the results below.

Apart from Apple itself, the most frequently mentioned company in the data was Google followed closely by Microsoft. The spread over time is very spiky for all the companies but some exhibit very little attention apart from the spikes - Dell, Adobe, and Facebook for example. Verizon shows a significant drop off in attention over this time span and both AT&T and Palm have become discussed more often over time. As in my previous post these distribution graphs have been normalized using the number of tweets in each time period to remove the overall trend of increasing twitter use from the picture. They are also scaled independently in the vertical direction in order to show the most detail for each graph.

I've created accentuated word clouds that show the words used in conjunction with the various companies to give some idea of what was being discussed in relation to Apple and the respective company. In an accentuated word cloud the sizes of the words are a function of both the frequency with which they occur and their prominence relative to a baseline text.


Apple Brand References in Tweets

By: Jeff Clark    Date: Fri, 18 Sep 2009

Which Apple brands are most discussed on Twitter? I have analyzed a large set of tweets that contain the word 'apple' sent out over the course of an entire year - from Sep 1st, 2008 until Aug 31, 2009. I collected a total of 2,852,320 tweets in this time frame and analyzed every fifth tweet emitted (570,464 of them) to get the results below.

The following graph shows the distribution over time of the number of tweets containing the word 'apple'. There is an obvious overall rising trend as we would expect since the use of twitter has grown greatly over the course of that year. There are also many large peaks throughout the year and at least one large trough in late Aug 2009 that was likely due to a failure in my data collection infrastructure. There also appears to be a relative slowdown in activity in July and August but examining data from multiple years might be necessary to confirm this.

The most frequently mentioned apple brand in the data was 'iphone' by far which had 97,166 references in the 570,464 tweets (17%). The bar graphs below show the total number of references for some other Apple brands and how they compare to 'iphone'. Also shown are the distribution of the brand usage over time. These distribution graphs have been normalized using the number of tweets in each time period to remove the overall trend from the picture. They are also scaled independently in the vertical direction in order to show the most detail for each graph.

Note that these results are for tweets containing 'apple' and the brand in question. There are obviously a lot of tweets that mention these brands without explicitly referencing 'apple' but they are not a part of this analysis.

I lined up the initial graph showing the total number of tweets with the brand distribution graphs and you can see that several of the peaks in number of tweets in March correspond to big spikes in references to 'iphone', 'ipod', and 'mac'. The brands 'safari', 'shuffle', 'ilife' , and 'iwork' have surprisingly few references apart from the big spikes - people just aren't tweeting about them. All the top 6 brands, together with 'nano', and 'itouch', seem to have more consistent chatter about them in the twittersphere. The term 'leopard' (as in Snow Leopard) is obviously of more recent interest.

These graphs above give a great idea of how often the various brands were mentioned and the distribution over time. What are people actually saying about these brands? I've created accentuated word clouds that show the words used in conjunction with the various brands.

In an accentuated word cloud the sizes of the words are a function of both the frequency with which they occur and their prominence relative to a baseline text. For example, the word 'new' may be used quite frequently in tweets about 'iphone' but if it is used proportionally less often than in other tweets it will be made smaller. Similarly, a word like '3gs' may appear much more frequently together with 'iphone' than in other tweets and so its' size is increased.


Obama School Remarks

By: Jeff Clark    Date: Tue, 08 Sep 2009

Today, in Arlington Virginia, Obama delivered some Back to School remarks to the students of America. Here are a few choice snippets:

Where you are right now doesn’t have to determine where you’ll end up. No one’s written your destiny for you. Here in America, you write your own destiny. You make your own future.
...
No one’s born being good at things, you become good at things through hard work.
...
So today, I want to ask you, what’s your contribution going to be? What problems are you going to solve? What discoveries will you make? What will a president who comes here in twenty or fifty or one hundred years say about what all of you did for this country?

I have constructed the Shaped Word Cloud shown below from the complete text. The red apple image template came from Wikimedia Commons.

True Blood Twitter Spam

By: Jeff Clark    Date: Mon, 31 Aug 2009

One of the trending phrases on twitter lately has been 'True Blood' due to the popularity of the True Blood TV series. I've noticed lately that most trending terms in twitter have quite a large number of spam tweets and this is no exception. I've used Twitter Venn to try and get a feel for what the proportion of the spam tweets are for this topic. A quick glance at the search results showed large numbers of spam tweets mentioning free grocery money or gift certificates so I did a twitter Venn of 'True Blood' versus 'grocery'.

Based on the tweets at this time there are 8597 tweets/day for 'True Blood' that don't mention 'grocery' and 3781 that mention both. This gives us a spam proportion of approximately 3781 / (3781 + 8597) = 31% without even including spam that don't mention grocery. If you look at the red word cloud for 'True Blood' without 'grocery' you can see that there are several other spammy words that are fairly prominent - 'won', 'free', 'cash', 'gift', 'cards'. This suggests that the amount of spam for this topic is even higher.

These numbers do change quickly because they are based on the latest tweets only. To do an accurate analysis would require looking at more data over a greater period of time.

Tweet Words By Week Day

By: Jeff Clark    Date: Fri, 28 Aug 2009

I have been having fun recently exploring how the use of words in tweets varies over the time of day ( #1, #2, #3, #4, and #5 ). A minor change in the code I use for the analysis of the text in the tweets lets me look instead at how use of words varies over the course of a week. The dataset contains over a million tweets sent from Toronto during June and July, 2009 so we have roughly 8 weeks of data. I've binned the data into 2 hour segments by day of the week.

As in the charts below, many of the time series show obvious daily patterns with no apparent variation across the different days. Note that the day of week labels are positioned at noon of the respective day.

Other words show strong peaks for certain days of the week. The terms 'tgif' (Thank God It's Friday), '#followfriday', and 'mondays' appear in the expected locations. Why is 'father' localized to Sunday ? And 'michael' on Thursday ?

Let's check out the terms that have similar shaped curves to these words. For 'father' we get:

From these terms that are temporally related I suspect the tight association between Father and Sunday is because of Father's Day which was on Sunday, June 21st this year which was in the range of data we used for this analysis.

Similarly for 'michael' we get the graphs below and it's easy to see that Michael Jackson died on a Thursday.

Here are a few terms that seem relatively high on weekends:

Overall the technique seems to work well for analyzing day of week patterns. As is often the case, much of what gets revealed seems obvious in retrospect. I suspect, however, that this type of analysis could discover non-obvious patterns as well.

Normalized Word Time Series

By: Jeff Clark    Date: Fri, 28 Aug 2009

Here is a fifth post in a series looking at word usage by time of day in tweets. The first four posts are useful background material if you haven't read them yet:

If you look at the time series for the top ten words you will notice that many of them have a very similar shape. For the words 'lol', 'new', 'time', 'love', 'know', 'great', and 'twitter' they all seem to peak around 1-2am, drop off to a lowest point between 3-5am, and gradually rise during the day. Why should there be a relationship between the curves for these words ? Do lots of people write tweets that use these words together ? Or is there some special temporal relationship between these words ?

The answer is much simpler. One of my readers, Kyle McDonald, posed an interesting question: is tweet density (tweets over time) relatively constant throughout a day?. The data I'm using here all comes from Toronto. It's a single location and is therefore from a single time zone which is important when looking into the time of day that the words were used. If we look at the curve for number of tweets over time of day for this data we get this:

So, no, the tweet density is not relatively constant throughout the day for a specific location. This curve is very similar to the common shape we see for the set of words listed above. The counts for these words are basically just tracking the number of tweets. Or, in Kyle's terms, the word count density over time is just tracking the tweet density. So the interesting features in the curve for the word 'love' seem to arise because more tweets are getting sent out during those times of day and are not due to any special temporal property of the word itself.

Kyle goes on to suggest that it would be really helpful to see the same plots normalized by tweet density. Here are the normalized curves for the same set as above:

Many of these normalized plots are basically flat except for noise. Those for 'new', 'time', 'know', and 'twitter' seem to show no special relationship with time that isn't accounted for by the simple fact that more tweets are occurring in total during certain periods. Several of the other words still show strong peaks, 'lol', 'day', and 'today' for example. The series for 'toronto' now has a jagged set of peaks evident just before 6am which were not apparent in the raw time series shown in blue. This technique does indeed appear to be useful in highlighting those words that are used preferentially during certain times of day.

Time of Day Word Correlations

By: Jeff Clark    Date: Fri, 21 Aug 2009

This is another post in a set looking at word usage by time of day in tweets. This time the data includes all the tweets sent from Toronto in June and July of 2009. The post Temporal Correlation for Words in Tweets probably has the most relevant background.

Each of these sets below consists of 5 line graphs showing usage of the word in tweets with the time of day along the horizontal axis. The first series, in black, is the word of interest. The next 2, in blue, are highly correlated with the focus word - the words tend to be used during the same times of day as the word of interest. The last 2 words, in red, have a negative correlation.


 
 

 

 
 

 

 
 

 

Note that these aren't necessarily the words with the strongest correlation. From the stronger matches I've selected the ones that seem most insightful. Many of the strongest positive correlations arise because the words are often used together within the same tweets. For example there are quite a few tweets that talk about eating 'pancakes' or 'eggs' at 'brunch' so it isn't especially surprising that their time of day profiles are similar. The combination 'yoga' and 'pets' seems a bit more surprising. I've checked in the actual tweets and can't find any that contain both words at once.

The negative correlation between 'yoga' and 'guns' isn't very strong but I find it kind of amusing. The strong correlation between 'drunk' and 'ill' and the negative match with 'gym' seems appropriate.

Shaped Word Search: Perfumes

By: Jeff Clark    Date: Thu, 20 Aug 2009

A mysterious person calling herself the perfumeladi contacted me a few weeks ago and asked nicely for a Shaped Word Search puzzle for perfumes. Here it is:

Click on the image below to get a high-quality PDF version to print:

The bottle is for a Vera Wang product and the names are a subset of those found in Haute-Couture Brands on osMoz.com.

Some Word Usage Time Series

By: Jeff Clark    Date: Thu, 20 Aug 2009

I'm continuing my exploration of how frequently words are used in tweets during the various times of day. If you haven't seen them already, you might want to read Time Series for Word Counts in Tweets and Temporal Correlation for Words in Tweets for background information and details about the dataset.

Here are some word graphs for a few different beverages. 'Coffee' shows the strongest time dependence and is of course at it's peak during the morning hours. Both 'beer' and 'wine' rise gradually from about noon until 2-3am. Showing pretty flat (but noisy) graphs are both 'tea' and 'water'.

Tweet Word Time of Day Traces: Beverages

Some more collections of graphs follow. You can spot the trends yourself so I won't describe them all. Note that many of these charts are quite noisy. They could obviously be improved by using more data although I am already analyzing half a million tweets to get these results. Using 30 minute time slices rather than the 15 minute slices I'm currently using would smooth out the graphs as well.

Tweet Word Time of Day Traces: Foods
 
 
Tweet Word Time of Day Traces: Acronyms
 
 
Tweet Word Time of Day Traces: Feelings

The graph for 'happy' has some unusual peaks that look like they occur around 10am, 11am, noon, 1pm, and 2pm. I'm not sure why the regularity over time. These tweets are from Toronto during the month of July which includes the data for Canada Day on July 1st. Here are the graphs for the words highly correlated with 'happy' :

Tweet Word Time of Day Traces: Highly Correlated with 'Happy'

Temporal Correlation for Words in Tweets

By: Jeff Clark    Date: Wed, 19 Aug 2009

In my last post, Time Series for Word Counts in Tweets, I showed some graphs illustrating how often a word was used in tweets during the various times of day. I'm using the same data here, 575,962 tweets sent from the Toronto area in the month of July 2009. Some of the graphs show very similar shapes, for example 'morning', 'breakfast', and 'coffee' in the set below.

We can spot these visually but if we are analyzing a large number of words, say 1000 or more, it would be useful to be able to calculate the similarity of the curves in order to find matches automatically. We want 'scale invariant' matches - curves with the same shape but not necessarily the same scale. Our curves are just plots of 96 numbers - since I'm summing the counts within 15 minute time buckets and 24 hours * 4 (buckets/hour) = 96 buckets. We can compare two curves by looking at the correlation between their time series values. If the curves go up and down in the same places then they are visually similar and the correlation gives us a way to quantify this.

If we select a time series for a word of interest we can calculate the correlation between that series and each of the others in turn. Then we can show the graphs with the highest correlation to see those with the most similar profile over time of day. Here are the top matches for 'morning':

The correlation coefficient is shown to the right of the graph. A value of '1' means perfect correlation, around '0' is no correlation, and a value of '-1' means an inverse or negative relationship. Below are shown some series that show no correlation at all with 'morning'. I was surprised to see that 'bed' isn't used in tweets around the same time of day as 'morning'.

Here are a few examples of negatively correlated words. The relationship isn't quite as strong as for the best positive matches , -.55 compared to +.90 .

So the word with the strongest inverse relationship with 'morning' is 'bored'. Interesting - I guess people don't get bored in the morning as much as the rest of the day.

Time Series for Word Counts in Tweets

By: Jeff Clark    Date: Tue, 18 Aug 2009

I have been playing around with a fairly large collection of tweets looking into the patterns of word usage over the time of day. The dataset contains 575,962 tweets that were sent out from accounts located within 50 miles of Toronto during the month of July, 2009. For each of the most common 1000 words (except for stop words) I counted how often they were used in each 15 minute period of the day. The counts for all the days in July were simply added together so the shape of the series is for a typical July day. The following graph shows the time series plotted for the most common word - 'lol'.

Both the beginning and end of the horizontal axis represent midnight and noon is in the middle. This graph shows a peak around roughly 2-3am in the morning and a low point around 6am.

If we look at the traces for the #1, #10, and #100 most popular words and keep the vertical scale the same we don't have any detail in the smaller series ( for 'girl' ).

If we scale each graph independently so that the fine details are present for each series then we can no longer tell when looking at a set of graphs which ones have the larger counts.

I've been experimenting with drawing both the absolute and independently scaled versions on the same graph so that both the detail and overall magnitude are evident.

It seems to work pretty well. I've used the darker line with the filled area underneath for the absolute scale to give it more prominence.

Here is a set of graphs for some obviously time-dependent terms:

These series seem more interesting than those with a more even distribution over time. Rather than visually scanning a large set of graphs to find these candidates I constructed a metric that measures the clumpiness of each series and used that to focus my search.

There is an obvious similarity evident in the curves for 'morning', 'breakfast', and 'coffee'. In a future post I will describe a technique for detecting these matching curves automatically and show some results based on it.

Shaped Word Cloud: Apple

By: Jeff Clark    Date: Fri, 07 Aug 2009

I just recently finished gathering a complete year of tweets containing the word 'apple' - from Aug 7th, 2008 until Aug 6th, 2009. There were approximately 2.7 million public tweets over that year containing the word. I have used a sample comprised of every 10th tweet of the complete set to create a shaped word cloud showing the words most frequently used. This is a re-creation of a shaped word cloud visualization I did in January that only included tweets from Jan 20-21, 2009.

The dominant words don't seem too surprising. You can click on the words to jump to Twitter Search and see the matching tweets.

City Differences in Tweet Content

By: Jeff Clark    Date: Fri, 07 Aug 2009

In Word Clouds from Adjusted Counts I introduced the idea of accentuated word clouds and mentioned the possibility of breaking down a collection of tweets by geographic origin and contrasting the word counts to uncover geographic patterns. I've done something similar with a large collection of tweets sent from either Toronto, London, or San Francisco. They are actually a 1% sample of all the public tweets sent within 50 miles of the respective city centers during the month of July, 2009.

The three blocks of words reflect those words used frequently and proportionally more often in tweets being sent from the respective cities. Apart from the city names, some prominent words are:

  • San Francisco: hella, humidity, oakland, collision, winds, hotjobs, giants
  • London: like, good, new, news, morning, bbc, work, flu
  • Toronto: lol, good, like, love, canada, know, strike, pumper

The prominence of 'pumper' for Toronto puzzled me a bit so I looked into the data more closely. There is a series of twitter accounts similar to ToFireE that pump out alerts for every emergency fire unit dispatched in the city. They include reason for dispatch, location information, and also the vehicle which is often named pumper-nnn where nnn is some number.

Another interesting thing that you can pick out from the clouds is that San Francisco tweets contain a lot more hashtags than in London or Toronto. Those that seem largest are: #science, #gaming, #loss, #prop8, #discount, #ffs, #weight, #wine, #sfgiants. It might be interesting to more carefully examine the proportion of tweets that contain hashtags and whether it is changing over time.

Shaped Word Search: Animals

By: Jeff Clark    Date: Fri, 17 Jul 2009

I have created another set of Shaped Word Search puzzles. This set of 26 puzzles are in black and white and will print nicely on a black and white printer. The theme is animals and the simple silhouette images are from the freeware font called 'Animals' by Alan Carr.

All 26 puzzles are found in a single PDF file. There are actually two versions: easy and hard. The hard versions use a smaller font size so there are more letters, add more partially matching distractors, and have more of the words in reverse order.

Feel free to print these off for your own personal use but don't post the PDF anywhere else or try and sell it. Have fun!

Shaped Word Search: Vehicles

By: Jeff Clark    Date: Fri, 17 Jul 2009

This new collection of Shaped Word Search puzzles is based on vehicle designs by cemagraphics. They all use the same transportation-related word list which I constructed with a little help from Google Sets.

Click on the images below or use these links to get high-quality PDF versions to print: VW bug, bus, truck, and ferrari. They look great when printed in colour but probably not so good in grayscale.

Have fun!


 

Shaped Word Search: Insects

By: Jeff Clark    Date: Thu, 16 Jul 2009

Here is a collection of four Shaped Word Search puzzles based on insect shapes. The insect images are from Iconshock and all the puzzles use the same word list derived from this list of insects.

Click on the images below or use these links to get high-quality PDF versions to print: ladybug, dragonfly, mantis, and ant. They look great when printed in colour but probably not so good in grayscale.

Have fun!


 

Differences in News Coverage

By: Jeff Clark    Date: Mon, 13 Jul 2009

I'm continuing to explore the idea of accentuated Word Clouds that I introduced in the previous post about New Testament Word Clouds. This time I compared the news coverage from four different sources about Obama's recent speech delivered in Ghana. The source texts are from the New York Times, Fox News, Al Jazeera, and AllAfrica.com.

The first word cloud was created from the text of all four articles put together and does a reasonable job of showing the key words for the event. The top words are 'Obama' , 'Africa', 'Ghana', 'president', 'life', 'future' etc.

These four accentuated clouds below are created by comparing each source article in turn against the overall collection. They illustrate the words that are used frequently and proportionally more often in that particular text.



Here are a few prominent words that I notice from a quick glance at these clouds:

  1. NYT - 'take', 'kept', 'cairo', 'effort', 'muslims', 'bill', 'rich'
  2. Fox News - 'great' , 'need', 'gym', 'hour', 'peace', 'blame', 'hotel', 'stem', 'cell', 'pope'
  3. Al Jazeera - 'set', 'based', 'oil', 'jazeera', 'london', 'investment', 'gold', 'cocoa'
  4. AllAfrica.com - 'civil' , 'control' , 'brother', 'returning', 'speaking', 'map', 'liberation'

New Testament Word Clouds

By: Jeff Clark    Date: Sun, 12 Jul 2009

The word cloud below was created from the text of the four gospels of the New Testament of the Christian Bible. I used the King James Version from the wonderful Project Gutenberg. The primary words of emphasis are not surprising - 'jesus' , 'son', 'father', 'lord', and 'god'.

Lately I have been exploring the idea of using clouds built from relative word frequency counts to emphasize the differences between a text and some baseline text. I'm leaning toward calling these accentuated word clouds.

I have created four separate accentuated word clouds for each of the gospels and show them below. The baseline text was all four gospels together so each cloud shows which words are used frequently and proportionally more often in that text versus the overall collection. This kind of cloud illustrates the unique aspects of that particular text.



Let's look at a word that is very prominent in one of the clouds. In the gospel of John, the word 'jews' seems central but it either doesn't appear or is very small in the other three. The number of times it appears in the four gospels is 5, 6, 5, and 67 for Matthew, Mark, Luke, and John respectively. If you calculate the number of occurrences per 1000 lines to account for the different sizes of the various texts then you get 1.4, 2.6, 1.3, and 23.2 times/1000 lines.

These accentuated word clouds appear to be doing a good job of highlighting the terms that are characteristic of the various gospels. It is certainly possible to design a visualization that more directly shows the relative frequency of the key words in different texts but the visual simplicity of these accentuated word clouds have some advantages.

Michael Jackson Flower Portrait

By: Jeff Clark    Date: Wed, 08 Jul 2009

Here is a flower portrait of Michael Jackson created from one of the images on his album Number Ones. The flower images are from Wikimedia Commons.

Word Clouds from Adjusted Counts

By: Jeff Clark    Date: Tue, 07 Jul 2009

When trying to understand something it is often very useful to compare and contrast the data of interest with some related data. This can serve to emphasize the unique characteristics of the data you are studying. Another way of thinking about it is that you are filtering out the background noise in order to clarify the signal.

I mentioned in the recent post Shaped Word Cloud: Canada that I had adjusted the word counts according to how frequently they occurred in a baseline dataset. In this post I give a graphic example of the effects of this type of adjustment.

The data used is a collection of 16,504 tweets gathered during the month of June, 2009 and containing the word 'starbucks' . They are every 10th tweet of the full 165,040 that I collected during this time period. I also discarded the tweets that were obviously non-English. The words 'starbucks' , 'coffee' , and any twitter ID were not used in the analysis.

The following word cloud was constructed from the word frequencies found. It includes stop words and the cloud shows that 'in' , 'to' , 'at', 'is' and many other small words are frequently used in the text. The problem is that this is true for any sizable amount of English text and so this word cloud doesn't illustrate any real useful information specific to 'starbucks'. For this reason, stop words are almost always excluded from word clouds.

This next cloud was generated from the same data and the only change was that stop words were excluded. Now we can start to see some interesting emotion-laden words like 'love' , 'good' , 'work' , 'like' as well as some that are obviously characteristic of the search term like 'hot' , 'cup', 'mocha', 'frap', and 'drinking'.

To reveal more detail specific to 'starbucks' I have adjusted the word counts in this final cloud based on how frequently the words occurred in a baseline data set. The baseline I used here was a collection of tweets containing the word 'coffee' taken over the same time period as the original starbucks tweets. I won't describe the math in detail but, basically, I boosted the counts for words by a factor that is a function of the word frequency rate in the two data sets. If a word is used much more frequently in the starbucks data than the coffee data then it's count is elevated so that it becomes more prominent in the cloud.

This word cloud is much more revealing of those things discussed in tweets together with 'starbucks'. Some of the large terms include, '#starbucks', several variations on 'frap', 'ruling', 'fructose', 'lemonade', 'venti', 'card', and 'sponsorship'.

By choosing different baseline datasets it is possible to accentuate different perspectives of the original data. For example, breaking down a collection of tweets by geographic origin and contrasting the data using this technique would let you uncover geographic patterns. What are people saying about Starbucks in San Francisco that is different from what they say in New York , or London ? If you break up the tweet collection by time you can answer questions like: What are people saying about Starbucks at lunchtime versus in the morning ? Or, What are they saying on Tuesdays versus Saturdays ?

I believe this technique may prove very useful in revealing information from large amounts of text.

Declaration of Independence

By: Jeff Clark    Date: Sat, 04 Jul 2009

The blog Computational Legal Studies has a word cloud using the text of the Declaration of Independence created with Wordle. I liked the idea and so to help all my American readers celebrate the 4th of July I've created a word cloud using the same text in the shape of the US map. I added some stars to fill out the shape better. The word colors are random.

Click on the image for a larger view.

Shaped Word Cloud: Canada

By: Jeff Clark    Date: Wed, 01 Jul 2009

Happy Canada Day ! This is a Shaped Word Cloud created from the text of approximately 168,000 tweets containing the word 'canada'. The tweets were gathered over an 11 month period from July 31, 2008 to June 30, 2009.

Basically, the larger the word the more frequently it appears in the text. Stop words were discarded. I also adjusted the size based on the relative frequency of the word in the canada dataset versus a baseline dataset containing tweets about india and china. A word like 'country' or 'travel' is used approximately the same for canada as for india and china and so will be de-emphasized. Words like 'hockey' , 'canadian', 'snow' and place names within canada will appear bigger. Because of the baseline content the result will not properly reflect any strong associations between canada and india or canada and china. As usual you can click on a word to see the current twitter search results.

Word Search: Canada Map

By: Jeff Clark    Date: Tue, 30 Jun 2009

Here is another Shaped Word Search in honour of Canada Day tomorrow, July 1st. This one is in the shape of a map of Canada and uses Provinces, Territories, and cities in the word list. Click on the image or here for the PDF version.

Feel free to print this in any newspaper or magazine. I only ask that you keep the reference to http://neoformix.com and that you send me an email letting me know.

Click on the image to download a hi-res PDF version suitable for printing

Word Search: Maple Leaf

By: Jeff Clark    Date: Tue, 30 Jun 2009

In honour of Canada Day tomorrow, July 1st, I have created a Shaped Word Search with a maple leaf design and words I associate with Canada. I improved my tool slightly to sort the words in alphabetical order so it is more convenient to look them up. Thanks to Joe S. for the suggestion. Click on the image or here for the PDF version.

Feel free to print this in any newspaper or magazine. I only ask that you keep the reference to http://neoformix.com and that you send me an email letting me know.

Click on the image to download a hi-res PDF version suitable for printing

Word Portrait: Michael Jackson

By: Jeff Clark    Date: Sat, 27 Jun 2009

Here is a Word Portrait of Michael Jackson created from the titles of many of his top songs.

Click on the image to see a larger version

Twitter Venn: Celebrity Deaths

By: Jeff Clark    Date: Fri, 26 Jun 2009

Here is a Venn Diagram made with Twitter Venn that shows the relative frequency of tweets made about the recent deaths of three celebrities - Michael Jackson, Farrah Fawcett, and Ed McMahon. This analysis was done around 7am EST today and the absolute numbers for tweets/day will certainly increase as more people in the US come online. I expect the proportions among the various combination regions to stay roughly the same.

A couple of points of interest:

  • Celebrity interest ranked by number of tweets is Michael > Farrah > Ed with ratios 62:6:1
  • Ed was mentioned together with both Michael and Farrah more often than he was by himself

To explore the data using the interactive application click on the image below or this link: Twitter Venn for #michaeljackson, #farrahfawcett, and #edmcmahon.

Twitter Employee Clusters

By: Jeff Clark    Date: Thu, 25 Jun 2009

Here is a different view of the relationships between the Twitter employee accounts first presented in this post. I measured the similarity between all the twitter employee accounts based on the overlap in vocabulary used in their last 200 tweets. A clustering algorithm was then used to group them together based on the pairwise similarity scores. The algorithm was tuned to limit clusters to have a maximum of 8 members.

The image below was created from the cluster members data, the similarity between clusters, and the similarity within each cluster. To minimize line clutter I am only drawing a connection if it is one of the top 2 strongest for either end node. The clustering and layout code is based on what I used for the Toronto Twitter Community project but has been recently enhanced to support some new client work.

Here is the PDF version of the Twitter Employee Clusters.

Shaped Word Search - Twitter

By: Jeff Clark    Date: Mon, 22 Jun 2009

Here is another example of a Shaped Word Search. This one uses a Twitter Bird as the image and a list of words related to twitter. I also experimented a bit with adding distractors in order to make the puzzle more difficult. There are a couple of partial matches for each word mixed in to the letter matrix. Click on the image or here for the PDF version.

Click on the image to download a hi-res PDF version suitable for printing

A Shaped Word Search - Malta

By: Jeff Clark    Date: Sun, 21 Jun 2009

I celebrated Father's Day this weekend with my wife's parents. While there, I spent a frustrating and unsuccessful 15 minutes looking for one of the few remaining words in a giant word search my father-in-law was working on. We found out later by checking online that there was an error and the word wasn't even present in the puzzle!

Much more enjoyable was the hour or so we spent doing a virtual tour of Malta using Google Earth. My father-in-law was born there and we had great fun zooming in with the aerial views finding the house where he lived, the church where he was baptized, etc. We were also able to easily see wonderful pictures of the many famous churches and natural features like the Blue Grotto. It's a beautiful and fascinating place and I'd love to visit sometime.

Well, the ideas of Malta, word search puzzles, and the usual mishmash from my coding projects mixed together in my brain while I was sleeping and I woke up early realizing I could easily write a tool to create 'Shaped Word Search Puzzles'. Basically, I can take a template image and a list of words and automatically construct a word search puzzle shaped and coloured to match the image.

The first example is below and uses a Maltese Cross with a list of words related to Malta. Most of the words are place names but there are a few other things mixed in as well. For example, Pastizzi are one of my favourite Maltese foods.

Click on the image to download a hi-res PDF version suitable for printing

IranElection Tweets Phrase Net

By: Jeff Clark    Date: Sat, 20 Jun 2009

I have uploaded the set of tweets I used to create the Iran Election Word Cloud to the wonderful Many Eyes and created a Phrase Net visualization for the data. This image below shows the net for the pattern word1 and word2. So, for example, the arrow connecting 'police' to 'riot' means there were lots of instances of the phrase 'police and riot'.

Static image of the phrase net for #IranElection Tweet Data (see below for interactive version)

See below for the interactive version.


(More...)

Iran Election Tweet Narrative II

By: Jeff Clark    Date: Sat, 20 Jun 2009

I have updated my Tweet Narrative about the Iran election. This one uses 141,000 tweets from the time period June 14-20th, 2009. I have also improved the algorithm that selects the characteristic tweets. The changes are difficult to describe succinctly but did reduce the number of tweets that started with 'RT'. This helps meet my primary goal of constructing a readable summary of the content. For this analysis I also only counted the first 10 tweets from any particular account which helps prevent the tweets from a few individual accounts from dominating the results.

DateCharacteristic Tweet
Jun 14
20:12 gmt
WTF! They're bringing tanks on the streets in Tehran #iranelection *
Jun 15
00:51 gmt
@Change_for_Iran 5:17am people outside are burning Saderat bank building or as it seems from this far #iranelection *
Jun 15
07:13 gmt
@IranNewsNow HUGE NEWS!!!! CNN reports that GRAND AYATOLLA SANAI has issued FATWA to resist govt that steals #IranElection *
Jun 15
10:24 gmt
Iran supreme leader orders probe of vote fraud #iranelection *
Jun 15
18:43 gmt
BEST FILTER SHEKAN: www.julo.free4r.com/prox.html #IranElection *
Jun 15
21:26 gmt
Please postpone maintenance! #nomaintenance #iranelection *
Jun 16
01:52 gmt
Twitter Reschedules Maintenance Around #IranElection Controversy *
Jun 16
05:09 gmt
Iran has blocked "#iranelection" Use #Tehran or #Iranians *
Jun 16
11:18 gmt
#iranelection cyberwar guide for beginners *
Jun 16
16:32 gmt
unconfmd major incident at Azadi - shooting - fires - ppl running #Iranelection *
Jun 16
22:28 gmt
pls everyone change your location on tweeter to IRAN inc timezone GMT+3.30 hrs - #Iranelection - cont.... *
Jun 17
03:44 gmt
NYT publishing sensitive names of Iranians on Twitter. Get them to stop! #NYTfail #iranelection *
Jun 17
05:52 gmt
BLOCK @serv_ SPREADING MISINFOMATIONS #iranelection *
Jun 17
09:29 gmt
Tehran march TODAY 5pm - 7Tir Sq - Meydan 7 Tir - silent - sea of green - #Iranelection *
Jun 17
15:17 gmt
Show support for #iranelection add green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/ *
Jun 17
18:56 gmt
news - Mousavi & Khatami have delivered joint letter to Ministry of Justice demanding release of protestors - #Iranelection *
Jun 18
02:15 gmt
"Change does not roll in on the wheels of inevitability, but comes through continuous struggle." -Dr.Martin Luther King #iranelection *
Jun 18
05:00 gmt
DOA Remix (Death of the Ayatollahs). Theme song for #IranElection www.myspace.com/revolutionofthemindhiphop *
Jun 18
11:00 gmt
Today - Sea of Green - Imam Khomeine Sq - 4pm - Tehran - All wear BLACK - we pray together - #Iranelection *
Jun 18
14:46 gmt
MOUSAVI - 25% inflation means IGNORANCE - THIEVING - CORRUPTION - where is the wealth of my nation? #Iranelection RT *
Jun 18
21:28 gmt
RT @andylevy BREAKING: Faulty #iranelection results attributed to Clerical errors. *
Jun 18
23:15 gmt
confirmed - Saeed Rajaie's (a prominent Iranian wartime martyr) wife has been arrested while praying in Qom - #Iranelection *
Jun 19
04:29 gmt
[Mashable] Facebook Releases Persian Translation for #IranElection Crisis http://tinyurl.com/kuzmc4 *
Jun 19
09:31 gmt
#iranelection Khamenei: (summery) (( correction )) Crowed yell: Death to england *
Jun 19
13:21 gmt
situation in Iran is now CRITICAL - nation is heartbroken - suppression is iminent - #Iranelection *
Jun 19
21:06 gmt
Mousavi's offices are trashed, Mousavi's staff in police custody, Mousavi is missing. #iranelection #gr88 #clarification *
Jun 19
23:22 gmt
#IranElection Must watch video & read transcript at the same time. Chills Pls RT after you watch. http://bit.ly/10qe5H *
Jun 20
06:44 gmt
whenwill we all stand together ascitizens of thewrld and demandour elected officials tohelp? one day wecould be in that crowd #iranelection *
Jun 20
08:28 gmt
Google Earth to update satellite images of Tehran #Iranelection http://twitition.com/csfeo *
Jun 20
13:26 gmt
Unconfirmed: Bomb Blast in Khomeini's shrine #iranelection *

Iran Election Word Cloud

By: Jeff Clark    Date: Thu, 18 Jun 2009

This is a Shaped Word Cloud created from the text of approximately 84,000 tweets containing the term #iranelection. The larger the word the more frequently it appears in the text. As usual you can click on a word to see the current twitter search results.

Feel free to follow JeffClark on Twitter to get more updates on my work.

Iran Election Tweet Narrative

By: Jeff Clark    Date: Tue, 16 Jun 2009

The world is watching with great interest the demonstrations in Iran related to the recent election. The twittersphere is filled with discussion of the event and, of course, much of it is redundant. I have built a Tweet Narrative based on a collection of ~ 60,000 tweets containing the tag #IranElection. Basically, I divided the tweets into 30 groups based on the time they were published and then statistically select the one tweet most representative for that time slot.

DateCharacteristic Tweet
Jun 14
20:52 gmt
RT @StopAhmadi WTF! They're bringing tanks on the streets in Tehran #iranelection *
Jun 14
22:49 gmt
We people of iran want peace! #CNNfail #iranelection *
Jun 15
00:09 gmt
RT @persiankiwi students being killed in tehran uni dorm in amirabad right now. this must stop. #Iranelection *
Jun 15
00:50 gmt
Follow @Change_for_Iran 5:17am people outside are burning Saderat bank building or as it seems from this far #iranelection *
Jun 15
02:49 gmt
RT @parinaz AhmadiN revoked all permits of foreign media & has instructed them to stop reporting or they will face jail time. #IRANelection *
Jun 15
05:00 gmt
Will you wear green tomorrow to support freedom in Iran? #iranelection #greenscream *
Jun 15
05:38 gmt
RT @greenscreamiran: World to wear green tomorrow for freedom in Iran. RT please. #IranElection #greenscream *
Jun 15
07:11 gmt
RT @IranNewsNow: HUGE NEWS!!!! CNN reports that GRAND AYATOLLA SANAI has issued FATWA to resist govt that steals #IranElection RT THIS *
Jun 15
08:42 gmt
RT @persiankiwi March is NOT CANCELLED today. Mousavi is in danger of being killed. #Iranelection *
Jun 15
11:25 gmt
RT @persiankiwi: March Started: ADVICE - carry photos of imam khomeini. they cannot shoot at us with these. #Iranelection *
Jun 15
11:54 gmt
RT @persiankiwi for later we need proxy address to upload film. we have no upload possibility now, can anyone help? #Iranelection *
Jun 15
13:27 gmt
RT @persiankiwi: Valli Asr st closed to traffic - tens of thousands marching - unbelievable sight. #Iranelection *
Jun 15
15:54 gmt
RT @herrcafe RT @phelo Telegraph reports of Iranian Interior Ministry leak that Ahmedinajad came in thir #IranElection - http://bit.ly/GGUy2 *
Jun 15
17:15 gmt
RT @persiankiwi: streets very dangerous now. groups of militia on motorbikes searching for protesters. #Iranelection *
Jun 15
18:36 gmt
RT @stephenfry Functioning Iran proxies 218.128.112.18:8080 218.206.94.132:808 218.253.65.99:808 219.50.16.70:8080 #IranElection *
Jun 15
20:00 gmt
RT @persiankiwi Gohardasht in Karaj - confirmed - people in street batles with militia - #Iranelection *
Jun 15
21:57 gmt
RT @IranRiggedElect: Please postpone Twitter maintenance #IranElection @twitter @ev @bs @ded @ej @lg @nk @rk @vl @al3x @stop #nomaintenance *
Jun 15
23:23 gmt
RT @nttajohn maintenance is postponed, twitter will be posting press release soon #nomaintenance #iranelection *
Jun 16
00:34 gmt
RT IRAN: we are moving location - seperating - situation in Tehran is tense - cant explain #Iranelection *
Jun 16
03:01 gmt
RT From Iran: CONF: #IRANELECTION tag/string is not filtered in #iran. Plz KEEP USING IT! #iran9 *
Jun 16
03:59 gmt
People in Iran, use https://twitter.com/ instead of http://twitter.com/ to avoid hashtag filtering #Iran9 #IranElection #tehran #iranians *
Jun 16
06:48 gmt
RT from inside Iran: rumour spreading Tehran - Army Generals have met in secret - Army considering position #Iranelection #iran9 *
Jun 16
07:16 gmt
RT @stephenfry @arashamel Pls get this out to your followers. #iranelection has been blocked in Iran. Switch to #Iranians , #Tehran, #Iran9 *
Jun 16
08:29 gmt
RT @stephenfry RT: pls get this out to your followers. #iranelection has been blocked in Iran. Switch to #Iranians , #Tehran, and #Iran9 ... *
Jun 16
10:33 gmt
RT @persiankiwi only official march today is valli asr. others may be a trap - avoid others - #Iranelection #gr88 *
Jun 16
12:51 gmt
#iranelection Iran has banned all foreign journalists from reporting on the sts. *
Jun 16
13:50 gmt
RT @twistedchick: RT URGENT: Army forces entering Tehran. Barricade streets where protests are on. Now. #iranelection #gr88 *
Jun 16
15:05 gmt
RUMOUR: the former prince of #Iran, Reza Pahlavi has announced returning to #Tehran in 36h. #IranElection #GR88 *
Jun 16
16:32 gmt
RT [redacted]: unconfmd major incident at Azadi - shooting - fires - ppl running #Iranelection #gr88 *
Jun 16
19:38 gmt
RT @PCMag: The U.S. State Department asked Twitter to delay downtime to help with #IranElection. *

Twitter StreamGraph Update II

By: Jeff Clark    Date: Mon, 15 Jun 2009

I have posted a small update to the Twitter StreamGraphs application to make it more useful. Previously it used Twitter Search to get results for simple queries of the type 'from:twitterid'. Twitter Search currently only gives results going back about 14 days - it used to be much longer. For most people who don't tweet frequently this resulted in a poor quality streamgraph because there weren't many results to work with.

I'm now using the standard Twitter API to retrieve the tweets for any simple user query and it will graph up to a maximum of 1000 tweets irregardless of how far back they go. The difference is shown below for Clay Shirky. The second image shows the new improved results which, for him, go back almost a year. The graph is much richer than the first one which can only base the graph on tweets in the last two weeks.

Previous results limited to approximately 14 days due to Twitter Search limitation
 
New results for simple queries of the type from:twitterid

Chinese Ideogram for Flower

By: Jeff Clark    Date: Sun, 14 Jun 2009

Here is another design made with the flower images from Wikimedia Commons. It's the chinese ideogram for 'flower' rendered with flowers.

Others in this series: FlowerTank, FlowerCycle, and John Lennon Flower Portrait.

Venn: Iran, Iraq, Afghanistan

By: Jeff Clark    Date: Sun, 14 Jun 2009

Here is the result of a Twitter Venn query for Iran, Iraq, and Afghanistan. The recent controversial elections in Iran have obviously grabbed a lot of attention in the Twittersphere. It's interesting that the number of tweets mentioning both Iran and Iraq is roughly the same as the number mentioned Afghanistan and Iraq even though tweets about Iran are so dominant.

Click on the image to see the current Twitter venn diagram for these three terms.

Celebrity Twitter Accounts

By: Jeff Clark    Date: Sun, 14 Jun 2009

I recently made some improvements in my graph display code for a client and have used it to create a new graph showing the vocabulary relationships between many celebrities on Twitter. The post More Twitter Account Graphs explains a little about what the similarity is based on.

The central people in this set appear to be RyanSeacrest, PaulaAbdul, and TheEllenShow. The similarity score between Ryan and Paula is 19.8% and the top words connecting them together are: 'radio', 'game', 'guys', 'adam', 'movie', 'coast', 'studio', and their respective Twitter IDs.

Another interesting grouping is BarackObama, schwarzenegger, and timoreilly. The similarity score between Obama and Schwarzenegger is 16.7% with the top connecting words being 'health' , 'care', 'video', 'president', 'address', 'vote', and 'event'.

I included jtimberlake in the analysis as well but he was removed from the final graph because he wasn't connected strongly enough with anybody else. His closest match was only 4.5% and was with Oprah.

Beetles

By: Jeff Clark    Date: Thu, 11 Jun 2009

After my previous John Lennon Flower Portrait I had the Beatles on my brain and stumbled across a lovely set of photographs of beetles on COLOURlovers. I have tried creating an image of The Beatles using beetles but haven't yet come up with a decent design. Instead I made this beetle outline image from 24 different species. I have seen a lovely physical display of beetles arranged in this manner but I'm not sure where it was. It may have been at the Royal Ontario Museum.

Click image to see larger version

John Lennon Flower Portrait

By: Jeff Clark    Date: Tue, 09 Jun 2009

Here is a flower portrait of John Lennon created from the image on the page 100 Portraits of Iconic People of all time. The flower images are from Wikimedia Commons.

John Lennon Word Portrait

By: Jeff Clark    Date: Tue, 09 Jun 2009

It has been a while since I've created a Word Portrait. Here is one of John Lennon created from the image on the page 100 Portraits of Iconic People of all time.

Here are links to Word Portaits of Obama and Einstein.

Cairo Speech Word Graph

By: Jeff Clark    Date: Thu, 04 Jun 2009

Here is another way to look at Obama's speech in Cairo calling for A New Beginning with Muslims. It uses a standard node link graph to show which words were used near each other in the text. There are virtual springs connecting words that are used frequently together and forces pushing apart nodes so they don't overlap too much. The nodes in orange have been fixed to a certain location and the other nodes move based on the springs and forces until a stable configuration is reached. This allows us to stretch out the graph and easily see where terms lay along a spectrum between 2 or more words of interest.

This first view shows that there was more discussion of 'peace' than 'war' and that words like 'palestinian', 'israel', and 'god' were highly associated with 'peace' relative to the other highlighted words.

Click image to see a larger version

This second view below is of the same graph but with different words pegged in place. The terms 'nuclear' 'weapons' and 'united' 'states' are both closer to 'iran' than the other countries. Similarly, 'women' 'denied' 'equal' is more associated with 'afghanistan'.

Click image to see a larger version

An obvious way to improve these would be to use word stemming to combine different forms of the same word. For example, 'muslim' and 'muslims' would use one node, as would 'peaceful' and 'peace'. This would reduce the number of nodes and probably more clearly expose any relationships.

The code to construct these was written with Processing and makes use of the excellent Traer Physics library.

Obama Cairo Speech StreamGraph

By: Jeff Clark    Date: Thu, 04 Jun 2009

Obama just delivered a speech in Cairo calling for A New Beginning with Muslims. Here is a StreamGraph prepared from the text. It does a reasonable job of illustrating which major themes were covered at the various points in the speech.

Click image to see a larger version

Google Squared

By: Jeff Clark    Date: Wed, 03 Jun 2009

datavisualization.ch has a quick review of a new Google offering called Google Squared. It allows you to see the results of a query organized in a table. One of the suggested queries is 'dog breeds' which seemed to work pretty well. The next one I tried was 'mammals' and it seemed OK as well until I looked more closely at the images shown for 'jaguar' and 'wolverine'...

Twitter Employee Account Similarity

By: Jeff Clark    Date: Tue, 02 Jun 2009

Dave Winer recently investigated Who do the people of Twitter follow?. He looked at which twitter accounts were followed by the most employees of Twitter and was curious about how that might be related to the accounts suggested to new Twitter users when they sign up.

His idea sparked one of my own - what are the relationships between Twitter employees themselves with respect to similarity of the vocabulary used in their tweets ? Here is the graph created using the same layout technique described in my recent post Twitter Account Graphs.

As a whole, the group of twitter employees seem to be well connected based on this vocabulary similarity metric. There are a few people floating around on their own - thuske, akshay_abd, jeremy, lukester, and em33. There is also a doublet separated from the others - keerthi and mikelimondba. They both only have about 40 tweets so this link is more tenuous than the others which are based on the latest 200 tweets. The bottom right shows a fairly cohesive subgroup connected to most of the rest thru ej or perhaps mzsanford/abdur. Co-founder biz seems to be a more central figure by this measure than CEO ev.

WeFollow Twitter Directory

By: Jeff Clark    Date: Mon, 01 Jun 2009

WeFollow has quickly become one of the primary directories of Twitter users. The site lets people assign up to 3 tags to their own account in order to describe their interests. People visiting WeFollow can then see for each tag the list of matching accounts sorted by number of followers.

When you categorize yourself on WeFollow, it sends out a tweet to all your followers having the form: 'Just added myself to the http://wefollow.com twitter directory under: #tag1, #tag2, #tag3'. This automatic viral message has helped WeFollow spread across the twittersphere. Some people have complained that they see too many of these and call them spam. Personally, I find it interesting to see how the people I'm following classify themselves.

These automatic registration messages can be tracked using Twitter Search and reveal lots of information about WeFollow that isn't publically available on their own site. I have analyzed the set of WeFollow registration tweets for the two month period Mar 28 - May 28, 2009. There were 144,506 tweets matching my search pattern in this time frame, or roughly 2400 new people added to the directory per day. Here is the graph over time:

The peak during this time frame occurred at the end of March and was about 6000. The time period for the analysis was shortly after the WeFollow launch which likely accounts for the rough gradual decline shown. It would be nice to see the data for the launch date but unfortunately limitations in Twitter Search prevent me from accessing this data. There appears to be a new peak showing up at the end of May and there are two obvious troughs around April 10th and 22nd. I've checked other data streams I'm monitoring and they don't show troughs or 'holes' during these two dates so it looks pretty likely that there was a problem with WeFollow infrastructure during those periods rather than it being a data collection problem.

The main page of WeFollow shows the 'top tags' but bases this on the number of followers of the people using those tags rather than the tag count itself. Which tags are actually used most often ? An analysis of our sample gives this graph:

The top three tags by follower count on the WeFollow site are Celebrity, TV, and Entrepeneur. When ranking instead by the number of people who actually self-assign these tags these rankings drop to 12 for Celebrity, 44 for TV, and 3 for Entrepeneur. This shows quite clearly that the average account tagged Celebrity or TV has more followers than, say, those tagged with Blogger.

The WeFollow registration tweets also show which tags are used together. I've constructed a couple of different types of graphics to illustrate the tag similarity relationships. This first one is a Clustered Word Cloud and show colored groups of tags that are frequently used together. The big blue group in the middle seems to contain many of the most frequently used tags and doesn't appear particularly cohesive. Many of the others do, at least subjectively, seem to make sense. Here are a couple of example clusters from the image: (church, conservative, christian, pastor, tcot) , (publishing, poetry, books, writing, poet).

This last image was created using the same layout technique as my recent Twitter Account Graphs. Basically, the tag nodes are positioned near others that they are 'similar to' in the sense that they are often used together.

Click on this to see the larger version

North Korean Flag Word Cloud

By: Jeff Clark    Date: Thu, 28 May 2009

The world is watching carefully the things happening in North Korea and there are lots of tweets discussing the issue. I have created a Shaped Word Cloud using 4000 tweets from the last few days and using the North Korean flag as a template. As usual you can click on a word to see the current twitter search results.

More Twitter Account Graphs

By: Jeff Clark    Date: Thu, 28 May 2009

Here is another graph showing a larger set of twitter accounts and their relationships based on a measure of shared vocabulary. The middle left cluster contains many Twitter accounts who discuss web technology including Twitter itself. I'm familiar with many of these accounts and know that the ones around my own icon ( JeffClark ) discuss data visualization (eagereyes, flowingdata, datavis, infosthetics). At the bottom right is a cluster of accounts that I follow which are focused on computational art (blprnt, flight404, toxi, mariuswatz, golan, reas, natzke). The group at the very top contains accounts with an interest in music or entertainment.

To create this graph I'm connecting nodes with a virtual spring if their similarity was greater than 9%. The stronger the similarity the shorter the spring. There are also long springs connecting extremely dissimilar nodes to push them apart but these are not shown. I've tried to avoid the usual tangled mess by not connecting nodes of medium similarity and also by only connecting two nodes if the link is one of the three strongest for either node.

Tweet Stream Similarity Graph

By: Jeff Clark    Date: Thu, 28 May 2009

At the end of the previous post, Tweet Stream Similarity, I suggested using a network graph to visualize the similarity relationships between the twitter accounts. Here is such a graph for the same small set of accounts I looked at before:

It nicely shows the small group of technology-related accounts (techcrunch, timoreilly, cshirky), the (britneyspears, mariahcarey) entertainment link, and the fact that the nfl account is not closely related to these others. It's interesting that the twitter ceo, ev, is connected to both the technology group and the entertainment group.

The mariahcarey link to the nba surprised me a bit and I looked into the details. Some of the shared vocabulary that caused the link are 'basket' ( as in Easter basket for mariah, and basketball basket for the nba) , and 'shoot' ( as in photo shoot for mariah and shoot the ball for the nba). It's obvious my metric will confuse different senses of the same word. There are many other shared words between these two accounts like friends, guys, baby, twitter, vegas, and everybody. I'm currently using the latest 200 tweets for each user in the analyis. Using more tweets might give better results.

Tweet Stream Similarity

By: Jeff Clark    Date: Sat, 23 May 2009

In my recent Twitter Spam post I showed two Twitter accounts that had an almost identical set of tweets. Being able to detect this situation automatically might have obvious benefit in detecting invalid accounts that should be disabled. We can do this by calculating a text similarity measure between the set of tweets coming from the two accounts. A high degree of similarity (say > 80%) is suggestive of automated duplication. This, coupled with some other likely indicators of spam (lots of links to commercial websites, high rate of updates, very low followers/following ratio, lots of followers showing spam-like behaviour) should be good enough for Twitter to find lots of spam accounts automatically.

A tweet stream similarity metric has some other potential uses as well. Given a set of accounts, we could group them into clusters based on similarity of tweet content. Or we could help a twitter user find new people to follow that seem to have shared interests based on tweet content.

There are lots of different functions that can be used to calculate text similarity. The current one I have designed is based on word frequency and excludes standard stop words (the,of,and...) , ignores URLs, ignores some words extremely common in tweets (RT, via), and discounts some other words found often in tweets (like, good, day, thanks...) . This metric can be refined over time and is fairly crude. It completely ignores word order for example and does not consider the semantics of the text at all. I'm hoping it is useful for detecting similarities at a broad topical level.

I have used my metric to calculate the tweet stream similarity between all pairs of 9 fairly well known twitter personalities. I used the last 200 tweets from each account for the analysis with the exception of britneyspears who only has 144 at this time. The lowest similarity score was 2.8% for ev (the twitter ceo) vs nfl (news about the National Football League). The highest was 20.3% and was between cshirky (Clay Shirky - American writer, consultant and teacher on the social and economic effects of Internet technologies) and timoreilly (Tim O'Reilly - founder and CEO of O'Reilly media). The highest score for THE_REAL_SHAQ ( Shaquille O'Neal ) was with the nba twitter account. The highest score for MariahCarey was with britneyspears. The metric seems to be doing a reasonable job. Here is the complete list:

  1. Sim(cshirky, timoreilly) = 20.0%
  2. Sim(cshirky, techcrunch) = 16.6%
  3. Sim(timoreilly, techcrunch) = 15.8%
  4. Sim(timoreilly, ev) = 14.2%
  5. Sim(cshirky, ev) = 13.3%
  6. Sim(MariahCarey, britneyspears) = 12.9%
  7. Sim(THE_REAL_SHAQ, nba) = 11.8%
  8. Sim(MariahCarey, ev) = 11.6%
  9. Sim(ev, techcrunch) = 10.9%
  10. Sim(MariahCarey, nba) = 10.8%
  11. Sim(cshirky, MariahCarey) = 10.7%
  12. Sim(MariahCarey, timoreilly) = 9.6%
  13. Sim(ev, britneyspears) = 9.2%
  14. Sim(timoreilly, nba) = 9.1%
  15. Sim(cshirky, nba) = 9.1%
  16. Sim(THE_REAL_SHAQ, ev) = 9.0%
  17. Sim(ev, nba) = 9.0%
  18. Sim(THE_REAL_SHAQ, MariahCarey) = 8.2%
  19. Sim(britneyspears, techcrunch) = 8.1%
  20. Sim(nba, britneyspears) = 7.8%
  21. Sim(MariahCarey, techcrunch) = 7.7%
  22. Sim(cshirky, britneyspears) = 7.5%
  23. Sim(cshirky, THE_REAL_SHAQ) = 7.5%
  24. Sim(timoreilly, britneyspears) = 7.2%
  25. Sim(THE_REAL_SHAQ, timoreilly) = 6.5%
  26. Sim(THE_REAL_SHAQ, britneyspears) = 6.4%
  27. Sim(nba, techcrunch) = 6.4%
  28. Sim(nba, nfl) = 4.5%
  29. Sim(THE_REAL_SHAQ, techcrunch) = 3.9%
  30. Sim(timoreilly, nfl) = 3.9%
  31. Sim(nfl, techcrunch) = 3.7%
  32. Sim(MariahCarey, nfl) = 3.6%
  33. Sim(cshirky, nfl) = 3.6%
  34. Sim(THE_REAL_SHAQ, nfl) = 3.4%
  35. Sim(nfl, britneyspears) = 3.2%
  36. Sim(ev, nfl) = 2.8%

An obvious next step is to use a better way to visualize this information. I'm thinking of using a network layout with nodes positioned closely and connected for high similarity scores and positioned far apart for low similarity scores. I'm hoping that it would illustrate nicely any structure within the group.

American Idol Tweet Narrative

By: Jeff Clark    Date: Thu, 21 May 2009

I have taken the collection of tweets I gathered for the American Idol StreamGraph and run them through my tool for creating a Characteristic Tweets Summary to produce the following output. My initial attempt included some obvious spam tweets so I had to refine my technique a little bit. Basically, a twitter spammer who repeated the same text over and over was highly likely to have one of their tweets selected as the 'characteristic tweet' for the time period containing the spam. The refinement was to only analyze one tweet per user per time period.

In the output table I also de-emphasized the twitter account for each tweet since they are statistically selected to be representative of an aggregate. The trailing '*' is a link to the original tweet which, of course, shows the proper attribution.

DateCharacteristic Tweet
May 03, 2009American Idol winner David Cook's brother dies of cancer. *
May 04, 2009'American Idol' star David Cook's brother Adam dies of brain cancer! *
May 05, 2009getting ready to watch american idol. *
May 06, 2009Headed home for american idol *
May 07, 2009very mad because Allison Iraheta got off American Idol *
May 08, 2009tickets for the american idol tour go on sale saturday @ 10!!!!!!!!! *
May 09, 2009Just got tickets to the American Idol tour!!!! *
May 10, 2009Tickets for the American Idol 2009 Summer tour on Sale|Tour Dates ... http://tinyurl.com/rdmcyl *
May 11, 2009Can't wait to see American Idol!!!! *
May 12, 2009getting ready to watch American Idol *
May 13, 2009American Idol i'm waiting for who is going home tonight !!!! *
May 14, 2009@jordanknight who cares about american idol...you're my american idol! *
May 15, 2009RT @kingsthings: who do you want to win American Idol? *
May 16, 2009What is the difference between the American Idol and Eurovision? *
May 17, 2009Clouds on horizon for "American Idol" juggernaut? (Reuters) http://ow.ly/7q1O *
May 18, 2009britney to perform on American Idol finale? *
May 19, 2009getting ready to watch american idol. come on,kris! *
May 20, 2009American Idol finale!!!! come on kris!!! even though adam has it, i really want you to win!!!! *
May 21, 2009Kris won the american idol *

Fish Tank

By: Jeff Clark    Date: Thu, 21 May 2009

Sorry - I couldn't resist. The fish images are Reef Fish of the Commonwealth of the Northern Mariana Islands and the tank outline comes from the free font Tanks-WW2.

American Idol StreamGraph

By: Jeff Clark    Date: Thu, 21 May 2009

Here is a Twitter StreamGraph created from the query "American Idol" OR #idol in the date range of May 3-21, 2009. I had to use a custom version of my tool that used tweet data harvested in a different manner from the online version which is limited to viewing the last 1000 tweets only. Given such a popular topic, 1000 tweets only goes back a few minutes and is uninteresting.

A couple of observations:

  • Note the large spikes for 'David', 'Cook', and 'brother' around May 3rd. This occurred because the contestant David Cook's brother had just passed away from cancer.
  • The eventual winner (