{"id":101,"date":"2012-01-10T11:01:14","date_gmt":"2012-01-10T11:01:14","guid":{"rendered":"http:\/\/blog.fellstat.com\/?p=101"},"modified":"2012-01-10T11:01:14","modified_gmt":"2012-01-10T11:01:14","slug":"words-in-politics-some-extensions-of-the-word-cloud","status":"publish","type":"post","link":"https:\/\/blog.fellstat.com\/?p=101","title":{"rendered":"Words in Politics: Some extensions of the word cloud"},"content":{"rendered":"<p>The word cloud is a commonly used plot to visualize a speech or set of documents in a succinct way. I really like them. They can be\u00a0extremely\u00a0visually pleasing, and you can spend a lot of time perusing over the words gaining new insights.<\/p>\n<p>That said, they don&#8217;t convey a great deal of information. From a statistical perspective, a word cloud is\u00a0equivalent\u00a0to a bar chart of univariate frequencies, but makes it more difficult for the viewer to estimate the relative frequency of two words. For example, here is a bar chart and word cloud of the state of the union address for 2010 and 2011 combined.<\/p>\n<figure id=\"attachment_102\" aria-describedby=\"caption-attachment-102\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/sotu_bar.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-102\" title=\"sotu_bar\" src=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/sotu_bar-300x230.png\" alt=\"\" width=\"300\" height=\"230\" \/><\/a><figcaption id=\"caption-attachment-102\" class=\"wp-caption-text\">Bar chart of the state of the union addresses for 2010-11<\/figcaption><\/figure>\n<figure id=\"attachment_103\" aria-describedby=\"caption-attachment-103\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/cloud.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-103\" title=\"cloud\" src=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/cloud-300x241.png\" alt=\"\" width=\"300\" height=\"241\" \/><\/a><figcaption id=\"caption-attachment-103\" class=\"wp-caption-text\">word cloud of the state of the union addresses for 2010-11<\/figcaption><\/figure>\n<p>Notice that the bar chart contains more information, with the exact frequencies being obtainable by looking at the y axis. Also, in the word cloud the size of the word both represents the frequency, and the number of characters in the word (with longer words being bigger in the plot). This could lead to confusion for the viewer. We can therefore see that from a\u00a0statistical\u00a0perspective that the bar chart is superior.<\/p>\n<p>&#8230; Except it isn&#8217;t &#8230;.<\/p>\n<p>The word cloud looks better. There is a reason why every infographic on the web uses word clouds. It&#8217;s because they strike a balance of presenting the quantitative information, while keeping the reader interested with good design. Below I am going to present some extensions of the basic word cloud that help visualize the differences and commonalities between documents.<\/p>\n<h1>The Comparison Cloud<\/h1>\n<p>The previous plots both pooled the two speeches together. Using standard word clouds that is as far as we can go. What if we want to compare the speeches? Did they talk about different things? If so, are certain words\u00a0associated\u00a0with those subjects?<\/p>\n<p>This is where the comparison cloud comes in.<\/p>\n<figure id=\"attachment_105\" aria-describedby=\"caption-attachment-105\" style=\"width: 720px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/sotu_diff1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-105\" title=\"sotu_diff1\" src=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/sotu_diff1.png\" alt=\"\" width=\"720\" height=\"616\" \/><\/a><figcaption id=\"caption-attachment-105\" class=\"wp-caption-text\">Comparison plot<\/figcaption><\/figure>\n<p>Word size is mapped to the difference between the rates that it occurs in each document. So we see that Obama was much more concerned with economic issues in 2010, and in 2011 focused more on education and the future. This can be generalized fairly naturally. The next figure shows a comparison cloud for the republican primary debate in new hampshire.<\/p>\n<p style=\"text-align: left;\"><a href=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/repub_diff.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-107\" title=\"repub_diff\" src=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/repub_diff.png\" alt=\"\" width=\"600\" height=\"484\" \/><\/a>One thing that you can notice in this plot is that Paul, Perry and Huntsman have larger words than the top tier candidates, meaning that they deviate from them mean frequencies more. On the one hand this may be due to a single minded focus on a few differentiating issues (..couch.. Ron Paul), but it may also reflect that the top tier candidates were asked more questions and thus focused on a more diverse set of issues.<\/p>\n<h1 style=\"text-align: left;\">The Commonality Cloud<\/h1>\n<p>Where the comparison cloud highlights differences, the commonality cloud highlights words common to all documents\/speakers. Here is one for the two state of the union addresses.<\/p>\n<p>&nbsp;<\/p>\n<figure id=\"attachment_112\" aria-describedby=\"caption-attachment-112\" style=\"width: 550px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/sotu_com1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-112\" title=\"sotu_com1\" src=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/sotu_com1-1024x876.png\" alt=\"\" width=\"550\" height=\"470\" \/><\/a><figcaption id=\"caption-attachment-112\" class=\"wp-caption-text\">Commonality cloud for the 2010-11 SOTU<\/figcaption><\/figure>\n<p>Here, word size is mapped to its minimum frequency across documents. So if a word is missing from any document it has size=0 (i.e. it is not shown). We can also do this on the primary debate data&#8230;<\/p>\n<figure id=\"attachment_113\" aria-describedby=\"caption-attachment-113\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/repub_com.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-113\" title=\"repub_com\" src=\"http:\/\/blog.fellstat.com\/wp-content\/uploads\/2012\/01\/repub_com-300x244.png\" alt=\"\" width=\"300\" height=\"244\" \/><\/a><figcaption id=\"caption-attachment-113\" class=\"wp-caption-text\">Republican primary commonality cloud<\/figcaption><\/figure>\n<p>From this we can infer that what\u00a0politicians\u00a0like more than anything else is people \ud83d\ude42<\/p>\n<p>&nbsp;<\/p>\n<h1>\u00a0The wordcloud package<\/h1>\n<p>Version 2.0 of wordcloud (just released to CRAN) implements these two types of graphs, and the code below reproduces them.<\/p>\n<pre>library(wordcloud)\nlibrary(tm)\ndata(SOTU)\ncorp &lt;- SOTU\ncorp &lt;- tm_map(corp, removePunctuation)\ncorp &lt;- tm_map(corp, removePunctuation)\ncorp &lt;- tm_map(corp, tolower)\ncorp &lt;- tm_map(corp, removeNumbers)\ncorp &lt;- tm_map(corp, function(x)removeWords(x,stopwords()))\n\nterm.matrix &lt;- TermDocumentMatrix(corp)\nterm.matrix &lt;- as.matrix(term.matrix)\ncolnames(term.matrix) &lt;- c(\"SOTU 2010\",\"SOTU 2011\")\ncomparison.cloud(term.matrix,max.words=300,random.order=FALSE)\ncommonality.cloud(term.matrix,random.order=FALSE)\n\nlibrary(tm)\nlibrary(wordcloud)\nlibrary(stringr)\nlibrary(RColorBrewer)\nrepub &lt;- paste(readLines(\"repub_debate.txt\"),collapse=\"\\n\")\nr2 &lt;- strsplit(repub,\"GREGORY\\\\:\")[[1]]\nsplitat &lt;- str_locate_all(repub,\n\t\"(PAUL|HILLER|DISTASOS|PERRY|HUNTSMAN|GINGRICH|SANTORUM|ROMNEY|ANNOUNCER|GREGORY)\\\\:\")[[1]]\nspeaker &lt;- str_sub(repub,splitat[,1],splitat[,2])\ncontent &lt;- str_sub(repub,splitat[,2]+1,c(splitat[-1,1]-1,nchar(repub)))\nnames(content) &lt;- speaker\ntmp &lt;- list()\nfor(sp in c(\"GINGRICH:\"  ,\"ROMNEY:\"  ,  \"SANTORUM:\",\"PAUL:\" ,     \"PERRY:\",     \"HUNTSMAN:\")){\n\ttmp[sp] &lt;- paste(content[sp==speaker],collapse=\"\\n\")\n}\ncollected &lt;- unlist(tmp)\n\nrcorp &lt;- Corpus(VectorSource(collected))\nrcorp &lt;- tm_map(rcorp, removePunctuation)\nrcorp &lt;- tm_map(rcorp, removeNumbers)\nrcorp &lt;- tm_map(rcorp, stripWhitespace)\nrcorp &lt;- tm_map(rcorp, tolower)\nrcorp &lt;- tm_map(rcorp, function(x)removeWords(x,stopwords()))\nrterms &lt;- TermDocumentMatrix(rcorp)\nrterms &lt;- as.matrix(rterms)\ncomparison.cloud(rterms,max.words=Inf,random.order=FALSE)\ncommonality.cloud(rterms)<\/pre>\n<p>&nbsp;<\/p>\n<p><a href=\"http:\/\/neolab.stat.ucla.edu\/cranstats\/repub_debate.txt\">Link to republican debate transcript<\/a><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The word cloud is a commonly used plot to visualize a speech or set of documents in a succinct way. I really like them. They can be\u00a0extremely\u00a0visually pleasing, and you can spend a lot of time perusing over the words gaining new insights. That said, they don&#8217;t convey a great deal of information. From a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6,11],"tags":[],"class_list":["post-101","post","type-post","status-publish","format-standard","hentry","category-r","category-wordcloud"],"_links":{"self":[{"href":"https:\/\/blog.fellstat.com\/index.php?rest_route=\/wp\/v2\/posts\/101","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.fellstat.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.fellstat.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.fellstat.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.fellstat.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=101"}],"version-history":[{"count":0,"href":"https:\/\/blog.fellstat.com\/index.php?rest_route=\/wp\/v2\/posts\/101\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.fellstat.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=101"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.fellstat.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=101"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.fellstat.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=101"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}