Tag cloud for the R Graph Gallery

This post has a following goals: announcing the graph gallery has gained a tag cloud, and showing how it is done.

The cloud is a simple tag cloud of the words in titles of graphics that are included in the gallery. For this purpose, I am using an XML dump of the main table of the gallery database, here is for example the information for graph 12.

226     <graph>
227         <id>12</id>
228         <titre>Conditionning plots</titre>
229         <titre_fr>graphique conditionnel</titre_fr>
230         <comments>Conditioning plots</comments>
231         <comments_fr>graphique conditionnel</comments_fr>
232         <demo>graphics</demo>
233         <notemoy>0.56769596199524</notemoy>
234         <nbNote>421</nbNote>
235         <nbKeywords>0</nbKeywords>
236         <boolForum>0</boolForum>
237         <px_w>500</px_w>
238         <px_h>400</px_h>
239     </graph>
240     <graph>

We are interested in the tag titre of each tag graph. That is something straightforward to get with the R4X package (I will do a post specifically on R4X soon).

   1 x <- xmlTreeParse( "/tmp/rgraphgallery.xml" )$doc$children[[1]]
   2 titles <- x["graph/titre/#"]

Next, we want to extract words of the titles, we need to be careful about removing &br; tags that appear in some of the titles and also remove any character that is not a letter or a space, and then seperate by spaces. For that, we will use the operators package like this :

4 words <- gsub( "<br>", " ", titles ) 
5 words <- words %-~% "[^[:alpha:][:space:]]" %/~% "[[:space:]]"

Next, we convert eveything to lower case, and extract the 100 most used words:

7 words <- casefold( words )
8 w100 <- tail( sort( table( words ) ), 100 )
9

and finally generate the (fairly simple) html code:

10 w100 <- w100[ order( names( w100 ) ) ]
11 html <- sprintf( '
12 <a href="search.php?engine=RGG&q=%s">
13     <span style="font-size:%dpt">%s</span>
14 </a>
15 ', 
16     names(w100), 
17     round( 20*log(w100, base = 5) ), 
18     names(w100) )
19 cat( html, file = "cloud.html"  )
20

and that's it. You can see it on the gallery frontpage Here is the full script:

   1 ### read the xml dump
   2 x <- xmlTreeParse( "rgraphgallery.xml" )$doc$children[[1]]
   3 
   4 ### extract the titles
   5 titles <- x["graph/titre/#"] 
   6 
   7 ### clean them up
   8 words <- gsub( "<br>", " ", titles ) 
   9 words <- words %-~% "[^[:alpha:][:space:]]" %/~% "[[:space:]]"
  10 
  11 ### get the 100 most used words
  12 words <- casefold( words )
  13 w100 <- tail( sort( table( words ) ), 100 )
  14 w100 <- w100[ order( names( w100 ) ) ]
  15 
  16 ### generate the html using sprintf
  17 html <- sprintf( '
  18 <a href="search.php?engine=RGG&q=%s">
  19     <span style="font-size:%dpt">%s</span>
  20 </a>
  21 ', 
  22     names(w100), 
  23     round( 20*log(w100, base = 5) ), 
  24     names(w100) )
  25 cat( html, file = "cloud.html"  )
  26 
  27 ### or using R4X again
  28 # - we need an enclosing tag for that
  29 # - note the &amp; instead of & to make the XML parser happy
  30 w <- names(w100)
  31 sizes <-  round( 20*log(w100, base = 5) )
  32 xhtml <- '##((xml
  33     <div id="cloud">
  34         <@i|100>
  35             <a href="search.php?q={ w[i] }&amp;engine=RGG">
  36                 <span style="font-size:{sizes[i]}pt" >{ w[i] }</span>
  37             </a>
  38         </@>
  39     </div>'##xml))
  40 html <- xml( xhtml )
  41

Tag cloud for the R Graph Gallery

Romain François