This post has a following goals: announcing the graph gallery has gained a tag cloud, and showing how it is done.
The cloud is a simple tag cloud of the words in titles of graphics that are included in the gallery. For this purpose, I am using an XML dump of the main table of the gallery database, here is for example the information for graph 12.
226 <graph> 227 <id>12</id> 228 <titre>Conditionning plots</titre> 229 <titre_fr>graphique conditionnel</titre_fr> 230 <comments>Conditioning plots</comments> 231 <comments_fr>graphique conditionnel</comments_fr> 232 <demo>graphics</demo> 233 <notemoy>0.56769596199524</notemoy> 234 <nbNote>421</nbNote> 235 <nbKeywords>0</nbKeywords> 236 <boolForum>0</boolForum> 237 <px_w>500</px_w> 238 <px_h>400</px_h> 239 </graph> 240 <graph>We are interested in the tag
titre
of each tag graph
. That is something straightforward to get with the R4X package (I will do a post specifically on R4X soon).
1 x <- xmlTreeParse( "/tmp/rgraphgallery.xml" )$doc$children[[1]] 2 titles <- x["graph/titre/#"]Next, we want to extract words of the titles, we need to be careful about removing
&br;
tags that appear in some of the titles and also remove any character that is not a letter or a space, and then seperate by spaces. For that, we will use the operators package like this :
4 words <- gsub( "<br>", " ", titles ) 5 words <- words %-~% "[^[:alpha:][:space:]]" %/~% "[[:space:]]"Next, we convert eveything to lower case, and extract the 100 most used words:
7 words <- casefold( words ) 8 w100 <- tail( sort( table( words ) ), 100 ) 9and finally generate the (fairly simple) html code:
10 w100 <- w100[ order( names( w100 ) ) ] 11 html <- sprintf( ' 12 <a href="search.php?engine=RGG&q=%s"> 13 <span style="font-size:%dpt">%s</span> 14 </a> 15 ', 16 names(w100), 17 round( 20*log(w100, base = 5) ), 18 names(w100) ) 19 cat( html, file = "cloud.html" ) 20and that's it. You can see it on the gallery frontpage Here is the full script:
1 ### read the xml dump 2 x <- xmlTreeParse( "rgraphgallery.xml" )$doc$children[[1]] 3 4 ### extract the titles 5 titles <- x["graph/titre/#"] 6 7 ### clean them up 8 words <- gsub( "<br>", " ", titles ) 9 words <- words %-~% "[^[:alpha:][:space:]]" %/~% "[[:space:]]" 10 11 ### get the 100 most used words 12 words <- casefold( words ) 13 w100 <- tail( sort( table( words ) ), 100 ) 14 w100 <- w100[ order( names( w100 ) ) ] 15 16 ### generate the html using sprintf 17 html <- sprintf( ' 18 <a href="search.php?engine=RGG&q=%s"> 19 <span style="font-size:%dpt">%s</span> 20 </a> 21 ', 22 names(w100), 23 round( 20*log(w100, base = 5) ), 24 names(w100) ) 25 cat( html, file = "cloud.html" ) 26 27 ### or using R4X again 28 # - we need an enclosing tag for that 29 # - note the & instead of & to make the XML parser happy 30 w <- names(w100) 31 sizes <- round( 20*log(w100, base = 5) ) 32 xhtml <- '##((xml 33 <div id="cloud"> 34 <@i|100> 35 <a href="search.php?q={ w[i] }&engine=RGG"> 36 <span style="font-size:{sizes[i]}pt" >{ w[i] }</span> 37 </a> 38 </@> 39 </div>'##xml)) 40 html <- xml( xhtml ) 41