25 Apr 2012

Reproducible Research: Running odfWeave with 7-zip

odfWeave is an R-package that is used for making dynamic reports by Sweave processing of Open Document Format (ODF) files. For anyone new to report generation and lacking knowledge of markup languages this might be a good starting point or even a true alternative to sweave / LATEX and others.

Now, anyone who recently tried to install the required zipping program for odfWeave might have noticed that there are currently no info-zip executables available (zip and unzip by info-zip software are suggested in the odfWeave manual). There are several other free zipping programs - but if you use these the default syntax for odfWeave changes. Looking into the internals it is revealed that the OS command specified for running the zipping program has to be adapted. There are some postings on the R-help mailing list concerning these topic, but none of them worked for me. After some trial and error I managed to get around this problem by using 7-zip with an adapted syntax and will share this here:

# write an in-file and save it to a folder:
dir()
[1] "Example_1_in.odt"

# testing the :
system("\"C:\\Program Files\\7-Zip\\7z.exe\" t -tzip Example_1_in.odt")

7-Zip 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18

Processing archive: Example_1_in.odt

Testing     mimetype
Testing     Configurations2\statusbar
Testing     Configurations2\accelerator\current.xml
Testing     Configurations2\floater
Testing     Configurations2\popupmenu
Testing     Configurations2\progressbar
Testing     Configurations2\menubar
Testing     Configurations2\toolbar
Testing     Configurations2\images\Bitmaps
Testing     content.xml
Testing     manifest.rdf
Testing     styles.xml
Testing     meta.xml
Testing     Thumbnails\thumbnail.png
Testing     settings.xml
Testing     META-INF\manifest.xml

Everything is Ok

Folders: 7
Files: 9
Size:       30139
Compressed: 9572

# setting up cmd prompt:
library(odfWeave)
ctrl <- odfWeaveControl(zipCmd =
           c("\"C:\\Program Files\\7-Zip\\7z.exe\" a $$file$$",
             "\"C:\\Program Files\\7-Zip\\7z.exe\" x -tzip $$file$$"))
# running:
odfWeave("Example_1_in.odt", "Example_1_out.odt", control = ctrl)

  Copying  Example_1_in.odt
  Setting wd to  C:\Users\Kay\AppData\Local\Temp\RtmpghYef5/odfWeave2223195249
  Unzipping ODF file using "C:\Program Files\7-Zip\7z.exe" x -tzip "Example_1_in.odt"

7-Zip 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18

Processing archive: Example_1_in.odt

Extracting  mimetype
Extracting  Configurations2\statusbar
Extracting  Configurations2\accelerator\current.xml
Extracting  Configurations2\floater
Extracting  Configurations2\popupmenu
Extracting  Configurations2\progressbar
Extracting  Configurations2\menubar
Extracting  Configurations2\toolbar
Extracting  Configurations2\images\Bitmaps
Extracting  content.xml
Extracting  manifest.rdf
Extracting  styles.xml
Extracting  meta.xml
Extracting  Thumbnails\thumbnail.png
Extracting  settings.xml
Extracting  META-INF\manifest.xml

Everything is Ok

Folders: 7
Files: 9
Size:       30139
Compressed: 9572

  Removing  Example_1_in.odt
  Creating a Pictures directory

  Pre-processing the contents
  Sweaving  content.Rnw

  Writing to file content_1.xml
  Processing code chunks ...

  'content_1.xml' has been Sweaved

  Removing content.xml

  Post-processing the contents
  Removing content.Rnw
  Removing styles.xml
  Renaming styles_2.xml to styles.xml
  Removing manifest.xml
  Renaming manifest_2.xml to manifest.xml
  Removing extra files

  Packaging file using "C:\Program Files\7-Zip\7z.exe" a "Example_1_in.odt"

7-Zip 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
Scanning

Creating archive Example_1_in.odt

Compressing  Configurations2\accelerator\current.xml
Compressing  content.xml
Compressing  manifest.rdf
Compressing  META-INF\manifest.xml
Compressing  meta.xml
Compressing  mimetype
Compressing  settings.xml
Compressing  styles.xml
Compressing  Thumbnails\thumbnail.png

Everything is Ok
  Copying  Example_1_in.odt
  Resetting wd
  Removing  C:\Users\Kay\AppData\Local\Temp\RtmpghYef5/odfWeave2223195249

  Done

# see the result:
dir()
[1] "Example_1_in.odt"  "Example_1_out.odt"

20 Apr 2012

Reproducible Research: Export Regression Table to MS Word

Here's a quick tip for anyone wishing to export results, say a regression table, from R to MS Word:

6 Apr 2012

R-Bloggers' Web-Presence

We love them, we hate them: RANKINGS!

Rankings are an inevitable tool to keep the human rat race going. In this regard I'll pick up my last two posts (HERE & HERE) and have some fun with it by using it to analyse R-Bloggers' web presence. I will use number of hits in Google Search as an indicator.

I searched for URLs like this: https://www.google.com/search?q="http://www.twotorials.com" - meaning that only the exact blog-URL is searched.

Blogs NoHits
http://google-opensource.blogspot.com 82300
http://www.programmingr.com 73500
http://googleresearch.blogspot.com 58000
http://dirk.eddelbuettel.com 53000
http://borasky-research.net 33100
http://casoilresource.lawr.ucdavis.edu 32500
http://andrewgelman.com 30000
http://yihui.name 29600
http://xianblog.wordpress.com 27900
http://nsaunders.wordpress.com 27600
http://chem-bla-ics.blogspot.com 26600
http://plindenbaum.blogspot.com 24600
http://blog.ouseful.info 24300
http://www.vcasmo.com 24200
http://yz.mit.edu 23500
http://romainfrancois.blog.free.fr 22700
http://blog.revolutionanalytics.com 21000
http://robjhyndman.com 18400
http://freakonometrics.blog.free.fr 16100
http://perfdynamics.blogspot.com 15400
http://www.stubbornmule.net 14800
http://zoonek.free.fr 14800
http://jackman.stanford.edu 13900
http://www.bytemining.com 13700
http://learnr.wordpress.com 12600
http://tommy.chheng.com 12200
http://mazamascience.com 12000
http://www.investuotojas.eu 11500
http://www.r-statistics.com 11300
http://www.franklincenterhq.org 10800
http://gettinggeneticsdone.blogspot.com 10700
http://mpastell.com 9930
http://pineda-krch.com 9780
http://blog.saush.com 9220
http://www.premiersoccerstats.com 8950
http://developmentality.wordpress.com 7250
http://www.dataspora.com 7200
http://blog.hiremebecauseimsmart.com 7050
http://isomorphismes.tumblr.com 7040
http://www.mathfinance.cn 6930
http://blog.nguyenvq.com 6150
http://www.drewconway.com 5970
http://www.carlboettiger.info 5520
http://www.statisticsblog.com 5110
http://www.decisionsciencenews.com 4950
http://www.r-chart.com 4810
http://chartsgraphs.wordpress.com 4480
http://www.portfolioprobe.com 4410
http://procomun.wordpress.com 4330
http://jeromyanglim.blogspot.com 4080
http://spatialanalysis.co.uk 4080
http://www.theresearchkitchen.com 4080
http://www.forex-bloggers.com 4070
https://www.rmetrics.org 4050
http://princeofslides.blogspot.com 3900
http://www.cybaea.net 3740
http://www.cerebralmastication.com 3710
http://ygc.name 3670
http://ryouready.wordpress.com 3450
http://jeffreybreen.wordpress.com 3410
http://systematicinvestor.wordpress.com 3400
http://sgsong.blogspot.com 3310
http://industrialengineertools.blogspot.com 3290
http://www.r-tutor.com 3270
http://fishlab.ucdavis.edu 3270
http://ggorjan.blogspot.com 3250
http://blog.ynada.com 3220
http://farmacokratia.blogspot.com 3170
http://4dpiecharts.com 3130
http://heuristically.wordpress.com 3040
http://blog.rtwilson.com 2910
http://www.wekaleamstudios.co.uk 2890
http://www.dataists.com 2840
http://ikanb.wordpress.com 2750
http://shape-of-code.coding-guidelines.com 2730
http://onertipaday.blogspot.com 2710
http://blog.fosstrading.com 2700
http://blog.echen.me 2690
http://www.theusrus.de 2670
http://cloudnumbers.com 2630
http://paulbutler.org 2620
http://biostatmatt.com 2460
http://www.johnmyleswhite.com 2430
http://dataninja.wordpress.com 2360
http://realizationsinbiostatistics.blogspot.com 2340
http://statisfaction.wordpress.com 2300
http://uxblog.idvsolutions.com 2250
http://timelyportfolio.blogspot.com 2210
http://radfordneal.wordpress.com 2200
http://sas-and-r.blogspot.com 2200
http://pairach.com 2110
http://yusung.blogspot.com 2050
http://blog.flacso.edu.mx 2010
http://www.rensenieuwenhuis.nl 2000
http://michaeldhealy.com 1990
http://freigeist.devmag.net 1950
http://www.fernandohrosa.com.br 1920
http://statbandit.wordpress.com 1870
http://www.win-vector.com 1840
http://lukemiller.org 1830
http://ropensci.org 1720
http://www.eggwall.com 1650
http://benmazzotta.wordpress.com 1620
http://bms.zeugner.eu 1610
http://cartesianfaith.wordpress.com 1580
http://linkedscience.org 1570
http://stevemosher.wordpress.com 1550
http://intelligenttradingtech.blogspot.com 1520
http://www.imachordata.com 1480
http://blog.diegovalle.net 1470
http://jermdemo.blogspot.com 1430
http://nortalktoowise.com 1420
http://ekonometrics.blogspot.com 1340
http://digitheadslabnotebook.blogspot.com 1320
http://flyordie.sin.khk.be 1310
http://schamberlain.github.com 1230
http://gribblelab.org 1180
http://www.quantf.com 1130
http://offensivepolitics.net 1020
http://www.markmfredrickson.com 981
http://blog.mckuhn.de 948
http://erehweb.wordpress.com 889
http://confounding.net 886
http://simplystatistics.tumblr.com 875
http://www.babelgraph.org 859
http://csgillespie.wordpress.com 857
http://joewheatley.net 844
http://helmingstay.blogspot.com 843
http://theaverageinvestor.wordpress.com 825
http://quantitative-ecology.blogspot.com 785
http://zvfak.blogspot.com 776
http://ucfagls.wordpress.com 766
http://opendatagroup.com 760
http://cameron.bracken.bz 740
http://rtutorialseries.blogspot.com 738
http://opencpu.org 708
http://novicemetrics.blogspot.com 700
http://lamages.blogspot.com 680
http://nir-quimiometria.blogspot.com 679
http://tonybreyal.wordpress.com 677
http://brokeringclosure.wordpress.com 658
http://socialdatablog.com 643
http://dancingeconomist.blogspot.com 629
http://www.rtexttools.com 603
http://danganothererror.wordpress.com 589
http://thebiobucket.blogspot.com 567
http://holtmeier.de 531
http://val-systems.blogspot.com 519
http://thelogcabin.wordpress.com 489
http://dcemri.blogspot.com 484
http://rdatamining.wordpress.com 477
http://bridgewater.wordpress.com 460
http://www.rcasts.com 444
http://dsparks.wordpress.com 436
http://pr.cloudst.at 422
http://polstat.org 409
http://www.compmath.com 401
http://techno-realism.blogspot.com 399
http://www.backsidesmack.com 395
http://geotheory.org 393
http://miraisolutions.wordpress.com 367
http://econometricsense.blogspot.com 352
http://blog.binfalse.de 344
http://rforcancer.drupalgardens.com 317
http://blog.rstudio.org 316
http://mcfromnz.wordpress.com 309
http://www.quantumforest.com 309
http://blog.quanttrader.org 303
http://chrisladroue.com 293
http://www.michaelbommarito.com 289
http://procrun.com 280
http://mikeksmith.posterous.com 279
http://bio7.org 278
http://kbroman.wordpress.com 278
http://martynplummer.wordpress.com 272
http://bryer.org 268
http://www.funjackals.com 265
http://www.harlan.harris.name 252
http://www.milktrader.net 248
http://www.surefoss.org 241
http://rigorousanalytics.blogspot.com 231
http://www.jameskeirstead.ca 229
http://programming-r-pro-bro.blogspot.com 225
http://plausibel.blogspot.com 224
http://statistic-on-air.blogspot.com 217
http://mintgene.wordpress.com 212
http://moderntoolmaking.blogspot.com 205
http://quantitativeecology.blogspot.com 199
http://www.sigmafield.org 199
http://www.ancienteco.com 194
http://worldofrcraft.blogspot.com 191
http://rappster.wordpress.com 190
http://stotastic.com 189
http://evolvingspaces.blogspot.com 184
http://strugglingthroughproblems.blogspot.com 166
http://sharpstatistics.co.uk 161
http://leftcensored.skepsi.net 160
http://omegahat.wordpress.com 156
http://drunks-and-lampposts.com 155
http://amathew.com 152
http://onlinelabor.blogspot.com 147
http://johnramey.net 144
http://gossetsstudent.wordpress.com 138
http://tomhopper.wordpress.com 135
http://ggobi.blogspot.com 134
http://blog.fellstat.com 131
http://www.openanalytics.eu 130
http://www.numbertheory.nl 127
http://stats.blogoverflow.com 127
http://the-praise-of-insects.blogspot.com 122
http://lpenz.github.com 118
http://christophergandrud.blogspot.com 118
http://f.giorlando.org 112
http://bayesianbiologist.com 110
http://www.graphoftheweek.org 109
http://oneliner.soma20.com 109
http://inundata.org 107
http://geokook.wordpress.com 104
http://blog.datapunks.com 102
http://eranraviv.com 102
http://eranraviv.com 102
http://www.compbiome.com 101
http://www.techpolicy.ca 99
http://www.psychwire.co.uk 97
http://blog.carlislerainey.com 93
http://vasishth-statistics.blogspot.com 93
http://www.statsravingmad.com 93
http://using-r-project.blogspot.com 93
http://www.nikhilgopal.com 92
http://thedatamonkey.blogspot.com 92
http://jeffreyhorner.tumblr.com 90
http://menugget.blogspot.com 88
http://www.twotorials.com 88
http://dataexcursions.wordpress.com 84
http://viksalgorithms.blogspot.com 83
http://exploringdatablog.blogspot.com 81
http://sachaepskamp.com 81
http://aphysicistinwallstreet.blogspot.com 77
http://lastresortsoftware.blogspot.com 75
http://www.nomad.priv.at 72
http://applyr.blogspot.com 71
http://www.knowledgediscovery.jp 71
http://weitaiyun.blogspot.com 71
http://xmphforex.wordpress.com 71
http://statsadventure.blogspot.com 70
http://davenportspatialanalytics.squarespace.com 70
http://anandram.wordpress.com 69
http://rpint.wordpress.com 68
http://datadebrief.blogspot.com 66
http://blog.cloudstat.org 64
http://www.r-podcast.org 64
http://rmkrug.wordpress.com 62
http://denishaine.wordpress.com 61
http://expansed.com 58
http://r.andrewredd.us 57
http://isseing333.blogspot.com 57
http://solomonmessing.wordpress.com 57
http://rtricks.wordpress.com 57
http://anrprogrammer.wordpress.com 56
http://arungaikwad.wordpress.com 56
http://geolabs.wordpress.com 55
http://lookingatdata.blogspot.com 55
http://factbased.blogspot.com 54
http://severity.blogspot.com 54
http://swordofcrom.wordpress.com 53
http://librestats.wordpress.com 51
http://marcinkula.wordpress.com 51
http://gsoc2010r.wordpress.com 47
http://psyccomputing.blogspot.com 46
http://fabiomarroni.wordpress.com 45
http://jedifran.com 45
http://alstatr.blogspot.com 43
http://r-video-tutorial.blogspot.com 42
http://alexfarquhar.posterous.com 40
http://bmb-common.blogspot.com 40
http://rdataviz.wordpress.com 40
http://mypapertrades.blogspot.com 38
http://pitchrx.blogspot.com 38
http://simonmueller.net 38
http://statisfactions.wordpress.com 37
http://nzprimarysectortrade.wordpress.com 36
http://seanmulcahy.blogspot.com 36
http://www.speakingstatistically.com 35
http://joshpaulson.wordpress.com 34
http://learningrbasic.blogspot.com 34
http://mockquant.blogspot.com 33
http://costaleconomist.blogspot.com 32
http://rsnippets.blogspot.com 31
http://statmethods.wordpress.com 29
http://aviadklein.wordpress.com 28
http://obeautifulcode.com 28
http://blog.cloudst.at 24
http://rstats.posterous.com 23
http://notebookonthewebs.tumblr.com 22
http://0utlier.blogspot.com 21
http://gjkerns.github.com 21
http://eigensomething.blogspot.com 10
http://brocktibert.wordpress.com 9
http://toddjobe.blogspot.com 9
http://mickeymousemodels.blogspot.com 9
http://forgetfulfunctor.blogspot.com 9
http://rocknrblog.wordpress.com 9
http://dmbates.blogspot.com 8
http://blog.nextbiomotif.com 8
http://indiacrunchin.wordpress.com 8
http://blog.trenthauck.com 8
http://mikescnc.blogspot.com 8
http://jeroldhaas.blogspot.com 8
http://tlevine.tumblr.com 8
http://empty-moon-9726.heroku.com 8
http://www.proc-x.com 7
http://jointposterior.blogspot.com 7
http://gastonsanchez.wordpress.com 7
http://mlt-thinks.blogspot.com 7
http://rstats.wordpress.com 7
http://playingwithr.blogspot.com 7
http://scottmutchler.blogspot.com 6
http://iamdata.wordpress.com 6
http://sfchaos.blogspot.com 6
http://nightlordtw.wordpress.com 5
http://pleasepasstheroc.blogspot.com 5
http://wiekvoet.blogspot.com 5
http://d7.stattler.com 4
http://yetanotherrblog.blogspot.com 4
http://blog.iwanluijks.nl:80 3
https://rlearner.wordpress.com 3
http://margintale.blogspot.com 1

When checking the results manually I discovered slight deviations in the numbers and admittedly have no clue why this is.. Sorry if any blog is under- overrepresented due to such an error - please report!

Here is the R-script:

require(XML)
library(stringr)
library(RCurl)
library(xtable)

GoogleHits.1 <- function(input)
   {
    url <- paste("https://www.google.com/search?q=\"",
                 input, "\"", sep = "")
 
    CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
    script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
    doc <- htmlParse(script)
    res <- xpathSApply(doc, "//div[@id='subform_ctrl']/*", xmlValue)[[2]]
    return(as.integer(gsub("[^0-9]", "", res)))
   }

# Example:
GoogleHits.1("R%Statistical%Software")

###################### Begin get r-blogger's URLs: ###########################################
# get blogger urls with XML:
script <- getURL("www.r-bloggers.com")
doc <- htmlParse(script)
li <- getNodeSet(doc, "//ul[@class='xoxo blogroll']//a")
urls <- sapply(li, xmlGetAttr, "href")

# extract sensible blog urls:
# get ids for those with only 2 slashes (no 3rd in the end):
id <- which(nchar(gsub("[^/]", "", urls )) == 2)
slash_2 <- urls[id]

# find position of 3rd slash occurrence in strings:
slash_stop <- unlist(lapply(str_locate_all(urls, "/"),"[[", 3))
slash_3 <- substring(urls, first = 1, last = slash_stop - 1)

# replace the ones with 2 slashes:
blogs <- slash_3; blogs[id] <- slash_2

# dismiss:
blogs <- blogs[blogs != "http://domain"]
###################### End get r-blogger's URLs: #############################

###################### Begin Google Search: ##################################
# with lapply google mocks about roboting the site..
# I'm blocked on the 300th recursion..
# unlist(lapply(blogs, GoogleHits.1))

# try splitting, doesn't work (blocked the same as before)
res1 <- unlist(lapply(blogs[1:170], GoogleHits.1))
res2 <- unlist(lapply(blogs[171:334], GoogleHits.1))

# try to do it in 2 sessions (saving first result), or manually re-connnect host before second try:
df1 <- data.frame(Blogs = blogs[1:170], NoHits = res1, row.names = NULL)
save(df1, file = "df1.R")
load("df1.RData"); unlink("df1.RData")

# second run:
df2 <- data.frame(Blogs = blogs[171:334], NoHits = res2, row.names = NULL)

# bind dfs, sort by NoHits:
finres <- as.data.frame(rbind(df1, df2)); finres$Blogs <- as.character(finres$Blogs)
(finres <- finres[order(finres$NoHits, decreasing = T), ])

htmltab <- xtable(finres)
print(htmltab, type = "html", include.rownames=FALSE, file = "Bloggers.Google.Hits.htm")
###################### End Google Search #####################################

###################### Begin Plot: ###########################################
pdf("RBloggersWebPresence.pdf")
par(mar = c(4.5, 4.5, 3, 2), ylog = F)
plot(finres$NoHits, cex = 0.5, col = 3, 
     ylab = "No. of Hits in Google Search",
     xlab = "Blogs", log = "y")
set.seed(19)
rid <- sample(13:nrow(finres), 15)
text(x = rid, y = finres$NoHits[rid], 
     labels = finres$Blogs[rid],
     cex = 0.75, srt = 90, pos = 4, offset = -1) 
title(main = "R-Bloggers' Web Presence")
dev.off()
###################### End Plot ##############################################

5 Apr 2012

A Little Web Scraping Exercise with XML-Package

Some months ago I posted an example of how to get the links of the contributing blogs on the R-Blogger site. I used readLines() and did some string processing using regular expressions.

With package XML this can be drastically shortened -
see this:
# get blogger urls with XML:
library(RCurl)
library(XML)
script <- getURL("www.r-bloggers.com")
doc <- htmlParse(script)
li <- getNodeSet(doc, "//ul[@class='xoxo blogroll']//a")
urls <- sapply(li, xmlGetAttr, "href")
With only a few lines of code this gives the same result as in the original post! Here I will also process the urls for retrieving links to each blog's start page:
# get ids for those with only 2 slashes (no 3rd in the end):
id <- which(nchar(gsub("[^/]", "", urls )) == 2)
slash_2 <- urls[id]

# find position of 3rd slash occurrence in strings:
slash_stop <- unlist(lapply(str_locate_all(urls, "/"),"[[", 3))
slash_3 <- substring(urls, first = 1, last = slash_stop - 1)

# final result, replace the ones with 2 slashes,
# which are lacking in slash_3:
blogs <- slash_3; blogs[id] <- slash_2
p.s.: Thanks to Vincent Zoonekynd for helping out with the XML syntax.