Categories
R

Moderna Pfizer Vaccine Update

In a previous post we looked at the potential effectiveness of the Pfizer-Biontech vaccine candidate. Today Moderna announced interim results from their study. I have to say that their press release was quite a bit more informative than the Pfizer release.

They report that only 5 of the 95 cases came from the vaccine group (94% efficacy!). This allows us to update our efficacy probability plots to include the Moderna vaccine. Recall that due to Pfizer only reporting that efficacy is “greater than 90%,” we don’t know whether that means that the point estimate is greater than 90%, or that they are 95% certain that efficacy is above 90%. For completeness, we will include both of these in our analysis, with “point” indicating that the point estimate is slightly greater than 90%, and “bound” indicating the result if they are 95% certain that efficacy is above 90%. We’ll use the weakly informative prior from the Pfizer study design, though any weak prior will give similar results.


# reference: https://pfe-pfizercom-d8-prod.s3.amazonaws.com/2020-09/C4591001_Clinical_Protocol.pdf

# prior interval (matches prior interval on page 103)
qbeta(c(.025,.975),.700102,1)


# posterior pfizer (bound)
cases_treatment <- 3
cases_control <- 94 - cases_treatment
theta_ci <- qbeta(c(.025,.975),cases_treatment+.700102,cases_control+1)
rate_ratio_ci <- theta_ci / (1-theta_ci)

# effectiveness
100 * (1 - rate_ratio_ci)

xx <- (1:90)/500
yy <- sapply(xx, function(x) dbeta(x,cases_treatment+.700102,cases_control+1))
xx <- 100 * (1 - xx / (1 - xx))
ggplot() + 
  geom_area(aes(x=xx,y=yy)) + 
  theme_bw() + 
  xlab("Vaccine Effectiveness") + 
  ylab("Posterior Density")

# posterior pfizer (point)
cases_treatment <- 8
cases_control <- 94 - cases_treatment
theta_ci <- qbeta(c(.025,.975),cases_treatment+.700102,cases_control+1)
rate_ratio_ci <- theta_ci / (1-theta_ci)

# effectiveness
100 * (1 - rate_ratio_ci)

xx1 <- (1:90)/500
yy1 <- sapply(xx1, function(x) dbeta(x,cases_treatment+.700102,cases_control+1))
xx1 <- 100 * (1 - xx1 / (1 - xx1))
ggplot() + 
  geom_area(aes(x=xx1,y=yy1)) + 
  theme_bw() + 
  xlab("Vaccine Effectiveness") + 
  ylab("Posterior Density")




# posterior moderna
cases_treatment <- 5
cases_control <- 95 - cases_treatment
theta_ci <- qbeta(c(.025,.975),cases_treatment+.700102,cases_control+1)
rate_ratio_ci <- theta_ci / (1-theta_ci)

# effectiveness
100 * (1 - rate_ratio_ci)



xx2 <- (1:90)/500
yy2 <- sapply(xx2, function(x) dbeta(x,cases_treatment+.700102,cases_control+1))
xx2 <- 100 * (1 - xx2 / (1 - xx2))
ggplot() + 
  geom_area(aes(x=xx2,y=yy2)) + 
  theme_bw() + 
  xlab("Vaccine Effectiveness") + 
  ylab("Posterior Density")


df <- rbind(
  data.frame(xx=xx,yy=yy,Company="Pfizer-Biontech (bound)"),
  data.frame(xx=xx1,yy=yy1,Company="Pfizer-Biontech (point)"),
  data.frame(xx=xx2,yy=yy2,Company="Moderna")
)
ggplot(df) + 
  geom_area(aes(x=xx,y=yy,fill=Company),alpha=.25,position = "identity") + 
  geom_line(aes(x=xx,y=yy,color=Company),size=1) + 
  theme_bw() + 
  xlab("Vaccine Effectiveness") + 
  ylab("Posterior Density")

moderna

The likelihood that Moderna has higher or lower efficacy compared to Pfizer depends on the Pfizer press release interpretation. Regardless, both show fantastic levels of protection. Additionally, Moderna reported some safety data.

A review of solicited adverse events indicated that the vaccine was generally well tolerated. The majority of adverse events were mild or moderate in severity. Grade 3 (severe) events greater than or equal to 2% in frequency after the first dose included injection site pain (2.7%), and after the second dose included fatigue (9.7%), myalgia (8.9%), arthralgia (5.2%), headache (4.5%), pain (4.1%) and erythema/redness at the injection site (2.0%). These solicited adverse events were generally short-lived. These data are subject to change based on ongoing analysis of further Phase 3 COVE study data and final analysis.

To my eye, these also look fantastic. Based on my personal experience, they appear to be in the same ballpark as a flu shot. None of them are anything I'd mind experiencing yearly or twice yearly. Notably absent from this list is fever, which appears to be relatively common in other candidates and could really put a damper on vaccine uptake.

Another pressing question is whether the vaccines protect one from getting severe disease. Moderna found 11 severe cases among the control vs 0 in the treatment. This number is significantly lower, but should be interpreted with care. Given that the vaccine is effective, the pressing question is whether subjects are less likely to get severe disease given that they become infected. That is to say, in addition to preventing infections, does the vaccine make it milder if you do get infected?

> x <- matrix(c(0,5,11,90-11),nrow=2)
> x
     [,1] [,2]
[1,]    0   11
[2,]    5   79
> fisher.test(x)

	Fisher's Exact Test for Count Data

data:  x
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.00000 8.88491
sample estimates:
odds ratio 
         0 

0 of the five cases in the treatment group were severe, vs 11 of the 90 in the placebo. This is certainly trending in the right direction, but is not even in the neighborhood of significance yet (Fisher's test p-value = 1).

Categories
R Uncategorized

The Pfizer-Biontech Vaccine May Be A Lot More Effective Than You Think

tl;dr; The point estimate for vaccine effectiveness may be 97%, which is a lot higher than 90%

Yesterday an announcement went out that the SARS-CoV-2 vaccine candidate developed by Pfizer and Biontech was determined to be effective during an interim analysis. This is fantastic news. Perhaps the best news of the year. It is however another example of science via press release. There is very limited information contained in the press release and one can only wonder why they couldn’t take the time to write up a two page report for the scientific community.

That said, we can draw some inferences from the release that may help put this in context. From the press release we know that a total of 94 COVID-19 cases were recorded.

“Upon the conclusion of those discussions, the evaluable case count reached 94 and the DMC performed its first analysis on all cases. “

However, we don’t know how many of these come from the control group, and how many come from the treatment group. We also don’t know how many total subjects are in the treatment and control arms. We do get two important quotes regarding efficacy.

“Vaccine candidate was found to be more than 90% effective in preventing COVID-19 in participants without evidence of prior SARS-CoV-2 infection in the first interim efficacy analysis

The case split between vaccinated individuals and those who received the placebo indicates a vaccine efficacy rate above 90%, at 7 days after the second dose.”

How should we interpret these? Was the observed rate of infection 90% lower in the treatment group, or are we to infer that the true (population parameter) efficacy is at least 90%? I would argue that the wording supports the later. If they were just providing a point estimate why express it as a bound? Why would they phrase it as “indicates a vaccine efficacy rate above 90%” if there was a reasonable probability that the actual vaccine efficacy rate is below 90%?

We can get some additional insight by looking at the study design. It specifies how the interim analysis is to be done. Specifically on pages 102-103, it calls for a Bayesian analysis using a beta binomial model with a weakly-informative prior.

To me, the most compatible statistical translation of their press release is that we are sure with 95% probability that the vaccine’s efficacy is greater than 90%. Why “95% probability?” Well, 95% probability intervals are standard for the medical literature if you are doing Bayesian analysis (deal with it), and 95% intervals with 2.5% probabilities on each tail are littered through the design document. They are going to the FDA with these claims, so they will likely stick to the standard evidentiary rules.

Assuming my interpretation is correct, let’s back out how many cases were in the treatment group. Conditional on the total number of infections, the number of infections in the treatment group is distributed binomially. We apply the beta prior to this posterior and then transform our inferences from the binomial proportion to vaccine effectiveness. Vaccine effectiveness is one minus the infection rate ratio between the two groups, and the rate ratio is related to the binomial proportion as the odds.

> # reference: https://pfe-pfizercom-d8-prod.s3.amazonaws.com/2020-09/C4591001_Clinical_Protocol.pdf
> 
> # prior interval (matches prior interval on page 103)
> qbeta(c(.025,.975),.700102,1)
[1] 0.005148448 0.964483043
> 
> 
> # posterior
> cases_treatment <- 3
> cases_control <- 94 - cases_treatment
> theta_ci <- qbeta(c(.025,.975),cases_treatment+.700102,cases_control+1)
> rate_ratio_ci <- theta_ci / (1-theta_ci)
> 
> # effectiveness
> 100 * (1 - rate_ratio_ci)
[1] 98.98688 90.68447
> library(ggplot2)
> xx <- (0:60)/500
> yy <- sapply(xx, function(x) dbeta(x,cases_treatment+.700102,cases_control+1))
> xx <- 100 * (1 - xx / (1 - xx))
> ggplot() + 
+   geom_area(aes(x=xx,y=yy)) + 
+   theme_bw() + 
+   xlab("Vaccine Effectiveness") + 
+   ylab("Posterior Density")

The largest number of treatment cases that would have a lower bound greater than 90% is 3, corresponding to 91 cases in the control group. The estimated effectiveness of the vaccine is then 97% with a probability interval from 90.7% to 99.0%. So sure, the effectiveness could be 90% or so, but odds are that it is a lot higher as the posterior plot below shows.

vaccine

To put this in perspective, consider the rates at which a 97% effective vaccine fails to provide protection, leading to an infection. A 90% effective vaccine has a 3.3 times higher failure rate, so if you vaccinated a population with a 90% effective vaccine and everyone was exposed you’d expect to see 3.3 times more infections compared to if you had used a 97% effective vaccine.

I do note that the analysis plan calls for sequential stopping rules that preserve type I error; however, I don’t believe that any reported statistics would be adjusted for that. Unlike frequentist intervals, Bayesian intervals are unchanged no matter how many interim analyses you do.

There is a lot we don’t know, and hopefully we will get more scientific clarity in the coming weeks. As it stands now, it seems like this vaccine has efficacy way above my baseline expectations, perhaps even in the 97% range or higher.

I could be wrong in my interpretation of the press release, and they are in fact talking about the sample effectiveness rather than the true effectiveness. In that case, 8 of the 94 cases would have been in the treatment group, and the interval for the true effectiveness would be between 81.6% and 95.6%. The posterior distribution would look pretty darn good, but not quite as nice as the previous one.

vaccine2

It is important to have realistic expectations though. Efficacy is not the only metric that is important in determining how useful the vaccine is. Due to the fact that the study population has only been followed for months, we do not know how long the vaccine provides protection for. There is significant evidence of COVID-19 reinfection, so the expectation is that a vaccine will not provide permanent immunity. If the length of immunity is very short (e.g. 3 months), then it won’t be the silver bullet we are looking for. I’d be happy to see a year of immunity and ecstatic if it lasts two.

Additionally, there are the side effects. We’ll have to see what the results are from this trial, but in the phase II trial, something like 8% or 17% of subjects (I’m unsure of the dosage for the phase III) experienced a fever after their booster. It is likely that you’ll want to take the day after you get the second shot off work in case you don’t feel well. The rate of side effects may harm vaccine uptake.

Categories
Deducer R Shiny Video

Vivid: Toward A Next Generation Statistical User Interface

We are announcing the development of a new statistical user interface for R. I’m really excited about it and I hope that your will be too. Vivid is a rich document format deeply integrated with RStudio that mixes user interface elements, code and output. I firmly believe that the next million R users are going to fall in love with R first through the lens of a statistical user interface. Vivid aims to be that interface.

Vivid is in its developmental infancy, but we would love feedback and collaborators. If you are interested in testing and/or helping us build this tool, let us know!

https://github.com/fellstat/vivid

Categories
R Shiny

Long Running Tasks With Shiny: Challenges and Solutions

One of the great additions to the R ecosystem in recent years is RStudio’s Shiny package. With it, you can easily whip up and share a user interface for a new statistical method in just a few hours. Today I want to share some of the methods and challenges that come up when the actual computation of a result takes a non-trivial amount of time (e.g >5 seconds).

First Attempt

Shiny operates in a reactive programming framework. Fundamentally this means that any time any UI element that affects the result changes, so does the result. This happens automatically, with your analysis code running every time a widget is changed. In a lot of cases, this is exactly what you want and it makes Shiny programs concise and easy to make; however in the case of long running processes, this can lead to frozen UI elements and a frustrating user experience.

The easiest solution is to use an Action Button and only run the analysis code when the action button is clicked. Another important component is to provide your user with feedback as to how long the analysis is going to take. Shiny has nice built in progress indicators that allow you to do this.

library(shiny)

# Define UI for application that draws a histogram
ui <- fluidPage(
   
   # Application title
   titlePanel("Long Run"),
   
   # Sidebar with a slider input for number of bins 
   sidebarLayout(
      sidebarPanel(
        actionButton('run', 'Run')
      ),
      
      # Show a plot of the generated distribution
      mainPanel(
         tableOutput("result")
      )
   )
)

server <- function(input, output) {
  N <- 10
  
  result_val <- reactiveVal()
  observeEvent(input$run,{
    result_val(NULL)
    withProgress(message = 'Calculation in progress', {
      for(i in 1:N){
        
        # Long Running Task
        Sys.sleep(1)
        
        # Update progress
        incProgress(1/N)
      }
      result_val(quantile(rnorm(1000)))
    })
  })
   output$result <- renderTable({
     result_val()
   })
}

# Run the application 
shinyApp(ui = ui, server = server)

 

Screen Shot 2018-07-30 at 10.50.37 AM

 

The above implementation has some of the things we want out of our interface:

  • The long running analysis is only executed when "Run" is clicked.
  • Progress is clearly displayed to the user.

It does have some serious downsides though:

  • If "Run" is clicked multiple times, the analysis is run back to back. A frustrated user can easily end up having to abort their session because they clicked to many times.
  • There is no way to cancel the calculation. The session's UI is locked while the computation takes place. Often a user will realize that some of the options they've selected are incorrect and will want to restart the computation. With this interface, they will have to wait however long the computation takes before they can fix the issue.
  • The whole server is blocked while the computation takes place. If multiple users are working with the app, the UIs of all users are frozen while any one user has an analysis in progress.

A Second Attempt With Shiny Async

Having the whole server blocked is a big issue if you want to have your app scale beyond a single concurrent user. Fortunately, Shiny's new support of asynchronous processing can be used to remove this behavior. Instead of assigning a value to the reactive value 'result_val', we will instead create a promise to execute the analysis in the future (using the future function) and when it is done assign it to result_val (using %...>%).

library(shiny)
library(promises)
library(future)
plan(multiprocess)

# Define UI for application that draws a histogram
ui <- fluidPage(
  
  # Application title
  titlePanel("Long Run Async"),
  
  # Sidebar with a slider input for number of bins 
  sidebarLayout(
    sidebarPanel(
      actionButton('run', 'Run')
    ),
    
    # Show a plot of the generated distribution
    mainPanel(
      tableOutput("result")
    )
  )
)

server <- function(input, output) {
  N <- 10
  
  result_val <- reactiveVal()
  observeEvent(input$run,{
    result_val(NULL)
    future({
      print("Running...")
      for(i in 1:N){
        Sys.sleep(1)
      }
      quantile(rnorm(1000))
    }) %...>% result_val()
  })
  output$result <- renderTable({
    req(result_val())
  })
}

# Run the application 
shinyApp(ui = ui, server = server)

 

Screen Shot 2018-07-30 at 11.20.04 AM

 

When "Run" is clicked, the UI is now blocked only for the individual performing the analysis. Other users will be able to perform analyses of there own concurrently. That said, we still have some undesirable properties:

  • If "Run" is clicked multiple times, the analysis is run back to back.
  • There is no way to cancel the calculation.
  • The user cannot monitor progress. Shiny's progress bar updates do not support calling them from within future, so we've had to remove the progress bar from the UI. This is not a huge problem for tasks that take a few seconds, but for those that take minutes or hours, not knowing how long until the results show up can be very frustrating.

Third Time Is the Charm

In order to solve the cancel and monitoring problems, we need to be able to communicate between the app and the inside of the promise. This can be accomplished with the use of a file, where progress and interrupt requests are read and written. If the user clicks the cancel button, "interrupt" is written to the file. During the course of the computation the analysis code checks whether interrupt has been signaled and if so, throws an error. If no interrupt has been requested, the analysis code writes its progress to the file. If Status is clicked, Shiny reads the file and notifies the user of its contents.

The last addition to the code is to create a reactive value nclicks that prevents the Run button from triggering multiple analyses.

 

library(shiny)
library(promises)
library(future)
plan(multiprocess)

ui <- fluidPage(
  titlePanel("Long Run Stoppable Async"),
  sidebarLayout(
    sidebarPanel(
      actionButton('run', 'Run'),
      actionButton('cancel', 'Cancel'),
      actionButton('status', 'Check Status')
    ),
    mainPanel(
      tableOutput("result")
    )
  )
)

server <- function(input, output) {
  N <- 10
  
  # Status File
  status_file <- tempfile()
  
  get_status <- function(){
    scan(status_file, what = "character",sep="\n")
  }
  
  set_status <- function(msg){
    write(msg, status_file)
  }
  
  fire_interrupt <- function(){
    set_status("interrupt")
  }
  
  fire_ready <- function(){
    set_status("Ready")
  }

  fire_running <- function(perc_complete){
    if(missing(perc_complete))
      msg <- "Running..."
    else
      msg <- paste0("Running... ", perc_complete, "% Complete")
    set_status(msg)
  }
  
  interrupted <- function(){
    get_status() == "interrupt"
  }

  # Delete file at end of session
  onStop(function(){
    print(status_file)
    if(file.exists(status_file))
      unlink(status_file)
  })
  
  # Create Status File
  fire_ready()
  
  
  nclicks <- reactiveVal(0)
  result_val <- reactiveVal()
  observeEvent(input$run,{
    
    # Don't do anything if analysis is already being run
    if(nclicks() != 0){
      showNotification("Already running analysis")
      return(NULL)
    }
    
    # Increment clicks and prevent concurrent analyses
    nclicks(nclicks() + 1)
    
    result_val(data.frame(Status="Running..."))
    
    fire_running()
    
    result <- future({
      print("Running...")
      for(i in 1:N){
        
        # Long Running Task
        Sys.sleep(1)
        
        # Check for user interrupts
        if(interrupted()){ 
          print("Stopping...")
          stop("User Interrupt")
        }
        
        # Notify status file of progress
        fire_running(100*i/N)
      }
      
      #Some results
      quantile(rnorm(1000))
    }) %...>% result_val()
    
    # Catch inturrupt (or any other error) and notify user
    result <- catch(result,
                    function(e){
                      result_val(NULL)
                      print(e$message)
                      showNotification(e$message)
                    })
    
    # After the promise has been evaluated set nclicks to 0 to allow for anlother Run
    result <- finally(result,
                      function(){
                        fire_ready() 
                        nclicks(0)
                      })
    
    # Return something other than the promise so shiny remains responsive
    NULL
  })
  
  output$result <- renderTable({
    req(result_val())
  })
  
  # Register user interrupt
  observeEvent(input$cancel,{
    print("Cancel")
    fire_interrupt()
  })
  
  # Let user get analysis progress
  observeEvent(input$status,{
    print("Status")
    showNotification(get_status())
  })
}

# Run the application 
shinyApp(ui = ui, server = server)

 

Screen Shot 2018-07-30 at 11.42.08 AM Screen Shot 2018-07-30 at 11.41.55 AM

 

All three of the problems with the original async code have been solved with this implementation. That said, some care should be taken when using async operations like this. It is possible for race conditions to occur, especially if you have multiple "Run" buttons in a single app.

 

Final Thoughts

It is great that Shiny now supports Asynchronous programming. It allows applications to be scaled much more easily, especially when long running processes are present. Making use of these features does add some complexity. The final implementation has ~ 3 times more lines of code compared to the first (naive) attempt.

It is less than ideal that the user has to click a button to get the status of the computation. I'd much prefer it if we were able to remove this button and just have a progress bar; however this is currently not possible within Shiny proper, though it might be achievable to inject some kludgy javascript magic to get a progress bar.

 

 

 

Categories
R wordcloud

New Book in Text Analysis for R

Focus for books on R tend to be highly focused on either statisticians or programmers. There is a dearth of material to assist those in typically less quantitative field access the powerful tools in the R ecosystem. Enter Text Analysis with R for Students of Literature.

I haven’t done a deep read of the book, but the goal of opening up R to the Literature community is a laudable one. Plus the wordcloud package gets a shout out!

Categories
OpenStreetMap R Spatial

New Version of the OpenStreetMap R Pacakge

A new version of the OpenStreetMap package has been released to CRAN. OpenStreetMap 0.3.3 contains several minor improvements. I’ve removed the CloudMade tile set types, as they seem to have gone out of business. MapQuest has also been removed as they have moved to a new API. The mapbox type has been updated to use their new API.

The most important update is the ability to use custom tile servers. Suppose you have a server using the tile map service specification with access urls looking something like

http://api.someplace.com/{z}/{x}/{y}.png*

where {z} is the map’s integer zoom level, {x} is the tile’s integer x position and {y} is the tile’s integer y position. You may pass this url into openmap’s type argument to obtain a map using this tileset.

For example, if you need to access MapQuest’s tilesets, which now require a developer API key, you can use the new custom tileserver facility to access them. Below is an example with my (free) API key.

 

ul <- c(40.9,-74.5)
lr <- c(40.1,-73.2)
par(mfrow=c(2,3))
url <- "https://a.tiles.mapbox.com/v4/mapquest.streets-mb/{z}/{x}/{y}.png?access_token=pk.eyJ1IjoibWFwcXVlc3QiLCJhIjoiY2Q2N2RlMmNhY2NiZTRkMzlmZjJmZDk0NWU0ZGJlNTMifQ.mPRiEubbajc6a5y9ISgydg"
map <- openmap(ul,lr,minNumTiles=4, type=url)
plot(map)


url <- "https://a.tiles.mapbox.com/v4/mapquest.satellitenolabels/{z}/{x}/{y}.png?access_token=pk.eyJ1IjoibWFwcXVlc3QiLCJhIjoiY2Q2N2RlMmNhY2NiZTRkMzlmZjJmZDk0NWU0ZGJlNTMifQ.mPRiEubbajc6a5y9ISgydg"
map <- openmap(ul,lr,minNumTiles=4, type=url)
plot(map)


url <- "https://a.tiles.mapbox.com/v4/mapquest.satellite/{z}/{x}/{y}.png?access_token=pk.eyJ1IjoibWFwcXVlc3QiLCJhIjoiY2Q2N2RlMmNhY2NiZTRkMzlmZjJmZDk0NWU0ZGJlNTMifQ.mPRiEubbajc6a5y9ISgydg"
map <- openmap(ul,lr,minNumTiles=4, type=url)
plot(map)

url <- "https://a.tiles.mapbox.com/v4/mapquest.dark/{z}/{x}/{y}.png?access_token=pk.eyJ1IjoibWFwcXVlc3QiLCJhIjoiY2Q2N2RlMmNhY2NiZTRkMzlmZjJmZDk0NWU0ZGJlNTMifQ.mPRiEubbajc6a5y9ISgydg"
map <- openmap(ul,lr,minNumTiles=4, type=url)
plot(map)

url <- "https://a.tiles.mapbox.com/v4/mapquest.light/{z}/{x}/{y}.png?access_token=pk.eyJ1IjoibWFwcXVlc3QiLCJhIjoiY2Q2N2RlMmNhY2NiZTRkMzlmZjJmZDk0NWU0ZGJlNTMifQ.mPRiEubbajc6a5y9ISgydg"
map <- openmap(ul,lr,minNumTiles=4, type=url)
plot(map)

 

tmp

 

 

 

 

Categories
C++ R Rcpp

Insert and Remove Performance of Boost’s flat_set v.s. std::set

The standard way to represent an ordered set of numbers is with a binary tree. This offers a good mix of performance properties as the number of elements gets large. In particular it offers O(log(n)) operations insertion/deletion, O(log(n)) operations to find an element. Finding the ith element of the set takes more time, O(n) operations for random access compared to a vector, which has O(1) complexity for that task.

In C++ std::set implements an ordered set as a binary tree, hiding all the ugly details for you. But std::set is not always the right structure to use. Unless you are doing lots of inserts and removals in large sets, binary trees can be inefficient because they have to do things like pointer referencing, and the data is not stored in a continuous block leading to cache misses. This is why some have declared that you shouldn’t use set.

An alternative is to use a sorted std::vector as the set data structure. This improves the performance of random access to O(1), and worsens the performance of insertion/deletion to O(n). There is a handy drop in replacement for std::set in boost called flat_set. You can mostly take any code using set and switch it to flat_set with no changes to the logic.

Recently I was in a situation where I had performance critical code using many small sets (typically between 0 and 20 elements). This code does lots of insertions and deletions, so one would initially think that flat_set is not a good option with its O(n) complexity, but remember that complexity is an asymptotic statement as n grows, and I was relatively certain that my n was small. So which should be used? The only way to find out was to do some simulations.

For each n I generated a set<int> and flat_set<int> with equally spaced integers from 0 to 100,000,000, then I inserted and removed 500,000 random integers in the same range and recorded the timings. The compiler optimization was set to high ( -O3 ).

size  flat_set  std::set
2     0.02791   0.070053
4     0.031647  0.07919
8     0.038431  0.08474
16    0.0528    0.091744
32    0.0697    0.104424
64    0.085957  0.129225
128   0.1176    0.129537
256   0.15825   0.138755
512   0.253153  0.148642
1024  0.401831  0.156223
2048  0.718302  0.166177
4096  1.35207   0.176593
8192  2.5926    0.19331

Times are in seconds here, so for small sets (2-16 elements) flat_set is twice as fast, and beats std::set all the way up though 128 elements. By 4096 elements we are paying the asymptotic cost however and flat_set is >10x slower. flat_set is vector backed, so we know that it will be much faster at other operations like random access, iteration and find because it is in a contiguous block of memory. The surprising thing is that it is even faster at insertions and deletions provided the set size is modest.

If you are an R user, you can use flat_set very easily now with the new BH package. Simply add it as a linkingTo in your package and Bob is your uncle. Below is the code that I used for the simulations (it uses Rcpp but that can easily be taken out).

	
        #include <boost/container/flat_set.hpp>
        #include <set>
        #include <ctime>
        #include <Rcpp.h>
        GetRNGstate();
        std::clock_t start;
	double d1,d2;
	boost::container::flat_set<int> fs;
	std::set<int> ss;
	int n = 500000;
	int max = 100000000;
	for(int i = 1;i<14;i++){
		int s = round(pow(2.0,i));
		d1=0.0;
		d2=0.0;
		fs.clear();
		fs.reserve(s*2);
		ss.clear();
		for(int j=0;j<s;j++){
			int val = round(((j+1)/(double)s)*max);//floor(Rf_runif(0.0,max));
			fs.insert(val);
			ss.insert(val);
		}
		start = std::clock();
		for(int j=0;j<n;j++){
			int rand = floor(Rf_runif(0.0,max));
			bool ins = fs.insert(rand).second;
			if(ins)
				fs.erase(rand);
		}
		d1 += ( std::clock() - start ) / (double) CLOCKS_PER_SEC;

		start = std::clock();
		for(int j=0;j<n;j++){
			int rand = floor(Rf_runif(0.0,max));
			bool ins = ss.insert(rand).second;
			if(ins)
				ss.erase(rand);
		}
		d2 += ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
		std::cout << s << ", " << d1 << ", " << d2 << "," << std::endl;
	}
	PutRNGstate();

 

 

Categories
OpenStreetMap R Spatial

The OpenStreetMap Package Opens Up

A new version of the OpenStreetMap package is now up on CRAN, and should propagate to all the mirrors in the next few days. The primary purpose of the package is to provide high resolution map/satellite imagery for use in your R plots. The package supports base graphics and ggplot2, as well as transformations between spatial coordinate systems.

The bigest change in the new version is the addition of dozens of tile servers, giving the user the option of many different map looks, including those from Bing, MapQuest and Apple.

nm = c("osm", "maptoolkit-topo",
"waze", "mapquest", "mapquest-aerial",
"bing", "stamen-toner", "stamen-terrain",
"stamen-watercolor", "osm-german", "osm-wanderreitkarte",
"mapbox", "esri", "esri-topo",
"nps", "apple-iphoto", "skobbler",
 "opencyclemap", "osm-transport",
"osm-public-transport", "osm-bbike", "osm-bbike-german")

png(width=1200,height=2200)
par(mfrow=c(6,4))
for(i in 1:length(nm)){
	print(nm[i])
	map = openmap(c(lat= 38.05025395161289,   lon= -123.03314208984375),
			c(lat= 36.36822190085111,   lon= -120.69580078125),
			minNumTiles=9,type=nm[i])
	plot(map)
	title(nm[i],cex.main=4)
}
dev.off()

You can also render maps from cloudmade.com which hosts tons of map tilings. Simply set the “type” parameter to “cloudmade-<id>” where <id> is the cloudmade identifier for the map you want to use. Here is a sample:

nm = c("cloudmade-2","cloudmade-999","cloudmade-998",
"cloudmade-7","cloudmade-1960","cloudmade-1155",
"cloudmade-12284")
#setCloudMadeKey("<your key>")
png("RPlot002.png",width=1300,height=800)
par(mfrow=c(2,4))
for(i in 1:length(nm)){
	print(nm[i])
	map = openmap(c(lat= 38.05025395161289,   lon= -123.03314208984375),
			c(lat= 36.36822190085111,   lon= -120.69580078125),
			minNumTiles=9,type=nm[i])
	plot(map)
	title(nm[i],cex.main=4)
}
dev.off()

Maps are initially put in a sperical mercator projection which is the standard for most (all?) map tiling systems, but can easily be translated to long-lat (or any other projection) using the openproj function. Maps can be plotted in ggplot2 using the autoplot function.

mapLatLon = openproj(map)
autoplot(mapLatLon)

The package also has a Java GUI to help with map type selection, and specification of coordinates to bound your map. clicking on the map will give you the latitude and longitude of the point clicked.

launchMapHelper()

Probably the main alternative to OpenStreetMap is the ggmap package. ggmap is an excellent package, and it is somewhat unfortunate that there is a significant duplication of effort between it and OpenStreetMap. That said, there are some differences that may help you decide which to use:

Reasons to favor OpenStreetMap:

  • More maps: OpenStreetMap supports more map types.
  • Better image resolution: ggmap only fetches one png from the server, and thus is limited to the resolution of that png, whereas OpenStreetMap can download many map tiles and stich them together to get an arbitrarily high image resolution.
  • Transformations: OpenStreetMap can be used with any map coordinate system, whereas ggmap is limited to long-lat.
  • Base graphics: Both packages support ggplot2, but OpenStreetMap also supports base graphics.
Reasons to favor ggmap:
  • No Java dependency: ggmap does not require Java to be installed.
  • Geocoding: ggmap has functions to do reverse geo coding.
  • Google maps: While OpenStreetMap has more map types, it currently does not support google maps.

 

 

 

Categories
Deducer R

Interview by DecisionStats

Ajay Ohri interviewed
me
on his popular DecisionStats blog. Topics discussed
ranged widely from Fellows
Statistics
, to Deducer, to statnet, to Poker A.I.,
to Big Data.    

Categories
R

Quickly Profiling Compiled Code within R on the Mac

This is a quick note on profiling your compiled code on the mac. It is important not to guess when figuring out where the bottlenecks in your code are, and for this reason, the R manual has several suggestions on how to profile compiled code running within R. All of the methods are platform dependent, with linux requiring command line tools (sprof, and oprofile). On the mac there are some convenient GUI tools built in to help you out.

Our test code is drawn from the ergm package, and fits a network model to a small graph. This model is fit using markov chain monte carlo, so we expect most of the time to be spent drawing MCMC samples.

library(ergm)
flomarriage = network(flo,directed=FALSE)
flomarriage %v% "wealth" = c(10,36,27,146,55,44,20,8,42,103,48,49,10,48,32,3)
for(i in 1:10) gest = ergm(flomarriage ~ kstar(1:2) + absdiff("wealth") + triangle)

The quickest way to get an idea of where your program is spending time is to open Activity Monitor.app and select the R process while the above code is running. Then click sample, which should result in something like:

In order to get down to the interesting code, you’ll have to wade though a bunch of recursive Rf_eval and do_begin calls used by R. Finally though we see the MCMCSample function in ergm.so which contains almost all of the samples. So it is as we expected, the MCMC sampling takes up all the time. Underneeth that we see calls to:

  • ChangeStats: calculates the network statistics
  • MH_TNT: calculates the metropolis proposal
  • ToggleEdge: alters the network to add (or remove) an edge

But how much these contribute is a bit difficult to see here. Enter the Time Profiler in Instruments.app. You can launch this program from within Xcode.app under Xcode – Open Developer Tool – Instruments. Once Instruments is open, click Library and drag the Time Profiler into the program. Then, while the R code is running, select ‘Attach to Process’ – R and click record. You should get something like:
Like in the Activity Monitor, we have to drill down past all of the recursive Rf_eval calls, but once we do, we see that ChangeStats is taking 44.6% of the time, MH_TNT takes 30.9% and ToggleEdges takes 11.7%. We could further expand these out to identify possible targets for optimization within each of these functions.

So there you have it. Apple makes it really easy to profile your compiled code. One issue that I ran into is to make sure that your Xcode is up-to-date and that Instruments is version 4.0 or above. I experienced some crashes profiling R with older versions.