Despite the fact that Excel is the most widespread application for data manipulation and (perhaps) analysis, R's support for the xls and xlsx file formats has left a lot to be desired. Fortunately, the XLConnect package has been created to fill this void, and now JGR 1.7-8 includes integration with XLConnect package to load .xls and .xlsx documents into R.
Not fancy, but very useful.
Last week a question came up on Stack Overflow about determining whether a variable is distributed normally. Some of the answers reminded me of a common and pervasive misconception about how to apply tests against normality. I felt the topic was general enough to reproduce my comments here (with minor edits).
Misconception: If your statistical analysis requires normality, it is a good idea to use a preliminary hypothesis test to screen for departures from normality.
Shapiro's test, Anderson Darling, and others are null hypothesis tests against the the assumption of normality. These should not be used to determine whether to use normal theory statistical procedures. In fact they are of virtually no value to the data analyst. Under what conditions are we interested in rejecting the null hypothesis that the data are normally distributed? I, personally, have never come across a situation where a normal test is the right thing to do. The problem is that when the sample size is small, even big departures from normality are not detected, and when your sample size is large, even the smallest deviation from normality will lead to a rejected null.
Let's look at a small sample example:
> set.seed(100) > x <- rbinom(15,5,.6) > shapiro.test(x) Shapiro-Wilk normality test data: x W = 0.8816, p-value = 0.0502 > x <- rlnorm(20,0,.4) > shapiro.test(x) Shapiro-Wilk normality test data: x W = 0.9405, p-value = 0.2453
In both these cases (binomial and lognormal variates) the p-value is > 0.05 causing a failure to reject the null (that the data are normal). Does this mean we are to conclude that the data are normal? (hint: the answer is no). Failure to reject is not the same thing as accepting. This is hypothesis testing 101.
But what about larger sample sizes? Let's take the case where there the distribution is very nearly normal.
> library(nortest) > x <- rt(500000,200) > ad.test(x) Anderson-Darling normality test data: x A = 1.1003, p-value = 0.006975 > qqnorm(x) > hist(x,breaks=100)
Here we are using a t-distribution with 200 degrees of freedom. The qq and histogram plots show the distribution is closer to normal than any distribution you are likely to see in the real world, but the test rejects normality with a very high degree of confidence.
Does the significant test against normality mean that we should not use normal theory statistics in this case? (Another hint: the answer is no 🙂 )
- In R run: install.packages('rgdal',repos="http://www.stats.ox.ac.uk/pub/RWin")
- Download and install GDAL 1.8 Complete and PROJ framework v4.7.0-2 from: http://www.kyngchaos.com/software/frameworks%29
- Download the latest version of rgdal from CRAN. Currently 0.7-1
- Run Terminal.app, and in the same directory as rgdal_0.7-1.tar.gz run:
R CMD INSTALL --configure-args='--with-gdal-config=/Library/Frameworks/GDAL.framework/Programs/gdal-config --with-proj-include=/Library/Frameworks/PROJ.framework/Headers --with-proj-lib=/Library/Frameworks/PROJ.framework/unix/lib' rgdal_0.7-1.tar.gz
Another titan of computer science died yesterday. Dennis Richie, father of C and UNIX passed away at 70. I enjoyed this post by Herb Sutter which gives an idea of the scale of cognitive leap that he had to make in order to think that he could make a universal programming language competitive with platform specific assembly.
For me it begs the modern question: Is it possible to build a dynamic modern programming language (ala Python, Ruby, R, etc.) that performs competitively with C? Most examples that I can think off are of by two orders of magnitude.
There have been some exciting developments in the Deducer ecosystem over the summer which should go into CRAN release in the next few months. Today I'm going to give a quick sneak peek at an Open Street Map - R connection with accompanying GUI. This post will just show the non-GUI components.
The first part of the project was to create a way to download and plot Open Street Map data from either Mapnik or Bing in a local R instance. Before we can do that however, we need to install DeducerSpatial.
install.packages(c("Deducer","sp","rgdal","maptools")) install.packages("UScensus2000") #get development versions install.packages(c("JGR","Deducer"),,"http://rforge.net",type="source") install.packages("DeducerSpatial",,"http://r-forge.r-project.org",type="source")
Note that you will need rgdal. And you will need your development tools to install the development versions of JGR, Deducer and DeducerSpatial.
Plot an Open Street Map Image
We are going to take a look at the median age of households in the 2000 california census survey. First, lets see if we can get the open street map image for that area.
#load package library(DeducerSpatial) library(UScensus2000) #create an open street map image lat <- c(43.834526782236814,30.334953881988564) lon <- c(-131.0888671875 ,-107.8857421875) southwest <- openmap(c(lat,lon),c(lat,lon),zoom=5,'osm')
Note that plot has an argument 'raster' which determines if the image is plotted as a raster image. 'zoom' controls the level of detail in the image. Some care needs to be taken in choosing the right level of zoom, as you can end up trying to pull street level images for the entire world if you are not careful.
Make a Choropleth Plot
Next, we can add a choropleth to the plot. We bring in the census data from the UScensus2000 package, and transform it to the mercator projection using spTransform.
#load in califonia data and transform coordinates to mercator data(california.tract) california.tract <- spTransform(california.tract,osm()) #add median age choropleth choro_plot(california.tract,dem = california.tract@data[,'med.age'], legend.title = 'Median Age')
Use Aerial Imagery
We can also easily use bing satellite imagery instead.
southwest <- openmap(c(lat,lon),c(lat,lon),5,'bing') plot(southwest,raster=FALSE) choro_plot(california.tract,dem = california.tract@data[,'med.age'],alpha=.8, legend.title = 'Median Age')
One other fun thing to note is that the image tiles are cached so they do not always need to be re-downloaded.
> system.time(southwest <- openmap(c(lat,lon),c(lat,lon),zoom=7,'bing')) user system elapsed 9.502 0.547 17.166 > system.time(southwest <- openmap(c(lat,lon),c(lat,lon),zoom=7,'bing')) user system elapsed 9.030 0.463 9.169
Notice how the elapsed time in the second call is half that of the first one, this is due to caching.
This just scratches the surface of what is possible with the new package. You can plot any type of spatial data (points, lines, etc.) supported by the sp package so long as it is in the mercator projection. Also, there is a full featured GUI to help you load your data and make plots, but I'll talk about that in a later post.