Well that was painful

By Holy Zarquon's Singing Fish on August 11, 2010 7:16 AM

I just blew my time budget on the Perl survey stuff for looking at programming language info, and I want to offload what I had to do to get to this point.

Firstly the tl;dr: You can see the complete results for language usage at the survey website.

Here are the gory details (just a note: the <- operator is a bit like the = operator in other languages, but there's also a -> assignment operator too, which means you can do all sorts of clever things in 1 line of code):

What we asked was for the 5 programing languages that you use most, roughly in order of how much they're used. Then we asked where perl came in that list.

I'm doing this in R because it's optimised for statistical calculations. In this case though the operations required were a mixure of the kind of thing that R is good at, and the kind of thing that perl is good at. I decided that dropping out to perl wherever wasn't going to be a terribly good option, as I wanted to have the analysis file self contained. Anyway, here's the source code, let me talk you through it. The code is in R

So if we think about the original data file as if it's a database, then we have the language name stored in its rank column. So the most used language is in column language$most.used, the second most is in the column language$second.most and so on through to 5. Finally the position of perl is stored in a separate column so that eventually we'll do a shuffle along one procedure to splice perl into the list correctly.

Source code time:

load("10_language_info.RData") # grab data we sliced out of the raw data earlier
for (i in 1:5) data[,i] <- tolower(data[,i]) # make the text all lower case

Here we're normalising the list of languages. The raw data has about 350 languages. In fact this is actually around 67 unique languages, but people are inconsistent:

## languages <- sort(unique(unlist(c(sapply(data[1:5],levels)),use.names=FALSE))) 
# write.csv(languages,file="lang.csv")
#  some manual processing to normalise the languages list a bit 
languages <- read.csv(file="lang.csv")


for (i in 1:length(languages[,1])) data <- replace(data, data==toString(languages[i,1]), toString(languages[i,2]))
data <- replace(data, data=="rubh", "ruby") # a typo we missed in normalisation process

Now rather than having one column per rank, we need one column per language. The following code for each rank explodes the column into the number of columns that corresponds to the the number of languages mentioned in that column (a standard statistical procedure called creating a dummy variable). We do this for each rank:

library(dummies) # like use Whatever; in perl

# it's like a bomb in a mannequin factory!
l1 <- dummy.data.frame(as.data.frame(data$language_1))
l2 <- dummy.data.frame(as.data.frame(data$language_2))
l3 <- dummy.data.frame(as.data.frame(data$language_3))
l4 <- dummy.data.frame(as.data.frame(data$language_4))
l5 <- dummy.data.frame(as.data.frame(data$language_5))

# this is just to make the column names prettier for display
names(l1) <- sub("^data.language_.","",names(l1))
names(l2) <- sub("^data.language_.","",names(l2))
names(l3) <- sub("^data.language_.","",names(l3))
names(l4) <- sub("^data.language_.","",names(l4))
names(l5) <- sub("^data.language_.","",names(l5))

At this point each new data frame (the container we put the exploded variables in if you like) is just numbers 1 and 0 - we want to make this number the actual rank if it isn't zero:

l1 <- replace(l1,l1==1,1)
l2 <- replace(l2,l2==1,2)
l3 <- replace(l3,l3==1,3)
l4 <- replace(l4,l4==1,4)
l5 <- replace(l5,l5==1,5)

And we need a complete list of all languages mentioned.

languages.list <- unique(names(c(l1,l2,l3,l4,l5)))

This is the variable that's the number of responses:

cases <- rep(0,length(l1[,1]))

Then we need to make sure there are the same number of columns for each data frame:

for ( i in 1:length(languages.list) ) {
    if ( length (l1[[languages.list[i]]]) == 0 ) {
        l1[[languages.list[i]]] <- cases
    }
    if ( length (l2[[languages.list[i]]]) == 0 ) {
        l2[[languages.list[i]]] <- cases
    }
    if ( length (l3[[languages.list[i]]]) == 0 ) {
        l3[[languages.list[i]]] <- cases
    }
    if ( length (l4[[languages.list[i]]]) == 0 ) {
        l4[[languages.list[i]]] <- cases
    }
    if ( length (l5[[languages.list[i]]]) == 0 ) {
        l5[[languages.list[i]]] <- cases
    }
 }

And then we add the 5 data frames together in a matrix addition operation (a data frame is basically a special matrix). Note that we need to make sure that each data frame returns the columns in the same order.

all.langs <- l1[names(l1)] +l2[names(l1)] +l3[names(l1)] +l4[names(l1)] + l5[names(l1)]

perl <- data$where_perl_belongs_in_list

insert.perl.order <- function(row.idx) {
    x <- all.langs[row.idx,]
    y <- perl[row.idx]
    change.logical <- which(x >= y);
    all.langs[row.idx,change.logical] <- x[change.logical]+1
}

all.langs$perl <- perl #append it to the data frame

We're at an important point here, because all.langs can be glued to other R data structures created in other parts of the survey, so at some point we can work out which programmers are the smartest perl programmers (or something like that).

Next up there's some jiggery-pokery to make sure that R knows that there are six levels of variable in the data frame, otherwise it will only report on the actual counts, and won't report zeros:

for (i in 1:length(names(all.langs)) ) all.langs[,i] <- factor(all.langs[,i],levels=c(1:6) )

This single line of code generates a counts for all languages:

lang.summary <- sapply(all.langs, summary)

In case you hadn't realised R is a functional language, and a variant of lisp.

Finally we want a report, so we make a new data frame that contains the summary statistics:

report.df <- data.frame('Most used'=integer(),
                    'Second most'=integer(),
                    'Third most'=integer(),
                    'Fourth most'=integer(),
                    'Fifth most'=integer(),
                    'Sixth most'=integer(),
                    'Non-users'=integer(),
                    'Total-users'=integer(),
                    'Percent users'=numeric(),
                    'Mean Rank'=numeric()
                    )

And we have to iterate through the lang.summary matrix to append lines to the report data frame:

for ( i in 1:report.rows ) {
    this.lang <- names(lang.summary[1,])[i]
    this.counts <- lang.summary[,i]

    # UGLY and possibly FRAGILE hack to remove non-perl users from counts
    this.counts[7] <- this.counts[7] - lang.summary[7,dim(lang.summary)[2]]

    this.counts[8] <- sum(this.counts[1:6])
    this.counts[9] <- round(this.counts[8]/sum(this.counts[1:7]) * 100,2)
    this.counts[10] <- round(sum(this.counts[1:6]*c(1:6))/sum(this.counts[1:6]),2)
    report.df[i,] <- this.counts
    rownames(report.df[1,]) <- this.lang
}

rownames(report.df) <- names(lang.summary[1,])

Note the comment in the above code. Getting all of this working just right was very fiddly and has taken more than three times my allocated time budget on this for the last three days. You'll also notice that the code is very procedural, for a functional language. This is basically because statistical computing is often long periods of exploration on the command line (or GUI), followed by consolidating what you've done into a script. The script is basically a duplicate of the procedure you went through on the command line. On the other hand, it's done now, it's reasonably robust, and it's replicatable for future runs of the survey.

0 comments

Tagged as:

perl survey R results

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Holy Zarquon's Singing Fish

Catalyst hacker, management researcher and health informatician.

More info »

Holy Zarquon's Singing Fish