Friday, February 29, 2008

The Django stuff was going so well, I switched over to learning R a couple days ago just to depress myself. It's tricky to get into. The documentation is very scattered (kudos to those who are improving it at the wiki but that's a work in progress). I haven't yet found a way to get documentation for arbitrary libaries. The cute name 'R' makes it very hard to Google for R libraries. E.g. I haven't yet found a core graph or a library function that creates horizontal bar graphs (though maybe somebody will tell me to do vertical bar graphs with all the labels turned, then turn the whole image).

One headache in particular was trying to work with display strings that needed to be converted to numbers. I had a table containing this tiny fraction of data (out of 8 columns and 15000 rows):

ICD.Chapter    Crude.Rate
A00-B99 22.0
C00-D48 193.2
D50-D89 3.2 (Unreliable)
E00-E88 33.7
F01-F99 21.0
...

To graph the Crude Rate without the occasional "Unreliable" strings, I needed to break the information about which measurements were unreliable into a separate column (not losing the information), reduce the Crude Rate to be a number and convert it into an R 'numeric' type. I spent about 8 hours. The first four I managed to solve the first problem: I looked for a "contains(string, substring)" style function but there was none; "grep" produced errors that were really hanging me up whenever the input cell did not have the string I was looking for. Finally I wrote my own little contains function using regexpr:

contains <- function(pattern, string) { 
regexpr(pattern, string) != -1
}

It looks simple now, but it was tricky to get a function that worked properly when converting 15000 cells to 15000 new cells.

The next four hours were spent trying to remove the "(Unreliable)" string from the cells that had it. Again poring through documentation, I tried various approaches and got the closest with this one:

first.word <- function(x) { 
substr(x, 0, regexpr(" ", x)[1]-1)
}

The problem was that although this one worked fine on single strings when I tested it, it didn't work on turning a whole column into a whole new column.

Today at 11:00 am I turned to Excel to see how quickly I could do it in Excel. It's been a long time since I've used Excel formulas, but I knew this kind of thing could be done. In many ways the Excel formula documentation is worse, but it's a smaller set of controls (or at least a clear subset) and it was comprehensive and organized.

In Excel the two new columns are created like this from the H column:

=NOT(ISERROR(SEARCH("Unreliable", H2)))
=IF(ISTEXT(H2), VALUE(REPLACE(H2, SEARCH(" ",H2), 13, "")), H2)

I'm sure this could still be done in R but making those two formulas work for this particular table took me 20 min in Excel instead of four hours.

No comments:

Blog Archive

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.