Work has presented me with a small data manipulation exercise. That’s another opportunity to do some more scripting in F#!
This time I’m processing some Comma-Separated Value (CSV) files. CSV files are one of the lowest forms of semi-structured data, used for representing a simple table of data textually. The basic idea is easy – values with commas between them – so CSV files are widely used. You might think parsing them is trivial. It can be like that if you’re lucky, but sometimes the values can contain commas, so then often the values get quoted, but then any quotes in the values have to be escaped. There are many variations on the “basic” idea.
So, writing a “quick CSV parser” can lead you into a maze of twisty little passages. You don’t want to pull out lex and yacc and roll your own full-blown grammar parser, because the whole point of CSV files is they’re supposed to be lightweight and easy!
Next time you need to write a CSV parser, don’t! You don’t have to reinvent the wheel – other people have already written well-tested libraries you can use. I’ve been using the open source .net FileHelpers library in my F# scripting exercise. (I tried the jet ADO adapter first, but got a strange hard crash I couldn’t be bothered to debug. Anyway..)
It’s easy to use FileHelpers from F#. Here’s how, transliterating the example from the FileHelpers site. Let’s say this is the file “FileIn.txt“:
1732,Juan Perez,435.00,11-05-2002 554,Pedro Gomez,12342.30,06-02-2004 112,Ramiro Politti,0.00,01-02-2000 924,Pablo Ramirez,3321.30,24-11-2002
First, define a (typed) class to represent a row in the CSV file. For the example, the F# type definition might look like this:
[< DelimitedRecord(",") >] type Customer = class val CustId : int val Name : string val Balance : decimal [< FieldConverter(ConverterKind.Date, "dd-MM-yyyy") >] val AddedDate : DateTime end
Note the use of attributes above (in [< … >] brackets). These are annotations that are carried into the compiled code, and can be accessed later by other tools using reflection. The attributes on the type above (e.g. DelimitedRecord) control how FileHelpers treats the overall representation of the file, and attributes on each of the fields (e.g. FieldConverter) are used to control the treatment of values in the corresponding columns in the file.
Create a parsing engine based on the type, like so:
let engine = new FileHelperEngine(typeof<Customer>)
and then you’re good to go:
let res = engine.ReadFile("FileIn.txt")
Actually, there is a wrinkle here. res is an obj array, but you’d prefer it to be a Customer array. You can’t use the ordinary F# dynamic downcast directly, because the array isn’t a super-type itself (its type parameter is, here). So you need to write and use an auxiliary type-casting function, like this:
let downcast_Customer_Array = Array.map (fun (a:obj) -> a :?> Customer) let res_Customers = downcast_Customer_Array res
You end up with an array of your values in your newly defined type, which you can use in the ordinary way, e.g. the date for the first customer is:
Easy, huh? Much easier than writing your own parser.
FileHelpers has a few other tricks if you need them. I’ve been using extra converter attributes to tell FileHelpers that some fields are quoted, and to help parse my dates. I’ve also been using a custom converter to parse a value which was itself a comma-separated list of values. (The only wrinkle there was not being able to use F# lists as .net objects – I had to go via ResizeArray objects instead.)
Thanks for the code it is very useful for me right now. Just one quick question. “let downcast_Customer_Array = Array.map (fun (a:obj) -> a > Customer)” comes through with a smiley face in it. I am new to F# and not sure what the code should be. Any help would be apreciated.
Thanks for pointing that out Doug! It looks OK when I edit it, but I get the smiley face too when viewing it…
The characters are the dynamic downcast symbol. Instead of (smiley-face, then greater than), you should have (colon, then question mark, then greater than).
I’ll see if I can update the post somehow, or stop the smiley-ness.
OK – fixed it. It was a wordpress configurable setting:
settings > writing > Convert emoticons like 🙂 and 😛 to graphics on display
Thanks a lot and your code example was great. It really helped me get going in the right direction.
One other small fix you could put in your blog. In one line of code you declare the engine object like this;
let engine = new FileHelperEngine(typeof)
but the next time you reference it you refer to it as engine2;
let res = engine2.ReadFile(“FileIn.txt”)
I understood what you meant, but it confused me for a minute.
Thanks again for your effort with this. The example was a huge help to me. BTW, with this example I converted some MATLAB code I had for parsing a file and creating timeseries objects. The files contain FX trading data and are huge (10+ million lines of data). In MATLAB the process took 4-5 hours to complete. In F# I could do the entire file and some analysis in slightly over 2 minutes.
Thanks again Doug – I’ve fixed that engine2 typo. I’m pleasantly surprised it’s still helpful/relevant, given that the official F# release happened since the orginal post.
Awesome to get such good results using F# too – I’m sure Don would be happy to hear from you. 🙂
There are some comments about this sample on StackOverflow: http://stackoverflow.com/questions/4971196/f-class-constructor-with-no-args-error-using-filehelpers-wtih-f/4971671#4971671. I posted a suggestion how to make the library a little nicer (remove the need for mutable fields) by moving annotations from fields to constructor parameters… If you ever plan to create a new version, I think this would be nice.
Thanks Tomas! I’m consciously resisting the urge to fix the example in this main blog post, like I did with Doug’s changes. Instead, everyone should head on over to stackoverflow to see the up-to-date version.