This time I’m processing some Comma-Separated Value (CSV) files. CSV files are one of the lowest forms of semi-structured data, used for representing a simple table of data textually. The basic idea is easy – values with commas between them – so CSV files are widely used. You might think parsing them is trivial. It can be like that if you’re lucky, but sometimes the values can contain commas, so then often the values get quoted, but then any quotes in the values have to be escaped. There are many variations on the “basic” idea.
So, writing a “quick CSV parser” can lead you into a maze of twisty little passages. You don’t want to pull out lex and yacc and roll your own full-blown grammar parser, because the whole point of CSV files is they’re supposed to be lightweight and easy!
Next time you need to write a CSV parser, don’t! You don’t have to reinvent the wheel – other people have already written well-tested libraries you can use. I’ve been using the open source .net FileHelpers library in my F# scripting exercise. (I tried the jet ADO adapter first, but got a strange hard crash I couldn’t be bothered to debug. Anyway..)
It’s easy to use FileHelpers from F#. Here’s how, transliterating the example from the FileHelpers site. Let’s say this is the file “FileIn.txt“:
1732,Juan Perez,435.00,11-05-2002 554,Pedro Gomez,12342.30,06-02-2004 112,Ramiro Politti,0.00,01-02-2000 924,Pablo Ramirez,3321.30,24-11-2002
First, define a (typed) class to represent a row in the CSV file. For the example, the F# type definition might look like this:
[< DelimitedRecord(",") >] type Customer = class val CustId : int val Name : string val Balance : decimal [< FieldConverter(ConverterKind.Date, "dd-MM-yyyy") >] val AddedDate : DateTime end
Note the use of attributes above (in [< … >] brackets). These are annotations that are carried into the compiled code, and can be accessed later by other tools using reflection. The attributes on the type above (e.g. DelimitedRecord) control how FileHelpers treats the overall representation of the file, and attributes on each of the fields (e.g. FieldConverter) are used to control the treatment of values in the corresponding columns in the file.
Create a parsing engine based on the type, like so:
let engine = new FileHelperEngine(typeof<Customer>)
and then you’re good to go:
let res = engine.ReadFile("FileIn.txt")
Actually, there is a wrinkle here. res is an obj array, but you’d prefer it to be a Customer array. You can’t use the ordinary F# dynamic downcast directly, because the array isn’t a super-type itself (its type parameter is, here). So you need to write and use an auxiliary type-casting function, like this:
let downcast_Customer_Array = Array.map (fun (a:obj) -> a :?> Customer) let res_Customers = downcast_Customer_Array res
You end up with an array of your values in your newly defined type, which you can use in the ordinary way, e.g. the date for the first customer is:
Easy, huh? Much easier than writing your own parser.
FileHelpers has a few other tricks if you need them. I’ve been using extra converter attributes to tell FileHelpers that some fields are quoted, and to help parse my dates. I’ve also been using a custom converter to parse a value which was itself a comma-separated list of values. (The only wrinkle there was not being able to use F# lists as .net objects – I had to go via ResizeArray objects instead.)