Tokenizer¶
Motivation¶
There is no possible standard how CSV files are described. There is no schema, so every file you get may look completely different. This rules out one single strategy to tokenize a given line in your CSV data.
Imagine a situation, where a column delimiter is also present in the column data like this:
FirstNameLastName;BirthDate
"Philipp,Wagner",1986/05/12
""Max,Mustermann",2014/01/01
A simple string.Split with a comma as column delimiter will lead to wrong data, so the line
needs to be split differently. And this is exactely where a Tokenizer fits in.
So a Tokenizer is used to split a given line of your CSV data into the column data.
Available Tokenizers¶
StringSplitTokenizer¶
The StringSplitTokenizer splits a line at a given column delimiter.
Philipp,Wagner,1986/05/12
Is tokenized into the following values:
PhilippWagner1986/05/12
RFC4180Tokenizer¶
The RFC4180 proposes a specification for the CSV format, which is commonly accepted. You can use
the RFC4180Tokenizer to parse a CSV file in a RFC4180-compliant format.
Example¶
Imagine a RFC4180-compliant CSV file with Person Names should be parsed. Each Person has a Name,
Age and Description. The Name and Description may contain the column
delimiter and also double quotes.
Name, Age, Description
"Michael, Chester", 24, "Also goes by ""Mike"", among friends that is"
"Robert, Willliamson", , "All-around nice guy who always says hi"
The following example shows how to use the RFC4180Tokenizer for the given example data.
// Copyright (c) Philipp Wagner. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using NUnit.Framework;
using System;
using System.Linq;
using System.Text;
using TinyCsvParser.Mapping;
using TinyCsvParser.Tokenizer.RFC4180;
namespace TinyCsvParser.Test.Tokenizer
{
[TestFixture]
public class Rfc4180TokenizerTest
{
private class SampleEntity
{
public string Name { get; set; }
public int? Age { get; set; }
public string Description { get; set; }
}
private class SampleEntityMapping : CsvMapping<SampleEntity>
{
public SampleEntityMapping()
{
MapProperty(0, x => x.Name);
MapProperty(1, x => x.Age);
MapProperty(2, x => x.Description);
}
}
[Test]
public void RFC4180_CsvParser_Integration_Test()
{
// Use a " as Quote Character, a \\ as Escape Character and a , as Delimiter.
var options = new Options('"', '\\', ',');
// Initialize the Rfc4180 Tokenizer:
var tokenizer = new RFC4180Tokenizer(options);
// Now Build the Parser:
CsvParserOptions csvParserOptions = new CsvParserOptions(true, tokenizer);
SampleEntityMapping csvMapper = new SampleEntityMapping();
CsvParser<SampleEntity> csvParser = new CsvParser<SampleEntity>(csvParserOptions, csvMapper);
var stringBuilder = new StringBuilder()
.AppendLine("Name, Age, Description")
.AppendLine("\"Michael, Chester\",24,\"Also goes by \"\"Mike\"\", among friends that is\"")
.AppendLine("\"Robert, Willliamson\", , \"All-around nice guy who always says hi\"");
// Define the NewLine Character to split at:
CsvReaderOptions csvReaderOptions = new CsvReaderOptions(new[] { Environment.NewLine });
var result = csvParser
.ReadFromString(csvReaderOptions, stringBuilder.ToString())
.ToList();
Assert.AreEqual(2, result.Count);
Assert.AreEqual(true, result.All(x => x.IsValid));
Assert.AreEqual("Michael, Chester", result[0].Result.Name);
Assert.AreEqual(24, result[0].Result.Age);
Assert.AreEqual("Also goes by \"Mike\", among friends that is", result[0].Result.Description);
Assert.AreEqual("Robert, Willliamson", result[1].Result.Name);
Assert.AreEqual(false, result[1].Result.Age.HasValue);
Assert.AreEqual("All-around nice guy who always says hi", result[1].Result.Description);
}
}
}
RegularExpressionTokenizer¶
The RegularExpressionTokenizer is an abstract base class, that uses a regular expression
to match a given line. So if you need to match a line with a regular expression, you have to implement
this base class.
The QuotedStringTokenizer is a good example to start with.
QuotedStringTokenizer¶
The QuotedStringTokenizer is an implementation of a RegularExpressionTokenizer. It uses
a (rather complicated) regular expression to leave data in a double quotes ("") untouched, so a line
like:
"Philipp,Wagner",1986/05/12
Is tokenized into the following values:
Philipp,Wagner1986/05/12
Example¶
Imagine a CSV file contains a list of persons with the following data:
FirstNameLastName;BirthDate
"Philipp,Wagner",1986/05/12
""Max,Mustermann",2014/01/01
The first name and the last name are using a comma, which is the same character as the column delimiter.
So the file can’t be tokenized by only splitting at the column delimiter with the default
StringSplitTokenizer.
This is where the QuotedStringTokenizer is needed!
The Tokenizer is set in the CsvParserOptions.
using NUnit.Framework;
using System;
using System.Linq;
using System.Text;
using TinyCsvParser.Mapping;
using TinyCsvParser.Tokenizer.RegularExpressions;
namespace TinyCsvParser.Test.Tokenizer
{
[TestFixture]
public class TokenizerExampleTest
{
private class Person
{
public string FirstNameWithLastName { get; set; }
public DateTime BirthDate { get; set; }
}
private class CsvPersonMapping : CsvMapping<Person>
{
public CsvPersonMapping()
{
MapProperty(0, x => x.FirstNameWithLastName);
MapProperty(1, x => x.BirthDate);
}
}
[Test]
public void QuotedStringTokenizerExampleTest()
{
CsvParserOptions csvParserOptions = new CsvParserOptions(true, new QuotedStringTokenizer(','));
CsvReaderOptions csvReaderOptions = new CsvReaderOptions(new[] { Environment.NewLine });
CsvPersonMapping csvMapper = new CsvPersonMapping();
CsvParser<Person> csvParser = new CsvParser<Person>(csvParserOptions, csvMapper);
var stringBuilder = new StringBuilder()
.AppendLine("FirstNameLastName;BirthDate")
.AppendLine("\"Philipp,Wagner\",1986/05/12")
.AppendLine("\"Max,Mustermann\",2014/01/01");
var result = csvParser
.ReadFromString(csvReaderOptions, stringBuilder.ToString())
.ToList();
// Make sure we got 2 results:
Assert.AreEqual(2, result.Count);
// And all of them have been parsed correctly:
Assert.IsTrue(result.All(x => x.IsValid));
// Now check the values:
Assert.AreEqual("Philipp,Wagner", result[0].Result.FirstNameWithLastName);
Assert.AreEqual(1986, result[0].Result.BirthDate.Year);
Assert.AreEqual(5, result[0].Result.BirthDate.Month);
Assert.AreEqual(12, result[0].Result.BirthDate.Day);
Assert.AreEqual("Max,Mustermann", result[1].Result.FirstNameWithLastName);
Assert.AreEqual(2014, result[1].Result.BirthDate.Year);
Assert.AreEqual(1, result[1].Result.BirthDate.Month);
Assert.AreEqual(1, result[1].Result.BirthDate.Day);
}
}
}
FixedLengthTokenizer¶
Sometimes you need to parse a CSV file, that is defined by fixed width columns. The FixedLengthTokenizer addresses this problem and makes
it possible to define columns by their start and end position in a given file. The FixedLengthTokenizer takes a list of
FixedLengthTokenizer.ColumnDefinition, which define the columns of the file.
Example¶
In the following example the textual input is split into two columns. The first column is between index 0 and 10, and the second column is between the
index 10 and 20. You can see, that these values build the list of ColumnDefinition, which are passed into the FixedLengthTokenizer.
// Copyright (c) Philipp Wagner. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using NUnit.Framework;
using System.Text;
using TinyCsvParser.Tokenizer;
namespace TinyCsvParser.Test.Tokenizer
{
[TestFixture]
public class FixedLengthTokenizerTests
{
[Test]
public void Tokenize_Line_Test()
{
var columns = new[] {
new FixedLengthTokenizer.ColumnDefinition(0, 10),
new FixedLengthTokenizer.ColumnDefinition(10, 20),
};
var tokenizer = new FixedLengthTokenizer(columns);
var input = new StringBuilder()
.AppendLine("Philipp Wagner ")
.ToString();
var result = tokenizer.Tokenize(input);
Assert.AreEqual("Philipp ", result[0]);
Assert.AreEqual("Wagner ", result[1]);
}
}
}