Tokenizer

Motivation

There is no possible standard how CSV files are described. There is no schema, so every file you get may look completely different. This rules out one single strategy to tokenize a given line in your CSV data.

Imagine a situation, where a column delimiter is also present in the column data like this:

FirstNameLastName;BirthDate
"Philipp,Wagner",1986/05/12
""Max,Mustermann",2014/01/01

A simple string.Split with a comma as column delimiter will lead to wrong data, so the line needs to be split differently. And this is exactely where a Tokenizer fits in.

So a Tokenizer is used to split a given line of your CSV data into the column data.

Available Tokenizers

StringSplitTokenizer

The StringSplitTokenizer splits a line at a given column delimiter.

Philipp,Wagner,1986/05/12

Is tokenized into the following values:

  • Philipp

  • Wagner

  • 1986/05/12

RFC4180Tokenizer

The RFC4180 proposes a specification for the CSV format, which is commonly accepted. You can use the RFC4180Tokenizer to parse a CSV file in a RFC4180-compliant format.

Example

Imagine a RFC4180-compliant CSV file with Person Names should be parsed. Each Person has a Name, Age and Description. The Name and Description may contain the column delimiter and also double quotes.

Name, Age, Description
    "Michael, Chester", 24, "Also goes by ""Mike"", among friends that is"
    "Robert, Willliamson", , "All-around nice guy who always says hi"

The following example shows how to use the RFC4180Tokenizer for the given example data.

// Copyright (c) Philipp Wagner. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.

using NUnit.Framework;
using System;
using System.Linq;
using System.Text;
using TinyCsvParser.Mapping;
using TinyCsvParser.Tokenizer.RFC4180;

namespace TinyCsvParser.Test.Tokenizer
{
        [TestFixture]
        public class Rfc4180TokenizerTest
        {
                private class SampleEntity
                {
                        public string Name { get; set; }

                        public int? Age { get; set; }

                        public string Description { get; set; }
                }

                private class SampleEntityMapping : CsvMapping<SampleEntity>
                {
                        public SampleEntityMapping()
                        {
                                MapProperty(0, x => x.Name);
                                MapProperty(1, x => x.Age);
                                MapProperty(2, x => x.Description);
                        }
                }

                [Test]
                public void RFC4180_CsvParser_Integration_Test()
                {
                        // Use a " as Quote Character, a \\ as Escape Character and a , as Delimiter.
                        var options = new Options('"', '\\', ',');

                        // Initialize the Rfc4180 Tokenizer:
                        var tokenizer = new RFC4180Tokenizer(options);

                        // Now Build the Parser:
                        CsvParserOptions csvParserOptions = new CsvParserOptions(true, tokenizer);
                        SampleEntityMapping csvMapper = new SampleEntityMapping();
                        CsvParser<SampleEntity> csvParser = new CsvParser<SampleEntity>(csvParserOptions, csvMapper);


                        var stringBuilder = new StringBuilder()
                                .AppendLine("Name, Age, Description")
                                .AppendLine("\"Michael, Chester\",24,\"Also goes by \"\"Mike\"\", among friends that is\"")
                                .AppendLine("\"Robert, Willliamson\", , \"All-around nice guy who always says hi\"");

                        // Define the NewLine Character to split at:
                        CsvReaderOptions csvReaderOptions = new CsvReaderOptions(new[] { Environment.NewLine });

                        var result = csvParser
                                .ReadFromString(csvReaderOptions, stringBuilder.ToString())
                                .ToList();

                        Assert.AreEqual(2, result.Count);

                        Assert.AreEqual(true, result.All(x => x.IsValid));

                        Assert.AreEqual("Michael, Chester", result[0].Result.Name);
                        Assert.AreEqual(24, result[0].Result.Age);
                        Assert.AreEqual("Also goes by \"Mike\", among friends that is", result[0].Result.Description);

                        Assert.AreEqual("Robert, Willliamson", result[1].Result.Name);
                        Assert.AreEqual(false, result[1].Result.Age.HasValue);
                        Assert.AreEqual("All-around nice guy who always says hi", result[1].Result.Description);
                }
        }
}

RegularExpressionTokenizer

The RegularExpressionTokenizer is an abstract base class, that uses a regular expression to match a given line. So if you need to match a line with a regular expression, you have to implement this base class.

The QuotedStringTokenizer is a good example to start with.

QuotedStringTokenizer

The QuotedStringTokenizer is an implementation of a RegularExpressionTokenizer. It uses a (rather complicated) regular expression to leave data in a double quotes ("") untouched, so a line like:

"Philipp,Wagner",1986/05/12

Is tokenized into the following values:

  • Philipp,Wagner

  • 1986/05/12

Example

Imagine a CSV file contains a list of persons with the following data:

FirstNameLastName;BirthDate
"Philipp,Wagner",1986/05/12
""Max,Mustermann",2014/01/01

The first name and the last name are using a comma, which is the same character as the column delimiter. So the file can’t be tokenized by only splitting at the column delimiter with the default StringSplitTokenizer.

This is where the QuotedStringTokenizer is needed!

The Tokenizer is set in the CsvParserOptions.

using NUnit.Framework;
using System;
using System.Linq;
using System.Text;
using TinyCsvParser.Mapping;
using TinyCsvParser.Tokenizer.RegularExpressions;

namespace TinyCsvParser.Test.Tokenizer
{
    [TestFixture]
    public class TokenizerExampleTest
    {
        private class Person
        {
            public string FirstNameWithLastName { get; set; }
            public DateTime BirthDate { get; set; }
        }

        private class CsvPersonMapping : CsvMapping<Person>
        {
            public CsvPersonMapping()
            {
                MapProperty(0, x => x.FirstNameWithLastName);
                MapProperty(1, x => x.BirthDate);
            }
        }

        [Test]
        public void QuotedStringTokenizerExampleTest()
        {
            CsvParserOptions csvParserOptions = new CsvParserOptions(true, new QuotedStringTokenizer(','));
            CsvReaderOptions csvReaderOptions = new CsvReaderOptions(new[] { Environment.NewLine });
            CsvPersonMapping csvMapper = new CsvPersonMapping();
            CsvParser<Person> csvParser = new CsvParser<Person>(csvParserOptions, csvMapper);

            var stringBuilder = new StringBuilder()
                .AppendLine("FirstNameLastName;BirthDate")
                .AppendLine("\"Philipp,Wagner\",1986/05/12")
                .AppendLine("\"Max,Mustermann\",2014/01/01");

            var result = csvParser
                .ReadFromString(csvReaderOptions, stringBuilder.ToString())
                .ToList();

            // Make sure we got 2 results:
            Assert.AreEqual(2, result.Count);

            // And all of them have been parsed correctly:
            Assert.IsTrue(result.All(x => x.IsValid));

            // Now check the values:
            Assert.AreEqual("Philipp,Wagner", result[0].Result.FirstNameWithLastName);

            Assert.AreEqual(1986, result[0].Result.BirthDate.Year);
            Assert.AreEqual(5, result[0].Result.BirthDate.Month);
            Assert.AreEqual(12, result[0].Result.BirthDate.Day);

            Assert.AreEqual("Max,Mustermann", result[1].Result.FirstNameWithLastName);

            Assert.AreEqual(2014, result[1].Result.BirthDate.Year);
            Assert.AreEqual(1, result[1].Result.BirthDate.Month);
            Assert.AreEqual(1, result[1].Result.BirthDate.Day);
        }
    }
}

FixedLengthTokenizer

Sometimes you need to parse a CSV file, that is defined by fixed width columns. The FixedLengthTokenizer addresses this problem and makes it possible to define columns by their start and end position in a given file. The FixedLengthTokenizer takes a list of FixedLengthTokenizer.ColumnDefinition, which define the columns of the file.

Example

In the following example the textual input is split into two columns. The first column is between index 0 and 10, and the second column is between the index 10 and 20. You can see, that these values build the list of ColumnDefinition, which are passed into the FixedLengthTokenizer.

// Copyright (c) Philipp Wagner. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.

using NUnit.Framework;
using System.Text;
using TinyCsvParser.Tokenizer;

namespace TinyCsvParser.Test.Tokenizer
{
        [TestFixture]
        public class FixedLengthTokenizerTests
        {
                [Test]
                public void Tokenize_Line_Test()
                {
                        var columns = new[] {
                                new FixedLengthTokenizer.ColumnDefinition(0, 10),
                                new FixedLengthTokenizer.ColumnDefinition(10, 20),
                        };

                        var tokenizer = new FixedLengthTokenizer(columns);

                        var input = new StringBuilder()
                                .AppendLine("Philipp   Wagner    ")
                                .ToString();

                        var result = tokenizer.Tokenize(input);

                        Assert.AreEqual("Philipp   ", result[0]);
                        Assert.AreEqual("Wagner    ", result[1]);
                }
         }
}