Hi, I'm Ray

I'm a software developer, full-time nerd and occasional human (while heavily caffeinated).

Parsing CSV in C#

Welcome to my first blog post. Hi, I’m Ray (psst! that’s the name of this website).

This is an easy exercise that I occasionally come across, parsing comma separated values (CSV). I’ve seen a lot of code purporting to do this on the Web but a lot of it doesn’t work in practice. This is my take on parsing CSV’s.

Note: This is also a real-life application of stacks (but we don’t care about that right now).

 

1. Start simple

If you have a CSV string, such as below, it’s simple to split the string by the commas. You get a five item array.

one,two,three,four,five

Sometimes the CSV will have double-quotes surrounding each item. You split the string and trim the quotes. This is the code I see most often, which is also fine.

"one","two","three","four","five"

 

2. Commas gone wild

The reason for surrounding each item in quotes is so they can contain a comma without breaking the format.

Items don’t need to be wrapped-up if they don’t need to be.

In this case if you split by commas you get an array of seven but there should only be five.

one,"two,three","four,five",six,"seven"

You can see that commas are “special” characters that separate the values.

We’ve added double-quotes as special characters for surrounding values. Commas inside a value is not a special character.

If you have a double-quote inside a value, it will be escaped such that it doesn’t affect the CSV format. This is typically done with a back-slash.

one,"two,three","four,\"five\"",six,"seven"

As you can see, the split and trim doesn’t work any more. We need to find another solution.

 

3. The disclaimer

I’ve seen some solutions using regular-expressions and others with complicated loop-di-loops.

This is my take on the problem, it’s what is known as, “keep it simple, stupid”. Or I call it, “easy-mode”.

 

4. The solution

This is when we revisit the stack. The stack is simply a list where the last thing added is the first thing removed. Think of it as a stack of plates. Except in this case we’re going to be adding and removing characters, it’s called pushing and popping.

Okay.

We start off with an empty stack and we’re going to look at each character in the CSV string from left to right. We’ll also want an array for the output items.

Remember that we only care about commas, quotes and slashes.

one,"two,three","four,\"five\"",six,"seven"

  1. The first special character we come to is the comma.
  2. We check the stack, it’s empty, we add the preceding string to the output.

We have one item in the array.

"two,three","four,\"five\"",six,"seven"

  1. The first special character we come to is the quote.
  2. We check the stack, it’s empty, we push the quote character on to the stack and continue.
  3. The next special character we come to is a comma.
  4. We check the stack, a quote is on top. We ignore the comma and continue.
  5. The next special character we come to is a quote.
  6. We check the stack, a quote is on top. We pop the quote out and continue.
  7. The next special character we come to is a comma.
  8. We check the stack, it’s empty, we add the preceding string to the output.

We have two items in the array.

"four,\"five\"",six,"seven"

  1. The first special character we come to is the quote.
  2. We check the stack, it’s empty, we push the quote character on to the stack and continue.
  3. The next special character we come to is a comma.
  4. We check the stack, a quote is on top. We ignore the comma and continue.
  5. The next special character we come to is a slash.
  6. We push the slash onto the stack and continue to the next character.
  7. The next character we come to is a quote.
  8. We check the stack, a slash is on top. We pop-off, ignore and continue.
  9. The next special character we come to is a slash.
  10. We push the slash onto the stack and continue to the next character.
  11. The next character we come to is a quote.
  12. We check the stack, a slash is on top. We pop-off, ignore and continue.
  13. The next special character we come to is a quote.
  14. We check the stack, a quote is on top. We pop the quote out and continue.
  15. The next special character we come to is a comma.
  16. We check the stack, it’s empty, we add the preceding string to the output.

We have three items in the array.

six,"seven"

  1. The first special character we come to is the comma.
  2. We check the stack, it’s empty, we add the preceding string to the output.

We have four items in the array.

"seven"

  1. The first special character we come to is the quote.
  2. We check the stack, it’s empty, we push the quote character on to the stack and continue.
  3. The next special character we come to is a quote.
  4. We check the stack, a quote is on top. We pop the quote out and continue.
  5. End of string, we add the preceding string to the output.

 

5. The end

You should have five items in the array. The method is long-winded but you won’t be doing it by hand.

  • one
  • two,three
  • four,"five"
  • six
  • seven

I have deliberately overlooked a few issues to keep the walk-through straight-forward but I have addressed them in the code sample. Of course you will likely have language specific considerations to address as well.

I hope someone finds this post useful. I’ll try to think up something more interesting next time.

 

6. The code

The sample code differs a little from the walk-through but you can adjust it as you will.

Source Code

The Author

Ray
  • Hi, I'm Ray. I'm a software developer, full-time nerd and occasional human (while heavily caffeinated).

Comments

No Comments

Copyright © 2014-2019 Ray Lam. All Rights Reserved.