blob: 9d5e607cd5b766567549fb72b93e68ec1ee7cb59 [file] [log] [blame]
Jakub Koturc72d7202020-12-21 17:28:15 +01001/*!
2A tutorial for handling CSV data in Rust.
3
4This tutorial will cover basic CSV reading and writing, automatic
5(de)serialization with Serde, CSV transformations and performance.
6
7This tutorial is targeted at beginner Rust programmers. Experienced Rust
8programmers may find this tutorial to be too verbose, but skimming may be
9useful. There is also a
10[cookbook](../cookbook/index.html)
11of examples for those that prefer more information density.
12
13For an introduction to Rust, please see the
14[official book](https://doc.rust-lang.org/book/second-edition/).
15If you haven't written any Rust code yet but have written code in another
16language, then this tutorial might be accessible to you without needing to read
17the book first.
18
19# Table of contents
20
211. [Setup](#setup)
221. [Basic error handling](#basic-error-handling)
23 * [Switch to recoverable errors](#switch-to-recoverable-errors)
241. [Reading CSV](#reading-csv)
25 * [Reading headers](#reading-headers)
26 * [Delimiters, quotes and variable length records](#delimiters-quotes-and-variable-length-records)
27 * [Reading with Serde](#reading-with-serde)
28 * [Handling invalid data with Serde](#handling-invalid-data-with-serde)
291. [Writing CSV](#writing-csv)
30 * [Writing tab separated values](#writing-tab-separated-values)
31 * [Writing with Serde](#writing-with-serde)
321. [Pipelining](#pipelining)
33 * [Filter by search](#filter-by-search)
34 * [Filter by population count](#filter-by-population-count)
351. [Performance](#performance)
36 * [Amortizing allocations](#amortizing-allocations)
37 * [Serde and zero allocation](#serde-and-zero-allocation)
38 * [CSV parsing without the standard library](#csv-parsing-without-the-standard-library)
391. [Closing thoughts](#closing-thoughts)
40
41# Setup
42
43In this section, we'll get you setup with a simple program that reads CSV data
44and prints a "debug" version of each record. This assumes that you have the
45[Rust toolchain installed](https://www.rust-lang.org/install.html),
46which includes both Rust and Cargo.
47
48We'll start by creating a new Cargo project:
49
50```text
51$ cargo new --bin csvtutor
52$ cd csvtutor
53```
54
55Once inside `csvtutor`, open `Cargo.toml` in your favorite text editor and add
56`csv = "1.1"` to your `[dependencies]` section. At this point, your
57`Cargo.toml` should look something like this:
58
59```text
60[package]
61name = "csvtutor"
62version = "0.1.0"
63authors = ["Your Name"]
64
65[dependencies]
66csv = "1.1"
67```
68
69Next, let's build your project. Since you added the `csv` crate as a
70dependency, Cargo will automatically download it and compile it for you. To
71build your project, use Cargo:
72
73```text
74$ cargo build
75```
76
77This will produce a new binary, `csvtutor`, in your `target/debug` directory.
78It won't do much at this point, but you can run it:
79
80```text
81$ ./target/debug/csvtutor
82Hello, world!
83```
84
85Let's make our program do something useful. Our program will read CSV data on
86stdin and print debug output for each record on stdout. To write this program,
87open `src/main.rs` in your favorite text editor and replace its contents with
88this:
89
90```no_run
91//tutorial-setup-01.rs
92// Import the standard library's I/O module so we can read from stdin.
93use std::io;
94
95// The `main` function is where your program starts executing.
96fn main() {
97 // Create a CSV parser that reads data from stdin.
98 let mut rdr = csv::Reader::from_reader(io::stdin());
99 // Loop over each record.
100 for result in rdr.records() {
101 // An error may occur, so abort the program in an unfriendly way.
102 // We will make this more friendly later!
103 let record = result.expect("a CSV record");
104 // Print a debug version of the record.
105 println!("{:?}", record);
106 }
107}
108```
109
110Don't worry too much about what this code means; we'll dissect it in the next
111section. For now, try rebuilding your project:
112
113```text
114$ cargo build
115```
116
117Assuming that succeeds, let's try running our program. But first, we will need
118some CSV data to play with! For that, we will use a random selection of 100
119US cities, along with their population size and geographical coordinates. (We
120will use this same CSV data throughout the entire tutorial.) To get the data,
121download it from github:
122
123```text
124$ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop.csv'
125```
126
127And now finally, run your program on `uspop.csv`:
128
129```text
130$ ./target/debug/csvtutor < uspop.csv
131StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
132StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
133StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"])
134# ... and much more
135```
136
137# Basic error handling
138
139Since reading CSV data can result in errors, error handling is pervasive
140throughout the examples in this tutorial. Therefore, we're going to spend a
141little bit of time going over basic error handling, and in particular, fix
142our previous example to show errors in a more friendly way. **If you're already
143comfortable with things like `Result` and `try!`/`?` in Rust, then you can
144safely skip this section.**
145
146Note that
147[The Rust Programming Language Book](https://doc.rust-lang.org/book/second-edition/)
148contains an
149[introduction to general error handling](https://doc.rust-lang.org/book/second-edition/ch09-00-error-handling.html).
150For a deeper dive, see
151[my blog post on error handling in Rust](http://blog.burntsushi.net/rust-error-handling/).
152The blog post is especially important if you plan on building Rust libraries.
153
154With that out of the way, error handling in Rust comes in two different forms:
155unrecoverable errors and recoverable errors.
156
157Unrecoverable errors generally correspond to things like bugs in your program,
158which might occur when an invariant or contract is broken. At that point, the
159state of your program is unpredictable, and there's typically little recourse
160other than *panicking*. In Rust, a panic is similar to simply aborting your
161program, but it will unwind the stack and clean up resources before your
162program exits.
163
164On the other hand, recoverable errors generally correspond to predictable
165errors. A non-existent file or invalid CSV data are examples of recoverable
166errors. In Rust, recoverable errors are handled via `Result`. A `Result`
167represents the state of a computation that has either succeeded or failed.
168It is defined like so:
169
170```
171enum Result<T, E> {
172 Ok(T),
173 Err(E),
174}
175```
176
177That is, a `Result` either contains a value of type `T` when the computation
178succeeds, or it contains a value of type `E` when the computation fails.
179
180The relationship between unrecoverable errors and recoverable errors is
181important. In particular, it is **strongly discouraged** to treat recoverable
182errors as if they were unrecoverable. For example, panicking when a file could
183not be found, or if some CSV data is invalid, is considered bad practice.
184Instead, predictable errors should be handled using Rust's `Result` type.
185
186With our new found knowledge, let's re-examine our previous example and dissect
187its error handling.
188
189```no_run
190//tutorial-error-01.rs
191use std::io;
192
193fn main() {
194 let mut rdr = csv::Reader::from_reader(io::stdin());
195 for result in rdr.records() {
196 let record = result.expect("a CSV record");
197 println!("{:?}", record);
198 }
199}
200```
201
202There are two places where an error can occur in this program. The first is
203if there was a problem reading a record from stdin. The second is if there is
204a problem writing to stdout. In general, we will ignore the latter problem in
205this tutorial, although robust command line applications should probably try
206to handle it (e.g., when a broken pipe occurs). The former however is worth
207looking into in more detail. For example, if a user of this program provides
208invalid CSV data, then the program will panic:
209
210```text
211$ cat invalid
212header1,header2
213foo,bar
214quux,baz,foobar
215$ ./target/debug/csvtutor < invalid
216StringRecord { position: Some(Position { byte: 16, line: 2, record: 1 }), fields: ["foo", "bar"] }
217thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: UnequalLengths { pos: Some(Position { byte: 24, line: 3, record: 2 }), expected_len: 2, len: 3 }', /checkout/src/libcore/result.rs:859
218note: Run with `RUST_BACKTRACE=1` for a backtrace.
219```
220
221What happened here? First and foremost, we should talk about why the CSV data
222is invalid. The CSV data consists of three records: a header and two data
223records. The header and first data record have two fields, but the second
224data record has three fields. By default, the csv crate will treat inconsistent
225record lengths as an error.
226(This behavior can be toggled using the
227[`ReaderBuilder::flexible`](../struct.ReaderBuilder.html#method.flexible)
228config knob.) This explains why the first data record is printed in this
229example, since it has the same number of fields as the header record. That is,
230we don't actually hit an error until we parse the second data record.
231
232(Note that the CSV reader automatically interprets the first record as a
233header. This can be toggled with the
234[`ReaderBuilder::has_headers`](../struct.ReaderBuilder.html#method.has_headers)
235config knob.)
236
237So what actually causes the panic to happen in our program? That would be the
238first line in our loop:
239
240```ignore
241for result in rdr.records() {
242 let record = result.expect("a CSV record"); // this panics
243 println!("{:?}", record);
244}
245```
246
247The key thing to understand here is that `rdr.records()` returns an iterator
248that yields `Result` values. That is, instead of yielding records, it yields
249a `Result` that contains either a record or an error. The `expect` method,
250which is defined on `Result`, *unwraps* the success value inside the `Result`.
251Since the `Result` might contain an error instead, `expect` will *panic* when
252it does contain an error.
253
254It might help to look at the implementation of `expect`:
255
256```ignore
257use std::fmt;
258
259// This says, "for all types T and E, where E can be turned into a human
260// readable debug message, define the `expect` method."
261impl<T, E: fmt::Debug> Result<T, E> {
262 fn expect(self, msg: &str) -> T {
263 match self {
264 Ok(t) => t,
265 Err(e) => panic!("{}: {:?}", msg, e),
266 }
267 }
268}
269```
270
271Since this causes a panic if the CSV data is invalid, and invalid CSV data is
272a perfectly predictable error, we've turned what should be a *recoverable*
273error into an *unrecoverable* error. We did this because it is expedient to
274use unrecoverable errors. Since this is bad practice, we will endeavor to avoid
275unrecoverable errors throughout the rest of the tutorial.
276
277## Switch to recoverable errors
278
279We'll convert our unrecoverable error to a recoverable error in 3 steps. First,
280let's get rid of the panic and print an error message manually:
281
282```no_run
283//tutorial-error-02.rs
284use std::io;
285use std::process;
286
287fn main() {
288 let mut rdr = csv::Reader::from_reader(io::stdin());
289 for result in rdr.records() {
290 // Examine our Result.
291 // If there was no problem, print the record.
292 // Otherwise, print the error message and quit the program.
293 match result {
294 Ok(record) => println!("{:?}", record),
295 Err(err) => {
296 println!("error reading CSV from <stdin>: {}", err);
297 process::exit(1);
298 }
299 }
300 }
301}
302```
303
304If we run our program again, we'll still see an error message, but it is no
305longer a panic message:
306
307```text
308$ cat invalid
309header1,header2
310foo,bar
311quux,baz,foobar
312$ ./target/debug/csvtutor < invalid
313StringRecord { position: Some(Position { byte: 16, line: 2, record: 1 }), fields: ["foo", "bar"] }
314error reading CSV from <stdin>: CSV error: record 2 (line: 3, byte: 24): found record with 3 fields, but the previous record has 2 fields
315```
316
317The second step for moving to recoverable errors is to put our CSV record loop
318into a separate function. This function then has the option of *returning* an
319error, which our `main` function can then inspect and decide what to do with.
320
321```no_run
322//tutorial-error-03.rs
323use std::error::Error;
324use std::io;
325use std::process;
326
327fn main() {
328 if let Err(err) = run() {
329 println!("{}", err);
330 process::exit(1);
331 }
332}
333
334fn run() -> Result<(), Box<dyn Error>> {
335 let mut rdr = csv::Reader::from_reader(io::stdin());
336 for result in rdr.records() {
337 // Examine our Result.
338 // If there was no problem, print the record.
339 // Otherwise, convert our error to a Box<dyn Error> and return it.
340 match result {
341 Err(err) => return Err(From::from(err)),
342 Ok(record) => {
343 println!("{:?}", record);
344 }
345 }
346 }
347 Ok(())
348}
349```
350
351Our new function, `run`, has a return type of `Result<(), Box<dyn Error>>`. In
352simple terms, this says that `run` either returns nothing when successful, or
353if an error occurred, it returns a `Box<dyn Error>`, which stands for "any kind of
354error." A `Box<dyn Error>` is hard to inspect if we cared about the specific error
355that occurred. But for our purposes, all we need to do is gracefully print an
356error message and exit the program.
357
358The third and final step is to replace our explicit `match` expression with a
359special Rust language feature: the question mark.
360
361```no_run
362//tutorial-error-04.rs
363use std::error::Error;
364use std::io;
365use std::process;
366
367fn main() {
368 if let Err(err) = run() {
369 println!("{}", err);
370 process::exit(1);
371 }
372}
373
374fn run() -> Result<(), Box<dyn Error>> {
375 let mut rdr = csv::Reader::from_reader(io::stdin());
376 for result in rdr.records() {
377 // This is effectively the same code as our `match` in the
378 // previous example. In other words, `?` is syntactic sugar.
379 let record = result?;
380 println!("{:?}", record);
381 }
382 Ok(())
383}
384```
385
386This last step shows how we can use the `?` to automatically forward errors
387to our caller without having to do explicit case analysis with `match`
388ourselves. We will use the `?` heavily throughout this tutorial, and it's
389important to note that it can **only be used in functions that return
390`Result`.**
391
392We'll end this section with a word of caution: using `Box<dyn Error>` as our error
393type is the minimally acceptable thing we can do here. Namely, while it allows
394our program to gracefully handle errors, it makes it hard for callers to
395inspect the specific error condition that occurred. However, since this is a
396tutorial on writing command line programs that do CSV parsing, we will consider
397ourselves satisfied. If you'd like to know more, or are interested in writing
398a library that handles CSV data, then you should check out my
399[blog post on error handling](http://blog.burntsushi.net/rust-error-handling/).
400
401With all that said, if all you're doing is writing a one-off program to do
402CSV transformations, then using methods like `expect` and panicking when an
403error occurs is a perfectly reasonable thing to do. Nevertheless, this tutorial
404will endeavor to show idiomatic code.
405
406# Reading CSV
407
408Now that we've got you setup and covered basic error handling, it's time to do
409what we came here to do: handle CSV data. We've already seen how to read
410CSV data from `stdin`, but this section will cover how to read CSV data from
411files and how to configure our CSV reader to data formatted with different
412delimiters and quoting strategies.
413
414First up, let's adapt the example we've been working with to accept a file
415path argument instead of stdin.
416
417```no_run
418//tutorial-read-01.rs
419use std::env;
420use std::error::Error;
421use std::ffi::OsString;
422use std::fs::File;
423use std::process;
424
425fn run() -> Result<(), Box<dyn Error>> {
426 let file_path = get_first_arg()?;
427 let file = File::open(file_path)?;
428 let mut rdr = csv::Reader::from_reader(file);
429 for result in rdr.records() {
430 let record = result?;
431 println!("{:?}", record);
432 }
433 Ok(())
434}
435
436/// Returns the first positional argument sent to this process. If there are no
437/// positional arguments, then this returns an error.
438fn get_first_arg() -> Result<OsString, Box<dyn Error>> {
439 match env::args_os().nth(1) {
440 None => Err(From::from("expected 1 argument, but got none")),
441 Some(file_path) => Ok(file_path),
442 }
443}
444
445fn main() {
446 if let Err(err) = run() {
447 println!("{}", err);
448 process::exit(1);
449 }
450}
451```
452
453If you replace the contents of your `src/main.rs` file with the above code,
454then you should be able to rebuild your project and try it out:
455
456```text
457$ cargo build
458$ ./target/debug/csvtutor uspop.csv
459StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
460StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
461StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"])
462# ... and much more
463```
464
465This example contains two new pieces of code:
466
4671. Code for querying the positional arguments of your program. We put this code
468 into its own function called `get_first_arg`. Our program expects a file
469 path in the first position (which is indexed at `1`; the argument at index
470 `0` is the executable name), so if one doesn't exist, then `get_first_arg`
471 returns an error.
4722. Code for opening a file. In `run`, we open a file using `File::open`. If
473 there was a problem opening the file, we forward the error to the caller of
474 `run` (which is `main` in this program). Note that we do *not* wrap the
475 `File` in a buffer. The CSV reader does buffering internally, so there's
476 no need for the caller to do it.
477
478Now is a good time to introduce an alternate CSV reader constructor, which
479makes it slightly more convenient to open CSV data from a file. That is,
480instead of:
481
482```ignore
483let file_path = get_first_arg()?;
484let file = File::open(file_path)?;
485let mut rdr = csv::Reader::from_reader(file);
486```
487
488you can use:
489
490```ignore
491let file_path = get_first_arg()?;
492let mut rdr = csv::Reader::from_path(file_path)?;
493```
494
495`csv::Reader::from_path` will open the file for you and return an error if
496the file could not be opened.
497
498## Reading headers
499
500If you had a chance to look at the data inside `uspop.csv`, you would notice
501that there is a header record that looks like this:
502
503```text
504City,State,Population,Latitude,Longitude
505```
506
507Now, if you look back at the output of the commands you've run so far, you'll
508notice that the header record is never printed. Why is that? By default, the
509CSV reader will interpret the first record in CSV data as a header, which
510is typically distinct from the actual data in the records that follow.
511Therefore, the header record is always skipped whenever you try to read or
512iterate over the records in CSV data.
513
514The CSV reader does not try to be smart about the header record and does
515**not** employ any heuristics for automatically detecting whether the first
516record is a header or not. Instead, if you don't want to treat the first record
517as a header, you'll need to tell the CSV reader that there are no headers.
518
519To configure a CSV reader to do this, we'll need to use a
520[`ReaderBuilder`](../struct.ReaderBuilder.html)
521to build a CSV reader with our desired configuration. Here's an example that
522does just that. (Note that we've moved back to reading from `stdin`, since it
523produces terser examples.)
524
525```no_run
526//tutorial-read-headers-01.rs
527# use std::error::Error;
528# use std::io;
529# use std::process;
530#
531fn run() -> Result<(), Box<dyn Error>> {
532 let mut rdr = csv::ReaderBuilder::new()
533 .has_headers(false)
534 .from_reader(io::stdin());
535 for result in rdr.records() {
536 let record = result?;
537 println!("{:?}", record);
538 }
539 Ok(())
540}
541#
542# fn main() {
543# if let Err(err) = run() {
544# println!("{}", err);
545# process::exit(1);
546# }
547# }
548```
549
550If you compile and run this program with our `uspop.csv` data, then you'll see
551that the header record is now printed:
552
553```text
554$ cargo build
555$ ./target/debug/csvtutor < uspop.csv
556StringRecord(["City", "State", "Population", "Latitude", "Longitude"])
557StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
558StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
559StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"])
560```
561
562If you ever need to access the header record directly, then you can use the
563[`Reader::header`](../struct.Reader.html#method.headers)
564method like so:
565
566```no_run
567//tutorial-read-headers-02.rs
568# use std::error::Error;
569# use std::io;
570# use std::process;
571#
572fn run() -> Result<(), Box<dyn Error>> {
573 let mut rdr = csv::Reader::from_reader(io::stdin());
574 {
575 // We nest this call in its own scope because of lifetimes.
576 let headers = rdr.headers()?;
577 println!("{:?}", headers);
578 }
579 for result in rdr.records() {
580 let record = result?;
581 println!("{:?}", record);
582 }
583 // We can ask for the headers at any time. There's no need to nest this
584 // call in its own scope because we never try to borrow the reader again.
585 let headers = rdr.headers()?;
586 println!("{:?}", headers);
587 Ok(())
588}
589#
590# fn main() {
591# if let Err(err) = run() {
592# println!("{}", err);
593# process::exit(1);
594# }
595# }
596```
597
598One interesting thing to note in this example is that we put the call to
599`rdr.headers()` in its own scope. We do this because `rdr.headers()` returns
600a *borrow* of the reader's internal header state. The nested scope in this
601code allows the borrow to end before we try to iterate over the records. If
602we didn't nest the call to `rdr.headers()` in its own scope, then the code
603wouldn't compile because we cannot borrow the reader's headers at the same time
604that we try to borrow the reader to iterate over its records.
605
606Another way of solving this problem is to *clone* the header record:
607
608```ignore
609let headers = rdr.headers()?.clone();
610```
611
612This converts it from a borrow of the CSV reader to a new owned value. This
613makes the code a bit easier to read, but at the cost of copying the header
614record into a new allocation.
615
616## Delimiters, quotes and variable length records
617
618In this section we'll temporarily depart from our `uspop.csv` data set and
619show how to read some CSV data that is a little less clean. This CSV data
620uses `;` as a delimiter, escapes quotes with `\"` (instead of `""`) and has
621records of varying length. Here's the data, which contains a list of WWE
622wrestlers and the year they started, if it's known:
623
624```text
625$ cat strange.csv
626"\"Hacksaw\" Jim Duggan";1987
627"Bret \"Hit Man\" Hart";1984
628# We're not sure when Rafael started, so omit the year.
629Rafael Halperin
630"\"Big Cat\" Ernie Ladd";1964
631"\"Macho Man\" Randy Savage";1985
632"Jake \"The Snake\" Roberts";1986
633```
634
635To read this CSV data, we'll want to do the following:
636
6371. Disable headers, since this data has none.
6382. Change the delimiter from `,` to `;`.
6393. Change the quote strategy from doubled (e.g., `""`) to escaped (e.g., `\"`).
6404. Permit flexible length records, since some omit the year.
6415. Ignore lines beginning with a `#`.
642
643All of this (and more!) can be configured with a
644[`ReaderBuilder`](../struct.ReaderBuilder.html),
645as seen in the following example:
646
647```no_run
648//tutorial-read-delimiter-01.rs
649# use std::error::Error;
650# use std::io;
651# use std::process;
652#
653fn run() -> Result<(), Box<dyn Error>> {
654 let mut rdr = csv::ReaderBuilder::new()
655 .has_headers(false)
656 .delimiter(b';')
657 .double_quote(false)
658 .escape(Some(b'\\'))
659 .flexible(true)
660 .comment(Some(b'#'))
661 .from_reader(io::stdin());
662 for result in rdr.records() {
663 let record = result?;
664 println!("{:?}", record);
665 }
666 Ok(())
667}
668#
669# fn main() {
670# if let Err(err) = run() {
671# println!("{}", err);
672# process::exit(1);
673# }
674# }
675```
676
677Now re-compile your project and try running the program on `strange.csv`:
678
679```text
680$ cargo build
681$ ./target/debug/csvtutor < strange.csv
682StringRecord(["\"Hacksaw\" Jim Duggan", "1987"])
683StringRecord(["Bret \"Hit Man\" Hart", "1984"])
684StringRecord(["Rafael Halperin"])
685StringRecord(["\"Big Cat\" Ernie Ladd", "1964"])
686StringRecord(["\"Macho Man\" Randy Savage", "1985"])
687StringRecord(["Jake \"The Snake\" Roberts", "1986"])
688```
689
690You should feel encouraged to play around with the settings. Some interesting
691things you might try:
692
6931. If you remove the `escape` setting, notice that no CSV errors are reported.
694 Instead, records are still parsed. This is a feature of the CSV parser. Even
695 though it gets the data slightly wrong, it still provides a parse that you
696 might be able to work with. This is a useful property given the messiness
697 of real world CSV data.
6982. If you remove the `delimiter` setting, parsing still succeeds, although
699 every record has exactly one field.
7003. If you remove the `flexible` setting, the reader will print the first two
701 records (since they both have the same number of fields), but will return a
702 parse error on the third record, since it has only one field.
703
704This covers most of the things you might want to configure on your CSV reader,
705although there are a few other knobs. For example, you can change the record
706terminator from a new line to any other character. (By default, the terminator
707is `CRLF`, which treats each of `\r\n`, `\r` and `\n` as single record
708terminators.) For more details, see the documentation and examples for each of
709the methods on
710[`ReaderBuilder`](../struct.ReaderBuilder.html).
711
712## Reading with Serde
713
714One of the most convenient features of this crate is its support for
715[Serde](https://serde.rs/).
716Serde is a framework for automatically serializing and deserializing data into
717Rust types. In simpler terms, that means instead of iterating over records
718as an array of string fields, we can iterate over records of a specific type
719of our choosing.
720
721For example, let's take a look at some data from our `uspop.csv` file:
722
723```text
724City,State,Population,Latitude,Longitude
725Davidsons Landing,AK,,65.2419444,-165.2716667
726Kenai,AK,7610,60.5544444,-151.2583333
727```
728
729While some of these fields make sense as strings (`City`, `State`), other
730fields look more like numbers. For example, `Population` looks like it contains
731integers while `Latitude` and `Longitude` appear to contain decimals. If we
732wanted to convert these fields to their "proper" types, then we need to do
733a lot of manual work. This next example shows how.
734
735```no_run
736//tutorial-read-serde-01.rs
737# use std::error::Error;
738# use std::io;
739# use std::process;
740#
741fn run() -> Result<(), Box<dyn Error>> {
742 let mut rdr = csv::Reader::from_reader(io::stdin());
743 for result in rdr.records() {
744 let record = result?;
745
746 let city = &record[0];
747 let state = &record[1];
748 // Some records are missing population counts, so if we can't
749 // parse a number, treat the population count as missing instead
750 // of returning an error.
751 let pop: Option<u64> = record[2].parse().ok();
752 // Lucky us! Latitudes and longitudes are available for every record.
753 // Therefore, if one couldn't be parsed, return an error.
754 let latitude: f64 = record[3].parse()?;
755 let longitude: f64 = record[4].parse()?;
756
757 println!(
758 "city: {:?}, state: {:?}, \
759 pop: {:?}, latitude: {:?}, longitude: {:?}",
760 city, state, pop, latitude, longitude);
761 }
762 Ok(())
763}
764#
765# fn main() {
766# if let Err(err) = run() {
767# println!("{}", err);
768# process::exit(1);
769# }
770# }
771```
772
773The problem here is that we need to parse each individual field manually, which
774can be labor intensive and repetitive. Serde, however, makes this process
775automatic. For example, we can ask to deserialize every record into a tuple
776type: `(String, String, Option<u64>, f64, f64)`.
777
778```no_run
779//tutorial-read-serde-02.rs
780# use std::error::Error;
781# use std::io;
782# use std::process;
783#
784// This introduces a type alias so that we can conveniently reference our
785// record type.
786type Record = (String, String, Option<u64>, f64, f64);
787
788fn run() -> Result<(), Box<dyn Error>> {
789 let mut rdr = csv::Reader::from_reader(io::stdin());
790 // Instead of creating an iterator with the `records` method, we create
791 // an iterator with the `deserialize` method.
792 for result in rdr.deserialize() {
793 // We must tell Serde what type we want to deserialize into.
794 let record: Record = result?;
795 println!("{:?}", record);
796 }
797 Ok(())
798}
799#
800# fn main() {
801# if let Err(err) = run() {
802# println!("{}", err);
803# process::exit(1);
804# }
805# }
806```
807
808Running this code should show similar output as previous examples:
809
810```text
811$ cargo build
812$ ./target/debug/csvtutor < uspop.csv
813("Davidsons Landing", "AK", None, 65.2419444, -165.2716667)
814("Kenai", "AK", Some(7610), 60.5544444, -151.2583333)
815("Oakman", "AL", None, 33.7133333, -87.3886111)
816# ... and much more
817```
818
819One of the downsides of using Serde this way is that the type you use must
820match the order of fields as they appear in each record. This can be a pain
821if your CSV data has a header record, since you might tend to think about each
822field as a value of a particular named field rather than as a numbered field.
823One way we might achieve this is to deserialize our record into a map type like
824[`HashMap`](https://doc.rust-lang.org/std/collections/struct.HashMap.html)
825or
826[`BTreeMap`](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html).
827The next example shows how, and in particular, notice that the only thing that
828changed from the last example is the definition of the `Record` type alias and
829a new `use` statement that imports `HashMap` from the standard library:
830
831```no_run
832//tutorial-read-serde-03.rs
833use std::collections::HashMap;
834# use std::error::Error;
835# use std::io;
836# use std::process;
837
838// This introduces a type alias so that we can conveniently reference our
839// record type.
840type Record = HashMap<String, String>;
841
842fn run() -> Result<(), Box<dyn Error>> {
843 let mut rdr = csv::Reader::from_reader(io::stdin());
844 for result in rdr.deserialize() {
845 let record: Record = result?;
846 println!("{:?}", record);
847 }
848 Ok(())
849}
850#
851# fn main() {
852# if let Err(err) = run() {
853# println!("{}", err);
854# process::exit(1);
855# }
856# }
857```
858
859Running this program shows similar results as before, but each record is
860printed as a map:
861
862```text
863$ cargo build
864$ ./target/debug/csvtutor < uspop.csv
865{"City": "Davidsons Landing", "Latitude": "65.2419444", "State": "AK", "Population": "", "Longitude": "-165.2716667"}
866{"City": "Kenai", "Population": "7610", "State": "AK", "Longitude": "-151.2583333", "Latitude": "60.5544444"}
867{"State": "AL", "City": "Oakman", "Longitude": "-87.3886111", "Population": "", "Latitude": "33.7133333"}
868```
869
870This method works especially well if you need to read CSV data with header
871records, but whose exact structure isn't known until your program runs.
872However, in our case, we know the structure of the data in `uspop.csv`. In
873particular, with the `HashMap` approach, we've lost the specific types we had
874for each field in the previous example when we deserialized each record into a
875`(String, String, Option<u64>, f64, f64)`. Is there a way to identify fields
876by their corresponding header name *and* assign each field its own unique
877type? The answer is yes, but we'll need to bring in Serde's `derive` feature
878first. You can do that by adding this to the `[dependencies]` section of your
879`Cargo.toml` file:
880
881```text
882serde = { version = "1", features = ["derive"] }
883```
884
885With these crates added to our project, we can now define our own custom struct
886that represents our record. We then ask Serde to automatically write the glue
887code required to populate our struct from a CSV record. The next example shows
888how. Don't miss the new Serde imports!
889
890```no_run
891//tutorial-read-serde-04.rs
892use std::error::Error;
893use std::io;
894use std::process;
895
896// This lets us write `#[derive(Deserialize)]`.
897use serde::Deserialize;
898
899// We don't need to derive `Debug` (which doesn't require Serde), but it's a
900// good habit to do it for all your types.
901//
902// Notice that the field names in this struct are NOT in the same order as
903// the fields in the CSV data!
904#[derive(Debug, Deserialize)]
905#[serde(rename_all = "PascalCase")]
906struct Record {
907 latitude: f64,
908 longitude: f64,
909 population: Option<u64>,
910 city: String,
911 state: String,
912}
913
914fn run() -> Result<(), Box<dyn Error>> {
915 let mut rdr = csv::Reader::from_reader(io::stdin());
916 for result in rdr.deserialize() {
917 let record: Record = result?;
918 println!("{:?}", record);
919 // Try this if you don't like each record smushed on one line:
920 // println!("{:#?}", record);
921 }
922 Ok(())
923}
924
925fn main() {
926 if let Err(err) = run() {
927 println!("{}", err);
928 process::exit(1);
929 }
930}
931```
932
933Compile and run this program to see similar output as before:
934
935```text
936$ cargo build
937$ ./target/debug/csvtutor < uspop.csv
938Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
939Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
940Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
941```
942
943Once again, we didn't need to change our `run` function at all: we're still
944iterating over records using the `deserialize` iterator that we started with
945in the beginning of this section. The only thing that changed in this example
946was the definition of the `Record` type and a new `use` statement. Our `Record`
947type is now a custom struct that we defined instead of a type alias, and as a
948result, Serde doesn't know how to deserialize it by default. However, a special
949compiler plugin provided by Serde is available, which will read your struct
950definition at compile time and generate code that will deserialize a CSV record
951into a `Record` value. To see what happens if you leave out the automatic
952derive, change `#[derive(Debug, Deserialize)]` to `#[derive(Debug)]`.
953
954One other thing worth mentioning in this example is the use of
955`#[serde(rename_all = "PascalCase")]`. This directive helps Serde map your
956struct's field names to the header names in the CSV data. If you recall, our
957header record is:
958
959```text
960City,State,Population,Latitude,Longitude
961```
962
963Notice that each name is capitalized, but the fields in our struct are not. The
964`#[serde(rename_all = "PascalCase")]` directive fixes that by interpreting each
965field in `PascalCase`, where the first letter of the field is capitalized. If
966we didn't tell Serde about the name remapping, then the program will quit with
967an error:
968
969```text
970$ ./target/debug/csvtutor < uspop.csv
971CSV deserialize error: record 1 (line: 2, byte: 41): missing field `latitude`
972```
973
974We could have fixed this through other means. For example, we could have used
975capital letters in our field names:
976
977```ignore
978#[derive(Debug, Deserialize)]
979struct Record {
980 Latitude: f64,
981 Longitude: f64,
982 Population: Option<u64>,
983 City: String,
984 State: String,
985}
986```
987
988However, this violates Rust naming style. (In fact, the Rust compiler
989will even warn you that the names do not follow convention!)
990
991Another way to fix this is to ask Serde to rename each field individually. This
992is useful when there is no consistent name mapping from fields to header names:
993
994```ignore
995#[derive(Debug, Deserialize)]
996struct Record {
997 #[serde(rename = "Latitude")]
998 latitude: f64,
999 #[serde(rename = "Longitude")]
1000 longitude: f64,
1001 #[serde(rename = "Population")]
1002 population: Option<u64>,
1003 #[serde(rename = "City")]
1004 city: String,
1005 #[serde(rename = "State")]
1006 state: String,
1007}
1008```
1009
1010To read more about renaming fields and about other Serde directives, please
1011consult the
1012[Serde documentation on attributes](https://serde.rs/attributes.html).
1013
1014## Handling invalid data with Serde
1015
1016In this section we will see a brief example of how to deal with data that isn't
1017clean. To do this exercise, we'll work with a slightly tweaked version of the
1018US population data we've been using throughout this tutorial. This version of
1019the data is slightly messier than what we've been using. You can get it like
1020so:
1021
1022```text
1023$ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-null.csv'
1024```
1025
1026Let's start by running our program from the previous section:
1027
1028```no_run
1029//tutorial-read-serde-invalid-01.rs
1030# use std::error::Error;
1031# use std::io;
1032# use std::process;
1033#
1034# use serde::Deserialize;
1035#
1036#[derive(Debug, Deserialize)]
1037#[serde(rename_all = "PascalCase")]
1038struct Record {
1039 latitude: f64,
1040 longitude: f64,
1041 population: Option<u64>,
1042 city: String,
1043 state: String,
1044}
1045
1046fn run() -> Result<(), Box<dyn Error>> {
1047 let mut rdr = csv::Reader::from_reader(io::stdin());
1048 for result in rdr.deserialize() {
1049 let record: Record = result?;
1050 println!("{:?}", record);
1051 }
1052 Ok(())
1053}
1054#
1055# fn main() {
1056# if let Err(err) = run() {
1057# println!("{}", err);
1058# process::exit(1);
1059# }
1060# }
1061```
1062
1063Compile and run it on our messier data:
1064
1065```text
1066$ cargo build
1067$ ./target/debug/csvtutor < uspop-null.csv
1068Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
1069Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
1070Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
1071# ... more records
1072CSV deserialize error: record 42 (line: 43, byte: 1710): field 2: invalid digit found in string
1073```
1074
1075Oops! What happened? The program printed several records, but stopped when it
1076tripped over a deserialization problem. The error message says that it found
1077an invalid digit in the field at index `2` (which is the `Population` field)
1078on line 43. What does line 43 look like?
1079
1080```text
1081$ head -n 43 uspop-null.csv | tail -n1
1082Flint Springs,KY,NULL,37.3433333,-86.7136111
1083```
1084
1085Ah! The third field (index `2`) is supposed to either be empty or contain a
1086population count. However, in this data, it seems that `NULL` sometimes appears
1087as a value, presumably to indicate that there is no count available.
1088
1089The problem with our current program is that it fails to read this record
1090because it doesn't know how to deserialize a `NULL` string into an
1091`Option<u64>`. That is, a `Option<u64>` either corresponds to an empty field
1092or an integer.
1093
1094To fix this, we tell Serde to convert any deserialization errors on this field
1095to a `None` value, as shown in this next example:
1096
1097```no_run
1098//tutorial-read-serde-invalid-02.rs
1099# use std::error::Error;
1100# use std::io;
1101# use std::process;
1102#
1103# use serde::Deserialize;
1104#[derive(Debug, Deserialize)]
1105#[serde(rename_all = "PascalCase")]
1106struct Record {
1107 latitude: f64,
1108 longitude: f64,
1109 #[serde(deserialize_with = "csv::invalid_option")]
1110 population: Option<u64>,
1111 city: String,
1112 state: String,
1113}
1114
1115fn run() -> Result<(), Box<dyn Error>> {
1116 let mut rdr = csv::Reader::from_reader(io::stdin());
1117 for result in rdr.deserialize() {
1118 let record: Record = result?;
1119 println!("{:?}", record);
1120 }
1121 Ok(())
1122}
1123#
1124# fn main() {
1125# if let Err(err) = run() {
1126# println!("{}", err);
1127# process::exit(1);
1128# }
1129# }
1130```
1131
1132If you compile and run this example, then it should run to completion just
1133like the other examples:
1134
1135```text
1136$ cargo build
1137$ ./target/debug/csvtutor < uspop-null.csv
1138Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
1139Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
1140Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
1141# ... and more
1142```
1143
1144The only change in this example was adding this attribute to the `population`
1145field in our `Record` type:
1146
1147```ignore
1148#[serde(deserialize_with = "csv::invalid_option")]
1149```
1150
1151The
1152[`invalid_option`](../fn.invalid_option.html)
1153function is a generic helper function that does one very simple thing: when
1154applied to `Option` fields, it will convert any deserialization error into a
1155`None` value. This is useful when you need to work with messy CSV data.
1156
1157# Writing CSV
1158
1159In this section we'll show a few examples that write CSV data. Writing CSV data
1160tends to be a bit more straight-forward than reading CSV data, since you get to
1161control the output format.
1162
1163Let's start with the most basic example: writing a few CSV records to `stdout`.
1164
1165```no_run
1166//tutorial-write-01.rs
1167use std::error::Error;
1168use std::io;
1169use std::process;
1170
1171fn run() -> Result<(), Box<dyn Error>> {
1172 let mut wtr = csv::Writer::from_writer(io::stdout());
1173 // Since we're writing records manually, we must explicitly write our
1174 // header record. A header record is written the same way that other
1175 // records are written.
1176 wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
1177 wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
1178 wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
1179 wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;
1180
1181 // A CSV writer maintains an internal buffer, so it's important
1182 // to flush the buffer when you're done.
1183 wtr.flush()?;
1184 Ok(())
1185}
1186
1187fn main() {
1188 if let Err(err) = run() {
1189 println!("{}", err);
1190 process::exit(1);
1191 }
1192}
1193```
1194
1195Compiling and running this example results in CSV data being printed:
1196
1197```text
1198$ cargo build
1199$ ./target/debug/csvtutor
1200City,State,Population,Latitude,Longitude
1201Davidsons Landing,AK,,65.2419444,-165.2716667
1202Kenai,AK,7610,60.5544444,-151.2583333
1203Oakman,AL,,33.7133333,-87.3886111
1204```
1205
1206Before moving on, it's worth taking a closer look at the `write_record`
1207method. In this example, it looks rather simple, but if you're new to Rust then
1208its type signature might look a little daunting:
1209
1210```ignore
1211pub fn write_record<I, T>(&mut self, record: I) -> csv::Result<()>
1212 where I: IntoIterator<Item=T>, T: AsRef<[u8]>
1213{
1214 // implementation elided
1215}
1216```
1217
1218To understand the type signature, we can break it down piece by piece.
1219
12201. The method takes two parameters: `self` and `record`.
12212. `self` is a special parameter that corresponds to the `Writer` itself.
12223. `record` is the CSV record we'd like to write. Its type is `I`, which is
1223 a generic type.
12244. In the method's `where` clause, the `I` type is constrained by the
1225 `IntoIterator<Item=T>` bound. What that means is that `I` must satisfy the
1226 `IntoIterator` trait. If you look at the documentation of the
1227 [`IntoIterator` trait](https://doc.rust-lang.org/std/iter/trait.IntoIterator.html),
1228 then we can see that it describes types that can build iterators. In this
1229 case, we want an iterator that yields *another* generic type `T`, where
1230 `T` is the type of each field we want to write.
12315. `T` also appears in the method's `where` clause, but its constraint is the
1232 `AsRef<[u8]>` bound. The `AsRef` trait is a way to describe zero cost
1233 conversions between types in Rust. In this case, the `[u8]` in `AsRef<[u8]>`
1234 means that we want to be able to *borrow* a slice of bytes from `T`.
1235 The CSV writer will take these bytes and write them as a single field.
1236 The `AsRef<[u8]>` bound is useful because types like `String`, `&str`,
1237 `Vec<u8>` and `&[u8]` all satisfy it.
12386. Finally, the method returns a `csv::Result<()>`, which is short-hand for
1239 `Result<(), csv::Error>`. That means `write_record` either returns nothing
1240 on success or returns a `csv::Error` on failure.
1241
1242Now, let's apply our new found understanding of the type signature of
1243`write_record`. If you recall, in our previous example, we used it like so:
1244
1245```ignore
1246wtr.write_record(&["field 1", "field 2", "etc"])?;
1247```
1248
1249So how do the types match up? Well, the type of each of our fields in this
1250code is `&'static str` (which is the type of a string literal in Rust). Since
1251we put them in a slice literal, the type of our parameter is
1252`&'static [&'static str]`, or more succinctly written as `&[&str]` without the
1253lifetime annotations. Since slices satisfy the `IntoIterator` bound and
1254strings satisfy the `AsRef<[u8]>` bound, this ends up being a legal call.
1255
1256Here are a few more examples of ways you can call `write_record`:
1257
1258```no_run
1259# use csv;
1260# let mut wtr = csv::Writer::from_writer(vec![]);
1261// A slice of byte strings.
1262wtr.write_record(&[b"a", b"b", b"c"]);
1263// A vector.
1264wtr.write_record(vec!["a", "b", "c"]);
1265// A string record.
1266wtr.write_record(&csv::StringRecord::from(vec!["a", "b", "c"]));
1267// A byte record.
1268wtr.write_record(&csv::ByteRecord::from(vec!["a", "b", "c"]));
1269```
1270
1271Finally, the example above can be easily adapted to write to a file instead
1272of `stdout`:
1273
1274```no_run
1275//tutorial-write-02.rs
1276use std::env;
1277use std::error::Error;
1278use std::ffi::OsString;
1279use std::process;
1280
1281fn run() -> Result<(), Box<dyn Error>> {
1282 let file_path = get_first_arg()?;
1283 let mut wtr = csv::Writer::from_path(file_path)?;
1284
1285 wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
1286 wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
1287 wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
1288 wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;
1289
1290 wtr.flush()?;
1291 Ok(())
1292}
1293
1294/// Returns the first positional argument sent to this process. If there are no
1295/// positional arguments, then this returns an error.
1296fn get_first_arg() -> Result<OsString, Box<dyn Error>> {
1297 match env::args_os().nth(1) {
1298 None => Err(From::from("expected 1 argument, but got none")),
1299 Some(file_path) => Ok(file_path),
1300 }
1301}
1302
1303fn main() {
1304 if let Err(err) = run() {
1305 println!("{}", err);
1306 process::exit(1);
1307 }
1308}
1309```
1310
1311## Writing tab separated values
1312
1313In the previous section, we saw how to write some simple CSV data to `stdout`
1314that looked like this:
1315
1316```text
1317City,State,Population,Latitude,Longitude
1318Davidsons Landing,AK,,65.2419444,-165.2716667
1319Kenai,AK,7610,60.5544444,-151.2583333
1320Oakman,AL,,33.7133333,-87.3886111
1321```
1322
1323You might wonder to yourself: what's the point of using a CSV writer if the
1324data is so simple? Well, the benefit of a CSV writer is that it can handle all
1325types of data without sacrificing the integrity of your data. That is, it knows
1326when to quote fields that contain special CSV characters (like commas or new
1327lines) or escape literal quotes that appear in your data. The CSV writer can
1328also be easily configured to use different delimiters or quoting strategies.
1329
1330In this section, we'll take a look a look at how to tweak some of the settings
1331on a CSV writer. In particular, we'll write TSV ("tab separated values")
1332instead of CSV, and we'll ask the CSV writer to quote all non-numeric fields.
1333Here's an example:
1334
1335```no_run
1336//tutorial-write-delimiter-01.rs
1337# use std::error::Error;
1338# use std::io;
1339# use std::process;
1340#
1341fn run() -> Result<(), Box<dyn Error>> {
1342 let mut wtr = csv::WriterBuilder::new()
1343 .delimiter(b'\t')
1344 .quote_style(csv::QuoteStyle::NonNumeric)
1345 .from_writer(io::stdout());
1346
1347 wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
1348 wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
1349 wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
1350 wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;
1351
1352 wtr.flush()?;
1353 Ok(())
1354}
1355#
1356# fn main() {
1357# if let Err(err) = run() {
1358# println!("{}", err);
1359# process::exit(1);
1360# }
1361# }
1362```
1363
1364Compiling and running this example gives:
1365
1366```text
1367$ cargo build
1368$ ./target/debug/csvtutor
1369"City" "State" "Population" "Latitude" "Longitude"
1370"Davidsons Landing" "AK" "" 65.2419444 -165.2716667
1371"Kenai" "AK" 7610 60.5544444 -151.2583333
1372"Oakman" "AL" "" 33.7133333 -87.3886111
1373```
1374
1375In this example, we used a new type
1376[`QuoteStyle`](../enum.QuoteStyle.html).
1377The `QuoteStyle` type represents the different quoting strategies available
1378to you. The default is to add quotes to fields only when necessary. This
1379probably works for most use cases, but you can also ask for quotes to always
1380be put around fields, to never be put around fields or to always be put around
1381non-numeric fields.
1382
1383## Writing with Serde
1384
1385Just like the CSV reader supports automatic deserialization into Rust types
1386with Serde, the CSV writer supports automatic serialization from Rust types
1387into CSV records using Serde. In this section, we'll learn how to use it.
1388
1389As with reading, let's start by seeing how we can serialize a Rust tuple.
1390
1391```no_run
1392//tutorial-write-serde-01.rs
1393# use std::error::Error;
1394# use std::io;
1395# use std::process;
1396#
1397fn run() -> Result<(), Box<dyn Error>> {
1398 let mut wtr = csv::Writer::from_writer(io::stdout());
1399
1400 // We still need to write headers manually.
1401 wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
1402
1403 // But now we can write records by providing a normal Rust value.
1404 //
1405 // Note that the odd `None::<u64>` syntax is required because `None` on
1406 // its own doesn't have a concrete type, but Serde needs a concrete type
1407 // in order to serialize it. That is, `None` has type `Option<T>` but
1408 // `None::<u64>` has type `Option<u64>`.
1409 wtr.serialize(("Davidsons Landing", "AK", None::<u64>, 65.2419444, -165.2716667))?;
1410 wtr.serialize(("Kenai", "AK", Some(7610), 60.5544444, -151.2583333))?;
1411 wtr.serialize(("Oakman", "AL", None::<u64>, 33.7133333, -87.3886111))?;
1412
1413 wtr.flush()?;
1414 Ok(())
1415}
1416#
1417# fn main() {
1418# if let Err(err) = run() {
1419# println!("{}", err);
1420# process::exit(1);
1421# }
1422# }
1423```
1424
1425Compiling and running this program gives the expected output:
1426
1427```text
1428$ cargo build
1429$ ./target/debug/csvtutor
1430City,State,Population,Latitude,Longitude
1431Davidsons Landing,AK,,65.2419444,-165.2716667
1432Kenai,AK,7610,60.5544444,-151.2583333
1433Oakman,AL,,33.7133333,-87.3886111
1434```
1435
1436The key thing to note in the above example is the use of `serialize` instead
1437of `write_record` to write our data. In particular, `write_record` is used
1438when writing a simple record that contains string-like data only. On the other
1439hand, `serialize` is used when your data consists of more complex values like
1440numbers, floats or optional values. Of course, you could always convert the
1441complex values to strings and then use `write_record`, but Serde can do it for
1442you automatically.
1443
1444As with reading, we can also serialize custom structs as CSV records. As a
1445bonus, the fields in a struct will automatically be written as a header
1446record!
1447
1448To write custom structs as CSV records, we'll need to make use of Serde's
1449automatic `derive` feature again. As in the
1450[previous section on reading with Serde](#reading-with-serde),
1451we'll need to add a couple crates to our `[dependencies]` section in our
1452`Cargo.toml` (if they aren't already there):
1453
1454```text
1455serde = { version = "1", features = ["derive"] }
1456```
1457
1458And we'll also need to add a new `use` statement to our code, for Serde, as
1459shown in the example:
1460
1461```no_run
1462//tutorial-write-serde-02.rs
1463use std::error::Error;
1464use std::io;
1465use std::process;
1466
1467use serde::Serialize;
1468
1469// Note that structs can derive both Serialize and Deserialize!
1470#[derive(Debug, Serialize)]
1471#[serde(rename_all = "PascalCase")]
1472struct Record<'a> {
1473 city: &'a str,
1474 state: &'a str,
1475 population: Option<u64>,
1476 latitude: f64,
1477 longitude: f64,
1478}
1479
1480fn run() -> Result<(), Box<dyn Error>> {
1481 let mut wtr = csv::Writer::from_writer(io::stdout());
1482
1483 wtr.serialize(Record {
1484 city: "Davidsons Landing",
1485 state: "AK",
1486 population: None,
1487 latitude: 65.2419444,
1488 longitude: -165.2716667,
1489 })?;
1490 wtr.serialize(Record {
1491 city: "Kenai",
1492 state: "AK",
1493 population: Some(7610),
1494 latitude: 60.5544444,
1495 longitude: -151.2583333,
1496 })?;
1497 wtr.serialize(Record {
1498 city: "Oakman",
1499 state: "AL",
1500 population: None,
1501 latitude: 33.7133333,
1502 longitude: -87.3886111,
1503 })?;
1504
1505 wtr.flush()?;
1506 Ok(())
1507}
1508
1509fn main() {
1510 if let Err(err) = run() {
1511 println!("{}", err);
1512 process::exit(1);
1513 }
1514}
1515```
1516
1517Compiling and running this example has the same output as last time, even
1518though we didn't explicitly write a header record:
1519
1520```text
1521$ cargo build
1522$ ./target/debug/csvtutor
1523City,State,Population,Latitude,Longitude
1524Davidsons Landing,AK,,65.2419444,-165.2716667
1525Kenai,AK,7610,60.5544444,-151.2583333
1526Oakman,AL,,33.7133333,-87.3886111
1527```
1528
1529In this case, the `serialize` method noticed that we were writing a struct
1530with field names. When this happens, `serialize` will automatically write a
1531header record (only if no other records have been written) that consists of
1532the fields in the struct in the order in which they are defined. Note that
1533this behavior can be disabled with the
1534[`WriterBuilder::has_headers`](../struct.WriterBuilder.html#method.has_headers)
1535method.
1536
1537It's also worth pointing out the use of a *lifetime parameter* in our `Record`
1538struct:
1539
1540```ignore
1541struct Record<'a> {
1542 city: &'a str,
1543 state: &'a str,
1544 population: Option<u64>,
1545 latitude: f64,
1546 longitude: f64,
1547}
1548```
1549
1550The `'a` lifetime parameter corresponds to the lifetime of the `city` and
1551`state` string slices. This says that the `Record` struct contains *borrowed*
1552data. We could have written our struct without borrowing any data, and
1553therefore, without any lifetime parameters:
1554
1555```ignore
1556struct Record {
1557 city: String,
1558 state: String,
1559 population: Option<u64>,
1560 latitude: f64,
1561 longitude: f64,
1562}
1563```
1564
1565However, since we had to replace our borrowed `&str` types with owned `String`
1566types, we're now forced to allocate a new `String` value for both of `city`
1567and `state` for every record that we write. There's no intrinsic problem with
1568doing that, but it might be a bit wasteful.
1569
1570For more examples and more details on the rules for serialization, please see
1571the
1572[`Writer::serialize`](../struct.Writer.html#method.serialize)
1573method.
1574
1575# Pipelining
1576
1577In this section, we're going to cover a few examples that demonstrate programs
1578that take CSV data as input, and produce possibly transformed or filtered CSV
1579data as output. This shows how to write a complete program that efficiently
1580reads and writes CSV data. Rust is well positioned to perform this task, since
1581you'll get great performance with the convenience of a high level CSV library.
1582
1583## Filter by search
1584
1585The first example of CSV pipelining we'll look at is a simple filter. It takes
1586as input some CSV data on stdin and a single string query as its only
1587positional argument, and it will produce as output CSV data that only contains
1588rows with a field that matches the query.
1589
1590```no_run
1591//tutorial-pipeline-search-01.rs
1592use std::env;
1593use std::error::Error;
1594use std::io;
1595use std::process;
1596
1597fn run() -> Result<(), Box<dyn Error>> {
1598 // Get the query from the positional arguments.
1599 // If one doesn't exist, return an error.
1600 let query = match env::args().nth(1) {
1601 None => return Err(From::from("expected 1 argument, but got none")),
1602 Some(query) => query,
1603 };
1604
1605 // Build CSV readers and writers to stdin and stdout, respectively.
1606 let mut rdr = csv::Reader::from_reader(io::stdin());
1607 let mut wtr = csv::Writer::from_writer(io::stdout());
1608
1609 // Before reading our data records, we should write the header record.
1610 wtr.write_record(rdr.headers()?)?;
1611
1612 // Iterate over all the records in `rdr`, and write only records containing
1613 // `query` to `wtr`.
1614 for result in rdr.records() {
1615 let record = result?;
1616 if record.iter().any(|field| field == &query) {
1617 wtr.write_record(&record)?;
1618 }
1619 }
1620
1621 // CSV writers use an internal buffer, so we should always flush when done.
1622 wtr.flush()?;
1623 Ok(())
1624}
1625
1626fn main() {
1627 if let Err(err) = run() {
1628 println!("{}", err);
1629 process::exit(1);
1630 }
1631}
1632```
1633
1634If we compile and run this program with a query of `MA` on `uspop.csv`, we'll
1635see that only one record matches:
1636
1637```text
1638$ cargo build
1639$ ./csvtutor MA < uspop.csv
1640City,State,Population,Latitude,Longitude
1641Reading,MA,23441,42.5255556,-71.0958333
1642```
1643
1644This example doesn't actually introduce anything new. It merely combines what
1645you've already learned about CSV readers and writers from previous sections.
1646
1647Let's add a twist to this example. In the real world, you're often faced with
1648messy CSV data that might not be encoded correctly. One example you might come
1649across is CSV data encoded in
1650[Latin-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1).
1651Unfortunately, for the examples we've seen so far, our CSV reader assumes that
1652all of the data is UTF-8. Since all of the data we've worked on has been
1653ASCII---which is a subset of both Latin-1 and UTF-8---we haven't had any
1654problems. But let's introduce a slightly tweaked version of our `uspop.csv`
1655file that contains an encoding of a Latin-1 character that is invalid UTF-8.
1656You can get the data like so:
1657
1658```text
1659$ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-latin1.csv'
1660```
1661
1662Even though I've already given away the problem, let's see what happen when
1663we try to run our previous example on this new data:
1664
1665```text
1666$ ./csvtutor MA < uspop-latin1.csv
1667City,State,Population,Latitude,Longitude
1668CSV parse error: record 3 (line 4, field: 0, byte: 125): invalid utf-8: invalid UTF-8 in field 0 near byte index 0
1669```
1670
1671The error message tells us exactly what's wrong. Let's take a look at line 4
1672to see what we're dealing with:
1673
1674```text
1675$ head -n4 uspop-latin1.csv | tail -n1
1676Õakman,AL,,33.7133333,-87.3886111
1677```
1678
1679In this case, the very first character is the Latin-1 `Õ`, which is encoded as
1680the byte `0xD5`, which is in turn invalid UTF-8. So what do we do now that our
1681CSV parser has choked on our data? You have two choices. The first is to go in
1682and fix up your CSV data so that it's valid UTF-8. This is probably a good
1683idea anyway, and tools like `iconv` can help with the task of transcoding.
1684But if you can't or don't want to do that, then you can instead read CSV data
1685in a way that is mostly encoding agnostic (so long as ASCII is still a valid
1686subset). The trick is to use *byte records* instead of *string records*.
1687
1688Thus far, we haven't actually talked much about the type of a record in this
1689library, but now is a good time to introduce them. There are two of them,
1690[`StringRecord`](../struct.StringRecord.html)
1691and
1692[`ByteRecord`](../struct.ByteRecord.html).
1693Each them represent a single record in CSV data, where a record is a sequence
1694of an arbitrary number of fields. The only difference between `StringRecord`
1695and `ByteRecord` is that `StringRecord` is guaranteed to be valid UTF-8, where
1696as `ByteRecord` contains arbitrary bytes.
1697
1698Armed with that knowledge, we can now begin to understand why we saw an error
1699when we ran the last example on data that wasn't UTF-8. Namely, when we call
1700`records`, we get back an iterator of `StringRecord`. Since `StringRecord` is
1701guaranteed to be valid UTF-8, trying to build a `StringRecord` with invalid
1702UTF-8 will result in the error that we see.
1703
1704All we need to do to make our example work is to switch from a `StringRecord`
1705to a `ByteRecord`. This means using `byte_records` to create our iterator
1706instead of `records`, and similarly using `byte_headers` instead of `headers`
1707if we think our header data might contain invalid UTF-8 as well. Here's the
1708change:
1709
1710```no_run
1711//tutorial-pipeline-search-02.rs
1712# use std::env;
1713# use std::error::Error;
1714# use std::io;
1715# use std::process;
1716#
1717fn run() -> Result<(), Box<dyn Error>> {
1718 let query = match env::args().nth(1) {
1719 None => return Err(From::from("expected 1 argument, but got none")),
1720 Some(query) => query,
1721 };
1722
1723 let mut rdr = csv::Reader::from_reader(io::stdin());
1724 let mut wtr = csv::Writer::from_writer(io::stdout());
1725
1726 wtr.write_record(rdr.byte_headers()?)?;
1727
1728 for result in rdr.byte_records() {
1729 let record = result?;
1730 // `query` is a `String` while `field` is now a `&[u8]`, so we'll
1731 // need to convert `query` to `&[u8]` before doing a comparison.
1732 if record.iter().any(|field| field == query.as_bytes()) {
1733 wtr.write_record(&record)?;
1734 }
1735 }
1736
1737 wtr.flush()?;
1738 Ok(())
1739}
1740#
1741# fn main() {
1742# if let Err(err) = run() {
1743# println!("{}", err);
1744# process::exit(1);
1745# }
1746# }
1747```
1748
1749Compiling and running this now yields the same results as our first example,
1750but this time it works on data that isn't valid UTF-8.
1751
1752```text
1753$ cargo build
1754$ ./csvtutor MA < uspop-latin1.csv
1755City,State,Population,Latitude,Longitude
1756Reading,MA,23441,42.5255556,-71.0958333
1757```
1758
1759## Filter by population count
1760
1761In this section, we will show another example program that both reads and
1762writes CSV data, but instead of dealing with arbitrary records, we will use
1763Serde to deserialize and serialize records with specific types.
1764
1765For this program, we'd like to be able to filter records in our population data
1766by population count. Specifically, we'd like to see which records meet a
1767certain population threshold. In addition to using a simple inequality, we must
1768also account for records that have a missing population count. This is where
1769types like `Option<T>` come in handy, because the compiler will force us to
1770consider the case when the population count is missing.
1771
1772Since we're using Serde in this example, don't forget to add the Serde
1773dependencies to your `Cargo.toml` in your `[dependencies]` section if they
1774aren't already there:
1775
1776```text
1777serde = { version = "1", features = ["derive"] }
1778```
1779
1780Now here's the code:
1781
1782```no_run
1783//tutorial-pipeline-pop-01.rs
1784use std::env;
1785use std::error::Error;
1786use std::io;
1787use std::process;
1788
1789use serde::{Deserialize, Serialize};
1790
1791// Unlike previous examples, we derive both Deserialize and Serialize. This
1792// means we'll be able to automatically deserialize and serialize this type.
1793#[derive(Debug, Deserialize, Serialize)]
1794#[serde(rename_all = "PascalCase")]
1795struct Record {
1796 city: String,
1797 state: String,
1798 population: Option<u64>,
1799 latitude: f64,
1800 longitude: f64,
1801}
1802
1803fn run() -> Result<(), Box<dyn Error>> {
1804 // Get the query from the positional arguments.
1805 // If one doesn't exist or isn't an integer, return an error.
1806 let minimum_pop: u64 = match env::args().nth(1) {
1807 None => return Err(From::from("expected 1 argument, but got none")),
1808 Some(arg) => arg.parse()?,
1809 };
1810
1811 // Build CSV readers and writers to stdin and stdout, respectively.
1812 // Note that we don't need to write headers explicitly. Since we're
1813 // serializing a custom struct, that's done for us automatically.
1814 let mut rdr = csv::Reader::from_reader(io::stdin());
1815 let mut wtr = csv::Writer::from_writer(io::stdout());
1816
1817 // Iterate over all the records in `rdr`, and write only records containing
1818 // a population that is greater than or equal to `minimum_pop`.
1819 for result in rdr.deserialize() {
1820 // Remember that when deserializing, we must use a type hint to
1821 // indicate which type we want to deserialize our record into.
1822 let record: Record = result?;
1823
1824 // `map_or` is a combinator on `Option`. It take two parameters:
1825 // a value to use when the `Option` is `None` (i.e., the record has
1826 // no population count) and a closure that returns another value of
1827 // the same type when the `Option` is `Some`. In this case, we test it
1828 // against our minimum population count that we got from the command
1829 // line.
1830 if record.population.map_or(false, |pop| pop >= minimum_pop) {
1831 wtr.serialize(record)?;
1832 }
1833 }
1834
1835 // CSV writers use an internal buffer, so we should always flush when done.
1836 wtr.flush()?;
1837 Ok(())
1838}
1839
1840fn main() {
1841 if let Err(err) = run() {
1842 println!("{}", err);
1843 process::exit(1);
1844 }
1845}
1846```
1847
1848If we compile and run our program with a minimum threshold of `100000`, we
1849should see three matching records. Notice that the headers were added even
1850though we never explicitly wrote them!
1851
1852```text
1853$ cargo build
1854$ ./target/debug/csvtutor 100000 < uspop.csv
1855City,State,Population,Latitude,Longitude
1856Fontana,CA,169160,34.0922222,-117.4341667
1857Bridgeport,CT,139090,41.1669444,-73.2052778
1858Indianapolis,IN,773283,39.7683333,-86.1580556
1859```
1860
1861# Performance
1862
1863In this section, we'll go over how to squeeze the most juice out of our CSV
1864reader. As it happens, most of the APIs we've seen so far were designed with
1865high level convenience in mind, and that often comes with some costs. For the
1866most part, those costs revolve around unnecessary allocations. Therefore, most
1867of the section will show how to do CSV parsing with as little allocation as
1868possible.
1869
1870There are two critical preliminaries we must cover.
1871
1872Firstly, when you care about performance, you should compile your code
1873with `cargo build --release` instead of `cargo build`. The `--release`
1874flag instructs the compiler to spend more time optimizing your code. When
1875compiling with the `--release` flag, you'll find your compiled program at
1876`target/release/csvtutor` instead of `target/debug/csvtutor`. Throughout this
1877tutorial, we've used `cargo build` because our dataset was small and we weren't
1878focused on speed. The downside of `cargo build --release` is that it will take
1879longer than `cargo build`.
1880
1881Secondly, the dataset we've used throughout this tutorial only has 100 records.
1882We'd have to try really hard to cause our program to run slowly on 100 records,
1883even when we compile without the `--release` flag. Therefore, in order to
1884actually witness a performance difference, we need a bigger dataset. To get
1885such a dataset, we'll use the original source of `uspop.csv`. **Warning: the
1886download is 41MB compressed and decompresses to 145MB.**
1887
1888```text
1889$ curl -LO http://burntsushi.net/stuff/worldcitiespop.csv.gz
1890$ gunzip worldcitiespop.csv.gz
1891$ wc worldcitiespop.csv
1892 3173959 5681543 151492068 worldcitiespop.csv
1893$ md5sum worldcitiespop.csv
18946198bd180b6d6586626ecbf044c1cca5 worldcitiespop.csv
1895```
1896
1897Finally, it's worth pointing out that this section is not attempting to
1898present a rigorous set of benchmarks. We will stay away from rigorous analysis
1899and instead rely a bit more on wall clock times and intuition.
1900
1901## Amortizing allocations
1902
1903In order to measure performance, we must be careful about what it is we're
1904measuring. We must also be careful to not change the thing we're measuring as
1905we make improvements to the code. For this reason, we will focus on measuring
1906how long it takes to count the number of records corresponding to city
1907population counts in Massachusetts. This represents a very small amount of work
1908that requires us to visit every record, and therefore represents a decent way
1909to measure how long it takes to do CSV parsing.
1910
1911Before diving into our first optimization, let's start with a baseline by
1912adapting a previous example to count the number of records in
1913`worldcitiespop.csv`:
1914
1915```no_run
1916//tutorial-perf-alloc-01.rs
1917use std::error::Error;
1918use std::io;
1919use std::process;
1920
1921fn run() -> Result<u64, Box<dyn Error>> {
1922 let mut rdr = csv::Reader::from_reader(io::stdin());
1923
1924 let mut count = 0;
1925 for result in rdr.records() {
1926 let record = result?;
1927 if &record[0] == "us" && &record[3] == "MA" {
1928 count += 1;
1929 }
1930 }
1931 Ok(count)
1932}
1933
1934fn main() {
1935 match run() {
1936 Ok(count) => {
1937 println!("{}", count);
1938 }
1939 Err(err) => {
1940 println!("{}", err);
1941 process::exit(1);
1942 }
1943 }
1944}
1945```
1946
1947Now let's compile and run it and see what kind of timing we get. Don't forget
1948to compile with the `--release` flag. (For grins, try compiling without the
1949`--release` flag and see how long it takes to run the program!)
1950
1951```text
1952$ cargo build --release
1953$ time ./target/release/csvtutor < worldcitiespop.csv
19542176
1955
1956real 0m0.645s
1957user 0m0.627s
1958sys 0m0.017s
1959```
1960
1961All right, so what's the first thing we can do to make this faster? This
1962section promised to speed things up by amortizing allocation, but we can do
1963something even simpler first: iterate over
1964[`ByteRecord`](../struct.ByteRecord.html)s
1965instead of
1966[`StringRecord`](../struct.StringRecord.html)s.
1967If you recall from a previous section, a `StringRecord` is guaranteed to be
1968valid UTF-8, and therefore must validate that its contents is actually UTF-8.
1969(If validation fails, then the CSV reader will return an error.) If we remove
1970that validation from our program, then we can realize a nice speed boost as
1971shown in the next example:
1972
1973```no_run
1974//tutorial-perf-alloc-02.rs
1975# use std::error::Error;
1976# use std::io;
1977# use std::process;
1978#
1979fn run() -> Result<u64, Box<dyn Error>> {
1980 let mut rdr = csv::Reader::from_reader(io::stdin());
1981
1982 let mut count = 0;
1983 for result in rdr.byte_records() {
1984 let record = result?;
1985 if &record[0] == b"us" && &record[3] == b"MA" {
1986 count += 1;
1987 }
1988 }
1989 Ok(count)
1990}
1991#
1992# fn main() {
1993# match run() {
1994# Ok(count) => {
1995# println!("{}", count);
1996# }
1997# Err(err) => {
1998# println!("{}", err);
1999# process::exit(1);
2000# }
2001# }
2002# }
2003```
2004
2005And now compile and run:
2006
2007```text
2008$ cargo build --release
2009$ time ./target/release/csvtutor < worldcitiespop.csv
20102176
2011
2012real 0m0.429s
2013user 0m0.403s
2014sys 0m0.023s
2015```
2016
2017Our program is now approximately 30% faster, all because we removed UTF-8
2018validation. But was it actually okay to remove UTF-8 validation? What have we
2019lost? In this case, it is perfectly acceptable to drop UTF-8 validation and use
2020`ByteRecord` instead because all we're doing with the data in the record is
2021comparing two of its fields to raw bytes:
2022
2023```ignore
2024if &record[0] == b"us" && &record[3] == b"MA" {
2025 count += 1;
2026}
2027```
2028
2029In particular, it doesn't matter whether `record` is valid UTF-8 or not, since
2030we're checking for equality on the raw bytes themselves.
2031
2032UTF-8 validation via `StringRecord` is useful because it provides access to
2033fields as `&str` types, where as `ByteRecord` provides fields as `&[u8]` types.
2034`&str` is the type of a borrowed string in Rust, which provides convenient
2035access to string APIs like substring search. Strings are also frequently used
2036in other areas, so they tend to be a useful thing to have. Therefore, sticking
2037with `StringRecord` is a good default, but if you need the extra speed and can
2038deal with arbitrary bytes, then switching to `ByteRecord` might be a good idea.
2039
2040Moving on, let's try to get another speed boost by amortizing allocation.
2041Amortizing allocation is the technique that creates an allocation once (or
2042very rarely), and then attempts to reuse it instead of creating additional
2043allocations. In the case of the previous examples, we used iterators created
2044by the `records` and `byte_records` methods on a CSV reader. These iterators
2045allocate a new record for every item that it yields, which in turn corresponds
2046to a new allocation. It does this because iterators cannot yield items that
2047borrow from the iterator itself, and because creating new allocations tends to
2048be a lot more convenient.
2049
2050If we're willing to forgo use of iterators, then we can amortize allocations
2051by creating a *single* `ByteRecord` and asking the CSV reader to read into it.
2052We do this by using the
2053[`Reader::read_byte_record`](../struct.Reader.html#method.read_byte_record)
2054method.
2055
2056```no_run
2057//tutorial-perf-alloc-03.rs
2058# use std::error::Error;
2059# use std::io;
2060# use std::process;
2061#
2062fn run() -> Result<u64, Box<dyn Error>> {
2063 let mut rdr = csv::Reader::from_reader(io::stdin());
2064 let mut record = csv::ByteRecord::new();
2065
2066 let mut count = 0;
2067 while rdr.read_byte_record(&mut record)? {
2068 if &record[0] == b"us" && &record[3] == b"MA" {
2069 count += 1;
2070 }
2071 }
2072 Ok(count)
2073}
2074#
2075# fn main() {
2076# match run() {
2077# Ok(count) => {
2078# println!("{}", count);
2079# }
2080# Err(err) => {
2081# println!("{}", err);
2082# process::exit(1);
2083# }
2084# }
2085# }
2086```
2087
2088Compile and run:
2089
2090```text
2091$ cargo build --release
2092$ time ./target/release/csvtutor < worldcitiespop.csv
20932176
2094
2095real 0m0.308s
2096user 0m0.283s
2097sys 0m0.023s
2098```
2099
2100Woohoo! This represents *another* 30% boost over the previous example, which is
2101a 50% boost over the first example.
2102
2103Let's dissect this code by taking a look at the type signature of the
2104`read_byte_record` method:
2105
2106```ignore
2107fn read_byte_record(&mut self, record: &mut ByteRecord) -> csv::Result<bool>;
2108```
2109
2110This method takes as input a CSV reader (the `self` parameter) and a *mutable
2111borrow* of a `ByteRecord`, and returns a `csv::Result<bool>`. (The
2112`csv::Result<bool>` is equivalent to `Result<bool, csv::Error>`.) The return
2113value is `true` if and only if a record was read. When it's `false`, that means
2114the reader has exhausted its input. This method works by copying the contents
2115of the next record into the provided `ByteRecord`. Since the same `ByteRecord`
2116is used to read every record, it will already have space allocated for data.
2117When `read_byte_record` runs, it will overwrite the contents that were there
2118with the new record, which means that it can reuse the space that was
2119allocated. Thus, we have *amortized allocation*.
2120
2121An exercise you might consider doing is to use a `StringRecord` instead of a
2122`ByteRecord`, and therefore
2123[`Reader::read_record`](../struct.Reader.html#method.read_record)
2124instead of `read_byte_record`. This will give you easy access to Rust strings
2125at the cost of UTF-8 validation but *without* the cost of allocating a new
2126`StringRecord` for every record.
2127
2128## Serde and zero allocation
2129
2130In this section, we are going to briefly examine how we use Serde and what we
2131can do to speed it up. The key optimization we'll want to make is to---you
2132guessed it---amortize allocation.
2133
2134As with the previous section, let's start with a simple baseline based off an
2135example using Serde in a previous section:
2136
2137```no_run
2138//tutorial-perf-serde-01.rs
2139use std::error::Error;
2140use std::io;
2141use std::process;
2142
2143use serde::Deserialize;
2144
2145#[derive(Debug, Deserialize)]
2146#[serde(rename_all = "PascalCase")]
2147struct Record {
2148 country: String,
2149 city: String,
2150 accent_city: String,
2151 region: String,
2152 population: Option<u64>,
2153 latitude: f64,
2154 longitude: f64,
2155}
2156
2157fn run() -> Result<u64, Box<dyn Error>> {
2158 let mut rdr = csv::Reader::from_reader(io::stdin());
2159
2160 let mut count = 0;
2161 for result in rdr.deserialize() {
2162 let record: Record = result?;
2163 if record.country == "us" && record.region == "MA" {
2164 count += 1;
2165 }
2166 }
2167 Ok(count)
2168}
2169
2170fn main() {
2171 match run() {
2172 Ok(count) => {
2173 println!("{}", count);
2174 }
2175 Err(err) => {
2176 println!("{}", err);
2177 process::exit(1);
2178 }
2179 }
2180}
2181```
2182
2183Now compile and run this program:
2184
2185```text
2186$ cargo build --release
2187$ ./target/release/csvtutor < worldcitiespop.csv
21882176
2189
2190real 0m1.381s
2191user 0m1.367s
2192sys 0m0.013s
2193```
2194
2195The first thing you might notice is that this is quite a bit slower than our
2196programs in the previous section. This is because deserializing each record
2197has a certain amount of overhead to it. In particular, some of the fields need
2198to be parsed as integers or floating point numbers, which isn't free. However,
2199there is hope yet, because we can speed up this program!
2200
2201Our first attempt to speed up the program will be to amortize allocation. Doing
2202this with Serde is a bit trickier than before, because we need to change our
2203`Record` type and use the manual deserialization API. Let's see what that looks
2204like:
2205
2206```no_run
2207//tutorial-perf-serde-02.rs
2208# use std::error::Error;
2209# use std::io;
2210# use std::process;
2211#
2212# use serde::Deserialize;
2213#
2214#[derive(Debug, Deserialize)]
2215#[serde(rename_all = "PascalCase")]
2216struct Record<'a> {
2217 country: &'a str,
2218 city: &'a str,
2219 accent_city: &'a str,
2220 region: &'a str,
2221 population: Option<u64>,
2222 latitude: f64,
2223 longitude: f64,
2224}
2225
2226fn run() -> Result<u64, Box<dyn Error>> {
2227 let mut rdr = csv::Reader::from_reader(io::stdin());
2228 let mut raw_record = csv::StringRecord::new();
2229 let headers = rdr.headers()?.clone();
2230
2231 let mut count = 0;
2232 while rdr.read_record(&mut raw_record)? {
2233 let record: Record = raw_record.deserialize(Some(&headers))?;
2234 if record.country == "us" && record.region == "MA" {
2235 count += 1;
2236 }
2237 }
2238 Ok(count)
2239}
2240#
2241# fn main() {
2242# match run() {
2243# Ok(count) => {
2244# println!("{}", count);
2245# }
2246# Err(err) => {
2247# println!("{}", err);
2248# process::exit(1);
2249# }
2250# }
2251# }
2252```
2253
2254Compile and run:
2255
2256```text
2257$ cargo build --release
2258$ ./target/release/csvtutor < worldcitiespop.csv
22592176
2260
2261real 0m1.055s
2262user 0m1.040s
2263sys 0m0.013s
2264```
2265
2266This corresponds to an approximately 24% increase in performance. To achieve
2267this, we had to make two important changes.
2268
2269The first was to make our `Record` type contain `&str` fields instead of
2270`String` fields. If you recall from a previous section, `&str` is a *borrowed*
2271string where a `String` is an *owned* string. A borrowed string points to
2272a already existing allocation where as a `String` always implies a new
2273allocation. In this case, our `&str` is borrowing from the CSV record itself.
2274
2275The second change we had to make was to stop using the
2276[`Reader::deserialize`](../struct.Reader.html#method.deserialize)
2277iterator, and instead deserialize our record into a `StringRecord` explicitly
2278and then use the
2279[`StringRecord::deserialize`](../struct.StringRecord.html#method.deserialize)
2280method to deserialize a single record.
2281
2282The second change is a bit tricky, because in order for it to work, our
2283`Record` type needs to borrow from the data inside the `StringRecord`. That
2284means that our `Record` value cannot outlive the `StringRecord` that it was
2285created from. Since we overwrite the same `StringRecord` on each iteration
2286(in order to amortize allocation), that means our `Record` value must evaporate
2287before the next iteration of the loop. Indeed, the compiler will enforce this!
2288
2289There is one more optimization we can make: remove UTF-8 validation. In
2290general, this means using `&[u8]` instead of `&str` and `ByteRecord` instead
2291of `StringRecord`:
2292
2293```no_run
2294//tutorial-perf-serde-03.rs
2295# use std::error::Error;
2296# use std::io;
2297# use std::process;
2298#
2299# use serde::Deserialize;
2300#
2301#[derive(Debug, Deserialize)]
2302#[serde(rename_all = "PascalCase")]
2303struct Record<'a> {
2304 country: &'a [u8],
2305 city: &'a [u8],
2306 accent_city: &'a [u8],
2307 region: &'a [u8],
2308 population: Option<u64>,
2309 latitude: f64,
2310 longitude: f64,
2311}
2312
2313fn run() -> Result<u64, Box<dyn Error>> {
2314 let mut rdr = csv::Reader::from_reader(io::stdin());
2315 let mut raw_record = csv::ByteRecord::new();
2316 let headers = rdr.byte_headers()?.clone();
2317
2318 let mut count = 0;
2319 while rdr.read_byte_record(&mut raw_record)? {
2320 let record: Record = raw_record.deserialize(Some(&headers))?;
2321 if record.country == b"us" && record.region == b"MA" {
2322 count += 1;
2323 }
2324 }
2325 Ok(count)
2326}
2327#
2328# fn main() {
2329# match run() {
2330# Ok(count) => {
2331# println!("{}", count);
2332# }
2333# Err(err) => {
2334# println!("{}", err);
2335# process::exit(1);
2336# }
2337# }
2338# }
2339```
2340
2341Compile and run:
2342
2343```text
2344$ cargo build --release
2345$ ./target/release/csvtutor < worldcitiespop.csv
23462176
2347
2348real 0m0.873s
2349user 0m0.850s
2350sys 0m0.023s
2351```
2352
2353This corresponds to a 17% increase over the previous example and a 37% increase
2354over the first example.
2355
2356In sum, Serde parsing is still quite fast, but will generally not be the
2357fastest way to parse CSV since it necessarily needs to do more work.
2358
2359## CSV parsing without the standard library
2360
2361In this section, we will explore a niche use case: parsing CSV without the
2362standard library. While the `csv` crate itself requires the standard library,
2363the underlying parser is actually part of the
2364[`csv-core`](https://docs.rs/csv-core)
2365crate, which does not depend on the standard library. The downside of not
2366depending on the standard library is that CSV parsing becomes a lot more
2367inconvenient.
2368
2369The `csv-core` crate is structured similarly to the `csv` crate. There is a
2370[`Reader`](../../csv_core/struct.Reader.html)
2371and a
2372[`Writer`](../../csv_core/struct.Writer.html),
2373as well as corresponding builders
2374[`ReaderBuilder`](../../csv_core/struct.ReaderBuilder.html)
2375and
2376[`WriterBuilder`](../../csv_core/struct.WriterBuilder.html).
2377The `csv-core` crate has no record types or iterators. Instead, CSV data
2378can either be read one field at a time or one record at a time. In this
2379section, we'll focus on reading a field at a time since it is simpler, but it
2380is generally faster to read a record at a time since it does more work per
2381function call.
2382
2383In keeping with this section on performance, let's write a program using only
2384`csv-core` that counts the number of records in the state of Massachusetts.
2385
2386(Note that we unfortunately use the standard library in this example even
2387though `csv-core` doesn't technically require it. We do this for convenient
2388access to I/O, which would be harder without the standard library.)
2389
2390```no_run
2391//tutorial-perf-core-01.rs
2392use std::io::{self, Read};
2393use std::process;
2394
2395use csv_core::{Reader, ReadFieldResult};
2396
2397fn run(mut data: &[u8]) -> Option<u64> {
2398 let mut rdr = Reader::new();
2399
2400 // Count the number of records in Massachusetts.
2401 let mut count = 0;
2402 // Indicates the current field index. Reset to 0 at start of each record.
2403 let mut fieldidx = 0;
2404 // True when the current record is in the United States.
2405 let mut inus = false;
2406 // Buffer for field data. Must be big enough to hold the largest field.
2407 let mut field = [0; 1024];
2408 loop {
2409 // Attempt to incrementally read the next CSV field.
2410 let (result, nread, nwrite) = rdr.read_field(data, &mut field);
2411 // nread is the number of bytes read from our input. We should never
2412 // pass those bytes to read_field again.
2413 data = &data[nread..];
2414 // nwrite is the number of bytes written to the output buffer `field`.
2415 // The contents of the buffer after this point is unspecified.
2416 let field = &field[..nwrite];
2417
2418 match result {
2419 // We don't need to handle this case because we read all of the
2420 // data up front. If we were reading data incrementally, then this
2421 // would be a signal to read more.
2422 ReadFieldResult::InputEmpty => {}
2423 // If we get this case, then we found a field that contains more
2424 // than 1024 bytes. We keep this example simple and just fail.
2425 ReadFieldResult::OutputFull => {
2426 return None;
2427 }
2428 // This case happens when we've successfully read a field. If the
2429 // field is the last field in a record, then `record_end` is true.
2430 ReadFieldResult::Field { record_end } => {
2431 if fieldidx == 0 && field == b"us" {
2432 inus = true;
2433 } else if inus && fieldidx == 3 && field == b"MA" {
2434 count += 1;
2435 }
2436 if record_end {
2437 fieldidx = 0;
2438 inus = false;
2439 } else {
2440 fieldidx += 1;
2441 }
2442 }
2443 // This case happens when the CSV reader has successfully exhausted
2444 // all input.
2445 ReadFieldResult::End => {
2446 break;
2447 }
2448 }
2449 }
2450 Some(count)
2451}
2452
2453fn main() {
2454 // Read the entire contents of stdin up front.
2455 let mut data = vec![];
2456 if let Err(err) = io::stdin().read_to_end(&mut data) {
2457 println!("{}", err);
2458 process::exit(1);
2459 }
2460 match run(&data) {
2461 None => {
2462 println!("error: could not count records, buffer too small");
2463 process::exit(1);
2464 }
2465 Some(count) => {
2466 println!("{}", count);
2467 }
2468 }
2469}
2470```
2471
2472And compile and run it:
2473
2474```text
2475$ cargo build --release
2476$ time ./target/release/csvtutor < worldcitiespop.csv
24772176
2478
2479real 0m0.572s
2480user 0m0.513s
2481sys 0m0.057s
2482```
2483
2484This isn't as fast as some of our previous examples where we used the `csv`
2485crate to read into a `StringRecord` or a `ByteRecord`. This is mostly because
2486this example reads a field at a time, which incurs more overhead than reading a
2487record at a time. To fix this, you would want to use the
2488[`Reader::read_record`](../../csv_core/struct.Reader.html#method.read_record)
2489method instead, which is defined on `csv_core::Reader`.
2490
2491The other thing to notice here is that the example is considerably longer than
2492the other examples. This is because we need to do more book keeping to keep
2493track of which field we're reading and how much data we've already fed to the
2494reader. There are basically two reasons to use the `csv_core` crate:
2495
24961. If you're in an environment where the standard library is not usable.
24972. If you wanted to build your own csv-like library, you could build it on top
2498 of `csv-core`.
2499
2500# Closing thoughts
2501
2502Congratulations on making it to the end! It seems incredible that one could
2503write so many words on something as basic as CSV parsing. I wanted this
2504guide to be accessible not only to Rust beginners, but to inexperienced
2505programmers as well. My hope is that the large number of examples will help
2506push you in the right direction.
2507
2508With that said, here are a few more things you might want to look at:
2509
2510* The [API documentation for the `csv` crate](../index.html) documents all
2511 facets of the library, and is itself littered with even more examples.
2512* The [`csv-index` crate](https://docs.rs/csv-index) provides data structures
2513 that can index CSV data that are amenable to writing to disk. (This library
2514 is still a work in progress.)
2515* The [`xsv` command line tool](https://github.com/BurntSushi/xsv) is a high
2516 performance CSV swiss army knife. It can slice, select, search, sort, join,
2517 concatenate, index, format and compute statistics on arbitrary CSV data. Give
2518 it a try!
2519
2520*/