Jakub Kotur | c72d720 | 2020-12-21 17:28:15 +0100 | [diff] [blame] | 1 | /*! |
| 2 | A tutorial for handling CSV data in Rust. |
| 3 | |
| 4 | This tutorial will cover basic CSV reading and writing, automatic |
| 5 | (de)serialization with Serde, CSV transformations and performance. |
| 6 | |
| 7 | This tutorial is targeted at beginner Rust programmers. Experienced Rust |
| 8 | programmers may find this tutorial to be too verbose, but skimming may be |
| 9 | useful. There is also a |
| 10 | [cookbook](../cookbook/index.html) |
| 11 | of examples for those that prefer more information density. |
| 12 | |
| 13 | For an introduction to Rust, please see the |
| 14 | [official book](https://doc.rust-lang.org/book/second-edition/). |
| 15 | If you haven't written any Rust code yet but have written code in another |
| 16 | language, then this tutorial might be accessible to you without needing to read |
| 17 | the book first. |
| 18 | |
| 19 | # Table of contents |
| 20 | |
| 21 | 1. [Setup](#setup) |
| 22 | 1. [Basic error handling](#basic-error-handling) |
| 23 | * [Switch to recoverable errors](#switch-to-recoverable-errors) |
| 24 | 1. [Reading CSV](#reading-csv) |
| 25 | * [Reading headers](#reading-headers) |
| 26 | * [Delimiters, quotes and variable length records](#delimiters-quotes-and-variable-length-records) |
| 27 | * [Reading with Serde](#reading-with-serde) |
| 28 | * [Handling invalid data with Serde](#handling-invalid-data-with-serde) |
| 29 | 1. [Writing CSV](#writing-csv) |
| 30 | * [Writing tab separated values](#writing-tab-separated-values) |
| 31 | * [Writing with Serde](#writing-with-serde) |
| 32 | 1. [Pipelining](#pipelining) |
| 33 | * [Filter by search](#filter-by-search) |
| 34 | * [Filter by population count](#filter-by-population-count) |
| 35 | 1. [Performance](#performance) |
| 36 | * [Amortizing allocations](#amortizing-allocations) |
| 37 | * [Serde and zero allocation](#serde-and-zero-allocation) |
| 38 | * [CSV parsing without the standard library](#csv-parsing-without-the-standard-library) |
| 39 | 1. [Closing thoughts](#closing-thoughts) |
| 40 | |
| 41 | # Setup |
| 42 | |
| 43 | In this section, we'll get you setup with a simple program that reads CSV data |
| 44 | and prints a "debug" version of each record. This assumes that you have the |
| 45 | [Rust toolchain installed](https://www.rust-lang.org/install.html), |
| 46 | which includes both Rust and Cargo. |
| 47 | |
| 48 | We'll start by creating a new Cargo project: |
| 49 | |
| 50 | ```text |
| 51 | $ cargo new --bin csvtutor |
| 52 | $ cd csvtutor |
| 53 | ``` |
| 54 | |
| 55 | Once inside `csvtutor`, open `Cargo.toml` in your favorite text editor and add |
| 56 | `csv = "1.1"` to your `[dependencies]` section. At this point, your |
| 57 | `Cargo.toml` should look something like this: |
| 58 | |
| 59 | ```text |
| 60 | [package] |
| 61 | name = "csvtutor" |
| 62 | version = "0.1.0" |
| 63 | authors = ["Your Name"] |
| 64 | |
| 65 | [dependencies] |
| 66 | csv = "1.1" |
| 67 | ``` |
| 68 | |
| 69 | Next, let's build your project. Since you added the `csv` crate as a |
| 70 | dependency, Cargo will automatically download it and compile it for you. To |
| 71 | build your project, use Cargo: |
| 72 | |
| 73 | ```text |
| 74 | $ cargo build |
| 75 | ``` |
| 76 | |
| 77 | This will produce a new binary, `csvtutor`, in your `target/debug` directory. |
| 78 | It won't do much at this point, but you can run it: |
| 79 | |
| 80 | ```text |
| 81 | $ ./target/debug/csvtutor |
| 82 | Hello, world! |
| 83 | ``` |
| 84 | |
| 85 | Let's make our program do something useful. Our program will read CSV data on |
| 86 | stdin and print debug output for each record on stdout. To write this program, |
| 87 | open `src/main.rs` in your favorite text editor and replace its contents with |
| 88 | this: |
| 89 | |
| 90 | ```no_run |
| 91 | //tutorial-setup-01.rs |
| 92 | // Import the standard library's I/O module so we can read from stdin. |
| 93 | use std::io; |
| 94 | |
| 95 | // The `main` function is where your program starts executing. |
| 96 | fn main() { |
| 97 | // Create a CSV parser that reads data from stdin. |
| 98 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 99 | // Loop over each record. |
| 100 | for result in rdr.records() { |
| 101 | // An error may occur, so abort the program in an unfriendly way. |
| 102 | // We will make this more friendly later! |
| 103 | let record = result.expect("a CSV record"); |
| 104 | // Print a debug version of the record. |
| 105 | println!("{:?}", record); |
| 106 | } |
| 107 | } |
| 108 | ``` |
| 109 | |
| 110 | Don't worry too much about what this code means; we'll dissect it in the next |
| 111 | section. For now, try rebuilding your project: |
| 112 | |
| 113 | ```text |
| 114 | $ cargo build |
| 115 | ``` |
| 116 | |
| 117 | Assuming that succeeds, let's try running our program. But first, we will need |
| 118 | some CSV data to play with! For that, we will use a random selection of 100 |
| 119 | US cities, along with their population size and geographical coordinates. (We |
| 120 | will use this same CSV data throughout the entire tutorial.) To get the data, |
| 121 | download it from github: |
| 122 | |
| 123 | ```text |
| 124 | $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop.csv' |
| 125 | ``` |
| 126 | |
| 127 | And now finally, run your program on `uspop.csv`: |
| 128 | |
| 129 | ```text |
| 130 | $ ./target/debug/csvtutor < uspop.csv |
| 131 | StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"]) |
| 132 | StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"]) |
| 133 | StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"]) |
| 134 | # ... and much more |
| 135 | ``` |
| 136 | |
| 137 | # Basic error handling |
| 138 | |
| 139 | Since reading CSV data can result in errors, error handling is pervasive |
| 140 | throughout the examples in this tutorial. Therefore, we're going to spend a |
| 141 | little bit of time going over basic error handling, and in particular, fix |
| 142 | our previous example to show errors in a more friendly way. **If you're already |
| 143 | comfortable with things like `Result` and `try!`/`?` in Rust, then you can |
| 144 | safely skip this section.** |
| 145 | |
| 146 | Note that |
| 147 | [The Rust Programming Language Book](https://doc.rust-lang.org/book/second-edition/) |
| 148 | contains an |
| 149 | [introduction to general error handling](https://doc.rust-lang.org/book/second-edition/ch09-00-error-handling.html). |
| 150 | For a deeper dive, see |
| 151 | [my blog post on error handling in Rust](http://blog.burntsushi.net/rust-error-handling/). |
| 152 | The blog post is especially important if you plan on building Rust libraries. |
| 153 | |
| 154 | With that out of the way, error handling in Rust comes in two different forms: |
| 155 | unrecoverable errors and recoverable errors. |
| 156 | |
| 157 | Unrecoverable errors generally correspond to things like bugs in your program, |
| 158 | which might occur when an invariant or contract is broken. At that point, the |
| 159 | state of your program is unpredictable, and there's typically little recourse |
| 160 | other than *panicking*. In Rust, a panic is similar to simply aborting your |
| 161 | program, but it will unwind the stack and clean up resources before your |
| 162 | program exits. |
| 163 | |
| 164 | On the other hand, recoverable errors generally correspond to predictable |
| 165 | errors. A non-existent file or invalid CSV data are examples of recoverable |
| 166 | errors. In Rust, recoverable errors are handled via `Result`. A `Result` |
| 167 | represents the state of a computation that has either succeeded or failed. |
| 168 | It is defined like so: |
| 169 | |
| 170 | ``` |
| 171 | enum Result<T, E> { |
| 172 | Ok(T), |
| 173 | Err(E), |
| 174 | } |
| 175 | ``` |
| 176 | |
| 177 | That is, a `Result` either contains a value of type `T` when the computation |
| 178 | succeeds, or it contains a value of type `E` when the computation fails. |
| 179 | |
| 180 | The relationship between unrecoverable errors and recoverable errors is |
| 181 | important. In particular, it is **strongly discouraged** to treat recoverable |
| 182 | errors as if they were unrecoverable. For example, panicking when a file could |
| 183 | not be found, or if some CSV data is invalid, is considered bad practice. |
| 184 | Instead, predictable errors should be handled using Rust's `Result` type. |
| 185 | |
| 186 | With our new found knowledge, let's re-examine our previous example and dissect |
| 187 | its error handling. |
| 188 | |
| 189 | ```no_run |
| 190 | //tutorial-error-01.rs |
| 191 | use std::io; |
| 192 | |
| 193 | fn main() { |
| 194 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 195 | for result in rdr.records() { |
| 196 | let record = result.expect("a CSV record"); |
| 197 | println!("{:?}", record); |
| 198 | } |
| 199 | } |
| 200 | ``` |
| 201 | |
| 202 | There are two places where an error can occur in this program. The first is |
| 203 | if there was a problem reading a record from stdin. The second is if there is |
| 204 | a problem writing to stdout. In general, we will ignore the latter problem in |
| 205 | this tutorial, although robust command line applications should probably try |
| 206 | to handle it (e.g., when a broken pipe occurs). The former however is worth |
| 207 | looking into in more detail. For example, if a user of this program provides |
| 208 | invalid CSV data, then the program will panic: |
| 209 | |
| 210 | ```text |
| 211 | $ cat invalid |
| 212 | header1,header2 |
| 213 | foo,bar |
| 214 | quux,baz,foobar |
| 215 | $ ./target/debug/csvtutor < invalid |
| 216 | StringRecord { position: Some(Position { byte: 16, line: 2, record: 1 }), fields: ["foo", "bar"] } |
| 217 | thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: UnequalLengths { pos: Some(Position { byte: 24, line: 3, record: 2 }), expected_len: 2, len: 3 }', /checkout/src/libcore/result.rs:859 |
| 218 | note: Run with `RUST_BACKTRACE=1` for a backtrace. |
| 219 | ``` |
| 220 | |
| 221 | What happened here? First and foremost, we should talk about why the CSV data |
| 222 | is invalid. The CSV data consists of three records: a header and two data |
| 223 | records. The header and first data record have two fields, but the second |
| 224 | data record has three fields. By default, the csv crate will treat inconsistent |
| 225 | record lengths as an error. |
| 226 | (This behavior can be toggled using the |
| 227 | [`ReaderBuilder::flexible`](../struct.ReaderBuilder.html#method.flexible) |
| 228 | config knob.) This explains why the first data record is printed in this |
| 229 | example, since it has the same number of fields as the header record. That is, |
| 230 | we don't actually hit an error until we parse the second data record. |
| 231 | |
| 232 | (Note that the CSV reader automatically interprets the first record as a |
| 233 | header. This can be toggled with the |
| 234 | [`ReaderBuilder::has_headers`](../struct.ReaderBuilder.html#method.has_headers) |
| 235 | config knob.) |
| 236 | |
| 237 | So what actually causes the panic to happen in our program? That would be the |
| 238 | first line in our loop: |
| 239 | |
| 240 | ```ignore |
| 241 | for result in rdr.records() { |
| 242 | let record = result.expect("a CSV record"); // this panics |
| 243 | println!("{:?}", record); |
| 244 | } |
| 245 | ``` |
| 246 | |
| 247 | The key thing to understand here is that `rdr.records()` returns an iterator |
| 248 | that yields `Result` values. That is, instead of yielding records, it yields |
| 249 | a `Result` that contains either a record or an error. The `expect` method, |
| 250 | which is defined on `Result`, *unwraps* the success value inside the `Result`. |
| 251 | Since the `Result` might contain an error instead, `expect` will *panic* when |
| 252 | it does contain an error. |
| 253 | |
| 254 | It might help to look at the implementation of `expect`: |
| 255 | |
| 256 | ```ignore |
| 257 | use std::fmt; |
| 258 | |
| 259 | // This says, "for all types T and E, where E can be turned into a human |
| 260 | // readable debug message, define the `expect` method." |
| 261 | impl<T, E: fmt::Debug> Result<T, E> { |
| 262 | fn expect(self, msg: &str) -> T { |
| 263 | match self { |
| 264 | Ok(t) => t, |
| 265 | Err(e) => panic!("{}: {:?}", msg, e), |
| 266 | } |
| 267 | } |
| 268 | } |
| 269 | ``` |
| 270 | |
| 271 | Since this causes a panic if the CSV data is invalid, and invalid CSV data is |
| 272 | a perfectly predictable error, we've turned what should be a *recoverable* |
| 273 | error into an *unrecoverable* error. We did this because it is expedient to |
| 274 | use unrecoverable errors. Since this is bad practice, we will endeavor to avoid |
| 275 | unrecoverable errors throughout the rest of the tutorial. |
| 276 | |
| 277 | ## Switch to recoverable errors |
| 278 | |
| 279 | We'll convert our unrecoverable error to a recoverable error in 3 steps. First, |
| 280 | let's get rid of the panic and print an error message manually: |
| 281 | |
| 282 | ```no_run |
| 283 | //tutorial-error-02.rs |
| 284 | use std::io; |
| 285 | use std::process; |
| 286 | |
| 287 | fn main() { |
| 288 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 289 | for result in rdr.records() { |
| 290 | // Examine our Result. |
| 291 | // If there was no problem, print the record. |
| 292 | // Otherwise, print the error message and quit the program. |
| 293 | match result { |
| 294 | Ok(record) => println!("{:?}", record), |
| 295 | Err(err) => { |
| 296 | println!("error reading CSV from <stdin>: {}", err); |
| 297 | process::exit(1); |
| 298 | } |
| 299 | } |
| 300 | } |
| 301 | } |
| 302 | ``` |
| 303 | |
| 304 | If we run our program again, we'll still see an error message, but it is no |
| 305 | longer a panic message: |
| 306 | |
| 307 | ```text |
| 308 | $ cat invalid |
| 309 | header1,header2 |
| 310 | foo,bar |
| 311 | quux,baz,foobar |
| 312 | $ ./target/debug/csvtutor < invalid |
| 313 | StringRecord { position: Some(Position { byte: 16, line: 2, record: 1 }), fields: ["foo", "bar"] } |
| 314 | error reading CSV from <stdin>: CSV error: record 2 (line: 3, byte: 24): found record with 3 fields, but the previous record has 2 fields |
| 315 | ``` |
| 316 | |
| 317 | The second step for moving to recoverable errors is to put our CSV record loop |
| 318 | into a separate function. This function then has the option of *returning* an |
| 319 | error, which our `main` function can then inspect and decide what to do with. |
| 320 | |
| 321 | ```no_run |
| 322 | //tutorial-error-03.rs |
| 323 | use std::error::Error; |
| 324 | use std::io; |
| 325 | use std::process; |
| 326 | |
| 327 | fn main() { |
| 328 | if let Err(err) = run() { |
| 329 | println!("{}", err); |
| 330 | process::exit(1); |
| 331 | } |
| 332 | } |
| 333 | |
| 334 | fn run() -> Result<(), Box<dyn Error>> { |
| 335 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 336 | for result in rdr.records() { |
| 337 | // Examine our Result. |
| 338 | // If there was no problem, print the record. |
| 339 | // Otherwise, convert our error to a Box<dyn Error> and return it. |
| 340 | match result { |
| 341 | Err(err) => return Err(From::from(err)), |
| 342 | Ok(record) => { |
| 343 | println!("{:?}", record); |
| 344 | } |
| 345 | } |
| 346 | } |
| 347 | Ok(()) |
| 348 | } |
| 349 | ``` |
| 350 | |
| 351 | Our new function, `run`, has a return type of `Result<(), Box<dyn Error>>`. In |
| 352 | simple terms, this says that `run` either returns nothing when successful, or |
| 353 | if an error occurred, it returns a `Box<dyn Error>`, which stands for "any kind of |
| 354 | error." A `Box<dyn Error>` is hard to inspect if we cared about the specific error |
| 355 | that occurred. But for our purposes, all we need to do is gracefully print an |
| 356 | error message and exit the program. |
| 357 | |
| 358 | The third and final step is to replace our explicit `match` expression with a |
| 359 | special Rust language feature: the question mark. |
| 360 | |
| 361 | ```no_run |
| 362 | //tutorial-error-04.rs |
| 363 | use std::error::Error; |
| 364 | use std::io; |
| 365 | use std::process; |
| 366 | |
| 367 | fn main() { |
| 368 | if let Err(err) = run() { |
| 369 | println!("{}", err); |
| 370 | process::exit(1); |
| 371 | } |
| 372 | } |
| 373 | |
| 374 | fn run() -> Result<(), Box<dyn Error>> { |
| 375 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 376 | for result in rdr.records() { |
| 377 | // This is effectively the same code as our `match` in the |
| 378 | // previous example. In other words, `?` is syntactic sugar. |
| 379 | let record = result?; |
| 380 | println!("{:?}", record); |
| 381 | } |
| 382 | Ok(()) |
| 383 | } |
| 384 | ``` |
| 385 | |
| 386 | This last step shows how we can use the `?` to automatically forward errors |
| 387 | to our caller without having to do explicit case analysis with `match` |
| 388 | ourselves. We will use the `?` heavily throughout this tutorial, and it's |
| 389 | important to note that it can **only be used in functions that return |
| 390 | `Result`.** |
| 391 | |
| 392 | We'll end this section with a word of caution: using `Box<dyn Error>` as our error |
| 393 | type is the minimally acceptable thing we can do here. Namely, while it allows |
| 394 | our program to gracefully handle errors, it makes it hard for callers to |
| 395 | inspect the specific error condition that occurred. However, since this is a |
| 396 | tutorial on writing command line programs that do CSV parsing, we will consider |
| 397 | ourselves satisfied. If you'd like to know more, or are interested in writing |
| 398 | a library that handles CSV data, then you should check out my |
| 399 | [blog post on error handling](http://blog.burntsushi.net/rust-error-handling/). |
| 400 | |
| 401 | With all that said, if all you're doing is writing a one-off program to do |
| 402 | CSV transformations, then using methods like `expect` and panicking when an |
| 403 | error occurs is a perfectly reasonable thing to do. Nevertheless, this tutorial |
| 404 | will endeavor to show idiomatic code. |
| 405 | |
| 406 | # Reading CSV |
| 407 | |
| 408 | Now that we've got you setup and covered basic error handling, it's time to do |
| 409 | what we came here to do: handle CSV data. We've already seen how to read |
| 410 | CSV data from `stdin`, but this section will cover how to read CSV data from |
| 411 | files and how to configure our CSV reader to data formatted with different |
| 412 | delimiters and quoting strategies. |
| 413 | |
| 414 | First up, let's adapt the example we've been working with to accept a file |
| 415 | path argument instead of stdin. |
| 416 | |
| 417 | ```no_run |
| 418 | //tutorial-read-01.rs |
| 419 | use std::env; |
| 420 | use std::error::Error; |
| 421 | use std::ffi::OsString; |
| 422 | use std::fs::File; |
| 423 | use std::process; |
| 424 | |
| 425 | fn run() -> Result<(), Box<dyn Error>> { |
| 426 | let file_path = get_first_arg()?; |
| 427 | let file = File::open(file_path)?; |
| 428 | let mut rdr = csv::Reader::from_reader(file); |
| 429 | for result in rdr.records() { |
| 430 | let record = result?; |
| 431 | println!("{:?}", record); |
| 432 | } |
| 433 | Ok(()) |
| 434 | } |
| 435 | |
| 436 | /// Returns the first positional argument sent to this process. If there are no |
| 437 | /// positional arguments, then this returns an error. |
| 438 | fn get_first_arg() -> Result<OsString, Box<dyn Error>> { |
| 439 | match env::args_os().nth(1) { |
| 440 | None => Err(From::from("expected 1 argument, but got none")), |
| 441 | Some(file_path) => Ok(file_path), |
| 442 | } |
| 443 | } |
| 444 | |
| 445 | fn main() { |
| 446 | if let Err(err) = run() { |
| 447 | println!("{}", err); |
| 448 | process::exit(1); |
| 449 | } |
| 450 | } |
| 451 | ``` |
| 452 | |
| 453 | If you replace the contents of your `src/main.rs` file with the above code, |
| 454 | then you should be able to rebuild your project and try it out: |
| 455 | |
| 456 | ```text |
| 457 | $ cargo build |
| 458 | $ ./target/debug/csvtutor uspop.csv |
| 459 | StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"]) |
| 460 | StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"]) |
| 461 | StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"]) |
| 462 | # ... and much more |
| 463 | ``` |
| 464 | |
| 465 | This example contains two new pieces of code: |
| 466 | |
| 467 | 1. Code for querying the positional arguments of your program. We put this code |
| 468 | into its own function called `get_first_arg`. Our program expects a file |
| 469 | path in the first position (which is indexed at `1`; the argument at index |
| 470 | `0` is the executable name), so if one doesn't exist, then `get_first_arg` |
| 471 | returns an error. |
| 472 | 2. Code for opening a file. In `run`, we open a file using `File::open`. If |
| 473 | there was a problem opening the file, we forward the error to the caller of |
| 474 | `run` (which is `main` in this program). Note that we do *not* wrap the |
| 475 | `File` in a buffer. The CSV reader does buffering internally, so there's |
| 476 | no need for the caller to do it. |
| 477 | |
| 478 | Now is a good time to introduce an alternate CSV reader constructor, which |
| 479 | makes it slightly more convenient to open CSV data from a file. That is, |
| 480 | instead of: |
| 481 | |
| 482 | ```ignore |
| 483 | let file_path = get_first_arg()?; |
| 484 | let file = File::open(file_path)?; |
| 485 | let mut rdr = csv::Reader::from_reader(file); |
| 486 | ``` |
| 487 | |
| 488 | you can use: |
| 489 | |
| 490 | ```ignore |
| 491 | let file_path = get_first_arg()?; |
| 492 | let mut rdr = csv::Reader::from_path(file_path)?; |
| 493 | ``` |
| 494 | |
| 495 | `csv::Reader::from_path` will open the file for you and return an error if |
| 496 | the file could not be opened. |
| 497 | |
| 498 | ## Reading headers |
| 499 | |
| 500 | If you had a chance to look at the data inside `uspop.csv`, you would notice |
| 501 | that there is a header record that looks like this: |
| 502 | |
| 503 | ```text |
| 504 | City,State,Population,Latitude,Longitude |
| 505 | ``` |
| 506 | |
| 507 | Now, if you look back at the output of the commands you've run so far, you'll |
| 508 | notice that the header record is never printed. Why is that? By default, the |
| 509 | CSV reader will interpret the first record in CSV data as a header, which |
| 510 | is typically distinct from the actual data in the records that follow. |
| 511 | Therefore, the header record is always skipped whenever you try to read or |
| 512 | iterate over the records in CSV data. |
| 513 | |
| 514 | The CSV reader does not try to be smart about the header record and does |
| 515 | **not** employ any heuristics for automatically detecting whether the first |
| 516 | record is a header or not. Instead, if you don't want to treat the first record |
| 517 | as a header, you'll need to tell the CSV reader that there are no headers. |
| 518 | |
| 519 | To configure a CSV reader to do this, we'll need to use a |
| 520 | [`ReaderBuilder`](../struct.ReaderBuilder.html) |
| 521 | to build a CSV reader with our desired configuration. Here's an example that |
| 522 | does just that. (Note that we've moved back to reading from `stdin`, since it |
| 523 | produces terser examples.) |
| 524 | |
| 525 | ```no_run |
| 526 | //tutorial-read-headers-01.rs |
| 527 | # use std::error::Error; |
| 528 | # use std::io; |
| 529 | # use std::process; |
| 530 | # |
| 531 | fn run() -> Result<(), Box<dyn Error>> { |
| 532 | let mut rdr = csv::ReaderBuilder::new() |
| 533 | .has_headers(false) |
| 534 | .from_reader(io::stdin()); |
| 535 | for result in rdr.records() { |
| 536 | let record = result?; |
| 537 | println!("{:?}", record); |
| 538 | } |
| 539 | Ok(()) |
| 540 | } |
| 541 | # |
| 542 | # fn main() { |
| 543 | # if let Err(err) = run() { |
| 544 | # println!("{}", err); |
| 545 | # process::exit(1); |
| 546 | # } |
| 547 | # } |
| 548 | ``` |
| 549 | |
| 550 | If you compile and run this program with our `uspop.csv` data, then you'll see |
| 551 | that the header record is now printed: |
| 552 | |
| 553 | ```text |
| 554 | $ cargo build |
| 555 | $ ./target/debug/csvtutor < uspop.csv |
| 556 | StringRecord(["City", "State", "Population", "Latitude", "Longitude"]) |
| 557 | StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"]) |
| 558 | StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"]) |
| 559 | StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"]) |
| 560 | ``` |
| 561 | |
| 562 | If you ever need to access the header record directly, then you can use the |
| 563 | [`Reader::header`](../struct.Reader.html#method.headers) |
| 564 | method like so: |
| 565 | |
| 566 | ```no_run |
| 567 | //tutorial-read-headers-02.rs |
| 568 | # use std::error::Error; |
| 569 | # use std::io; |
| 570 | # use std::process; |
| 571 | # |
| 572 | fn run() -> Result<(), Box<dyn Error>> { |
| 573 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 574 | { |
| 575 | // We nest this call in its own scope because of lifetimes. |
| 576 | let headers = rdr.headers()?; |
| 577 | println!("{:?}", headers); |
| 578 | } |
| 579 | for result in rdr.records() { |
| 580 | let record = result?; |
| 581 | println!("{:?}", record); |
| 582 | } |
| 583 | // We can ask for the headers at any time. There's no need to nest this |
| 584 | // call in its own scope because we never try to borrow the reader again. |
| 585 | let headers = rdr.headers()?; |
| 586 | println!("{:?}", headers); |
| 587 | Ok(()) |
| 588 | } |
| 589 | # |
| 590 | # fn main() { |
| 591 | # if let Err(err) = run() { |
| 592 | # println!("{}", err); |
| 593 | # process::exit(1); |
| 594 | # } |
| 595 | # } |
| 596 | ``` |
| 597 | |
| 598 | One interesting thing to note in this example is that we put the call to |
| 599 | `rdr.headers()` in its own scope. We do this because `rdr.headers()` returns |
| 600 | a *borrow* of the reader's internal header state. The nested scope in this |
| 601 | code allows the borrow to end before we try to iterate over the records. If |
| 602 | we didn't nest the call to `rdr.headers()` in its own scope, then the code |
| 603 | wouldn't compile because we cannot borrow the reader's headers at the same time |
| 604 | that we try to borrow the reader to iterate over its records. |
| 605 | |
| 606 | Another way of solving this problem is to *clone* the header record: |
| 607 | |
| 608 | ```ignore |
| 609 | let headers = rdr.headers()?.clone(); |
| 610 | ``` |
| 611 | |
| 612 | This converts it from a borrow of the CSV reader to a new owned value. This |
| 613 | makes the code a bit easier to read, but at the cost of copying the header |
| 614 | record into a new allocation. |
| 615 | |
| 616 | ## Delimiters, quotes and variable length records |
| 617 | |
| 618 | In this section we'll temporarily depart from our `uspop.csv` data set and |
| 619 | show how to read some CSV data that is a little less clean. This CSV data |
| 620 | uses `;` as a delimiter, escapes quotes with `\"` (instead of `""`) and has |
| 621 | records of varying length. Here's the data, which contains a list of WWE |
| 622 | wrestlers and the year they started, if it's known: |
| 623 | |
| 624 | ```text |
| 625 | $ cat strange.csv |
| 626 | "\"Hacksaw\" Jim Duggan";1987 |
| 627 | "Bret \"Hit Man\" Hart";1984 |
| 628 | # We're not sure when Rafael started, so omit the year. |
| 629 | Rafael Halperin |
| 630 | "\"Big Cat\" Ernie Ladd";1964 |
| 631 | "\"Macho Man\" Randy Savage";1985 |
| 632 | "Jake \"The Snake\" Roberts";1986 |
| 633 | ``` |
| 634 | |
| 635 | To read this CSV data, we'll want to do the following: |
| 636 | |
| 637 | 1. Disable headers, since this data has none. |
| 638 | 2. Change the delimiter from `,` to `;`. |
| 639 | 3. Change the quote strategy from doubled (e.g., `""`) to escaped (e.g., `\"`). |
| 640 | 4. Permit flexible length records, since some omit the year. |
| 641 | 5. Ignore lines beginning with a `#`. |
| 642 | |
| 643 | All of this (and more!) can be configured with a |
| 644 | [`ReaderBuilder`](../struct.ReaderBuilder.html), |
| 645 | as seen in the following example: |
| 646 | |
| 647 | ```no_run |
| 648 | //tutorial-read-delimiter-01.rs |
| 649 | # use std::error::Error; |
| 650 | # use std::io; |
| 651 | # use std::process; |
| 652 | # |
| 653 | fn run() -> Result<(), Box<dyn Error>> { |
| 654 | let mut rdr = csv::ReaderBuilder::new() |
| 655 | .has_headers(false) |
| 656 | .delimiter(b';') |
| 657 | .double_quote(false) |
| 658 | .escape(Some(b'\\')) |
| 659 | .flexible(true) |
| 660 | .comment(Some(b'#')) |
| 661 | .from_reader(io::stdin()); |
| 662 | for result in rdr.records() { |
| 663 | let record = result?; |
| 664 | println!("{:?}", record); |
| 665 | } |
| 666 | Ok(()) |
| 667 | } |
| 668 | # |
| 669 | # fn main() { |
| 670 | # if let Err(err) = run() { |
| 671 | # println!("{}", err); |
| 672 | # process::exit(1); |
| 673 | # } |
| 674 | # } |
| 675 | ``` |
| 676 | |
| 677 | Now re-compile your project and try running the program on `strange.csv`: |
| 678 | |
| 679 | ```text |
| 680 | $ cargo build |
| 681 | $ ./target/debug/csvtutor < strange.csv |
| 682 | StringRecord(["\"Hacksaw\" Jim Duggan", "1987"]) |
| 683 | StringRecord(["Bret \"Hit Man\" Hart", "1984"]) |
| 684 | StringRecord(["Rafael Halperin"]) |
| 685 | StringRecord(["\"Big Cat\" Ernie Ladd", "1964"]) |
| 686 | StringRecord(["\"Macho Man\" Randy Savage", "1985"]) |
| 687 | StringRecord(["Jake \"The Snake\" Roberts", "1986"]) |
| 688 | ``` |
| 689 | |
| 690 | You should feel encouraged to play around with the settings. Some interesting |
| 691 | things you might try: |
| 692 | |
| 693 | 1. If you remove the `escape` setting, notice that no CSV errors are reported. |
| 694 | Instead, records are still parsed. This is a feature of the CSV parser. Even |
| 695 | though it gets the data slightly wrong, it still provides a parse that you |
| 696 | might be able to work with. This is a useful property given the messiness |
| 697 | of real world CSV data. |
| 698 | 2. If you remove the `delimiter` setting, parsing still succeeds, although |
| 699 | every record has exactly one field. |
| 700 | 3. If you remove the `flexible` setting, the reader will print the first two |
| 701 | records (since they both have the same number of fields), but will return a |
| 702 | parse error on the third record, since it has only one field. |
| 703 | |
| 704 | This covers most of the things you might want to configure on your CSV reader, |
| 705 | although there are a few other knobs. For example, you can change the record |
| 706 | terminator from a new line to any other character. (By default, the terminator |
| 707 | is `CRLF`, which treats each of `\r\n`, `\r` and `\n` as single record |
| 708 | terminators.) For more details, see the documentation and examples for each of |
| 709 | the methods on |
| 710 | [`ReaderBuilder`](../struct.ReaderBuilder.html). |
| 711 | |
| 712 | ## Reading with Serde |
| 713 | |
| 714 | One of the most convenient features of this crate is its support for |
| 715 | [Serde](https://serde.rs/). |
| 716 | Serde is a framework for automatically serializing and deserializing data into |
| 717 | Rust types. In simpler terms, that means instead of iterating over records |
| 718 | as an array of string fields, we can iterate over records of a specific type |
| 719 | of our choosing. |
| 720 | |
| 721 | For example, let's take a look at some data from our `uspop.csv` file: |
| 722 | |
| 723 | ```text |
| 724 | City,State,Population,Latitude,Longitude |
| 725 | Davidsons Landing,AK,,65.2419444,-165.2716667 |
| 726 | Kenai,AK,7610,60.5544444,-151.2583333 |
| 727 | ``` |
| 728 | |
| 729 | While some of these fields make sense as strings (`City`, `State`), other |
| 730 | fields look more like numbers. For example, `Population` looks like it contains |
| 731 | integers while `Latitude` and `Longitude` appear to contain decimals. If we |
| 732 | wanted to convert these fields to their "proper" types, then we need to do |
| 733 | a lot of manual work. This next example shows how. |
| 734 | |
| 735 | ```no_run |
| 736 | //tutorial-read-serde-01.rs |
| 737 | # use std::error::Error; |
| 738 | # use std::io; |
| 739 | # use std::process; |
| 740 | # |
| 741 | fn run() -> Result<(), Box<dyn Error>> { |
| 742 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 743 | for result in rdr.records() { |
| 744 | let record = result?; |
| 745 | |
| 746 | let city = &record[0]; |
| 747 | let state = &record[1]; |
| 748 | // Some records are missing population counts, so if we can't |
| 749 | // parse a number, treat the population count as missing instead |
| 750 | // of returning an error. |
| 751 | let pop: Option<u64> = record[2].parse().ok(); |
| 752 | // Lucky us! Latitudes and longitudes are available for every record. |
| 753 | // Therefore, if one couldn't be parsed, return an error. |
| 754 | let latitude: f64 = record[3].parse()?; |
| 755 | let longitude: f64 = record[4].parse()?; |
| 756 | |
| 757 | println!( |
| 758 | "city: {:?}, state: {:?}, \ |
| 759 | pop: {:?}, latitude: {:?}, longitude: {:?}", |
| 760 | city, state, pop, latitude, longitude); |
| 761 | } |
| 762 | Ok(()) |
| 763 | } |
| 764 | # |
| 765 | # fn main() { |
| 766 | # if let Err(err) = run() { |
| 767 | # println!("{}", err); |
| 768 | # process::exit(1); |
| 769 | # } |
| 770 | # } |
| 771 | ``` |
| 772 | |
| 773 | The problem here is that we need to parse each individual field manually, which |
| 774 | can be labor intensive and repetitive. Serde, however, makes this process |
| 775 | automatic. For example, we can ask to deserialize every record into a tuple |
| 776 | type: `(String, String, Option<u64>, f64, f64)`. |
| 777 | |
| 778 | ```no_run |
| 779 | //tutorial-read-serde-02.rs |
| 780 | # use std::error::Error; |
| 781 | # use std::io; |
| 782 | # use std::process; |
| 783 | # |
| 784 | // This introduces a type alias so that we can conveniently reference our |
| 785 | // record type. |
| 786 | type Record = (String, String, Option<u64>, f64, f64); |
| 787 | |
| 788 | fn run() -> Result<(), Box<dyn Error>> { |
| 789 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 790 | // Instead of creating an iterator with the `records` method, we create |
| 791 | // an iterator with the `deserialize` method. |
| 792 | for result in rdr.deserialize() { |
| 793 | // We must tell Serde what type we want to deserialize into. |
| 794 | let record: Record = result?; |
| 795 | println!("{:?}", record); |
| 796 | } |
| 797 | Ok(()) |
| 798 | } |
| 799 | # |
| 800 | # fn main() { |
| 801 | # if let Err(err) = run() { |
| 802 | # println!("{}", err); |
| 803 | # process::exit(1); |
| 804 | # } |
| 805 | # } |
| 806 | ``` |
| 807 | |
| 808 | Running this code should show similar output as previous examples: |
| 809 | |
| 810 | ```text |
| 811 | $ cargo build |
| 812 | $ ./target/debug/csvtutor < uspop.csv |
| 813 | ("Davidsons Landing", "AK", None, 65.2419444, -165.2716667) |
| 814 | ("Kenai", "AK", Some(7610), 60.5544444, -151.2583333) |
| 815 | ("Oakman", "AL", None, 33.7133333, -87.3886111) |
| 816 | # ... and much more |
| 817 | ``` |
| 818 | |
| 819 | One of the downsides of using Serde this way is that the type you use must |
| 820 | match the order of fields as they appear in each record. This can be a pain |
| 821 | if your CSV data has a header record, since you might tend to think about each |
| 822 | field as a value of a particular named field rather than as a numbered field. |
| 823 | One way we might achieve this is to deserialize our record into a map type like |
| 824 | [`HashMap`](https://doc.rust-lang.org/std/collections/struct.HashMap.html) |
| 825 | or |
| 826 | [`BTreeMap`](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html). |
| 827 | The next example shows how, and in particular, notice that the only thing that |
| 828 | changed from the last example is the definition of the `Record` type alias and |
| 829 | a new `use` statement that imports `HashMap` from the standard library: |
| 830 | |
| 831 | ```no_run |
| 832 | //tutorial-read-serde-03.rs |
| 833 | use std::collections::HashMap; |
| 834 | # use std::error::Error; |
| 835 | # use std::io; |
| 836 | # use std::process; |
| 837 | |
| 838 | // This introduces a type alias so that we can conveniently reference our |
| 839 | // record type. |
| 840 | type Record = HashMap<String, String>; |
| 841 | |
| 842 | fn run() -> Result<(), Box<dyn Error>> { |
| 843 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 844 | for result in rdr.deserialize() { |
| 845 | let record: Record = result?; |
| 846 | println!("{:?}", record); |
| 847 | } |
| 848 | Ok(()) |
| 849 | } |
| 850 | # |
| 851 | # fn main() { |
| 852 | # if let Err(err) = run() { |
| 853 | # println!("{}", err); |
| 854 | # process::exit(1); |
| 855 | # } |
| 856 | # } |
| 857 | ``` |
| 858 | |
| 859 | Running this program shows similar results as before, but each record is |
| 860 | printed as a map: |
| 861 | |
| 862 | ```text |
| 863 | $ cargo build |
| 864 | $ ./target/debug/csvtutor < uspop.csv |
| 865 | {"City": "Davidsons Landing", "Latitude": "65.2419444", "State": "AK", "Population": "", "Longitude": "-165.2716667"} |
| 866 | {"City": "Kenai", "Population": "7610", "State": "AK", "Longitude": "-151.2583333", "Latitude": "60.5544444"} |
| 867 | {"State": "AL", "City": "Oakman", "Longitude": "-87.3886111", "Population": "", "Latitude": "33.7133333"} |
| 868 | ``` |
| 869 | |
| 870 | This method works especially well if you need to read CSV data with header |
| 871 | records, but whose exact structure isn't known until your program runs. |
| 872 | However, in our case, we know the structure of the data in `uspop.csv`. In |
| 873 | particular, with the `HashMap` approach, we've lost the specific types we had |
| 874 | for each field in the previous example when we deserialized each record into a |
| 875 | `(String, String, Option<u64>, f64, f64)`. Is there a way to identify fields |
| 876 | by their corresponding header name *and* assign each field its own unique |
| 877 | type? The answer is yes, but we'll need to bring in Serde's `derive` feature |
| 878 | first. You can do that by adding this to the `[dependencies]` section of your |
| 879 | `Cargo.toml` file: |
| 880 | |
| 881 | ```text |
| 882 | serde = { version = "1", features = ["derive"] } |
| 883 | ``` |
| 884 | |
| 885 | With these crates added to our project, we can now define our own custom struct |
| 886 | that represents our record. We then ask Serde to automatically write the glue |
| 887 | code required to populate our struct from a CSV record. The next example shows |
| 888 | how. Don't miss the new Serde imports! |
| 889 | |
| 890 | ```no_run |
| 891 | //tutorial-read-serde-04.rs |
| 892 | use std::error::Error; |
| 893 | use std::io; |
| 894 | use std::process; |
| 895 | |
| 896 | // This lets us write `#[derive(Deserialize)]`. |
| 897 | use serde::Deserialize; |
| 898 | |
| 899 | // We don't need to derive `Debug` (which doesn't require Serde), but it's a |
| 900 | // good habit to do it for all your types. |
| 901 | // |
| 902 | // Notice that the field names in this struct are NOT in the same order as |
| 903 | // the fields in the CSV data! |
| 904 | #[derive(Debug, Deserialize)] |
| 905 | #[serde(rename_all = "PascalCase")] |
| 906 | struct Record { |
| 907 | latitude: f64, |
| 908 | longitude: f64, |
| 909 | population: Option<u64>, |
| 910 | city: String, |
| 911 | state: String, |
| 912 | } |
| 913 | |
| 914 | fn run() -> Result<(), Box<dyn Error>> { |
| 915 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 916 | for result in rdr.deserialize() { |
| 917 | let record: Record = result?; |
| 918 | println!("{:?}", record); |
| 919 | // Try this if you don't like each record smushed on one line: |
| 920 | // println!("{:#?}", record); |
| 921 | } |
| 922 | Ok(()) |
| 923 | } |
| 924 | |
| 925 | fn main() { |
| 926 | if let Err(err) = run() { |
| 927 | println!("{}", err); |
| 928 | process::exit(1); |
| 929 | } |
| 930 | } |
| 931 | ``` |
| 932 | |
| 933 | Compile and run this program to see similar output as before: |
| 934 | |
| 935 | ```text |
| 936 | $ cargo build |
| 937 | $ ./target/debug/csvtutor < uspop.csv |
| 938 | Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" } |
| 939 | Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" } |
| 940 | Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" } |
| 941 | ``` |
| 942 | |
| 943 | Once again, we didn't need to change our `run` function at all: we're still |
| 944 | iterating over records using the `deserialize` iterator that we started with |
| 945 | in the beginning of this section. The only thing that changed in this example |
| 946 | was the definition of the `Record` type and a new `use` statement. Our `Record` |
| 947 | type is now a custom struct that we defined instead of a type alias, and as a |
| 948 | result, Serde doesn't know how to deserialize it by default. However, a special |
| 949 | compiler plugin provided by Serde is available, which will read your struct |
| 950 | definition at compile time and generate code that will deserialize a CSV record |
| 951 | into a `Record` value. To see what happens if you leave out the automatic |
| 952 | derive, change `#[derive(Debug, Deserialize)]` to `#[derive(Debug)]`. |
| 953 | |
| 954 | One other thing worth mentioning in this example is the use of |
| 955 | `#[serde(rename_all = "PascalCase")]`. This directive helps Serde map your |
| 956 | struct's field names to the header names in the CSV data. If you recall, our |
| 957 | header record is: |
| 958 | |
| 959 | ```text |
| 960 | City,State,Population,Latitude,Longitude |
| 961 | ``` |
| 962 | |
| 963 | Notice that each name is capitalized, but the fields in our struct are not. The |
| 964 | `#[serde(rename_all = "PascalCase")]` directive fixes that by interpreting each |
| 965 | field in `PascalCase`, where the first letter of the field is capitalized. If |
| 966 | we didn't tell Serde about the name remapping, then the program will quit with |
| 967 | an error: |
| 968 | |
| 969 | ```text |
| 970 | $ ./target/debug/csvtutor < uspop.csv |
| 971 | CSV deserialize error: record 1 (line: 2, byte: 41): missing field `latitude` |
| 972 | ``` |
| 973 | |
| 974 | We could have fixed this through other means. For example, we could have used |
| 975 | capital letters in our field names: |
| 976 | |
| 977 | ```ignore |
| 978 | #[derive(Debug, Deserialize)] |
| 979 | struct Record { |
| 980 | Latitude: f64, |
| 981 | Longitude: f64, |
| 982 | Population: Option<u64>, |
| 983 | City: String, |
| 984 | State: String, |
| 985 | } |
| 986 | ``` |
| 987 | |
| 988 | However, this violates Rust naming style. (In fact, the Rust compiler |
| 989 | will even warn you that the names do not follow convention!) |
| 990 | |
| 991 | Another way to fix this is to ask Serde to rename each field individually. This |
| 992 | is useful when there is no consistent name mapping from fields to header names: |
| 993 | |
| 994 | ```ignore |
| 995 | #[derive(Debug, Deserialize)] |
| 996 | struct Record { |
| 997 | #[serde(rename = "Latitude")] |
| 998 | latitude: f64, |
| 999 | #[serde(rename = "Longitude")] |
| 1000 | longitude: f64, |
| 1001 | #[serde(rename = "Population")] |
| 1002 | population: Option<u64>, |
| 1003 | #[serde(rename = "City")] |
| 1004 | city: String, |
| 1005 | #[serde(rename = "State")] |
| 1006 | state: String, |
| 1007 | } |
| 1008 | ``` |
| 1009 | |
| 1010 | To read more about renaming fields and about other Serde directives, please |
| 1011 | consult the |
| 1012 | [Serde documentation on attributes](https://serde.rs/attributes.html). |
| 1013 | |
| 1014 | ## Handling invalid data with Serde |
| 1015 | |
| 1016 | In this section we will see a brief example of how to deal with data that isn't |
| 1017 | clean. To do this exercise, we'll work with a slightly tweaked version of the |
| 1018 | US population data we've been using throughout this tutorial. This version of |
| 1019 | the data is slightly messier than what we've been using. You can get it like |
| 1020 | so: |
| 1021 | |
| 1022 | ```text |
| 1023 | $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-null.csv' |
| 1024 | ``` |
| 1025 | |
| 1026 | Let's start by running our program from the previous section: |
| 1027 | |
| 1028 | ```no_run |
| 1029 | //tutorial-read-serde-invalid-01.rs |
| 1030 | # use std::error::Error; |
| 1031 | # use std::io; |
| 1032 | # use std::process; |
| 1033 | # |
| 1034 | # use serde::Deserialize; |
| 1035 | # |
| 1036 | #[derive(Debug, Deserialize)] |
| 1037 | #[serde(rename_all = "PascalCase")] |
| 1038 | struct Record { |
| 1039 | latitude: f64, |
| 1040 | longitude: f64, |
| 1041 | population: Option<u64>, |
| 1042 | city: String, |
| 1043 | state: String, |
| 1044 | } |
| 1045 | |
| 1046 | fn run() -> Result<(), Box<dyn Error>> { |
| 1047 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 1048 | for result in rdr.deserialize() { |
| 1049 | let record: Record = result?; |
| 1050 | println!("{:?}", record); |
| 1051 | } |
| 1052 | Ok(()) |
| 1053 | } |
| 1054 | # |
| 1055 | # fn main() { |
| 1056 | # if let Err(err) = run() { |
| 1057 | # println!("{}", err); |
| 1058 | # process::exit(1); |
| 1059 | # } |
| 1060 | # } |
| 1061 | ``` |
| 1062 | |
| 1063 | Compile and run it on our messier data: |
| 1064 | |
| 1065 | ```text |
| 1066 | $ cargo build |
| 1067 | $ ./target/debug/csvtutor < uspop-null.csv |
| 1068 | Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" } |
| 1069 | Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" } |
| 1070 | Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" } |
| 1071 | # ... more records |
| 1072 | CSV deserialize error: record 42 (line: 43, byte: 1710): field 2: invalid digit found in string |
| 1073 | ``` |
| 1074 | |
| 1075 | Oops! What happened? The program printed several records, but stopped when it |
| 1076 | tripped over a deserialization problem. The error message says that it found |
| 1077 | an invalid digit in the field at index `2` (which is the `Population` field) |
| 1078 | on line 43. What does line 43 look like? |
| 1079 | |
| 1080 | ```text |
| 1081 | $ head -n 43 uspop-null.csv | tail -n1 |
| 1082 | Flint Springs,KY,NULL,37.3433333,-86.7136111 |
| 1083 | ``` |
| 1084 | |
| 1085 | Ah! The third field (index `2`) is supposed to either be empty or contain a |
| 1086 | population count. However, in this data, it seems that `NULL` sometimes appears |
| 1087 | as a value, presumably to indicate that there is no count available. |
| 1088 | |
| 1089 | The problem with our current program is that it fails to read this record |
| 1090 | because it doesn't know how to deserialize a `NULL` string into an |
| 1091 | `Option<u64>`. That is, a `Option<u64>` either corresponds to an empty field |
| 1092 | or an integer. |
| 1093 | |
| 1094 | To fix this, we tell Serde to convert any deserialization errors on this field |
| 1095 | to a `None` value, as shown in this next example: |
| 1096 | |
| 1097 | ```no_run |
| 1098 | //tutorial-read-serde-invalid-02.rs |
| 1099 | # use std::error::Error; |
| 1100 | # use std::io; |
| 1101 | # use std::process; |
| 1102 | # |
| 1103 | # use serde::Deserialize; |
| 1104 | #[derive(Debug, Deserialize)] |
| 1105 | #[serde(rename_all = "PascalCase")] |
| 1106 | struct Record { |
| 1107 | latitude: f64, |
| 1108 | longitude: f64, |
| 1109 | #[serde(deserialize_with = "csv::invalid_option")] |
| 1110 | population: Option<u64>, |
| 1111 | city: String, |
| 1112 | state: String, |
| 1113 | } |
| 1114 | |
| 1115 | fn run() -> Result<(), Box<dyn Error>> { |
| 1116 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 1117 | for result in rdr.deserialize() { |
| 1118 | let record: Record = result?; |
| 1119 | println!("{:?}", record); |
| 1120 | } |
| 1121 | Ok(()) |
| 1122 | } |
| 1123 | # |
| 1124 | # fn main() { |
| 1125 | # if let Err(err) = run() { |
| 1126 | # println!("{}", err); |
| 1127 | # process::exit(1); |
| 1128 | # } |
| 1129 | # } |
| 1130 | ``` |
| 1131 | |
| 1132 | If you compile and run this example, then it should run to completion just |
| 1133 | like the other examples: |
| 1134 | |
| 1135 | ```text |
| 1136 | $ cargo build |
| 1137 | $ ./target/debug/csvtutor < uspop-null.csv |
| 1138 | Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" } |
| 1139 | Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" } |
| 1140 | Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" } |
| 1141 | # ... and more |
| 1142 | ``` |
| 1143 | |
| 1144 | The only change in this example was adding this attribute to the `population` |
| 1145 | field in our `Record` type: |
| 1146 | |
| 1147 | ```ignore |
| 1148 | #[serde(deserialize_with = "csv::invalid_option")] |
| 1149 | ``` |
| 1150 | |
| 1151 | The |
| 1152 | [`invalid_option`](../fn.invalid_option.html) |
| 1153 | function is a generic helper function that does one very simple thing: when |
| 1154 | applied to `Option` fields, it will convert any deserialization error into a |
| 1155 | `None` value. This is useful when you need to work with messy CSV data. |
| 1156 | |
| 1157 | # Writing CSV |
| 1158 | |
| 1159 | In this section we'll show a few examples that write CSV data. Writing CSV data |
| 1160 | tends to be a bit more straight-forward than reading CSV data, since you get to |
| 1161 | control the output format. |
| 1162 | |
| 1163 | Let's start with the most basic example: writing a few CSV records to `stdout`. |
| 1164 | |
| 1165 | ```no_run |
| 1166 | //tutorial-write-01.rs |
| 1167 | use std::error::Error; |
| 1168 | use std::io; |
| 1169 | use std::process; |
| 1170 | |
| 1171 | fn run() -> Result<(), Box<dyn Error>> { |
| 1172 | let mut wtr = csv::Writer::from_writer(io::stdout()); |
| 1173 | // Since we're writing records manually, we must explicitly write our |
| 1174 | // header record. A header record is written the same way that other |
| 1175 | // records are written. |
| 1176 | wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; |
| 1177 | wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?; |
| 1178 | wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?; |
| 1179 | wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?; |
| 1180 | |
| 1181 | // A CSV writer maintains an internal buffer, so it's important |
| 1182 | // to flush the buffer when you're done. |
| 1183 | wtr.flush()?; |
| 1184 | Ok(()) |
| 1185 | } |
| 1186 | |
| 1187 | fn main() { |
| 1188 | if let Err(err) = run() { |
| 1189 | println!("{}", err); |
| 1190 | process::exit(1); |
| 1191 | } |
| 1192 | } |
| 1193 | ``` |
| 1194 | |
| 1195 | Compiling and running this example results in CSV data being printed: |
| 1196 | |
| 1197 | ```text |
| 1198 | $ cargo build |
| 1199 | $ ./target/debug/csvtutor |
| 1200 | City,State,Population,Latitude,Longitude |
| 1201 | Davidsons Landing,AK,,65.2419444,-165.2716667 |
| 1202 | Kenai,AK,7610,60.5544444,-151.2583333 |
| 1203 | Oakman,AL,,33.7133333,-87.3886111 |
| 1204 | ``` |
| 1205 | |
| 1206 | Before moving on, it's worth taking a closer look at the `write_record` |
| 1207 | method. In this example, it looks rather simple, but if you're new to Rust then |
| 1208 | its type signature might look a little daunting: |
| 1209 | |
| 1210 | ```ignore |
| 1211 | pub fn write_record<I, T>(&mut self, record: I) -> csv::Result<()> |
| 1212 | where I: IntoIterator<Item=T>, T: AsRef<[u8]> |
| 1213 | { |
| 1214 | // implementation elided |
| 1215 | } |
| 1216 | ``` |
| 1217 | |
| 1218 | To understand the type signature, we can break it down piece by piece. |
| 1219 | |
| 1220 | 1. The method takes two parameters: `self` and `record`. |
| 1221 | 2. `self` is a special parameter that corresponds to the `Writer` itself. |
| 1222 | 3. `record` is the CSV record we'd like to write. Its type is `I`, which is |
| 1223 | a generic type. |
| 1224 | 4. In the method's `where` clause, the `I` type is constrained by the |
| 1225 | `IntoIterator<Item=T>` bound. What that means is that `I` must satisfy the |
| 1226 | `IntoIterator` trait. If you look at the documentation of the |
| 1227 | [`IntoIterator` trait](https://doc.rust-lang.org/std/iter/trait.IntoIterator.html), |
| 1228 | then we can see that it describes types that can build iterators. In this |
| 1229 | case, we want an iterator that yields *another* generic type `T`, where |
| 1230 | `T` is the type of each field we want to write. |
| 1231 | 5. `T` also appears in the method's `where` clause, but its constraint is the |
| 1232 | `AsRef<[u8]>` bound. The `AsRef` trait is a way to describe zero cost |
| 1233 | conversions between types in Rust. In this case, the `[u8]` in `AsRef<[u8]>` |
| 1234 | means that we want to be able to *borrow* a slice of bytes from `T`. |
| 1235 | The CSV writer will take these bytes and write them as a single field. |
| 1236 | The `AsRef<[u8]>` bound is useful because types like `String`, `&str`, |
| 1237 | `Vec<u8>` and `&[u8]` all satisfy it. |
| 1238 | 6. Finally, the method returns a `csv::Result<()>`, which is short-hand for |
| 1239 | `Result<(), csv::Error>`. That means `write_record` either returns nothing |
| 1240 | on success or returns a `csv::Error` on failure. |
| 1241 | |
| 1242 | Now, let's apply our new found understanding of the type signature of |
| 1243 | `write_record`. If you recall, in our previous example, we used it like so: |
| 1244 | |
| 1245 | ```ignore |
| 1246 | wtr.write_record(&["field 1", "field 2", "etc"])?; |
| 1247 | ``` |
| 1248 | |
| 1249 | So how do the types match up? Well, the type of each of our fields in this |
| 1250 | code is `&'static str` (which is the type of a string literal in Rust). Since |
| 1251 | we put them in a slice literal, the type of our parameter is |
| 1252 | `&'static [&'static str]`, or more succinctly written as `&[&str]` without the |
| 1253 | lifetime annotations. Since slices satisfy the `IntoIterator` bound and |
| 1254 | strings satisfy the `AsRef<[u8]>` bound, this ends up being a legal call. |
| 1255 | |
| 1256 | Here are a few more examples of ways you can call `write_record`: |
| 1257 | |
| 1258 | ```no_run |
| 1259 | # use csv; |
| 1260 | # let mut wtr = csv::Writer::from_writer(vec![]); |
| 1261 | // A slice of byte strings. |
| 1262 | wtr.write_record(&[b"a", b"b", b"c"]); |
| 1263 | // A vector. |
| 1264 | wtr.write_record(vec!["a", "b", "c"]); |
| 1265 | // A string record. |
| 1266 | wtr.write_record(&csv::StringRecord::from(vec!["a", "b", "c"])); |
| 1267 | // A byte record. |
| 1268 | wtr.write_record(&csv::ByteRecord::from(vec!["a", "b", "c"])); |
| 1269 | ``` |
| 1270 | |
| 1271 | Finally, the example above can be easily adapted to write to a file instead |
| 1272 | of `stdout`: |
| 1273 | |
| 1274 | ```no_run |
| 1275 | //tutorial-write-02.rs |
| 1276 | use std::env; |
| 1277 | use std::error::Error; |
| 1278 | use std::ffi::OsString; |
| 1279 | use std::process; |
| 1280 | |
| 1281 | fn run() -> Result<(), Box<dyn Error>> { |
| 1282 | let file_path = get_first_arg()?; |
| 1283 | let mut wtr = csv::Writer::from_path(file_path)?; |
| 1284 | |
| 1285 | wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; |
| 1286 | wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?; |
| 1287 | wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?; |
| 1288 | wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?; |
| 1289 | |
| 1290 | wtr.flush()?; |
| 1291 | Ok(()) |
| 1292 | } |
| 1293 | |
| 1294 | /// Returns the first positional argument sent to this process. If there are no |
| 1295 | /// positional arguments, then this returns an error. |
| 1296 | fn get_first_arg() -> Result<OsString, Box<dyn Error>> { |
| 1297 | match env::args_os().nth(1) { |
| 1298 | None => Err(From::from("expected 1 argument, but got none")), |
| 1299 | Some(file_path) => Ok(file_path), |
| 1300 | } |
| 1301 | } |
| 1302 | |
| 1303 | fn main() { |
| 1304 | if let Err(err) = run() { |
| 1305 | println!("{}", err); |
| 1306 | process::exit(1); |
| 1307 | } |
| 1308 | } |
| 1309 | ``` |
| 1310 | |
| 1311 | ## Writing tab separated values |
| 1312 | |
| 1313 | In the previous section, we saw how to write some simple CSV data to `stdout` |
| 1314 | that looked like this: |
| 1315 | |
| 1316 | ```text |
| 1317 | City,State,Population,Latitude,Longitude |
| 1318 | Davidsons Landing,AK,,65.2419444,-165.2716667 |
| 1319 | Kenai,AK,7610,60.5544444,-151.2583333 |
| 1320 | Oakman,AL,,33.7133333,-87.3886111 |
| 1321 | ``` |
| 1322 | |
| 1323 | You might wonder to yourself: what's the point of using a CSV writer if the |
| 1324 | data is so simple? Well, the benefit of a CSV writer is that it can handle all |
| 1325 | types of data without sacrificing the integrity of your data. That is, it knows |
| 1326 | when to quote fields that contain special CSV characters (like commas or new |
| 1327 | lines) or escape literal quotes that appear in your data. The CSV writer can |
| 1328 | also be easily configured to use different delimiters or quoting strategies. |
| 1329 | |
| 1330 | In this section, we'll take a look a look at how to tweak some of the settings |
| 1331 | on a CSV writer. In particular, we'll write TSV ("tab separated values") |
| 1332 | instead of CSV, and we'll ask the CSV writer to quote all non-numeric fields. |
| 1333 | Here's an example: |
| 1334 | |
| 1335 | ```no_run |
| 1336 | //tutorial-write-delimiter-01.rs |
| 1337 | # use std::error::Error; |
| 1338 | # use std::io; |
| 1339 | # use std::process; |
| 1340 | # |
| 1341 | fn run() -> Result<(), Box<dyn Error>> { |
| 1342 | let mut wtr = csv::WriterBuilder::new() |
| 1343 | .delimiter(b'\t') |
| 1344 | .quote_style(csv::QuoteStyle::NonNumeric) |
| 1345 | .from_writer(io::stdout()); |
| 1346 | |
| 1347 | wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; |
| 1348 | wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?; |
| 1349 | wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?; |
| 1350 | wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?; |
| 1351 | |
| 1352 | wtr.flush()?; |
| 1353 | Ok(()) |
| 1354 | } |
| 1355 | # |
| 1356 | # fn main() { |
| 1357 | # if let Err(err) = run() { |
| 1358 | # println!("{}", err); |
| 1359 | # process::exit(1); |
| 1360 | # } |
| 1361 | # } |
| 1362 | ``` |
| 1363 | |
| 1364 | Compiling and running this example gives: |
| 1365 | |
| 1366 | ```text |
| 1367 | $ cargo build |
| 1368 | $ ./target/debug/csvtutor |
| 1369 | "City" "State" "Population" "Latitude" "Longitude" |
| 1370 | "Davidsons Landing" "AK" "" 65.2419444 -165.2716667 |
| 1371 | "Kenai" "AK" 7610 60.5544444 -151.2583333 |
| 1372 | "Oakman" "AL" "" 33.7133333 -87.3886111 |
| 1373 | ``` |
| 1374 | |
| 1375 | In this example, we used a new type |
| 1376 | [`QuoteStyle`](../enum.QuoteStyle.html). |
| 1377 | The `QuoteStyle` type represents the different quoting strategies available |
| 1378 | to you. The default is to add quotes to fields only when necessary. This |
| 1379 | probably works for most use cases, but you can also ask for quotes to always |
| 1380 | be put around fields, to never be put around fields or to always be put around |
| 1381 | non-numeric fields. |
| 1382 | |
| 1383 | ## Writing with Serde |
| 1384 | |
| 1385 | Just like the CSV reader supports automatic deserialization into Rust types |
| 1386 | with Serde, the CSV writer supports automatic serialization from Rust types |
| 1387 | into CSV records using Serde. In this section, we'll learn how to use it. |
| 1388 | |
| 1389 | As with reading, let's start by seeing how we can serialize a Rust tuple. |
| 1390 | |
| 1391 | ```no_run |
| 1392 | //tutorial-write-serde-01.rs |
| 1393 | # use std::error::Error; |
| 1394 | # use std::io; |
| 1395 | # use std::process; |
| 1396 | # |
| 1397 | fn run() -> Result<(), Box<dyn Error>> { |
| 1398 | let mut wtr = csv::Writer::from_writer(io::stdout()); |
| 1399 | |
| 1400 | // We still need to write headers manually. |
| 1401 | wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; |
| 1402 | |
| 1403 | // But now we can write records by providing a normal Rust value. |
| 1404 | // |
| 1405 | // Note that the odd `None::<u64>` syntax is required because `None` on |
| 1406 | // its own doesn't have a concrete type, but Serde needs a concrete type |
| 1407 | // in order to serialize it. That is, `None` has type `Option<T>` but |
| 1408 | // `None::<u64>` has type `Option<u64>`. |
| 1409 | wtr.serialize(("Davidsons Landing", "AK", None::<u64>, 65.2419444, -165.2716667))?; |
| 1410 | wtr.serialize(("Kenai", "AK", Some(7610), 60.5544444, -151.2583333))?; |
| 1411 | wtr.serialize(("Oakman", "AL", None::<u64>, 33.7133333, -87.3886111))?; |
| 1412 | |
| 1413 | wtr.flush()?; |
| 1414 | Ok(()) |
| 1415 | } |
| 1416 | # |
| 1417 | # fn main() { |
| 1418 | # if let Err(err) = run() { |
| 1419 | # println!("{}", err); |
| 1420 | # process::exit(1); |
| 1421 | # } |
| 1422 | # } |
| 1423 | ``` |
| 1424 | |
| 1425 | Compiling and running this program gives the expected output: |
| 1426 | |
| 1427 | ```text |
| 1428 | $ cargo build |
| 1429 | $ ./target/debug/csvtutor |
| 1430 | City,State,Population,Latitude,Longitude |
| 1431 | Davidsons Landing,AK,,65.2419444,-165.2716667 |
| 1432 | Kenai,AK,7610,60.5544444,-151.2583333 |
| 1433 | Oakman,AL,,33.7133333,-87.3886111 |
| 1434 | ``` |
| 1435 | |
| 1436 | The key thing to note in the above example is the use of `serialize` instead |
| 1437 | of `write_record` to write our data. In particular, `write_record` is used |
| 1438 | when writing a simple record that contains string-like data only. On the other |
| 1439 | hand, `serialize` is used when your data consists of more complex values like |
| 1440 | numbers, floats or optional values. Of course, you could always convert the |
| 1441 | complex values to strings and then use `write_record`, but Serde can do it for |
| 1442 | you automatically. |
| 1443 | |
| 1444 | As with reading, we can also serialize custom structs as CSV records. As a |
| 1445 | bonus, the fields in a struct will automatically be written as a header |
| 1446 | record! |
| 1447 | |
| 1448 | To write custom structs as CSV records, we'll need to make use of Serde's |
| 1449 | automatic `derive` feature again. As in the |
| 1450 | [previous section on reading with Serde](#reading-with-serde), |
| 1451 | we'll need to add a couple crates to our `[dependencies]` section in our |
| 1452 | `Cargo.toml` (if they aren't already there): |
| 1453 | |
| 1454 | ```text |
| 1455 | serde = { version = "1", features = ["derive"] } |
| 1456 | ``` |
| 1457 | |
| 1458 | And we'll also need to add a new `use` statement to our code, for Serde, as |
| 1459 | shown in the example: |
| 1460 | |
| 1461 | ```no_run |
| 1462 | //tutorial-write-serde-02.rs |
| 1463 | use std::error::Error; |
| 1464 | use std::io; |
| 1465 | use std::process; |
| 1466 | |
| 1467 | use serde::Serialize; |
| 1468 | |
| 1469 | // Note that structs can derive both Serialize and Deserialize! |
| 1470 | #[derive(Debug, Serialize)] |
| 1471 | #[serde(rename_all = "PascalCase")] |
| 1472 | struct Record<'a> { |
| 1473 | city: &'a str, |
| 1474 | state: &'a str, |
| 1475 | population: Option<u64>, |
| 1476 | latitude: f64, |
| 1477 | longitude: f64, |
| 1478 | } |
| 1479 | |
| 1480 | fn run() -> Result<(), Box<dyn Error>> { |
| 1481 | let mut wtr = csv::Writer::from_writer(io::stdout()); |
| 1482 | |
| 1483 | wtr.serialize(Record { |
| 1484 | city: "Davidsons Landing", |
| 1485 | state: "AK", |
| 1486 | population: None, |
| 1487 | latitude: 65.2419444, |
| 1488 | longitude: -165.2716667, |
| 1489 | })?; |
| 1490 | wtr.serialize(Record { |
| 1491 | city: "Kenai", |
| 1492 | state: "AK", |
| 1493 | population: Some(7610), |
| 1494 | latitude: 60.5544444, |
| 1495 | longitude: -151.2583333, |
| 1496 | })?; |
| 1497 | wtr.serialize(Record { |
| 1498 | city: "Oakman", |
| 1499 | state: "AL", |
| 1500 | population: None, |
| 1501 | latitude: 33.7133333, |
| 1502 | longitude: -87.3886111, |
| 1503 | })?; |
| 1504 | |
| 1505 | wtr.flush()?; |
| 1506 | Ok(()) |
| 1507 | } |
| 1508 | |
| 1509 | fn main() { |
| 1510 | if let Err(err) = run() { |
| 1511 | println!("{}", err); |
| 1512 | process::exit(1); |
| 1513 | } |
| 1514 | } |
| 1515 | ``` |
| 1516 | |
| 1517 | Compiling and running this example has the same output as last time, even |
| 1518 | though we didn't explicitly write a header record: |
| 1519 | |
| 1520 | ```text |
| 1521 | $ cargo build |
| 1522 | $ ./target/debug/csvtutor |
| 1523 | City,State,Population,Latitude,Longitude |
| 1524 | Davidsons Landing,AK,,65.2419444,-165.2716667 |
| 1525 | Kenai,AK,7610,60.5544444,-151.2583333 |
| 1526 | Oakman,AL,,33.7133333,-87.3886111 |
| 1527 | ``` |
| 1528 | |
| 1529 | In this case, the `serialize` method noticed that we were writing a struct |
| 1530 | with field names. When this happens, `serialize` will automatically write a |
| 1531 | header record (only if no other records have been written) that consists of |
| 1532 | the fields in the struct in the order in which they are defined. Note that |
| 1533 | this behavior can be disabled with the |
| 1534 | [`WriterBuilder::has_headers`](../struct.WriterBuilder.html#method.has_headers) |
| 1535 | method. |
| 1536 | |
| 1537 | It's also worth pointing out the use of a *lifetime parameter* in our `Record` |
| 1538 | struct: |
| 1539 | |
| 1540 | ```ignore |
| 1541 | struct Record<'a> { |
| 1542 | city: &'a str, |
| 1543 | state: &'a str, |
| 1544 | population: Option<u64>, |
| 1545 | latitude: f64, |
| 1546 | longitude: f64, |
| 1547 | } |
| 1548 | ``` |
| 1549 | |
| 1550 | The `'a` lifetime parameter corresponds to the lifetime of the `city` and |
| 1551 | `state` string slices. This says that the `Record` struct contains *borrowed* |
| 1552 | data. We could have written our struct without borrowing any data, and |
| 1553 | therefore, without any lifetime parameters: |
| 1554 | |
| 1555 | ```ignore |
| 1556 | struct Record { |
| 1557 | city: String, |
| 1558 | state: String, |
| 1559 | population: Option<u64>, |
| 1560 | latitude: f64, |
| 1561 | longitude: f64, |
| 1562 | } |
| 1563 | ``` |
| 1564 | |
| 1565 | However, since we had to replace our borrowed `&str` types with owned `String` |
| 1566 | types, we're now forced to allocate a new `String` value for both of `city` |
| 1567 | and `state` for every record that we write. There's no intrinsic problem with |
| 1568 | doing that, but it might be a bit wasteful. |
| 1569 | |
| 1570 | For more examples and more details on the rules for serialization, please see |
| 1571 | the |
| 1572 | [`Writer::serialize`](../struct.Writer.html#method.serialize) |
| 1573 | method. |
| 1574 | |
| 1575 | # Pipelining |
| 1576 | |
| 1577 | In this section, we're going to cover a few examples that demonstrate programs |
| 1578 | that take CSV data as input, and produce possibly transformed or filtered CSV |
| 1579 | data as output. This shows how to write a complete program that efficiently |
| 1580 | reads and writes CSV data. Rust is well positioned to perform this task, since |
| 1581 | you'll get great performance with the convenience of a high level CSV library. |
| 1582 | |
| 1583 | ## Filter by search |
| 1584 | |
| 1585 | The first example of CSV pipelining we'll look at is a simple filter. It takes |
| 1586 | as input some CSV data on stdin and a single string query as its only |
| 1587 | positional argument, and it will produce as output CSV data that only contains |
| 1588 | rows with a field that matches the query. |
| 1589 | |
| 1590 | ```no_run |
| 1591 | //tutorial-pipeline-search-01.rs |
| 1592 | use std::env; |
| 1593 | use std::error::Error; |
| 1594 | use std::io; |
| 1595 | use std::process; |
| 1596 | |
| 1597 | fn run() -> Result<(), Box<dyn Error>> { |
| 1598 | // Get the query from the positional arguments. |
| 1599 | // If one doesn't exist, return an error. |
| 1600 | let query = match env::args().nth(1) { |
| 1601 | None => return Err(From::from("expected 1 argument, but got none")), |
| 1602 | Some(query) => query, |
| 1603 | }; |
| 1604 | |
| 1605 | // Build CSV readers and writers to stdin and stdout, respectively. |
| 1606 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 1607 | let mut wtr = csv::Writer::from_writer(io::stdout()); |
| 1608 | |
| 1609 | // Before reading our data records, we should write the header record. |
| 1610 | wtr.write_record(rdr.headers()?)?; |
| 1611 | |
| 1612 | // Iterate over all the records in `rdr`, and write only records containing |
| 1613 | // `query` to `wtr`. |
| 1614 | for result in rdr.records() { |
| 1615 | let record = result?; |
| 1616 | if record.iter().any(|field| field == &query) { |
| 1617 | wtr.write_record(&record)?; |
| 1618 | } |
| 1619 | } |
| 1620 | |
| 1621 | // CSV writers use an internal buffer, so we should always flush when done. |
| 1622 | wtr.flush()?; |
| 1623 | Ok(()) |
| 1624 | } |
| 1625 | |
| 1626 | fn main() { |
| 1627 | if let Err(err) = run() { |
| 1628 | println!("{}", err); |
| 1629 | process::exit(1); |
| 1630 | } |
| 1631 | } |
| 1632 | ``` |
| 1633 | |
| 1634 | If we compile and run this program with a query of `MA` on `uspop.csv`, we'll |
| 1635 | see that only one record matches: |
| 1636 | |
| 1637 | ```text |
| 1638 | $ cargo build |
| 1639 | $ ./csvtutor MA < uspop.csv |
| 1640 | City,State,Population,Latitude,Longitude |
| 1641 | Reading,MA,23441,42.5255556,-71.0958333 |
| 1642 | ``` |
| 1643 | |
| 1644 | This example doesn't actually introduce anything new. It merely combines what |
| 1645 | you've already learned about CSV readers and writers from previous sections. |
| 1646 | |
| 1647 | Let's add a twist to this example. In the real world, you're often faced with |
| 1648 | messy CSV data that might not be encoded correctly. One example you might come |
| 1649 | across is CSV data encoded in |
| 1650 | [Latin-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1). |
| 1651 | Unfortunately, for the examples we've seen so far, our CSV reader assumes that |
| 1652 | all of the data is UTF-8. Since all of the data we've worked on has been |
| 1653 | ASCII---which is a subset of both Latin-1 and UTF-8---we haven't had any |
| 1654 | problems. But let's introduce a slightly tweaked version of our `uspop.csv` |
| 1655 | file that contains an encoding of a Latin-1 character that is invalid UTF-8. |
| 1656 | You can get the data like so: |
| 1657 | |
| 1658 | ```text |
| 1659 | $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-latin1.csv' |
| 1660 | ``` |
| 1661 | |
| 1662 | Even though I've already given away the problem, let's see what happen when |
| 1663 | we try to run our previous example on this new data: |
| 1664 | |
| 1665 | ```text |
| 1666 | $ ./csvtutor MA < uspop-latin1.csv |
| 1667 | City,State,Population,Latitude,Longitude |
| 1668 | CSV parse error: record 3 (line 4, field: 0, byte: 125): invalid utf-8: invalid UTF-8 in field 0 near byte index 0 |
| 1669 | ``` |
| 1670 | |
| 1671 | The error message tells us exactly what's wrong. Let's take a look at line 4 |
| 1672 | to see what we're dealing with: |
| 1673 | |
| 1674 | ```text |
| 1675 | $ head -n4 uspop-latin1.csv | tail -n1 |
| 1676 | Õakman,AL,,33.7133333,-87.3886111 |
| 1677 | ``` |
| 1678 | |
| 1679 | In this case, the very first character is the Latin-1 `Õ`, which is encoded as |
| 1680 | the byte `0xD5`, which is in turn invalid UTF-8. So what do we do now that our |
| 1681 | CSV parser has choked on our data? You have two choices. The first is to go in |
| 1682 | and fix up your CSV data so that it's valid UTF-8. This is probably a good |
| 1683 | idea anyway, and tools like `iconv` can help with the task of transcoding. |
| 1684 | But if you can't or don't want to do that, then you can instead read CSV data |
| 1685 | in a way that is mostly encoding agnostic (so long as ASCII is still a valid |
| 1686 | subset). The trick is to use *byte records* instead of *string records*. |
| 1687 | |
| 1688 | Thus far, we haven't actually talked much about the type of a record in this |
| 1689 | library, but now is a good time to introduce them. There are two of them, |
| 1690 | [`StringRecord`](../struct.StringRecord.html) |
| 1691 | and |
| 1692 | [`ByteRecord`](../struct.ByteRecord.html). |
| 1693 | Each them represent a single record in CSV data, where a record is a sequence |
| 1694 | of an arbitrary number of fields. The only difference between `StringRecord` |
| 1695 | and `ByteRecord` is that `StringRecord` is guaranteed to be valid UTF-8, where |
| 1696 | as `ByteRecord` contains arbitrary bytes. |
| 1697 | |
| 1698 | Armed with that knowledge, we can now begin to understand why we saw an error |
| 1699 | when we ran the last example on data that wasn't UTF-8. Namely, when we call |
| 1700 | `records`, we get back an iterator of `StringRecord`. Since `StringRecord` is |
| 1701 | guaranteed to be valid UTF-8, trying to build a `StringRecord` with invalid |
| 1702 | UTF-8 will result in the error that we see. |
| 1703 | |
| 1704 | All we need to do to make our example work is to switch from a `StringRecord` |
| 1705 | to a `ByteRecord`. This means using `byte_records` to create our iterator |
| 1706 | instead of `records`, and similarly using `byte_headers` instead of `headers` |
| 1707 | if we think our header data might contain invalid UTF-8 as well. Here's the |
| 1708 | change: |
| 1709 | |
| 1710 | ```no_run |
| 1711 | //tutorial-pipeline-search-02.rs |
| 1712 | # use std::env; |
| 1713 | # use std::error::Error; |
| 1714 | # use std::io; |
| 1715 | # use std::process; |
| 1716 | # |
| 1717 | fn run() -> Result<(), Box<dyn Error>> { |
| 1718 | let query = match env::args().nth(1) { |
| 1719 | None => return Err(From::from("expected 1 argument, but got none")), |
| 1720 | Some(query) => query, |
| 1721 | }; |
| 1722 | |
| 1723 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 1724 | let mut wtr = csv::Writer::from_writer(io::stdout()); |
| 1725 | |
| 1726 | wtr.write_record(rdr.byte_headers()?)?; |
| 1727 | |
| 1728 | for result in rdr.byte_records() { |
| 1729 | let record = result?; |
| 1730 | // `query` is a `String` while `field` is now a `&[u8]`, so we'll |
| 1731 | // need to convert `query` to `&[u8]` before doing a comparison. |
| 1732 | if record.iter().any(|field| field == query.as_bytes()) { |
| 1733 | wtr.write_record(&record)?; |
| 1734 | } |
| 1735 | } |
| 1736 | |
| 1737 | wtr.flush()?; |
| 1738 | Ok(()) |
| 1739 | } |
| 1740 | # |
| 1741 | # fn main() { |
| 1742 | # if let Err(err) = run() { |
| 1743 | # println!("{}", err); |
| 1744 | # process::exit(1); |
| 1745 | # } |
| 1746 | # } |
| 1747 | ``` |
| 1748 | |
| 1749 | Compiling and running this now yields the same results as our first example, |
| 1750 | but this time it works on data that isn't valid UTF-8. |
| 1751 | |
| 1752 | ```text |
| 1753 | $ cargo build |
| 1754 | $ ./csvtutor MA < uspop-latin1.csv |
| 1755 | City,State,Population,Latitude,Longitude |
| 1756 | Reading,MA,23441,42.5255556,-71.0958333 |
| 1757 | ``` |
| 1758 | |
| 1759 | ## Filter by population count |
| 1760 | |
| 1761 | In this section, we will show another example program that both reads and |
| 1762 | writes CSV data, but instead of dealing with arbitrary records, we will use |
| 1763 | Serde to deserialize and serialize records with specific types. |
| 1764 | |
| 1765 | For this program, we'd like to be able to filter records in our population data |
| 1766 | by population count. Specifically, we'd like to see which records meet a |
| 1767 | certain population threshold. In addition to using a simple inequality, we must |
| 1768 | also account for records that have a missing population count. This is where |
| 1769 | types like `Option<T>` come in handy, because the compiler will force us to |
| 1770 | consider the case when the population count is missing. |
| 1771 | |
| 1772 | Since we're using Serde in this example, don't forget to add the Serde |
| 1773 | dependencies to your `Cargo.toml` in your `[dependencies]` section if they |
| 1774 | aren't already there: |
| 1775 | |
| 1776 | ```text |
| 1777 | serde = { version = "1", features = ["derive"] } |
| 1778 | ``` |
| 1779 | |
| 1780 | Now here's the code: |
| 1781 | |
| 1782 | ```no_run |
| 1783 | //tutorial-pipeline-pop-01.rs |
| 1784 | use std::env; |
| 1785 | use std::error::Error; |
| 1786 | use std::io; |
| 1787 | use std::process; |
| 1788 | |
| 1789 | use serde::{Deserialize, Serialize}; |
| 1790 | |
| 1791 | // Unlike previous examples, we derive both Deserialize and Serialize. This |
| 1792 | // means we'll be able to automatically deserialize and serialize this type. |
| 1793 | #[derive(Debug, Deserialize, Serialize)] |
| 1794 | #[serde(rename_all = "PascalCase")] |
| 1795 | struct Record { |
| 1796 | city: String, |
| 1797 | state: String, |
| 1798 | population: Option<u64>, |
| 1799 | latitude: f64, |
| 1800 | longitude: f64, |
| 1801 | } |
| 1802 | |
| 1803 | fn run() -> Result<(), Box<dyn Error>> { |
| 1804 | // Get the query from the positional arguments. |
| 1805 | // If one doesn't exist or isn't an integer, return an error. |
| 1806 | let minimum_pop: u64 = match env::args().nth(1) { |
| 1807 | None => return Err(From::from("expected 1 argument, but got none")), |
| 1808 | Some(arg) => arg.parse()?, |
| 1809 | }; |
| 1810 | |
| 1811 | // Build CSV readers and writers to stdin and stdout, respectively. |
| 1812 | // Note that we don't need to write headers explicitly. Since we're |
| 1813 | // serializing a custom struct, that's done for us automatically. |
| 1814 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 1815 | let mut wtr = csv::Writer::from_writer(io::stdout()); |
| 1816 | |
| 1817 | // Iterate over all the records in `rdr`, and write only records containing |
| 1818 | // a population that is greater than or equal to `minimum_pop`. |
| 1819 | for result in rdr.deserialize() { |
| 1820 | // Remember that when deserializing, we must use a type hint to |
| 1821 | // indicate which type we want to deserialize our record into. |
| 1822 | let record: Record = result?; |
| 1823 | |
| 1824 | // `map_or` is a combinator on `Option`. It take two parameters: |
| 1825 | // a value to use when the `Option` is `None` (i.e., the record has |
| 1826 | // no population count) and a closure that returns another value of |
| 1827 | // the same type when the `Option` is `Some`. In this case, we test it |
| 1828 | // against our minimum population count that we got from the command |
| 1829 | // line. |
| 1830 | if record.population.map_or(false, |pop| pop >= minimum_pop) { |
| 1831 | wtr.serialize(record)?; |
| 1832 | } |
| 1833 | } |
| 1834 | |
| 1835 | // CSV writers use an internal buffer, so we should always flush when done. |
| 1836 | wtr.flush()?; |
| 1837 | Ok(()) |
| 1838 | } |
| 1839 | |
| 1840 | fn main() { |
| 1841 | if let Err(err) = run() { |
| 1842 | println!("{}", err); |
| 1843 | process::exit(1); |
| 1844 | } |
| 1845 | } |
| 1846 | ``` |
| 1847 | |
| 1848 | If we compile and run our program with a minimum threshold of `100000`, we |
| 1849 | should see three matching records. Notice that the headers were added even |
| 1850 | though we never explicitly wrote them! |
| 1851 | |
| 1852 | ```text |
| 1853 | $ cargo build |
| 1854 | $ ./target/debug/csvtutor 100000 < uspop.csv |
| 1855 | City,State,Population,Latitude,Longitude |
| 1856 | Fontana,CA,169160,34.0922222,-117.4341667 |
| 1857 | Bridgeport,CT,139090,41.1669444,-73.2052778 |
| 1858 | Indianapolis,IN,773283,39.7683333,-86.1580556 |
| 1859 | ``` |
| 1860 | |
| 1861 | # Performance |
| 1862 | |
| 1863 | In this section, we'll go over how to squeeze the most juice out of our CSV |
| 1864 | reader. As it happens, most of the APIs we've seen so far were designed with |
| 1865 | high level convenience in mind, and that often comes with some costs. For the |
| 1866 | most part, those costs revolve around unnecessary allocations. Therefore, most |
| 1867 | of the section will show how to do CSV parsing with as little allocation as |
| 1868 | possible. |
| 1869 | |
| 1870 | There are two critical preliminaries we must cover. |
| 1871 | |
| 1872 | Firstly, when you care about performance, you should compile your code |
| 1873 | with `cargo build --release` instead of `cargo build`. The `--release` |
| 1874 | flag instructs the compiler to spend more time optimizing your code. When |
| 1875 | compiling with the `--release` flag, you'll find your compiled program at |
| 1876 | `target/release/csvtutor` instead of `target/debug/csvtutor`. Throughout this |
| 1877 | tutorial, we've used `cargo build` because our dataset was small and we weren't |
| 1878 | focused on speed. The downside of `cargo build --release` is that it will take |
| 1879 | longer than `cargo build`. |
| 1880 | |
| 1881 | Secondly, the dataset we've used throughout this tutorial only has 100 records. |
| 1882 | We'd have to try really hard to cause our program to run slowly on 100 records, |
| 1883 | even when we compile without the `--release` flag. Therefore, in order to |
| 1884 | actually witness a performance difference, we need a bigger dataset. To get |
| 1885 | such a dataset, we'll use the original source of `uspop.csv`. **Warning: the |
| 1886 | download is 41MB compressed and decompresses to 145MB.** |
| 1887 | |
| 1888 | ```text |
| 1889 | $ curl -LO http://burntsushi.net/stuff/worldcitiespop.csv.gz |
| 1890 | $ gunzip worldcitiespop.csv.gz |
| 1891 | $ wc worldcitiespop.csv |
| 1892 | 3173959 5681543 151492068 worldcitiespop.csv |
| 1893 | $ md5sum worldcitiespop.csv |
| 1894 | 6198bd180b6d6586626ecbf044c1cca5 worldcitiespop.csv |
| 1895 | ``` |
| 1896 | |
| 1897 | Finally, it's worth pointing out that this section is not attempting to |
| 1898 | present a rigorous set of benchmarks. We will stay away from rigorous analysis |
| 1899 | and instead rely a bit more on wall clock times and intuition. |
| 1900 | |
| 1901 | ## Amortizing allocations |
| 1902 | |
| 1903 | In order to measure performance, we must be careful about what it is we're |
| 1904 | measuring. We must also be careful to not change the thing we're measuring as |
| 1905 | we make improvements to the code. For this reason, we will focus on measuring |
| 1906 | how long it takes to count the number of records corresponding to city |
| 1907 | population counts in Massachusetts. This represents a very small amount of work |
| 1908 | that requires us to visit every record, and therefore represents a decent way |
| 1909 | to measure how long it takes to do CSV parsing. |
| 1910 | |
| 1911 | Before diving into our first optimization, let's start with a baseline by |
| 1912 | adapting a previous example to count the number of records in |
| 1913 | `worldcitiespop.csv`: |
| 1914 | |
| 1915 | ```no_run |
| 1916 | //tutorial-perf-alloc-01.rs |
| 1917 | use std::error::Error; |
| 1918 | use std::io; |
| 1919 | use std::process; |
| 1920 | |
| 1921 | fn run() -> Result<u64, Box<dyn Error>> { |
| 1922 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 1923 | |
| 1924 | let mut count = 0; |
| 1925 | for result in rdr.records() { |
| 1926 | let record = result?; |
| 1927 | if &record[0] == "us" && &record[3] == "MA" { |
| 1928 | count += 1; |
| 1929 | } |
| 1930 | } |
| 1931 | Ok(count) |
| 1932 | } |
| 1933 | |
| 1934 | fn main() { |
| 1935 | match run() { |
| 1936 | Ok(count) => { |
| 1937 | println!("{}", count); |
| 1938 | } |
| 1939 | Err(err) => { |
| 1940 | println!("{}", err); |
| 1941 | process::exit(1); |
| 1942 | } |
| 1943 | } |
| 1944 | } |
| 1945 | ``` |
| 1946 | |
| 1947 | Now let's compile and run it and see what kind of timing we get. Don't forget |
| 1948 | to compile with the `--release` flag. (For grins, try compiling without the |
| 1949 | `--release` flag and see how long it takes to run the program!) |
| 1950 | |
| 1951 | ```text |
| 1952 | $ cargo build --release |
| 1953 | $ time ./target/release/csvtutor < worldcitiespop.csv |
| 1954 | 2176 |
| 1955 | |
| 1956 | real 0m0.645s |
| 1957 | user 0m0.627s |
| 1958 | sys 0m0.017s |
| 1959 | ``` |
| 1960 | |
| 1961 | All right, so what's the first thing we can do to make this faster? This |
| 1962 | section promised to speed things up by amortizing allocation, but we can do |
| 1963 | something even simpler first: iterate over |
| 1964 | [`ByteRecord`](../struct.ByteRecord.html)s |
| 1965 | instead of |
| 1966 | [`StringRecord`](../struct.StringRecord.html)s. |
| 1967 | If you recall from a previous section, a `StringRecord` is guaranteed to be |
| 1968 | valid UTF-8, and therefore must validate that its contents is actually UTF-8. |
| 1969 | (If validation fails, then the CSV reader will return an error.) If we remove |
| 1970 | that validation from our program, then we can realize a nice speed boost as |
| 1971 | shown in the next example: |
| 1972 | |
| 1973 | ```no_run |
| 1974 | //tutorial-perf-alloc-02.rs |
| 1975 | # use std::error::Error; |
| 1976 | # use std::io; |
| 1977 | # use std::process; |
| 1978 | # |
| 1979 | fn run() -> Result<u64, Box<dyn Error>> { |
| 1980 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 1981 | |
| 1982 | let mut count = 0; |
| 1983 | for result in rdr.byte_records() { |
| 1984 | let record = result?; |
| 1985 | if &record[0] == b"us" && &record[3] == b"MA" { |
| 1986 | count += 1; |
| 1987 | } |
| 1988 | } |
| 1989 | Ok(count) |
| 1990 | } |
| 1991 | # |
| 1992 | # fn main() { |
| 1993 | # match run() { |
| 1994 | # Ok(count) => { |
| 1995 | # println!("{}", count); |
| 1996 | # } |
| 1997 | # Err(err) => { |
| 1998 | # println!("{}", err); |
| 1999 | # process::exit(1); |
| 2000 | # } |
| 2001 | # } |
| 2002 | # } |
| 2003 | ``` |
| 2004 | |
| 2005 | And now compile and run: |
| 2006 | |
| 2007 | ```text |
| 2008 | $ cargo build --release |
| 2009 | $ time ./target/release/csvtutor < worldcitiespop.csv |
| 2010 | 2176 |
| 2011 | |
| 2012 | real 0m0.429s |
| 2013 | user 0m0.403s |
| 2014 | sys 0m0.023s |
| 2015 | ``` |
| 2016 | |
| 2017 | Our program is now approximately 30% faster, all because we removed UTF-8 |
| 2018 | validation. But was it actually okay to remove UTF-8 validation? What have we |
| 2019 | lost? In this case, it is perfectly acceptable to drop UTF-8 validation and use |
| 2020 | `ByteRecord` instead because all we're doing with the data in the record is |
| 2021 | comparing two of its fields to raw bytes: |
| 2022 | |
| 2023 | ```ignore |
| 2024 | if &record[0] == b"us" && &record[3] == b"MA" { |
| 2025 | count += 1; |
| 2026 | } |
| 2027 | ``` |
| 2028 | |
| 2029 | In particular, it doesn't matter whether `record` is valid UTF-8 or not, since |
| 2030 | we're checking for equality on the raw bytes themselves. |
| 2031 | |
| 2032 | UTF-8 validation via `StringRecord` is useful because it provides access to |
| 2033 | fields as `&str` types, where as `ByteRecord` provides fields as `&[u8]` types. |
| 2034 | `&str` is the type of a borrowed string in Rust, which provides convenient |
| 2035 | access to string APIs like substring search. Strings are also frequently used |
| 2036 | in other areas, so they tend to be a useful thing to have. Therefore, sticking |
| 2037 | with `StringRecord` is a good default, but if you need the extra speed and can |
| 2038 | deal with arbitrary bytes, then switching to `ByteRecord` might be a good idea. |
| 2039 | |
| 2040 | Moving on, let's try to get another speed boost by amortizing allocation. |
| 2041 | Amortizing allocation is the technique that creates an allocation once (or |
| 2042 | very rarely), and then attempts to reuse it instead of creating additional |
| 2043 | allocations. In the case of the previous examples, we used iterators created |
| 2044 | by the `records` and `byte_records` methods on a CSV reader. These iterators |
| 2045 | allocate a new record for every item that it yields, which in turn corresponds |
| 2046 | to a new allocation. It does this because iterators cannot yield items that |
| 2047 | borrow from the iterator itself, and because creating new allocations tends to |
| 2048 | be a lot more convenient. |
| 2049 | |
| 2050 | If we're willing to forgo use of iterators, then we can amortize allocations |
| 2051 | by creating a *single* `ByteRecord` and asking the CSV reader to read into it. |
| 2052 | We do this by using the |
| 2053 | [`Reader::read_byte_record`](../struct.Reader.html#method.read_byte_record) |
| 2054 | method. |
| 2055 | |
| 2056 | ```no_run |
| 2057 | //tutorial-perf-alloc-03.rs |
| 2058 | # use std::error::Error; |
| 2059 | # use std::io; |
| 2060 | # use std::process; |
| 2061 | # |
| 2062 | fn run() -> Result<u64, Box<dyn Error>> { |
| 2063 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 2064 | let mut record = csv::ByteRecord::new(); |
| 2065 | |
| 2066 | let mut count = 0; |
| 2067 | while rdr.read_byte_record(&mut record)? { |
| 2068 | if &record[0] == b"us" && &record[3] == b"MA" { |
| 2069 | count += 1; |
| 2070 | } |
| 2071 | } |
| 2072 | Ok(count) |
| 2073 | } |
| 2074 | # |
| 2075 | # fn main() { |
| 2076 | # match run() { |
| 2077 | # Ok(count) => { |
| 2078 | # println!("{}", count); |
| 2079 | # } |
| 2080 | # Err(err) => { |
| 2081 | # println!("{}", err); |
| 2082 | # process::exit(1); |
| 2083 | # } |
| 2084 | # } |
| 2085 | # } |
| 2086 | ``` |
| 2087 | |
| 2088 | Compile and run: |
| 2089 | |
| 2090 | ```text |
| 2091 | $ cargo build --release |
| 2092 | $ time ./target/release/csvtutor < worldcitiespop.csv |
| 2093 | 2176 |
| 2094 | |
| 2095 | real 0m0.308s |
| 2096 | user 0m0.283s |
| 2097 | sys 0m0.023s |
| 2098 | ``` |
| 2099 | |
| 2100 | Woohoo! This represents *another* 30% boost over the previous example, which is |
| 2101 | a 50% boost over the first example. |
| 2102 | |
| 2103 | Let's dissect this code by taking a look at the type signature of the |
| 2104 | `read_byte_record` method: |
| 2105 | |
| 2106 | ```ignore |
| 2107 | fn read_byte_record(&mut self, record: &mut ByteRecord) -> csv::Result<bool>; |
| 2108 | ``` |
| 2109 | |
| 2110 | This method takes as input a CSV reader (the `self` parameter) and a *mutable |
| 2111 | borrow* of a `ByteRecord`, and returns a `csv::Result<bool>`. (The |
| 2112 | `csv::Result<bool>` is equivalent to `Result<bool, csv::Error>`.) The return |
| 2113 | value is `true` if and only if a record was read. When it's `false`, that means |
| 2114 | the reader has exhausted its input. This method works by copying the contents |
| 2115 | of the next record into the provided `ByteRecord`. Since the same `ByteRecord` |
| 2116 | is used to read every record, it will already have space allocated for data. |
| 2117 | When `read_byte_record` runs, it will overwrite the contents that were there |
| 2118 | with the new record, which means that it can reuse the space that was |
| 2119 | allocated. Thus, we have *amortized allocation*. |
| 2120 | |
| 2121 | An exercise you might consider doing is to use a `StringRecord` instead of a |
| 2122 | `ByteRecord`, and therefore |
| 2123 | [`Reader::read_record`](../struct.Reader.html#method.read_record) |
| 2124 | instead of `read_byte_record`. This will give you easy access to Rust strings |
| 2125 | at the cost of UTF-8 validation but *without* the cost of allocating a new |
| 2126 | `StringRecord` for every record. |
| 2127 | |
| 2128 | ## Serde and zero allocation |
| 2129 | |
| 2130 | In this section, we are going to briefly examine how we use Serde and what we |
| 2131 | can do to speed it up. The key optimization we'll want to make is to---you |
| 2132 | guessed it---amortize allocation. |
| 2133 | |
| 2134 | As with the previous section, let's start with a simple baseline based off an |
| 2135 | example using Serde in a previous section: |
| 2136 | |
| 2137 | ```no_run |
| 2138 | //tutorial-perf-serde-01.rs |
| 2139 | use std::error::Error; |
| 2140 | use std::io; |
| 2141 | use std::process; |
| 2142 | |
| 2143 | use serde::Deserialize; |
| 2144 | |
| 2145 | #[derive(Debug, Deserialize)] |
| 2146 | #[serde(rename_all = "PascalCase")] |
| 2147 | struct Record { |
| 2148 | country: String, |
| 2149 | city: String, |
| 2150 | accent_city: String, |
| 2151 | region: String, |
| 2152 | population: Option<u64>, |
| 2153 | latitude: f64, |
| 2154 | longitude: f64, |
| 2155 | } |
| 2156 | |
| 2157 | fn run() -> Result<u64, Box<dyn Error>> { |
| 2158 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 2159 | |
| 2160 | let mut count = 0; |
| 2161 | for result in rdr.deserialize() { |
| 2162 | let record: Record = result?; |
| 2163 | if record.country == "us" && record.region == "MA" { |
| 2164 | count += 1; |
| 2165 | } |
| 2166 | } |
| 2167 | Ok(count) |
| 2168 | } |
| 2169 | |
| 2170 | fn main() { |
| 2171 | match run() { |
| 2172 | Ok(count) => { |
| 2173 | println!("{}", count); |
| 2174 | } |
| 2175 | Err(err) => { |
| 2176 | println!("{}", err); |
| 2177 | process::exit(1); |
| 2178 | } |
| 2179 | } |
| 2180 | } |
| 2181 | ``` |
| 2182 | |
| 2183 | Now compile and run this program: |
| 2184 | |
| 2185 | ```text |
| 2186 | $ cargo build --release |
| 2187 | $ ./target/release/csvtutor < worldcitiespop.csv |
| 2188 | 2176 |
| 2189 | |
| 2190 | real 0m1.381s |
| 2191 | user 0m1.367s |
| 2192 | sys 0m0.013s |
| 2193 | ``` |
| 2194 | |
| 2195 | The first thing you might notice is that this is quite a bit slower than our |
| 2196 | programs in the previous section. This is because deserializing each record |
| 2197 | has a certain amount of overhead to it. In particular, some of the fields need |
| 2198 | to be parsed as integers or floating point numbers, which isn't free. However, |
| 2199 | there is hope yet, because we can speed up this program! |
| 2200 | |
| 2201 | Our first attempt to speed up the program will be to amortize allocation. Doing |
| 2202 | this with Serde is a bit trickier than before, because we need to change our |
| 2203 | `Record` type and use the manual deserialization API. Let's see what that looks |
| 2204 | like: |
| 2205 | |
| 2206 | ```no_run |
| 2207 | //tutorial-perf-serde-02.rs |
| 2208 | # use std::error::Error; |
| 2209 | # use std::io; |
| 2210 | # use std::process; |
| 2211 | # |
| 2212 | # use serde::Deserialize; |
| 2213 | # |
| 2214 | #[derive(Debug, Deserialize)] |
| 2215 | #[serde(rename_all = "PascalCase")] |
| 2216 | struct Record<'a> { |
| 2217 | country: &'a str, |
| 2218 | city: &'a str, |
| 2219 | accent_city: &'a str, |
| 2220 | region: &'a str, |
| 2221 | population: Option<u64>, |
| 2222 | latitude: f64, |
| 2223 | longitude: f64, |
| 2224 | } |
| 2225 | |
| 2226 | fn run() -> Result<u64, Box<dyn Error>> { |
| 2227 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 2228 | let mut raw_record = csv::StringRecord::new(); |
| 2229 | let headers = rdr.headers()?.clone(); |
| 2230 | |
| 2231 | let mut count = 0; |
| 2232 | while rdr.read_record(&mut raw_record)? { |
| 2233 | let record: Record = raw_record.deserialize(Some(&headers))?; |
| 2234 | if record.country == "us" && record.region == "MA" { |
| 2235 | count += 1; |
| 2236 | } |
| 2237 | } |
| 2238 | Ok(count) |
| 2239 | } |
| 2240 | # |
| 2241 | # fn main() { |
| 2242 | # match run() { |
| 2243 | # Ok(count) => { |
| 2244 | # println!("{}", count); |
| 2245 | # } |
| 2246 | # Err(err) => { |
| 2247 | # println!("{}", err); |
| 2248 | # process::exit(1); |
| 2249 | # } |
| 2250 | # } |
| 2251 | # } |
| 2252 | ``` |
| 2253 | |
| 2254 | Compile and run: |
| 2255 | |
| 2256 | ```text |
| 2257 | $ cargo build --release |
| 2258 | $ ./target/release/csvtutor < worldcitiespop.csv |
| 2259 | 2176 |
| 2260 | |
| 2261 | real 0m1.055s |
| 2262 | user 0m1.040s |
| 2263 | sys 0m0.013s |
| 2264 | ``` |
| 2265 | |
| 2266 | This corresponds to an approximately 24% increase in performance. To achieve |
| 2267 | this, we had to make two important changes. |
| 2268 | |
| 2269 | The first was to make our `Record` type contain `&str` fields instead of |
| 2270 | `String` fields. If you recall from a previous section, `&str` is a *borrowed* |
| 2271 | string where a `String` is an *owned* string. A borrowed string points to |
| 2272 | a already existing allocation where as a `String` always implies a new |
| 2273 | allocation. In this case, our `&str` is borrowing from the CSV record itself. |
| 2274 | |
| 2275 | The second change we had to make was to stop using the |
| 2276 | [`Reader::deserialize`](../struct.Reader.html#method.deserialize) |
| 2277 | iterator, and instead deserialize our record into a `StringRecord` explicitly |
| 2278 | and then use the |
| 2279 | [`StringRecord::deserialize`](../struct.StringRecord.html#method.deserialize) |
| 2280 | method to deserialize a single record. |
| 2281 | |
| 2282 | The second change is a bit tricky, because in order for it to work, our |
| 2283 | `Record` type needs to borrow from the data inside the `StringRecord`. That |
| 2284 | means that our `Record` value cannot outlive the `StringRecord` that it was |
| 2285 | created from. Since we overwrite the same `StringRecord` on each iteration |
| 2286 | (in order to amortize allocation), that means our `Record` value must evaporate |
| 2287 | before the next iteration of the loop. Indeed, the compiler will enforce this! |
| 2288 | |
| 2289 | There is one more optimization we can make: remove UTF-8 validation. In |
| 2290 | general, this means using `&[u8]` instead of `&str` and `ByteRecord` instead |
| 2291 | of `StringRecord`: |
| 2292 | |
| 2293 | ```no_run |
| 2294 | //tutorial-perf-serde-03.rs |
| 2295 | # use std::error::Error; |
| 2296 | # use std::io; |
| 2297 | # use std::process; |
| 2298 | # |
| 2299 | # use serde::Deserialize; |
| 2300 | # |
| 2301 | #[derive(Debug, Deserialize)] |
| 2302 | #[serde(rename_all = "PascalCase")] |
| 2303 | struct Record<'a> { |
| 2304 | country: &'a [u8], |
| 2305 | city: &'a [u8], |
| 2306 | accent_city: &'a [u8], |
| 2307 | region: &'a [u8], |
| 2308 | population: Option<u64>, |
| 2309 | latitude: f64, |
| 2310 | longitude: f64, |
| 2311 | } |
| 2312 | |
| 2313 | fn run() -> Result<u64, Box<dyn Error>> { |
| 2314 | let mut rdr = csv::Reader::from_reader(io::stdin()); |
| 2315 | let mut raw_record = csv::ByteRecord::new(); |
| 2316 | let headers = rdr.byte_headers()?.clone(); |
| 2317 | |
| 2318 | let mut count = 0; |
| 2319 | while rdr.read_byte_record(&mut raw_record)? { |
| 2320 | let record: Record = raw_record.deserialize(Some(&headers))?; |
| 2321 | if record.country == b"us" && record.region == b"MA" { |
| 2322 | count += 1; |
| 2323 | } |
| 2324 | } |
| 2325 | Ok(count) |
| 2326 | } |
| 2327 | # |
| 2328 | # fn main() { |
| 2329 | # match run() { |
| 2330 | # Ok(count) => { |
| 2331 | # println!("{}", count); |
| 2332 | # } |
| 2333 | # Err(err) => { |
| 2334 | # println!("{}", err); |
| 2335 | # process::exit(1); |
| 2336 | # } |
| 2337 | # } |
| 2338 | # } |
| 2339 | ``` |
| 2340 | |
| 2341 | Compile and run: |
| 2342 | |
| 2343 | ```text |
| 2344 | $ cargo build --release |
| 2345 | $ ./target/release/csvtutor < worldcitiespop.csv |
| 2346 | 2176 |
| 2347 | |
| 2348 | real 0m0.873s |
| 2349 | user 0m0.850s |
| 2350 | sys 0m0.023s |
| 2351 | ``` |
| 2352 | |
| 2353 | This corresponds to a 17% increase over the previous example and a 37% increase |
| 2354 | over the first example. |
| 2355 | |
| 2356 | In sum, Serde parsing is still quite fast, but will generally not be the |
| 2357 | fastest way to parse CSV since it necessarily needs to do more work. |
| 2358 | |
| 2359 | ## CSV parsing without the standard library |
| 2360 | |
| 2361 | In this section, we will explore a niche use case: parsing CSV without the |
| 2362 | standard library. While the `csv` crate itself requires the standard library, |
| 2363 | the underlying parser is actually part of the |
| 2364 | [`csv-core`](https://docs.rs/csv-core) |
| 2365 | crate, which does not depend on the standard library. The downside of not |
| 2366 | depending on the standard library is that CSV parsing becomes a lot more |
| 2367 | inconvenient. |
| 2368 | |
| 2369 | The `csv-core` crate is structured similarly to the `csv` crate. There is a |
| 2370 | [`Reader`](../../csv_core/struct.Reader.html) |
| 2371 | and a |
| 2372 | [`Writer`](../../csv_core/struct.Writer.html), |
| 2373 | as well as corresponding builders |
| 2374 | [`ReaderBuilder`](../../csv_core/struct.ReaderBuilder.html) |
| 2375 | and |
| 2376 | [`WriterBuilder`](../../csv_core/struct.WriterBuilder.html). |
| 2377 | The `csv-core` crate has no record types or iterators. Instead, CSV data |
| 2378 | can either be read one field at a time or one record at a time. In this |
| 2379 | section, we'll focus on reading a field at a time since it is simpler, but it |
| 2380 | is generally faster to read a record at a time since it does more work per |
| 2381 | function call. |
| 2382 | |
| 2383 | In keeping with this section on performance, let's write a program using only |
| 2384 | `csv-core` that counts the number of records in the state of Massachusetts. |
| 2385 | |
| 2386 | (Note that we unfortunately use the standard library in this example even |
| 2387 | though `csv-core` doesn't technically require it. We do this for convenient |
| 2388 | access to I/O, which would be harder without the standard library.) |
| 2389 | |
| 2390 | ```no_run |
| 2391 | //tutorial-perf-core-01.rs |
| 2392 | use std::io::{self, Read}; |
| 2393 | use std::process; |
| 2394 | |
| 2395 | use csv_core::{Reader, ReadFieldResult}; |
| 2396 | |
| 2397 | fn run(mut data: &[u8]) -> Option<u64> { |
| 2398 | let mut rdr = Reader::new(); |
| 2399 | |
| 2400 | // Count the number of records in Massachusetts. |
| 2401 | let mut count = 0; |
| 2402 | // Indicates the current field index. Reset to 0 at start of each record. |
| 2403 | let mut fieldidx = 0; |
| 2404 | // True when the current record is in the United States. |
| 2405 | let mut inus = false; |
| 2406 | // Buffer for field data. Must be big enough to hold the largest field. |
| 2407 | let mut field = [0; 1024]; |
| 2408 | loop { |
| 2409 | // Attempt to incrementally read the next CSV field. |
| 2410 | let (result, nread, nwrite) = rdr.read_field(data, &mut field); |
| 2411 | // nread is the number of bytes read from our input. We should never |
| 2412 | // pass those bytes to read_field again. |
| 2413 | data = &data[nread..]; |
| 2414 | // nwrite is the number of bytes written to the output buffer `field`. |
| 2415 | // The contents of the buffer after this point is unspecified. |
| 2416 | let field = &field[..nwrite]; |
| 2417 | |
| 2418 | match result { |
| 2419 | // We don't need to handle this case because we read all of the |
| 2420 | // data up front. If we were reading data incrementally, then this |
| 2421 | // would be a signal to read more. |
| 2422 | ReadFieldResult::InputEmpty => {} |
| 2423 | // If we get this case, then we found a field that contains more |
| 2424 | // than 1024 bytes. We keep this example simple and just fail. |
| 2425 | ReadFieldResult::OutputFull => { |
| 2426 | return None; |
| 2427 | } |
| 2428 | // This case happens when we've successfully read a field. If the |
| 2429 | // field is the last field in a record, then `record_end` is true. |
| 2430 | ReadFieldResult::Field { record_end } => { |
| 2431 | if fieldidx == 0 && field == b"us" { |
| 2432 | inus = true; |
| 2433 | } else if inus && fieldidx == 3 && field == b"MA" { |
| 2434 | count += 1; |
| 2435 | } |
| 2436 | if record_end { |
| 2437 | fieldidx = 0; |
| 2438 | inus = false; |
| 2439 | } else { |
| 2440 | fieldidx += 1; |
| 2441 | } |
| 2442 | } |
| 2443 | // This case happens when the CSV reader has successfully exhausted |
| 2444 | // all input. |
| 2445 | ReadFieldResult::End => { |
| 2446 | break; |
| 2447 | } |
| 2448 | } |
| 2449 | } |
| 2450 | Some(count) |
| 2451 | } |
| 2452 | |
| 2453 | fn main() { |
| 2454 | // Read the entire contents of stdin up front. |
| 2455 | let mut data = vec![]; |
| 2456 | if let Err(err) = io::stdin().read_to_end(&mut data) { |
| 2457 | println!("{}", err); |
| 2458 | process::exit(1); |
| 2459 | } |
| 2460 | match run(&data) { |
| 2461 | None => { |
| 2462 | println!("error: could not count records, buffer too small"); |
| 2463 | process::exit(1); |
| 2464 | } |
| 2465 | Some(count) => { |
| 2466 | println!("{}", count); |
| 2467 | } |
| 2468 | } |
| 2469 | } |
| 2470 | ``` |
| 2471 | |
| 2472 | And compile and run it: |
| 2473 | |
| 2474 | ```text |
| 2475 | $ cargo build --release |
| 2476 | $ time ./target/release/csvtutor < worldcitiespop.csv |
| 2477 | 2176 |
| 2478 | |
| 2479 | real 0m0.572s |
| 2480 | user 0m0.513s |
| 2481 | sys 0m0.057s |
| 2482 | ``` |
| 2483 | |
| 2484 | This isn't as fast as some of our previous examples where we used the `csv` |
| 2485 | crate to read into a `StringRecord` or a `ByteRecord`. This is mostly because |
| 2486 | this example reads a field at a time, which incurs more overhead than reading a |
| 2487 | record at a time. To fix this, you would want to use the |
| 2488 | [`Reader::read_record`](../../csv_core/struct.Reader.html#method.read_record) |
| 2489 | method instead, which is defined on `csv_core::Reader`. |
| 2490 | |
| 2491 | The other thing to notice here is that the example is considerably longer than |
| 2492 | the other examples. This is because we need to do more book keeping to keep |
| 2493 | track of which field we're reading and how much data we've already fed to the |
| 2494 | reader. There are basically two reasons to use the `csv_core` crate: |
| 2495 | |
| 2496 | 1. If you're in an environment where the standard library is not usable. |
| 2497 | 2. If you wanted to build your own csv-like library, you could build it on top |
| 2498 | of `csv-core`. |
| 2499 | |
| 2500 | # Closing thoughts |
| 2501 | |
| 2502 | Congratulations on making it to the end! It seems incredible that one could |
| 2503 | write so many words on something as basic as CSV parsing. I wanted this |
| 2504 | guide to be accessible not only to Rust beginners, but to inexperienced |
| 2505 | programmers as well. My hope is that the large number of examples will help |
| 2506 | push you in the right direction. |
| 2507 | |
| 2508 | With that said, here are a few more things you might want to look at: |
| 2509 | |
| 2510 | * The [API documentation for the `csv` crate](../index.html) documents all |
| 2511 | facets of the library, and is itself littered with even more examples. |
| 2512 | * The [`csv-index` crate](https://docs.rs/csv-index) provides data structures |
| 2513 | that can index CSV data that are amenable to writing to disk. (This library |
| 2514 | is still a work in progress.) |
| 2515 | * The [`xsv` command line tool](https://github.com/BurntSushi/xsv) is a high |
| 2516 | performance CSV swiss army knife. It can slice, select, search, sort, join, |
| 2517 | concatenate, index, format and compute statistics on arbitrary CSV data. Give |
| 2518 | it a try! |
| 2519 | |
| 2520 | */ |