Langsec Workshop 2015: Nom, a Byte-oriented, Streaming, Zero-copy Parser Combinators Library in Rust

About me

thank the weak diffie hellman that appeared while I departed

What is Clever Cloud ?

Platform as a service hosting, git push, zero downtime redeployments

Why nom ?

VideoLAN

what I do at VideoLAN

VLC handles most audio, video and streaming formats

VLC media player, in 2015

handling multiple formats is dangerous. MP4 has a lot of flaws, but not the MKV demuxer

fuzzing

We need a practical solution, now

if not for the "embeddable", I would write some haskell

Rust

v1.0 released on May 15th

no null pointers, efficient memory management emphasis on safety, zero cost abstractions fix bugclasses, not bugs

Generics

enum Option<T> {
  Some(T),
  None
}

enum Result<T, E> {
  Ok(T),
  Err(E)
}

generics are useful also, look, option type, no null pointers define some bounds or traits on the generic types

Pattern matching

fn f(i: u8) -> Option<u8> { ... }

match f(1) {
  Some(x) => println!("got int: {}", x),
  None    => println!("got None")
}

if let Some(x) = f(2) {
  ...
}

easier than most control flow alternatives

Immutability by default

let x = 5;
x = 6; // error

//////
let mut x = 5;
x = 6; // ok

fn to_vec(a:u8, b: u8) -> Vec<u8> {
  let mut v = Vec::new();
  v.push(a);
  v.push(b);
  v
}

great for multithreading great to reason on code

Slices

pub struct Slice<T> {
  pub data: *const T,
  pub len: usize,
}

let vec = vec![1, 2, 3];
let int_slice = &vec[1..];

slice = pointer + length since we have slices and lifetime, we could make zero copy parsers

Ownership, borrowing, move semantics

let v = vec![1, 2, 3];
let v2 = v;
println!("v[0] is: {}", v[0]);

//error: use of moved value: `v`
//println!("v[0] is: {}", v[0]);
//                        ^

this is where the compiler is smart only one owner for a defined variable

Lifetimes

fn take<'a>(x: &'a [u8], count: usize) -> &'a [u8]{ }

the compiler knows when to allocate and deallocate a memory zone

No garbage collection

this is huge

Nom

with lifetimes and slices,

a bet that we can make zero copy parsers in Rust

Nom

Basic types

pub enum IResult<'a,I,O> {
  Done(I,O),
  Error(Err<'a>),
  Incomplete(Needed)
}

pub enum Err<'a>{
  Code(u32),
  Node(u32, Box<Err<'a>>),
  Position(u32, &'a [u8]),
  NodePosition(u32, &'a [u8], Box<Err<'a>>)
}

pub enum Needed {
  Unknown,
  Size(u32)
}

Done contains output and remaining input

Generics

IResult<&[u8], &[u8]>
IResult<&[u8], &str>
IResult<&str, MyStruct>

accept any type, stay type safe

writing a parser

pub fn digit<'a>(input:&'a [u8]) -> IResult<'a,&'a [u8], &[u8]> {
  for idx in 0..input.len() {
    if !is_digit(input[idx]) {
      if idx == 0 {
        return Error(Position(ErrorCode::Digit as u32, input))
      } else {
        return Done(&input[idx..], &input[0..idx])
      }
    }
  }
  Done(b"", input)
}

complicated code explain that code

macros

named!( not_space, is_not!( " \t\r\n" ) );

let r = not_space(&b"abcdefgh\nijkl"[..]);
// -> Done(&b"\nijkl"[..], &b"abcdefgh"[..])

macros are combinators

named! defines a function

macros make it easy to define complex parsers

more complex patterns

struct FileType<'a> {
  major_brand:         &'a str,
  major_brand_version: &'a [u8],
  compatible_brands:   Vec<&'a str>
}

named!(brand_name<&[u8],&str>, map_res!(take!(4), str::from_utf8));
named!(mp4_filetype<&[u8], FileType>,
  chain!(
    m: brand_name          ~
    v: take!(4)            ~
    c: many0!(brand_name)  ,
    || { FileType {
      major_brand:         m,
      major_brand_version: v,
      compatible_brands:   c
      } }
  )
);

here, filetype object in MP4 file format

usual combinators and parsers

Performance

Filetype box parsing in MP4

small.mp4 (375 kB) bigbuckbunny.mp4 (5.3 MB)
hammer 32424 ns/iter 26523 ns/iter
attoparsec 1138 ns/iter (+/- 55.2) 1124 ns/iter (+/- 62.3)
cereal 189 ns/iter (+/- 9.9) 193 ns/iter (+/- 12.4)
nom 240 ns/iter (+/- 56) 195 ns/iter (+/- 69)

the benchmarks are available on github for reproduction

Make everything easier

philosophy: make everything easier for the developer It is alright that nom gets more complex if building stuff with nom gets easier

Incomplete

pub enum Needed {
  Unknown,
  Size(u32)
}
pub enum IResult<'a,I,O> {
  Done(I,O),
  Error(Err<'a>),
  Incomplete(Needed)
}

Incomplete

pub fn be_u16<'a>(i: &[u8]) -> IResult<'a,&[u8], u16> {
  if i.len() < 2 {
    Incomplete(Needed::Size(2))
  } else {
    let res = ((i[0] as u16) << 8) + i[1] as u16;
    Done(&i[2..], res)
  }
}

Streaming

Producers

pub enum ProducerState<O> {
  Eof(O),
  Continue,
  Data(O),
  ProducerError(u32),
}

pub trait Producer {
  fn produce(&mut self)                   -> ProducerState<&[u8]>;
  fn seek(&mut self,   position:SeekFrom) -> Option<u64>;
}

Producer example

fn local_print<'a, T: Debug>(input: T) -> IResult<'a, T, ()> {
  println!("{:?}", input);
  Done(input, ())
}

FileProducer::new("links.txt", 20).map(|producer: FileProducer| {
  let mut p = producer;

  pusher!(pipeline, local_print);
  pipeline(&mut p);
});

Consumers

pub enum ConsumerState {
  Await(
    usize,    // consumed
    usize     // needed buffer size
  ),
  Seek(
    usize,    // consumed
    SeekFrom, // new position
    usize     // needed buffer size
  ),
  Incomplete,
  ConsumerDone,
  ConsumerError(u32)
}

Consumers

pub trait Consumer {
  fn consume(&mut self, input: &[u8]) -> ConsumerState;
  fn end(&mut self);

  fn run(&mut self, producer: &mut Producer) { ... }
}

Example consumer

enum State {
  Beginning,
  Middle,
  End,
  Done
}

struct TestConsumer {
  state:   State,
  counter: usize,
}

named!(om_parser,                        tag!("om"));
named!(nomnom_parser<&[u8],Vec<&[u8]> >, many1!(tag!("nom")));
named!(end_parser,                       tag!("kthxbye"));

Example consumer

impl Consumer for TestConsumer {
  fn consume(&mut self, input: &[u8]) -> ConsumerState {
    match self.state {
      State::Beginning => {
        match om_parser(input) {
          IResult::Error(_)      => ConsumerState::ConsumerError(0),
          IResult::Incomplete(_) => ConsumerState::Await(0, 2),
          IResult::Done(_,_)     => {
            self.state = State::Middle;
            ConsumerState::Await(2, 3)
          }
        }
      },
      [ ... ]
    }
  }

  fn end(&mut self) {
    println!("counted {} noms", self.counter);
  }
}

a powerful way to give an API to a parser

State machine

machine!(TrafficLight {
  attributes { cars: u8 }
  impl { }

  states {
    Green { }  => {
      next => Orange;
    };
    Orange { } => {
      next => Red;
    };
    Red { }    =>  {
      next => Green;
    };
  }
});

impl TrafficLight<Green> {
  pub fn pass_car(&mut self) {
    self.cars = self.cars + 1;
  }
}

let mut t = TrafficLight { state: Green, cars: 0 };
t.pass_car();
let t = t.next();
t.pass_car();
// error: no method named `pass_car` found for type `TrafficLight<Orange>` in the current scope
// t.pass_car();
// ^~~~~~~~~~

state machine protected via type system

actually, would be more interesting to handle errors at runtime

Debugging

tooling is crucial, we need easy ways to debug parsers

Hexdump

slice.to_hex(8);
ftyp header:
00000000        6d 70 34 32 00 00 00 00         mp42....
00000008        6d 70 34 32 69 73 6f 6d         mp42isom
00000010        61 76 63 31                     avc1

dbg!

named!(f, dbg!( tag!( "abcd" ) ) );

let a = &b"efgh"[..];

f(a);
// -> Error(Position(0, [101, 102, 103, 104])) at l.1 by ' tag ! ( "abcd" ) '

dbg_dmp!

named!(f, dbg_dmp!( tag!( "abcd" ) ) );

let a = &b"efghijkl"[..];

f(a);
// -> Error(Position(0, [101, 102, 103, 104, 105, 106, 107, 108]))
//    at l.1 by ' tag ! ( "abcd" ) '
// -> 00000000        65 66 67 68 69 6a 6b 6c         efghijkl

Error management

pub enum Err<'a>{
  Code(u32),
  Node(u32, Box<Err<'a>>),
  Position(u32, &'a [u8]),
  NodePosition(u32, &'a [u8], Box<Err<'a>>)
}

pub enum IResult<'a,I,O> {
  Done(I,O),
  Error(Err<'a>),
  Incomplete(Needed)
}

Error macro

named!(err_test, alt!(
    tag!("abcd")
  | preceded!(tag!("efgh"), error!(42,
      chain!(
             tag!("ijkl")              ~
        res: error!(128, tag!("mnop")) ,
        || { res }
      )
    )
  )
));
let a = &b"efghblah"[..];
let err = err_test(a);
// -> Error(NodePosition(42, blah, Box::new(Position(0, &b"blah"[..]))))

cut operator. No backtracking, early return. Catch the returned error from child parser, wraps it

A chain of errors

named!(err_test, alt!(
    tag!("abcd")
  | preceded!(tag!("efgh"), error!(42,
      chain!(
        tag!("ijkl")              ~
        res: error!(128, tag!("mnop")) ,
        || { res }
      )
    )
  )
));

let b = &b"efghijklblah"[..];
let blah = &b"blah"[..];

let err = err_test(b);
// -> Error(
//      NodePosition(42, &b"ijklblah"[..],
//        Box::new(NodePosition(128, blah,
//          Box::new(Position(0, blah))
//        ))
//      )
//    )

we have a chain of errors, with corresponding positions in the input

what can we do with that?

pattern matching

use nom::util::error_to_list;

fn error_to_string(e: Err) -> &str {
  let v:Vec<u32> = error_to_list(e);
  match &v[..] {
    [42, 0]      => "missing `ijkl` tag",
    [42, 128, 0] => "missing `mnop` tag after `ijkl`",
    _            => "unrecognized error"
  }
}

pattern matching will break if the grammar changes

Merr

let mut err_map = collections::HashMap::new();
add_error_pattern(
  &mut err_map,
  err_test(&b"efghaaaa"[..]),
  "missing `ijkl` tag"
);

add_error_pattern(
  &mut err_map,
  err_test(&b"efghijklaaaa"[..]),
  "missing `mnop` tag after `ijkl`"
);

if let IResult::Error(e) = err_test(&b"efghblah"[..]) {
  let msg = err_map.get(&error_to_list(e));
  // -> "missing `ijkl` tag"
};

if let IResult::Error(e) = err_test(&b"efghijklblah"[..]) {
  let msg = err_map.get(&error_to_list(e))
  // -> "missing `mnop` tag after `ijkl`"
};

Merr for Menhir parsers in OCaml

idea: generate error patterns from known bad inputs

Colorful hexdump

named!(err_test, alt!(
  tag!("abcd") |
  error!(12, preceded!(tag!("efgh"),
    error!(42, chain!(
               tag!("ijk")               ~
          res: error!(128, tag!("mnop")) ,
          || { res }
    ))
  ))
));

there's a catch: nested parser errors, not necessarily contiguous ones

Is it usable?

Embeddable as C lib

not hard to do

remove dependance on the standard library

can do static or dynamic lib

Existing parsers

dropbox started using months ago, when it was not very stable

Summing up

summing up: it is fast, it is easy to embed, there are lots of tools to help the developer, people are using it

Thanks!

Questions ?