Real World OCaml

2nd Edition (in progress)
Back
Table of Contents

The Compiler Frontend: Parsing and Type Checking

Compiling source code into executable programs involves a fairly complex set of libraries, linkers, and assemblers. It’s important to understand how these fit together to help with your day-to-day workflow of developing, debugging, and deploying applications. 

OCaml has a strong emphasis on static type safety and rejects source code that doesn’t meet its requirements as early as possible. The compiler does this by running the source code through a series of checks and transformations. Each stage performs its job (e.g., type checking, optimization, or code generation) and discards some information from the previous stage. The final native code output is low-level assembly code that doesn’t know anything about the OCaml modules or objects that the compiler started with.  

You don’t have to do all of this manually, of course. The compiler frontends (ocamlc and ocamlopt) are invoked via the command line and chain the stages together for you. Sometimes though, you’ll need to dive into the toolchain to hunt down a bug or investigate a performance problem. This chapter explains the compiler pipeline in more depth so you understand how to harness the command-line tools effectively.   

In this chapter, we’ll cover the following topics:

  • The compilation pipeline and what each stage represents

  • The type-checking process, including module resolution

The details of the compilation process into executable code can be found next, in Chapter 24, The Compiler Backend Byte Code And Native Code.

An Overview of the Toolchain

The OCaml tools accept textual source code as input, using the filename extensions .ml and .mli for modules and signatures, respectively. We explained the basics of the build process in Chapter 4, Files Modules And Programs, so we’ll assume you’ve built a few OCaml programs already by this point. 

Each source file represents a compilation unit that is built separately. The compiler generates intermediate files with different filename extensions to use as it advances through the compilation stages. The linker takes a collection of compiled units and produces a standalone executable or library archive that can be reused by other applications. 

The overall compilation pipeline looks like this:  

Notice that the pipeline branches toward the end. OCaml has multiple compiler backends that reuse the early stages of compilation but produce very different final outputs. The bytecode can be run by a portable interpreter and can even be transformed into JavaScript (via js_of_ocaml) or C source code (via OCamlCC). The native code compiler generates specialized executable binaries suitable for high-performance applications.  

Obtaining the Compiler Source Code

Although it’s not necessary to understand the examples, you may find it useful to have a copy of the OCaml source tree checked out while you read through this chapter. The source code is available from multiple places:

  • Stable releases as zip and tar archives from the OCaml download site

  • A Git repository with all the history and development branches included, browsable online at GitHub

The source tree is split up into subdirectories. The core compiler consists of:

config/
Configuration directives to tailor OCaml for your operating system and architecture.
bytecomp/
Bytecode compiler that converts OCaml into an interpreted executable format.
asmcomp/
Native-code compiler that converts OCaml into high performance native code executables.
parsing/
The OCaml lexer, parser, and libraries for manipulating them.
typing/
The static type checking implementation and type definitions.
driver/
Command-line interfaces for the compiler tools.

A number of tools and scripts are also built alongside the core compiler:

debugger/
The interactive bytecode debugger.
toplevel/
Interactive top-level console.
emacs/
A caml-mode for the Emacs editor.
stdlib/
The compiler standard library, including the Pervasives module.
ocamlbuild/
Build system that automates common OCaml compilation modes.
otherlibs/
Optional libraries such as the Unix and graphics modules.
tools/
Command-line utilities such as ocamldep that are installed with the compiler.
testsuite/
Regression tests for the core compiler.

We’ll go through each of the compilation stages now and explain how they will be useful to you during day-to-day OCaml development.

Parsing Source Code

When a source file is passed to the OCaml compiler, its first task is to parse the text into a more structured abstract syntax tree (AST). The parsing logic is implemented in OCaml itself using the techniques described earlier in Chapter 17, Parsing With Ocamllex And Menhir. The lexer and parser rules can be found in the parsing directory in the source distribution.    

Syntax Errors

The OCaml parser’s goal is to output a well-formed AST data structure to the next phase of compilation, and so it any source code that doesn’t match basic syntactic requirements. The compiler emits a syntax error in this situation, with a pointer to the filename and line and character number that’s as close to the error as possible.  

Here’s an example syntax error that we obtain by performing a module assignment as a statement instead of as a let binding:

let () =
  module MyString = String;
  ()

The code results in a syntax error when compiled:

ocamlc -c broken_module.ml
>File "broken_module.ml", line 2, characters 2-8:
>2 |   module MyString = String;
>      ^^^^^^
>Error: Syntax error
[2]

The correct version of this source code creates the MyString module correctly via a local open, and compiles successfully:

let () =
  let module MyString = String in
  ()

The syntax error points to the line and character number of the first token that couldn’t be parsed. In the broken example, the module keyword isn’t a valid token at that point in parsing, so the error location information is correct.

Automatically Indenting Source Code

Sadly, syntax errors do get more inaccurate sometimes, depending on the nature of your mistake. Try to spot the deliberate error in the following function definitions:  

let concat_and_print x y =
  let v = x ^ y in
  print_endline v;
  v;

let add_and_print x y =
  let v = x + y in
  print_endline (string_of_int v);
  v

let () =
  let _x = add_and_print 1 2 in
  let _y = concat_and_print "a" "b" in
  ()

When you compile this file, you’ll get a syntax error again:

ocamlc -c follow_on_function.ml
>File "follow_on_function.ml", line 11, characters 0-3:
>11 | let () =
>     ^^^
>Error: Syntax error
[2]

The line number in the error points to the end of the add_and_print function, but the actual error is at the end of the first function definition. There’s an extra semicolon at the end of the first definition that causes the second definition to become part of the first let binding. This eventually results in a parsing error at the very end of the second function.

This class of bug (due to a single errant character) can be hard to spot in a large body of code. Luckily, there’s a great tool available via OPAM called ocp-indent that applies structured indenting rules to your source code on a line-by-line basis. This not only beautifies your code layout, but it also makes this syntax error much easier to locate. 

Let’s run our erroneous file through ocp-indent and see how it processes it:

ocp-indent follow_on_function.ml
>let concat_and_print x y =
>  let v = x ^ y in
>  print_endline v;
>  v;
>
>  let add_and_print x y =
>    let v = x + y in
>    print_endline (string_of_int v);
>    v
>
>let () =
>  let _x = add_and_print 1 2 in
>  let _y = concat_and_print "a" "b" in
>  ()

The add_and_print definition has been indented as if it were part of the first concat_and_print definition, and the errant semicolon is now much easier to spot. We just need to remove that semicolon and rerun ocp-indent to verify that the syntax is correct:

ocp-indent follow_on_function_fixed.ml
>(*TODO: Check contents*)
>let concat_and_print x y =
>  let v = x ^ y in
>  print_endline v;
>  v
>
>let add_and_print x y =
>  let v = x + y in
>  print_endline (string_of_int v);
>  v
>
>let () =
>  let _x = add_and_print 1 2 in
>  let _y = concat_and_print "a" "b" in
>  ()

The ocp-indent homepage documents how to integrate it with your favorite editor. All the Core libraries are formatted using it to ensure consistency, and it’s a good idea to do this before publishing your own source code online.

Generating Documentation from Interfaces

Whitespace and source code comments are removed during parsing and aren’t significant in determining the semantics of the program. However, other tools in the OCaml distribution can interpret comments for their own ends.    

The ocamldoc tool uses specially formatted comments in the source code to generate documentation bundles. These comments are combined with the function definitions and signatures, and output as structured documentation in a variety of formats. It can generate HTML pages, LaTeX and PDF documents, UNIX manual pages, and even module dependency graphs that can be viewed using Graphviz.

Here’s a sample of some source code that’s been annotated with ocamldoc comments:

(** example.ml: The first special comment of the file is the comment
    associated with the whole module. *)

(** Comment for exception My_exception. *)
exception My_exception of (int -> int) * int

(** Comment for type [weather]  *)
type weather =
  | Rain of int (** The comment for construtor Rain *)
  | Sun         (** The comment for constructor Sun *)

(** Find the current weather for a country
    @author Anil Madhavapeddy
    @param location The country to get the weather for.
*)
let what_is_the_weather_in location =
  match location with
  | `Cambridge  -> Rain 100
  | `New_york   -> Rain 20
  | `California -> Sun

The ocamldoc comments are distinguished by beginning with the double asterisk. There are formatting conventions for the contents of the comment to mark metadata. For instance, the @tag fields mark specific properties such as the author of that section of code.

Try compiling the HTML documentation and UNIX man pages by running ocamldoc over the source file:

$ mkdir -p html man/man3
$ ocamldoc -html -d html doc.ml
$ ocamldoc -man -d man/man3 doc.ml
$ man -M man Doc

You should now have HTML files inside the html/ directory and also be able to view the UNIX manual pages held in man/man3. There are quite a few comment formats and options to control the output for the various backends. Refer to the OCaml manual for the complete list.         

Using Custom ocamldoc Generators

The default HTML output stylesheets from ocamldoc are pretty spartan and distinctly Web 1.0. The tool supports plugging in custom documentation generators, and there are several available that provide prettier or more detailed output:

  • Argot is an enhanced HTML generator that supports code folding and searching by name or type definition.

  • ocamldoc generators add support for Bibtex references within comments and generating literate documentation that embeds the code alongside the comments.

  • JSON output is available via a custom generator in Xen.

Static Type Checking

After obtaining a valid abstract syntax tree, the compiler has to verify that the code obeys the rules of the OCaml type system. Code that is syntactically correct but misuses values is rejected with an explanation of the problem.

Although type checking is done in a single pass in OCaml, it actually consists of three distinct steps that happen simultaneously:      

automatic type inference
An algorithm that calculates types for a module without requiring manual type annotations
module system
Combines software components with explicit knowledge of their type signatures
explicit subtyping
Checks for objects and polymorphic variants

Automatic type inference lets you write succinct code for a particular task and have the compiler ensure that your use of variables is locally consistent.

Type inference doesn’t scale to very large codebases that depend on separate compilation of files. A small change in one module may ripple through thousands of other files and libraries and require all of them to be recompiled. The module system solves this by providing the facility to combine and manipulate explicit type signatures for modules within a large project, and also to reuse them via functors and first-class modules.  

Subtyping in OCaml objects is always an explicit operation (via the :> operator). This means that it doesn’t complicate the core type inference engine and can be tested as a separate concern.

Displaying Inferred Types from the Compiler

We’ve already seen how you can explore type inference directly from the toplevel. It’s also possible to generate type signatures for an entire file by asking the compiler to do the work for you. Create a file with a single type definition and value:

type t = Foo | Bar
let v = Foo

Now run the compiler with the -i flag to infer the type signature for that file. This runs the type checker but doesn’t compile the code any further after displaying the interface to the standard output:

The output is the default signature for the module that represents the input file. It’s often useful to redirect this output to an mli file to give you a starting signature to edit the external interface without having to type it all in by hand.

The compiler stores a compiled version of the interface as a cmi file. This interface is either obtained from compiling an mli signature file for a module, or by the inferred type if there is only an ml implementation present.

The compiler makes sure that your ml and mli files have compatible signatures. The type checker throws an immediate error if this isn’t the case:

type t = Foo
type t = Bar
ocamlc -c conflicting_interface.mli conflicting_interface.ml
>File "conflicting_interface.ml", line 1:
>Error: The implementation conflicting_interface.ml
>       does not match the interface conflicting_interface.cmi:
>       Type declarations do not match:
>         type t = Foo
>       is not included in
>         type t = Bar
>       Constructors number 1 have different names, Foo and Bar.
>       File "conflicting_interface.mli", line 1, characters 0-12:
>         Expected declaration
>       File "conflicting_interface.ml", line 1, characters 0-12:
>         Actual declaration
[2]
Which Comes First: The ml or the mli?

There are two schools of thought on which order OCaml code should be written in. It’s very easy to begin writing code by starting with an ml file and using the type inference to guide you as you build up your functions. The mli file can then be generated as described, and the exported functions documented.     

If you’re writing code that spans multiple files, it’s sometimes easier to start by writing all the mli signatures and checking that they type-check against one another. Once the signatures are in place, you can write the implementations with the confidence that they’ll all glue together correctly, with no cyclic dependencies among the modules.

As with any such stylistic debate, you should experiment with which system works best for you. Everyone agrees on one thing though: no matter in what order you write them, production code should always explicitly define an mli file for every ml file in the project. It’s also perfectly fine to have an mli file without a corresponding ml file if you’re only declaring signatures (such as module types).

Signature files provide a place to write succinct documentation and to abstract internal details that shouldn’t be exported. Maintaining separate signature files also speeds up incremental compilation in larger code bases, since recompiling a mli signature is much faster than a full compilation of the implementation to native code.

Type Inference

Type inference is the process of determining the appropriate types for expressions based on their use. It’s a feature that’s partially present in many other languages such as Haskell and Scala, but OCaml embeds it as a fundamental feature throughout the core language.   

OCaml type inference is based on the Hindley-Milner algorithm, which is notable for its ability to infer the most general type for an expression without requiring any explicit type annotations. The algorithm can deduce multiple types for an expression and has the notion of a principal type that is the most general choice from the possible inferences. Manual type annotations can specialize the type explicitly, but the automatic inference selects the most general type unless told otherwise.

OCaml does have some language extensions that strain the limits of principal type inference, but by and large, most programs you write will never require annotations (although they sometimes help the compiler produce better error messages).

Adding type annotations to find errors

It’s often said that the hardest part of writing OCaml code is getting past the type checker—but once the code does compile, it works correctly the first time! This is an exaggeration of course, but it can certainly feel true when moving from a dynamically typed language. The OCaml static type system protects you from certain classes of bugs such as memory errors and abstraction violations by rejecting your program at compilation time rather than by generating an error at runtime. Learning how to navigate the type checker’s compile-time feedback is key to building robust libraries and applications that take full advantage of these static checks.     

There are a couple of tricks to make it easier to quickly locate type errors in your code. The first is to introduce manual type annotations to narrow down the source of your error more accurately. These annotations shouldn’t actually change your types and can be removed once your code is correct. However, they act as anchors to locate errors while you’re still writing your code.

Manual type annotations are particularly useful if you use lots of polymorphic variants or objects. Type inference with row polymorphism can generate some very large signatures, and errors tend to propagate more widely than if you are using more explicitly typed variants or classes.  

For instance, consider this broken example that expresses some simple algebraic operations over integers:

let rec algebra =
  function
  | `Add (x,y) -> (algebra x) + (algebra y)
  | `Sub (x,y) -> (algebra x) - (algebra y)
  | `Mul (x,y) -> (algebra x) * (algebra y)
  | `Num x     -> x

let _ =
  algebra (
    `Add (
      (`Num 0),
      (`Sub (
          (`Num 1),
          (`Mul (
              (`Nu 3),(`Num 2)
            ))
        ))
    ))

There’s a single character typo in the code so that it uses Nu instead of Num. The resulting type error is impressive:

ocamlc -c broken_poly.ml
>File "broken_poly.ml", lines 9-18, characters 10-6:
> 9 | ..........(
>10 |     `Add (
>11 |       (`Num 0),
>12 |       (`Sub (
>13 |           (`Num 1),
>14 |           (`Mul (
>15 |               (`Nu 3),(`Num 2)
>16 |             ))
>17 |         ))
>18 |     ))
>Error: This expression has type
>         [> `Add of
>              ([< `Add of 'a * 'a
>                | `Mul of 'a * 'a
>                | `Num of int
>                | `Sub of 'a * 'a
>                > `Num ]
>               as 'a) *
>              [> `Sub of 'a * [> `Mul of [> `Nu of int ] * [> `Num of int ] ]
>              ] ]
>       but an expression was expected of type
>         [< `Add of 'a * 'a | `Mul of 'a * 'a | `Num of int | `Sub of 'a * 'a
>          > `Num ]
>         as 'a
>       The second variant type does not allow tag(s) `Nu
[2]

The type error is perfectly accurate, but rather verbose and with a line number that doesn’t point to the exact location of the incorrect variant name. The best the compiler can do is to point you in the general direction of the algebra function application.

This is because the type checker doesn’t have enough information to match the inferred type of the algebra definition to its application a few lines down. It calculates types for both expressions separately, and when they don’t match up, outputs the difference as best it can.

Let’s see what happens with an explicit type annotation to help the compiler out:

type t = [
  | `Add of t * t
  | `Sub of t * t
  | `Mul of t * t
  | `Num of int
]

let rec algebra (x:t) =
  match x with
  | `Add (x,y) -> (algebra x) + (algebra y)
  | `Sub (x,y) -> (algebra x) - (algebra y)
  | `Mul (x,y) -> (algebra x) * (algebra y)
  | `Num x     -> x

let _ =
  algebra (
    `Add (
      (`Num 0),
      (`Sub (
          (`Num 1),
          (`Mul (
              (`Nu 3),(`Num 2)
            ))
        ))
    ))

This code contains exactly the same error as before, but we’ve added a closed type definition of the polymorphic variants, and a type annotation to the algebra definition. The compiler error we get is much more useful now:

ocamlc -i broken_poly_with_annot.ml
>File "broken_poly_with_annot.ml", line 22, characters 14-21:
>22 |               (`Nu 3),(`Num 2)
>                   ^^^^^^^
>Error: This expression has type [> `Nu of int ]
>       but an expression was expected of type t
>       The second variant type does not allow tag(s) `Nu
[2]

This error points directly to the correct line number that contains the typo. Once you fix the problem, you can remove the manual annotations if you prefer more succinct code. You can also leave the annotations there, of course, to help with future refactoring and debugging.

Enforcing principal typing

The compiler also has a stricter principal type checking mode that is activated via the -principal flag. This warns about risky uses of type information to ensure that the type inference has one principal result. A type is considered risky if the success or failure of type inference depends on the order in which subexpressions are typed.   

The principality check only affects a few language features:

  • Polymorphic methods for objects

  • Permuting the order of labeled arguments in a function from their type definition

  • Discarding optional labeled arguments

  • Generalized algebraic data types (GADTs) present from OCaml 4.0 onward

  • Automatic disambiguation of record field and constructor names (since OCaml 4.1)

Here’s an example of principality warnings when used with record disambiguation.

type s = { foo: int; bar: unit }
type t = { foo: int }

let f x =
  x.bar;
  x.foo

Inferring the signature with -principal will show you a new warning:

ocamlc -i -principal non_principal.ml
>File "non_principal.ml", line 6, characters 4-7:
>6 |   x.foo
>        ^^^
>Warning 18: this type-based field disambiguation is not principal.
>type s = { foo : int; bar : unit; }
>type t = { foo : int; }
>val f : s -> int

This example isn’t principal, since the inferred type for x.foo is guided by the inferred type of x.bar, whereas principal typing requires that each subexpression’s type can be calculated independently. If the x.bar use is removed from the definition of f, its argument would be of type t and not type s.

You can fix this either by permuting the order of the type declarations, or by adding an explicit type annotation:

type s = { foo: int; bar: unit }
type t = { foo: int }

let f (x:s) =
  x.bar;
  x.foo

There is now no ambiguity about the inferred types, since we’ve explicitly given the argument a type, and the order of inference of the subexpressions no longer matters.

ocamlc -i -principal principal.ml
>type s = { foo : int; bar : unit; }
>type t = { foo : int; }
>val f : s -> int

The dune equivalent is to add the flag -principal to your build description.

(executable
  (name principal)
  (flags :standard -principal)
  (modules principal))

(executable
  (name non_principal)
  (flags :standard -principal)
  (modules non_principal))

The :standard directive will include all the default flags, and then -principal will be appended after those in the compiler build flags.

dune build principal.exe
dune build non_principal.exe
>File "non_principal.ml", line 6, characters 4-7:
>6 |   x.foo
>        ^^^
>Error (warning 18): this type-based field disambiguation is not principal.
[1]

Ideally, all code should systematically use -principal. It reduces variance in type inference and enforces the notion of a single known type. However, there are drawbacks to this mode: type inference is slower, and the cmi files become larger. This is generally only a problem if you extensively use objects, which usually have larger type signatures to cover all their methods.

If compiling in principal mode works, it is guaranteed that the program will pass type checking in non-principal mode, too. Bear in mind that the cmi files generated in principal mode differ from the default mode. Try to ensure that you compile your whole project with it activated. Getting the files mixed up won’t let you violate type safety, but it can result in the type checker failing unexpectedly very occasionally. In this case, just recompile with a clean source tree.

Modules and Separate Compilation

The OCaml module system enables smaller components to be reused effectively in large projects while still retaining all the benefits of static type safety. We covered the basics of using modules earlier in Chapter 4, Files Modules And Programs. The module language that operates over these signatures also extends to functors and first-class modules, described in Chapter 9, Functors and Chapter 10, First Class Modules, respectively.  

This section discusses how the compiler implements them in more detail. Modules are essential for larger projects that consist of many source files (also known as compilation units). It’s impractical to recompile every single source file when changing just one or two files, and the module system minimizes such recompilation while still encouraging code reuse.  

The mapping between files and modules

Individual compilation units provide a convenient way to break up a big module hierarchy into a collection of files. The relationship between files and modules can be explained directly in terms of the module system.  

Create a file called alice.ml with the following contents:

let friends = [ Bob.name ]

and a corresponding signature file:

val friends : Bob.t list

These two files are exactly analogous to including the following code directly in another module that references Alice:

module Alice : sig
  val friends : Bob.t list
end = struct
  let friends = [ Bob.name ]
end

Defining a module search path

In the preceding example, Alice also has a reference to another module Bob. For the overall type of Alice to be valid, the compiler also needs to check that the Bob module contains at least a Bob.name value and defines a Bob.t type.  

The type checker resolves such module references into concrete structures and signatures in order to unify types across module boundaries. It does this by searching a list of directories for a compiled interface file matching that module’s name. For example, it will look for alice.cmi and bob.cmi on the search path and use the first ones it encounters as the interfaces for Alice and Bob.

The module search path is set by adding -I flags to the compiler command line with the directory containing the cmi files as the argument. Manually specifying these flags gets complex when you have lots of libraries, and is the reason why the OCamlfind frontend to the compiler exists. OCamlfind automates the process of turning third-party package names and build descriptions into command-line flags that are passed to the compiler command line.

By default, only the current directory and the OCaml standard library will be searched for cmi files. The Pervasives module from the standard library will also be opened by default in every compilation unit. The standard library location is obtained by running ocamlc -where and can be overridden by setting the CAMLLIB environment variable. Needless to say, don’t override the default path unless you have a good reason to (such as setting up a cross-compilation environment).    

Inspecting Compilation Units with ocamlobjinfo

For separate compilation to be sound, we need to ensure that all the cmi files used to type-check a module are the same across compilation runs. If they vary, this raises the possibility of two modules checking different type signatures for a common module with the same name. This in turn lets the program completely violate the static type system and can lead to memory corruption and crashes.

OCaml guards against this by recording a MD5 checksum in every cmi. Let’s examine our earlier typedef.ml more closely:

ocamlc -c typedef.ml
ocamlobjinfo typedef.cmi
>File typedef.cmi
>Unit name: Typedef
>Interfaces imported:
>   cdd43318ee9dd1b187513a4341737717    Typedef
>   9b04ecdc97e5102c1d342892ef7ad9a2    Pervasives
>   79ae8c0eb753af6b441fe05456c7970b    CamlinternalFormatBasics

ocamlobjinfo examines the compiled interface and displays what other compilation units it depends on. In this case, we don’t use any external modules other than Pervasives. Every module depends on Pervasives by default, unless you use the -nopervasives flag (this is an advanced use case, and you shouldn’t normally need it).

The long alphanumeric identifier beside each module name is a hash calculated from all the types and values exported from that compilation unit. It’s used during type-checking and linking to ensure that all of the compilation units have been compiled consistently against one another. A difference in the hashes means that a compilation unit with the same module name may have conflicting type signatures in different modules. The compiler will reject such programs with an error similar to this:

$ ocamlc -c foo.ml
File "foo.ml", line 1, characters 0-1:
Error: The files /home/build/bar.cmi
       and /usr/lib/ocaml/map.cmi make inconsistent assumptions
       over interface Map

This hash check is very conservative, but ensures that separate compilation remains type-safe all the way up to the final link phase. Your build system should ensure that you never see the preceding error messages, but if you do run into it, just clean out your intermediate files and recompile from scratch.

Packing Modules Together

The module-to-file mapping described so far rigidly enforces a 1:1 mapping between a top-level module and a file. It’s often convenient to split larger modules into separate files to make editing easier, but still compile them all into a single OCaml module.  

The -pack compiler option accepts a list of compiled object files ( .cmo in bytecode and .cmx for native code) and their associated .cmi compiled interfaces, and combines them into a single module that contains them as submodules of the output. Packing thus generates an entirely new .cmo (or .cmx file) and .cmi that includes the input modules.

Packing for native code introduces an additional requirement: the modules that are intended to be packed must be compiled with the -for-pack argument that specifies the eventual name of the pack. The easiest way to handle packing is to let ocamlbuild figure out the command-line arguments for you, so let’s try that out next with a simple example.

First, create a couple of toy modules called A.ml and B.ml that contain a single value. You will also need a _tags file that adds the -for-pack option for the cmx files (but careful to exclude the pack target itself). Finally, the X.mlpack file contains the list of modules that are intended to be packed under module X. There are special rules in ocamlbuild that tell it how to map %.mlpack files to the packed %.cmx or %.cmo equivalent:

cat A.ml
>let v = "hello"
cat B.ml
>let w = 42
cat _tags
><*.cmx> and not "X.cmx": for-pack(X)
cat X.mlpack
>A
>B

You can now run corebuild to build the X.cmx file directly, but let’s create a new module to link against X to complete the example:

let v = X.A.v
let w = X.B.w

You can now compile this test module and see that its inferred interface is the result of using the packed contents of X. We further verify this by examining the imported interfaces in Test and confirming that neither A nor B are mentioned in there and that only the packed X module is used:

corebuild test.inferred.mli test.cmi
>ocamlfind ocamldep -package core -ppx 'ppx-jane -as-ppx' -modules test.ml > test.ml.depends
>ocamlfind ocamldep -package core -ppx 'ppx-jane -as-ppx' -modules A.ml > A.ml.depends
>ocamlfind ocamldep -package core -ppx 'ppx-jane -as-ppx' -modules B.ml > B.ml.depends
>ocamlfind ocamlc -c -w A-4-33-40-41-42-43-34-44 -strict-sequence -g -bin-annot -short-paths -thread -package core -ppx 'ppx-jane -as-ppx' -o A.cmo A.ml
>ocamlfind ocamlc -c -w A-4-33-40-41-42-43-34-44 -strict-sequence -g -bin-annot -short-paths -thread -package core -ppx 'ppx-jane -as-ppx' -o B.cmo B.ml
>ocamlfind ocamlc -pack -g -bin-annot A.cmo B.cmo -o X.cmo
>ocamlfind ocamlc -i -thread -short-paths -package core -ppx 'ppx-jane -as-ppx' test.ml > test.inferred.mli
>ocamlfind ocamlc -c -w A-4-33-40-41-42-43-34-44 -strict-sequence -g -bin-annot -short-paths -thread -package core -ppx 'ppx-jane -as-ppx' -o test.cmo test.ml
cat _build/test.inferred.mli
>val v : string
>val w : int
ocamlobjinfo _build/test.cmi
>File _build/test.cmi
>Unit name: Test
>Interfaces imported:
>   7b1e33d4304b9f8a8e844081c001ef22    Test
>   27a343af5f1904230d1edc24926fde0e    X
>   9b04ecdc97e5102c1d342892ef7ad9a2    Pervasives
>   79ae8c0eb753af6b441fe05456c7970b    CamlinternalFormatBasics

Packing and Search Paths

One very common build error that happens with packing is confusion resulting from building the packed cmi in the same directory as the submodules. When you add this directory to your module search path, the submodules are also visible. If you forget to include the top-level prefix (e.g., X.A) and instead use a submodule directly (A), then this will compile and link fine.

However, the types of A and X.A are not automatically equivalent so the type checker will complain if you attempt to mix and match the packed and unpacked versions of the library.

This mostly only happens with unit tests, since they are built at the same time as the library. You can avoid it by being aware of the need to open the packed module from the test, or only using the library after it has been installed (and hence not exposing the intermediate compiled modules).

Shorter Module Paths in Type Errors

Core uses the OCaml module system quite extensively to provide a complete replacement standard library. It collects these modules into a single Std module, which provides a single module that needs to be opened to import the replacement modules and functions.  

There’s one downside to this approach: type errors suddenly get much more verbose. We can see this if you run the vanilla OCaml toplevel (not utop).

$ ocaml
# List.map print_endline "" ;;
Error: This expression has type string but an expression was expected of type
         string list

This type error without Core has a straightforward type error. When we switch to Core, though, it gets more verbose:

$ ocaml
# open Core ;;
# List.map ~f:print_endline "" ;;
Error: This expression has type string but an expression was expected of type
         'a Core.List.t = 'a list

The default List module in OCaml is overridden by Core.List. The compiler does its best to show the type equivalence, but at the cost of a more verbose error message.

The compiler can remedy this via a so-called short paths heuristic. This causes the compiler to search all the type aliases for the shortest module path and use that as the preferred output type. The option is activated by passing -short-paths to the compiler, and works on the toplevel, too. 

$ ocaml -short-paths
# open Core;;
# List.map ~f:print_endline "foo";;
Error: This expression has type string but an expression was expected of type
         'a list

The utop enhanced toplevel activates short paths by default, which is why we have not had to do this before in our interactive examples. However, the compiler doesn’t default to the short path heuristic, since there are some situations where the type aliasing information is useful to know, and it would be lost in the error if the shortest module path is always picked.

You’ll need to choose for yourself if you prefer short paths or the default behavior in your own projects, and pass the -short-paths flag to the compiler if you need it. 

The Typed Syntax Tree

When the type checking process has successfully completed, it is combined with the AST to form a typed abstract syntax tree. This contains precise location information for every token in the input file, and decorates each token with concrete type information.       

The compiler can output this as compiled cmt and cmti files that contain the typed AST for the implementation and signatures of a compilation unit. This is activated by passing the -bin-annot flag to the compiler.

The cmt files are particularly useful for IDE tools to match up OCaml source code at a specific location to the inferred or external types.

Using ocp-index for Autocompletion

One such command-line tool to display autocompletion information in your editor is ocp-index. Install it via OPAM as follows:  

opam install ocp-index
ocp-index

Let’s refer back to our Ncurses binding example from the beginning of Chapter 19, Foreign Function Interface. This module defined bindings for the Ncurses library. First, compile the interfaces with -bin-annot so that we can obtain the cmt and cmti files, and then run ocp-index in completion mode:

(cd ffi/ncurses && corebuild -pkg ctypes.foreign -tag bin_annot ncurses.cmi)
>ocamlfind ocamldep -package ctypes.foreign -package core -ppx 'ppx-jane -as-ppx' -modules ncurses.mli > ncurses.mli.depends
>ocamlfind ocamlc -c -w A-4-33-40-41-42-43-34-44 -strict-sequence -g -bin-annot -short-paths -thread -package ctypes.foreign -package core -ppx 'ppx-jane -as-ppx' -o ncurses.cmi ncurses.mli
ocp-index complete -I ffi Ncur
ocp-index complete -I ffi Ncurses.a
ocp-index complete -I ffi Ncurses.

You need to pass ocp-index a set of directories to search for cmt files in, and a fragment of text to autocomplete. As you can imagine, autocompletion is invaluable on larger codebases. See the ocp-index home page for more information on how to integrate it with your favorite editor.

Examining the Typed Syntax Tree Directly

The compiler has a couple of advanced flags that can dump the raw output of the internal AST representation. You can’t depend on these flags to give the same output across compiler revisions, but they are a useful learning tool. 

We’ll use our toy typedef.ml again:

type t = Foo | Bar
let v = Foo

Let’s first look at the untyped syntax tree that’s generated from the parsing phase:

ocamlc -dparsetree typedef.ml 2>&1
>[
>  structure_item (typedef.ml[1,0+0]..[1,0+18])
>    Pstr_type Rec
>    [
>      type_declaration "t" (typedef.ml[1,0+5]..[1,0+6]) (typedef.ml[1,0+0]..[1,0+18])
>        ptype_params =
>          []
>        ptype_cstrs =
>          []
>        ptype_kind =
>          Ptype_variant
>            [
>              (typedef.ml[1,0+9]..[1,0+12])
>                "Foo" (typedef.ml[1,0+9]..[1,0+12])
>                []
>                None
>              (typedef.ml[1,0+13]..[1,0+18])
>                "Bar" (typedef.ml[1,0+15]..[1,0+18])
>                []
>                None
>            ]
>        ptype_private = Public
>        ptype_manifest =
>          None
>    ]
>  structure_item (typedef.ml[2,19+0]..[2,19+11])
>    Pstr_value Nonrec
>    [
>      <def>
>        pattern (typedef.ml[2,19+4]..[2,19+5])
>          Ppat_var "v" (typedef.ml[2,19+4]..[2,19+5])
>        expression (typedef.ml[2,19+8]..[2,19+11])
>          Pexp_construct "Foo" (typedef.ml[2,19+8]..[2,19+11])
>          None
>    ]
>]

This is rather a lot of output for a simple two-line program, but it shows just how much structure the OCaml parser generates even from a small source file.

Each portion of the AST is decorated with the precise location information (including the filename and character location of the token). This code hasn’t been type checked yet, so the raw tokens are all included.

The typed AST that is normally output as a compiled cmt file can be displayed in a more developer-readable form via the -dtypedtree option:

ocamlc -dtypedtree typedef.ml 2>&1
>[
>  structure_item (typedef.ml[1,0+0]..typedef.ml[1,0+18])
>    Tstr_type Rec
>    [
>      type_declaration t/80 (typedef.ml[1,0+0]..typedef.ml[1,0+18])
>        ptype_params =
>          []
>        ptype_cstrs =
>          []
>        ptype_kind =
>          Ttype_variant
>            [
>              (typedef.ml[1,0+9]..typedef.ml[1,0+12])
>                Foo/81
>                []
>                None
>              (typedef.ml[1,0+13]..typedef.ml[1,0+18])
>                Bar/82
>                []
>                None
>            ]
>        ptype_private = Public
>        ptype_manifest =
>          None
>    ]
>  structure_item (typedef.ml[2,19+0]..typedef.ml[2,19+11])
>    Tstr_value Nonrec
>    [
>      <def>
>        pattern (typedef.ml[2,19+4]..typedef.ml[2,19+5])
>          Tpat_var "v/83"
>        expression (typedef.ml[2,19+8]..typedef.ml[2,19+11])
>          Texp_construct "Foo"
>          []
>    ]
>]

The typed AST is more explicit than the untyped syntax tree. For instance, the type declaration has been given a unique name (t/1008), as has the v value (v/1011).   

You’ll rarely need to look at this raw output from the compiler unless you’re building IDE tools such as ocp-index, or are hacking on extensions to the core compiler itself. However, it’s useful to know that this intermediate form exists before we delve further into the code generation process next, in Chapter 24, The Compiler Backend Byte Code And Native Code.

There are several new integrated tools emerging that combine these typed AST files with common editors such as Emacs or Vim. The best of these is Merlin, which adds value and module autocompletion, displays inferred types and can build and display errors directly from within your editor. There are instructions available on its homepage for configuring Merlin with your favorite editor.

Next: Chapter 23Preprocessing with ppx