2 files changed, 399 insertions, 4 deletions
diff --git a/dev/README.md b/dev/README.md
index 0c6b8020f1..0a6b196ec0 100644
--- a/dev/README.md
+++ b/dev/README.md
@@ -22,14 +22,12 @@
 | [`dev/doc/changes.md`](doc/changes.md) | (partial) Per-version summary of the evolution of Coq ML source |
 | [`dev/doc/style.txt`](doc/style.txt) | A few style recommendations for writing Coq ML files |
 | [`dev/doc/debugging.md`](doc/debugging.md) | Help for debugging or profiling |
-| [`dev/doc/universes.txt`](doc/universes.txt) |  Help for debugging universes |
-| [`dev/doc/extensions.txt`](doc/extensions.txt) | Some help about TACTIC EXTEND |
-| [`dev/doc/perf-analysis`](doc/perf-analysis)|  Analysis of perfs measured on the compilation of user contribs |
+| [`dev/doc/universes.md`](doc/universes.md) |  Help for debugging universes |
 | [`dev/doc/econstr.md`](doc/econstr.md) | Describes `Econstr`, implementation of treatment of `evar` in the engine |
 | [`dev/doc/primproj.md`](doc/primproj.md) | Describes primitive projections |
+| [`dev/doc/parsing.md`](doc/parsing.md) | Grammar and parsing overview |
 | [`dev/doc/proof-engine.md`](doc/proof-engine.md) | Tutorial on new proof engine |
 | [`dev/doc/xml-protocol.md`](doc/xml-protocol.md) | XML protocol that coqtop and IDEs use to communicate |
-| [`dev/doc/MERGING.md`](doc/MERGING.md) |  How pull requests should be merged into `master` |
 | [`dev/doc/release-process.md`](doc/release-process.md) | Process of creating a new Coq release |
 
 
diff --git a/dev/doc/parsing.md b/dev/doc/parsing.md
new file mode 100644
index 0000000000..f8b4537e77
--- /dev/null
+++ b/dev/doc/parsing.md
@@ -0,0 +1,397 @@
+# Parsing
+
+Coq's parser is based on Camlp5 using an extensible grammar.  Somewhat helpful
+Camlp5 documentation is available [here](http://camlp5.github.io/doc/htmlc/grammars.html).
+However, the Camlp5 code has been copied into the Coq source tree and may differ
+from the Camlp5 release.
+
+Notable attributes of the parser include:
+
+* The grammar is extensible at run time.  This is essential for supporting notations
+  and optionally-loaded plugins that extend the grammar.
+
+* The grammar is split into multiple source files.  Nonterminals can be local to a file
+  or global.
+
+* While 95% of the nonterminals and almost all the productions are defined in the grammar,
+  a few are defined directly in OCaml code.  Since many developers have worked on the parser
+  over the years, this code can be idiosyncratic, reflecting various coding styles.
+
+* The parser is a recursive descent parser that, by default, only looks at the next token
+  to make a parsing decision.  It's possible to hand-code additional lookahead where
+  necessary by writing OCaml code.
+
+* There's no code that checks whether a grammar is ambiguous or whether every production
+  can be recognized.  Developers who modify the grammar may, in some cases, need to structure their
+  added productions in specific ways to ensure that their additions are parsable and that they
+  don't break existing productions.
+
+## Contents ##
+
+- [Grammars: `*.mlg` File Structure](#grammars-mlg-file-structure)
+- [Grammars: Nonterminals and Productions](#grammars-nonterminals-and-productions)
+  - [Alternate production syntax](#alternate-production-syntax)
+- [Usage notes](#usage-notes)
+  - [Other components](#other-components)
+  - [Parsing productions](#parsing-productions)
+  - [Lookahead](#lookahead)
+
+## Grammars: `*.mlg` File Structure ##
+
+Grammars are defined in `*.mlg` files, which `coqpp` compiles into `*.ml` files at build time.
+`coqpp` code is in the `coqpp` directory.  `coqpp` uses yacc and lex to parse the grammar files.
+You can examine its yacc and lex input files in `coqpp_lex.mll` and `coqpp_parse.mly` for
+details not fully covered here.
+
+In addition, there is a `doc_grammar` build utility that uses the `coqpp` parser to extract the
+grammar, then edits and inserts it into the documentation.  This is described in
+[`doc/tools/docgram/README.md`](../../doc/tools/docgram/README.md).
+`doc_grammar` generates
+[`doc/tools/docgram/fullGrammar`](../../doc/tools/docgram/fullGrammar),
+which has the full grammar for Coq
+(not including some optionally-loaded plugins).  This may be easier to read since everything is
+in one file and the parser action routines and other OCaml code are omitted.
+
+`*.mlg` files contain the following types of nodes (See `node` in the yacc grammar).  This part is
+very specific to Coq (not so similar to Camlp5):
+
+* OCaml code - OCaml code enclosed in curly braces, which is copied verbatim to the generated `*.ml` file
+
+* Comments - comments in the `*.mlg` file in the form `(* … *)`, which are not copied
+  to the generated `*.ml` file.  Comments in OCaml code are preserved.
+
+* `DECLARE_PLUGIN "*_plugin"` - associates the file with a specific plugin, for example "ltac_plugin"
+
+* `GRAMMAR EXTEND` - adds additional nonterminals and productions to the grammar and declares global
+  nonterminals referenced in the `GRAMMAR EXTEND`:
+
+  ```
+  GRAMMAR EXTEND Gram
+    GLOBAL:
+      bignat bigint …;
+    <nonterminal definitions>
+  END
+  ```
+
+  Global nonterminals are declared in `pcoq.ml`, e.g. `let bignat = Entry.create "Prim.bignat"`.
+  All the `*.mlg` files include `open Pcoq` and often its modules, e.g. `open Pcoq.Prim`.
+
+  `GRAMMAR EXTEND` should be used only for large syntax additions.  To add new commands
+  and tactics, use these instead:
+
+  - `VERNAC COMMAND EXTEND` to add new commands
+  - `TACTIC EXTEND` to add new tactics
+  - `ARGUMENT EXTEND` to add new nonterminals
+
+  These constructs provide essential semantic information that's provided in a more complex,
+  less readable way with `GRAMMAR EXTEND`.
+
+* `VERNAC COMMAND EXTEND` - adds new command syntax by adding productions to the
+  `command` nonterminal.  For example:
+
+  ```
+  VERNAC COMMAND EXTEND ExtractionLibrary CLASSIFIED AS QUERY
+  | [ "Extraction" "Library" ident(m) ]
+    -> { extraction_library false m }
+  END
+  ```
+
+  Productions here are represented with alternate syntax, described later.
+
+  New commands should be added using this construct rather than `GRAMMAR EXTEND` so
+  they are correctly registered, such as having the correct command classifier.
+
+  TODO: explain "ExtractionLibrary", CLASSIFIED AS, CLASSIFIED BY, "{ tactic_mode }", STATE
+
+* `VERNAC { … } EXTEND` - TODO.  A variant.  The `{ … }` is a block of OCaml code.
+
+* `TACTIC EXTEND` - adds new tactic syntax by adding productions to `simple_tactic`.
+  For example:
+
+  ```
+  TACTIC EXTEND btauto
+  | [ "btauto" ] -> { Refl_btauto.Btauto.tac }
+  END
+  ```
+
+  adds a new nonterminal `btauto`.
+
+  New tactics should be added using this construct rather than `GRAMMAR EXTEND`.
+
+  TODO: explain DEPRECATED, LEVEL (not shown)
+
+* `ARGUMENT EXTEND` - defines a new nonterminal
+
+  ```
+  ARGUMENT EXTEND ast_closure_term
+       PRINTED BY { pp_ast_closure_term }
+       INTERPRETED BY { interp_ast_closure_term }
+       GLOBALIZED BY { glob_ast_closure_term }
+       SUBSTITUTED BY { subst_ast_closure_term }
+       RAW_PRINTED BY { pp_ast_closure_term }
+       GLOB_PRINTED BY { pp_ast_closure_term }
+    | [ term_annotation(a) constr(c) ] -> { mk_ast_closure_term a c }
+  END
+  ```
+
+  See comments in `tacentries.mli` for partial information on the various
+  arguments.
+
+* `VERNAC ARGUMENT EXTEND` - (part of `argument_extend` in the yacc grammar) defines
+  productions for a single nonterminal.  For example:
+
+  ```
+  VERNAC ARGUMENT EXTEND language
+  PRINTED BY { pr_language }
+  | [ "Ocaml" ] -> { let _ = warn_deprecated_ocaml_spelling () in Ocaml }
+  | [ "OCaml" ] -> { Ocaml }
+  | [ "Haskell" ] -> { Haskell }
+  | [ "Scheme" ] -> { Scheme }
+  | [ "JSON" ] -> { JSON }
+  END
+  ```
+
+  TODO: explain PRINTED BY, CODE
+
+* DOC_GRAMMAR - Used in `doc_grammar`-generated files to permit simplified syntax
+
+Note that you can reverse engineer many details by comparing the `.mlg` input file with
+the `.ml` generated by `coqpp`.
+
+## Grammars: Nonterminals and Productions
+
+Here's a simple nonterminal definition in the Camlp5 format:
+
+  ```
+  universe:
+    [ [ IDENT "max"; "("; ids = LIST1 universe_expr SEP ","; ")" -> { ids }
+      | u = universe_expr -> { [u] } ] ]
+  ;
+  ```
+
+In which:
+* `universe` is the nonterminal being defined
+* productions are separated by `|` and, as a group, are enclosed in `[ [ … ] ];`
+* `u = universe_expr` refers to the `universe_expr` nonterminal.  `u` is bound to
+  the value returned by that nonterminal's action routine, which can be
+  referred to in the action routine.  For `ids = LIST1 universe_expr SEP ","`,
+  `ids` is bound to the list of values returned by `universe_expr`.
+* `-> { … }` contains the OCaml action routine, which is executed when the production is recognized
+  and returns a value
+* Semicolons separate adjacent grammatical elements (nonterminals, strings or other constructs)
+
+Grammatical elements that appear in productions are:
+
+- nonterminal names - identifiers in the form `[a-zA-Z0-9_]*`.  These correspond to variables in
+  the generated `.ml` code.  In some cases a qualified name, such as `Prim.name`, is used.
+- `"…"` - a literal string that becomes a keyword and cannot be used as an `ident`.
+  The string doesn't have to be a valid identifier; frequently the string will contain only
+  punctuation characters.  Generally we try to avoid adding new keywords that are also valid
+  identifiers--though there is an unresolved debate among the developers about whether having more
+  such keywords in general is good (e.g. it makes it easier to highlight keywords in GUIs)
+  or bad (more keywords for the user to avoid and new keywords may require changes to existing
+  proof files).
+- `IDENT "…"` - a literal string that has the form of an `ident` that doesn't become
+  a keyword
+- `OPT element` - optionally include `element` (e.g. a nonterminal, IDENT "…" or "…").
+  The value is of type `'a option`.
+- `LIST1 element` - a list of one or more `element`s.  The value is of type `'a list`.
+- `LIST0 element` - an optional list of `element`s
+- `LIST1 element SEP sep` - a list of `element`s separated by `sep`
+- `LIST0 element SEP sep` - an optional list of `element`s separated by `sep`
+- `( elements )` - grouping to represent a series of elements as a unit,
+  useful within `OPT` and `LIST*`.
+- `[ elements1 | elements2 | … ]` - alternatives (either `elements1` or `elements2` or …),
+  actually nested productions, each of which can have its own action routines
+
+Nonterminals can also be defined with multiple levels to specify precedence and associativity
+of its productions.  This is described in the Coq documentation under the `Print Grammar`
+command.  The first square bracket around a nonterminal definition is for grouping
+level definitions, which are separated with `|`, for example:
+
+  ```
+  tactic_expr:
+      [ "5" RIGHTA
+        [ te = binder_tactic -> { te } ]
+      | "4" LEFTA
+      :
+  ```
+
+Grammar extensions can specify what level they are modifying, for example:
+
+  ```
+  tactic_expr: LEVEL "1" [ RIGHTA
+    [ tac = tactic_expr; intros = ssrintros_ne -> { tclintros_expr ~loc tac intros }
+    ] ];
+  ```
+
+### Alternate production syntax ###
+
+Except for `GRAMMAR EXTEND`, the `EXTEND` nodes in the `*.mlg`s use simplified syntax in
+productions that's similar to what's used in the `Tactic Notation` and
+`Ltac2 Notation` commands.  For example:
+
+  ```
+  TACTIC EXTEND cc
+  | [ "congruence" ] -> { congruence_tac 1000 [] }
+  | [ "congruence" integer(n) ] -> { congruence_tac n [] }
+  | [ "congruence" "with" ne_constr_list(l) ] -> { congruence_tac 1000 l }
+  | [ "congruence" integer(n) "with" ne_constr_list(l) ] ->
+     { congruence_tac n l }
+  END
+  ```
+
+Nonterminals appearing in the alternate production syntax are accessed through `wit_*` symbols
+defined in the OCaml code.  Some commonly used symbols are defined in `stdarg.ml`.
+Others are defined in the code generated by `ARGUMENT EXTEND` and `VERNAC ARGUMENT EXTEND`
+constructs.  References to nonterminals that don't have `wit_*` symbols cause
+compilation errors.
+
+The differences are:
+* The outer `: [ … ];` is omitted.  Each production is enclosed in `| [ … ]`.
+* The action routine is outside the square brackets
+* Literal strings that are valid identifiers don't become reserved keywords
+* No semicolons separating elements of the production
+* `integer(n)` is used to bind a nonterminal value to a variable instead of `n = integer`
+* Alternate forms of constructs are used:
+  * `ne_entry_list` for `LIST1 entry`
+  * `entry_list` for `LIST0 entry`
+  * `ne_entry_list_sep(var, sep)` for `LIST1 entry SEP sep` where the list is bound to `var`
+  * `entry_list_sep(var, sep)` for `LIST0 entry SEP sep` where the list is bound to `var`
+  * `entry_opt` for OPT entry
+* There's no way to define `LEVEL`s
+* There's no equivalent to `( elements )` or `[ elements1 | elements2 | … ]`, which may
+  require repeating similar syntax several times.  For example, this single production
+  is equivalent to 8 productions in `TACTIC EXTEND` representing all possible expansions of
+  three `OPT`s:
+
+  ```
+  | IDENT "Add"; IDENT "Parametric"; IDENT "Relation"; LIST0 binder; ":"; constr; constr;
+          OPT [ IDENT "reflexivity"; IDENT "proved"; IDENT "by"; constr -> { … } ];
+          OPT [ IDENT "symmetry"; IDENT "proved"; IDENT "by"; constr -> { … } ];
+          OPT [ IDENT "transitivity"; IDENT "proved"; IDENT "by"; constr -> { … } ];
+          IDENT "as"; ident -> { … }
+  ```
+
+## Usage notes
+
+### Other components
+
+Coq's lexer is in `clexer.ml`.  Its 10 token types are defined in `tok.ml`.
+
+The parser is in `grammar.ml`.  The extensive use of GADT (generalized algebraic datatypes)
+makes it harder for the uninitiated to understand it.
+
+When the parser is invoked, the call tells the parser which nonterminal to parse.  `vernac_control`
+is the start symbol for commands.  `tactic_mode` is the start symbol for tactics.
+Tactics give syntax errors if Coq is not in proof mode.  There are additional details
+not mentioned here.
+
+### Parsing productions
+
+Some thoughts, not to be taken as identifying all the issues:
+
+Since the parser examines only the next token to make a parsing decision (and perhaps
+because of other potentially fixable limitations), some productions have to be ordered
+or structured in a particular way to parse correctly in all cases.
+
+For example, consider these productions:
+
+  ```
+  command: [ [
+    | IDENT "Print"; p = printable -> { VernacPrint p }
+    | IDENT "Print"; qid = smart_global; l = OPT univ_name_list -> { VernacPrint (PrintName (qid,l)) }
+    | IDENT "Print"; IDENT "Module"; "Type"; qid = global ->
+        { VernacPrint (PrintModuleType qid) }
+    | IDENT "Print"; IDENT "Module"; qid = global ->
+        { VernacPrint (PrintModule qid) }
+    | IDENT "Print"; IDENT "Namespace" ; ns = dirpath ->
+        { VernacPrint (PrintNamespace ns) }
+    :
+
+  printable:
+    [ [ IDENT "Term"; qid = smart_global; l = OPT univ_name_list -> { PrintName (qid,l) }
+      | IDENT "All" -> { PrintFullContext }
+      | IDENT "Section"; s = global -> { PrintSectionContext s }
+      :
+  ```
+
+Reversing the order of the first two productions in `command` causes the `All` in `Print All` to
+be parsed incorrectly as a `smart_global`, making that command unavailable.  `Print Namespace nat.`
+still works correctly, though.
+
+Similarly, the production for `Print Module Type` has to appear before `Print Module <global>`
+in order to be reachable.
+
+Internally, the parser generates a tree that represents the possible prefixes for the
+productions of a nonterminal as described in
+[the Camlp5 documentation](http://camlp5.github.io/doc/htmlc/grammars.html#b:Rules-insertion).
+
+Here's another example in which the way the productions are written matters.  `OPT` at
+the beginning of a production doesn't always work well:
+
+  ```
+  command: [ [
+      | IDENT "Foo"; n = natural -> { VernacBack 1 }
+      | OPT (IDENT "ZZ"); IDENT "Foo" -> { VernacBack 1 }
+      :
+  ```
+
+`Foo.` looks like it should be accepted, but it gives a parse error:
+
+  ```
+  Unnamed_thm < Foo.
+  Toplevel input, characters 3-4:
+  > Foo.
+  >    ^
+  Error:
+  Syntax error: [prim:natural] expected after 'Foo' (in [vernac:command]).
+  ```
+
+Reversing the order of the productions doesn't help, but splitting
+the 'OPT' production into 2 productions works:
+
+  ```
+      | IDENT "Foo" -> { VernacBack 1 }
+      | IDENT "ZZ"; IDENT "Foo" -> { VernacBack 1 }
+      | IDENT "Foo"; n = natural -> { VernacBack 1 }
+
+  ```
+
+On the other hand, `OPT` works just fine when the parser has already found the
+right production.  For example `Back` and `Back <natural>` can be combined using
+an `OPT`:
+
+  ```
+      | IDENT "Back"; n = OPT natural -> { VernacBack (Option.default 1 n) }
+  ```
+
+### Lookahead
+
+It's possible to look ahead more than one symbol using OCaml code.  Generally we
+avoid doing this unless there's a strong reason to do so.  For example, this
+code defines a new nonterminal `local_test_lpar_id_colon` that checks that
+the next 3 tokens are `"("` `ident` and `":"` without consuming any input:
+
+  ```
+  let local_test_lpar_id_colon =
+    let open Pcoq.Lookahead in
+    to_entry "lpar_id_colon" begin
+      lk_kw "(" >> lk_ident >> lk_kw ":"
+    end
+  ```
+
+This one checks that the next 2 tokens are `"["` and `"|"` with no space between.
+This is a special case: intropatterns can have sequences like `"[|]"` that are
+3 different tokens with empty nonterminals between them.  Making `"[|"` a keyword
+would break existing code with "[|]":
+
+  ```
+  let test_array_opening =
+    let open Pcoq.Lookahead in
+    to_entry "test_array_opening" begin
+      lk_kw "[" >> lk_kw "|" >> check_no_space
+    end
+  ```
+
+TODO: how to add a tactic or command