1 files changed, 346 insertions, 0 deletions
diff --git a/share/doc/cppinternals/Lexer.html b/share/doc/cppinternals/Lexer.html
new file mode 100644
index 0000000..51465a1
--- /dev/null
+++ b/share/doc/cppinternals/Lexer.html
@@ -0,0 +1,346 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<!-- Created by GNU Texinfo 5.1, http://www.gnu.org/software/texinfo/ -->
+<head>
+<title>The GNU C Preprocessor Internals: Lexer</title>
+
+<meta name="description" content="The GNU C Preprocessor Internals: Lexer">
+<meta name="keywords" content="The GNU C Preprocessor Internals: Lexer">
+<meta name="resource-type" content="document">
+<meta name="distribution" content="global">
+<meta name="Generator" content="makeinfo">
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
+<link href="index.html#Top" rel="start" title="Top">
+<link href="Concept-Index.html#Concept-Index" rel="index" title="Concept Index">
+<link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
+<link href="index.html#Top" rel="up" title="Top">
+<link href="Hash-Nodes.html#Hash-Nodes" rel="next" title="Hash Nodes">
+<link href="Conventions.html#Conventions" rel="previous" title="Conventions">
+<style type="text/css">
+<!--
+a.summary-letter {text-decoration: none}
+blockquote.smallquotation {font-size: smaller}
+div.display {margin-left: 3.2em}
+div.example {margin-left: 3.2em}
+div.indentedblock {margin-left: 3.2em}
+div.lisp {margin-left: 3.2em}
+div.smalldisplay {margin-left: 3.2em}
+div.smallexample {margin-left: 3.2em}
+div.smallindentedblock {margin-left: 3.2em; font-size: smaller}
+div.smalllisp {margin-left: 3.2em}
+kbd {font-style:oblique}
+pre.display {font-family: inherit}
+pre.format {font-family: inherit}
+pre.menu-comment {font-family: serif}
+pre.menu-preformatted {font-family: serif}
+pre.smalldisplay {font-family: inherit; font-size: smaller}
+pre.smallexample {font-size: smaller}
+pre.smallformat {font-family: inherit; font-size: smaller}
+pre.smalllisp {font-size: smaller}
+span.nocodebreak {white-space:nowrap}
+span.nolinebreak {white-space:nowrap}
+span.roman {font-family:serif; font-weight:normal}
+span.sansserif {font-family:sans-serif; font-weight:normal}
+ul.no-bullet {list-style: none}
+-->
+</style>
+
+
+</head>
+
+<body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">
+<a name="Lexer"></a>
+<div class="header">
+<p>
+Next: <a href="Hash-Nodes.html#Hash-Nodes" accesskey="n" rel="next">Hash Nodes</a>, Previous: <a href="Conventions.html#Conventions" accesskey="p" rel="previous">Conventions</a>, Up: <a href="index.html#Top" accesskey="u" rel="up">Top</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
+</div>
+<hr>
+<a name="The-Lexer"></a>
+<h2 class="unnumbered">The Lexer</h2>
+<a name="index-lexer"></a>
+<a name="index-newlines"></a>
+<a name="index-escaped-newlines"></a>
+
+<a name="Overview"></a>
+<h3 class="section">Overview</h3>
+<p>The lexer is contained in the file <samp>lex.cc</samp>.  It is a hand-coded
+lexer, and not implemented as a state machine.  It can understand C, C++
+and Objective-C source code, and has been extended to allow reasonably
+successful preprocessing of assembly language.  The lexer does not make
+an initial pass to strip out trigraphs and escaped newlines, but handles
+them as they are encountered in a single pass of the input file.  It
+returns preprocessing tokens individually, not a line at a time.
+</p>
+<p>It is mostly transparent to users of the library, since the library&rsquo;s
+interface for obtaining the next token, <code>cpp_get_token</code>, takes care
+of lexing new tokens, handling directives, and expanding macros as
+necessary.  However, the lexer does expose some functionality so that
+clients of the library can easily spell a given token, such as
+<code>cpp_spell_token</code> and <code>cpp_token_len</code>.  These functions are
+useful when generating diagnostics, and for emitting the preprocessed
+output.
+</p>
+<a name="Lexing-a-token"></a>
+<h3 class="section">Lexing a token</h3>
+<p>Lexing of an individual token is handled by <code>_cpp_lex_direct</code> and
+its subroutines.  In its current form the code is quite complicated,
+with read ahead characters and such-like, since it strives to not step
+back in the character stream in preparation for handling non-ASCII file
+encodings.  The current plan is to convert any such files to UTF-8
+before processing them.  This complexity is therefore unnecessary and
+will be removed, so I&rsquo;ll not discuss it further here.
+</p>
+<p>The job of <code>_cpp_lex_direct</code> is simply to lex a token.  It is not
+responsible for issues like directive handling, returning lookahead
+tokens directly, multiple-include optimization, or conditional block
+skipping.  It necessarily has a minor r&ocirc;le to play in memory
+management of lexed lines.  I discuss these issues in a separate section
+(see <a href="#Lexing-a-line">Lexing a line</a>).
+</p>
+<p>The lexer places the token it lexes into storage pointed to by the
+variable <code>cur_token</code>, and then increments it.  This variable is
+important for correct diagnostic positioning.  Unless a specific line
+and column are passed to the diagnostic routines, they will examine the
+<code>line</code> and <code>col</code> values of the token just before the location
+that <code>cur_token</code> points to, and use that location to report the
+diagnostic.
+</p>
+<p>The lexer does not consider whitespace to be a token in its own right.
+If whitespace (other than a new line) precedes a token, it sets the
+<code>PREV_WHITE</code> bit in the token&rsquo;s flags.  Each token has its
+<code>line</code> and <code>col</code> variables set to the line and column of the
+first character of the token.  This line number is the line number in
+the translation unit, and can be converted to a source (file, line) pair
+using the line map code.
+</p>
+<p>The first token on a logical, i.e. unescaped, line has the flag
+<code>BOL</code> set for beginning-of-line.  This flag is intended for
+internal use, both to distinguish a &lsquo;<samp>#</samp>&rsquo; that begins a directive
+from one that doesn&rsquo;t, and to generate a call-back to clients that want
+to be notified about the start of every non-directive line with tokens
+on it.  Clients cannot reliably determine this for themselves: the first
+token might be a macro, and the tokens of a macro expansion do not have
+the <code>BOL</code> flag set.  The macro expansion may even be empty, and the
+next token on the line certainly won&rsquo;t have the <code>BOL</code> flag set.
+</p>
+<p>New lines are treated specially; exactly how the lexer handles them is
+context-dependent.  The C standard mandates that directives are
+terminated by the first unescaped newline character, even if it appears
+in the middle of a macro expansion.  Therefore, if the state variable
+<code>in_directive</code> is set, the lexer returns a <code>CPP_EOF</code> token,
+which is normally used to indicate end-of-file, to indicate
+end-of-directive.  In a directive a <code>CPP_EOF</code> token never means
+end-of-file.  Conveniently, if the caller was <code>collect_args</code>, it
+already handles <code>CPP_EOF</code> as if it were end-of-file, and reports an
+error about an unterminated macro argument list.
+</p>
+<p>The C standard also specifies that a new line in the middle of the
+arguments to a macro is treated as whitespace.  This white space is
+important in case the macro argument is stringized.  The state variable
+<code>parsing_args</code> is nonzero when the preprocessor is collecting the
+arguments to a macro call.  It is set to 1 when looking for the opening
+parenthesis to a function-like macro, and 2 when collecting the actual
+arguments up to the closing parenthesis, since these two cases need to
+be distinguished sometimes.  One such time is here: the lexer sets the
+<code>PREV_WHITE</code> flag of a token if it meets a new line when
+<code>parsing_args</code> is set to 2.  It doesn&rsquo;t set it if it meets a new
+line when <code>parsing_args</code> is 1, since then code like
+</p>
+<div class="smallexample">
+<pre class="smallexample">#define foo() bar
+foo
+baz
+</pre></div>
+
+<p>would be output with an erroneous space before &lsquo;<samp>baz</samp>&rsquo;:
+</p>
+<div class="smallexample">
+<pre class="smallexample">foo
+ baz
+</pre></div>
+
+<p>This is a good example of the subtlety of getting token spacing correct
+in the preprocessor; there are plenty of tests in the testsuite for
+corner cases like this.
+</p>
+<p>The lexer is written to treat each of &lsquo;<samp>\r</samp>&rsquo;, &lsquo;<samp>\n</samp>&rsquo;, &lsquo;<samp>\r\n</samp>&rsquo;
+and &lsquo;<samp>\n\r</samp>&rsquo; as a single new line indicator.  This allows it to
+transparently preprocess MS-DOS, Macintosh and Unix files without their
+needing to pass through a special filter beforehand.
+</p>
+<p>We also decided to treat a backslash, either &lsquo;<samp>\</samp>&rsquo; or the trigraph
+&lsquo;<samp>??/</samp>&rsquo;, separated from one of the above newline indicators by
+non-comment whitespace only, as intending to escape the newline.  It
+tends to be a typing mistake, and cannot reasonably be mistaken for
+anything else in any of the C-family grammars.  Since handling it this
+way is not strictly conforming to the ISO standard, the library issues a
+warning wherever it encounters it.
+</p>
+<p>Handling newlines like this is made simpler by doing it in one place
+only.  The function <code>handle_newline</code> takes care of all newline
+characters, and <code>skip_escaped_newlines</code> takes care of arbitrarily
+long sequences of escaped newlines, deferring to <code>handle_newline</code>
+to handle the newlines themselves.
+</p>
+<p>The most painful aspect of lexing ISO-standard C and C++ is handling
+trigraphs and backlash-escaped newlines.  Trigraphs are processed before
+any interpretation of the meaning of a character is made, and unfortunately
+there is a trigraph representation for a backslash, so it is possible for
+the trigraph &lsquo;<samp>??/</samp>&rsquo; to introduce an escaped newline.
+</p>
+<p>Escaped newlines are tedious because theoretically they can occur
+anywhere&mdash;between the &lsquo;<samp>+</samp>&rsquo; and &lsquo;<samp>=</samp>&rsquo; of the &lsquo;<samp>+=</samp>&rsquo; token,
+within the characters of an identifier, and even between the &lsquo;<samp>*</samp>&rsquo;
+and &lsquo;<samp>/</samp>&rsquo; that terminates a comment.  Moreover, you cannot be sure
+there is just one&mdash;there might be an arbitrarily long sequence of them.
+</p>
+<p>So, for example, the routine that lexes a number, <code>parse_number</code>,
+cannot assume that it can scan forwards until the first non-number
+character and be done with it, because this could be the &lsquo;<samp>\</samp>&rsquo;
+introducing an escaped newline, or the &lsquo;<samp>?</samp>&rsquo; introducing the trigraph
+sequence that represents the &lsquo;<samp>\</samp>&rsquo; of an escaped newline.  If it
+encounters a &lsquo;<samp>?</samp>&rsquo; or &lsquo;<samp>\</samp>&rsquo;, it calls <code>skip_escaped_newlines</code>
+to skip over any potential escaped newlines before checking whether the
+number has been finished.
+</p>
+<p>Similarly code in the main body of <code>_cpp_lex_direct</code> cannot simply
+check for a &lsquo;<samp>=</samp>&rsquo; after a &lsquo;<samp>+</samp>&rsquo; character to determine whether it
+has a &lsquo;<samp>+=</samp>&rsquo; token; it needs to be prepared for an escaped newline of
+some sort.  Such cases use the function <code>get_effective_char</code>, which
+returns the first character after any intervening escaped newlines.
+</p>
+<p>The lexer needs to keep track of the correct column position, including
+counting tabs as specified by the <samp>-ftabstop=</samp> option.  This
+should be done even within C-style comments; they can appear in the
+middle of a line, and we want to report diagnostics in the correct
+position for text appearing after the end of the comment.
+</p>
+<a name="Invalid-identifiers"></a><p>Some identifiers, such as <code>__VA_ARGS__</code> and poisoned identifiers,
+may be invalid and require a diagnostic.  However, if they appear in a
+macro expansion we don&rsquo;t want to complain with each use of the macro.
+It is therefore best to catch them during the lexing stage, in
+<code>parse_identifier</code>.  In both cases, whether a diagnostic is needed
+or not is dependent upon the lexer&rsquo;s state.  For example, we don&rsquo;t want
+to issue a diagnostic for re-poisoning a poisoned identifier, or for
+using <code>__VA_ARGS__</code> in the expansion of a variable-argument macro.
+Therefore <code>parse_identifier</code> makes use of state flags to determine
+whether a diagnostic is appropriate.  Since we change state on a
+per-token basis, and don&rsquo;t lex whole lines at a time, this is not a
+problem.
+</p>
+<p>Another place where state flags are used to change behavior is whilst
+lexing header names.  Normally, a &lsquo;<samp>&lt;</samp>&rsquo; would be lexed as a single
+token.  After a <code>#include</code> directive, though, it should be lexed as
+a single token as far as the nearest &lsquo;<samp>&gt;</samp>&rsquo; character.  Note that we
+don&rsquo;t allow the terminators of header names to be escaped; the first
+&lsquo;<samp>&quot;</samp>&rsquo; or &lsquo;<samp>&gt;</samp>&rsquo; terminates the header name.
+</p>
+<p>Interpretation of some character sequences depends upon whether we are
+lexing C, C++ or Objective-C, and on the revision of the standard in
+force.  For example, &lsquo;<samp>::</samp>&rsquo; is a single token in C++, but in C it is
+two separate &lsquo;<samp>:</samp>&rsquo; tokens and almost certainly a syntax error.  Such
+cases are handled by <code>_cpp_lex_direct</code> based upon command-line
+flags stored in the <code>cpp_options</code> structure.
+</p>
+<p>Once a token has been lexed, it leads an independent existence.  The
+spelling of numbers, identifiers and strings is copied to permanent
+storage from the original input buffer, so a token remains valid and
+correct even if its source buffer is freed with <code>_cpp_pop_buffer</code>.
+The storage holding the spellings of such tokens remains until the
+client program calls cpp_destroy, probably at the end of the translation
+unit.
+</p>
+<a name="Lexing-a-line"></a><a name="Lexing-a-line-1"></a>
+<h3 class="section">Lexing a line</h3>
+<a name="index-token-run"></a>
+
+<p>When the preprocessor was changed to return pointers to tokens, one
+feature I wanted was some sort of guarantee regarding how long a
+returned pointer remains valid.  This is important to the stand-alone
+preprocessor, the future direction of the C family front ends, and even
+to cpplib itself internally.
+</p>
+<p>Occasionally the preprocessor wants to be able to peek ahead in the
+token stream.  For example, after the name of a function-like macro, it
+wants to check the next token to see if it is an opening parenthesis.
+Another example is that, after reading the first few tokens of a
+<code>#pragma</code> directive and not recognizing it as a registered pragma,
+it wants to backtrack and allow the user-defined handler for unknown
+pragmas to access the full <code>#pragma</code> token stream.  The stand-alone
+preprocessor wants to be able to test the current token with the
+previous one to see if a space needs to be inserted to preserve their
+separate tokenization upon re-lexing (paste avoidance), so it needs to
+be sure the pointer to the previous token is still valid.  The
+recursive-descent C++ parser wants to be able to perform tentative
+parsing arbitrarily far ahead in the token stream, and then to be able
+to jump back to a prior position in that stream if necessary.
+</p>
+<p>The rule I chose, which is fairly natural, is to arrange that the
+preprocessor lex all tokens on a line consecutively into a token buffer,
+which I call a <em>token run</em>, and when meeting an unescaped new line
+(newlines within comments do not count either), to start lexing back at
+the beginning of the run.  Note that we do <em>not</em> lex a line of
+tokens at once; if we did that <code>parse_identifier</code> would not have
+state flags available to warn about invalid identifiers (see <a href="#Invalid-identifiers">Invalid identifiers</a>).
+</p>
+<p>In other words, accessing tokens that appeared earlier in the current
+line is valid, but since each logical line overwrites the tokens of the
+previous line, tokens from prior lines are unavailable.  In particular,
+since a directive only occupies a single logical line, this means that
+the directive handlers like the <code>#pragma</code> handler can jump around
+in the directive&rsquo;s tokens if necessary.
+</p>
+<p>Two issues remain: what about tokens that arise from macro expansions,
+and what happens when we have a long line that overflows the token run?
+</p>
+<p>Since we promise clients that we preserve the validity of pointers that
+we have already returned for tokens that appeared earlier in the line,
+we cannot reallocate the run.  Instead, on overflow it is expanded by
+chaining a new token run on to the end of the existing one.
+</p>
+<p>The tokens forming a macro&rsquo;s replacement list are collected by the
+<code>#define</code> handler, and placed in storage that is only freed by
+<code>cpp_destroy</code>.  So if a macro is expanded in the line of tokens,
+the pointers to the tokens of its expansion that are returned will always
+remain valid.  However, macros are a little trickier than that, since
+they give rise to three sources of fresh tokens.  They are the built-in
+macros like <code>__LINE__</code>, and the &lsquo;<samp>#</samp>&rsquo; and &lsquo;<samp>##</samp>&rsquo; operators
+for stringizing and token pasting.  I handled this by allocating
+space for these tokens from the lexer&rsquo;s token run chain.  This means
+they automatically receive the same lifetime guarantees as lexed tokens,
+and we don&rsquo;t need to concern ourselves with freeing them.
+</p>
+<p>Lexing into a line of tokens solves some of the token memory management
+issues, but not all.  The opening parenthesis after a function-like
+macro name might lie on a different line, and the front ends definitely
+want the ability to look ahead past the end of the current line.  So
+cpplib only moves back to the start of the token run at the end of a
+line if the variable <code>keep_tokens</code> is zero.  Line-buffering is
+quite natural for the preprocessor, and as a result the only time cpplib
+needs to increment this variable is whilst looking for the opening
+parenthesis to, and reading the arguments of, a function-like macro.  In
+the near future cpplib will export an interface to increment and
+decrement this variable, so that clients can share full control over the
+lifetime of token pointers too.
+</p>
+<p>The routine <code>_cpp_lex_token</code> handles moving to new token runs,
+calling <code>_cpp_lex_direct</code> to lex new tokens, or returning
+previously-lexed tokens if we stepped back in the token stream.  It also
+checks each token for the <code>BOL</code> flag, which might indicate a
+directive that needs to be handled, or require a start-of-line call-back
+to be made.  <code>_cpp_lex_token</code> also handles skipping over tokens in
+failed conditional blocks, and invalidates the control macro of the
+multiple-include optimization if a token was successfully lexed outside
+a directive.  In other words, its callers do not need to concern
+themselves with such issues.
+</p>
+<hr>
+<div class="header">
+<p>
+Next: <a href="Hash-Nodes.html#Hash-Nodes" accesskey="n" rel="next">Hash Nodes</a>, Previous: <a href="Conventions.html#Conventions" accesskey="p" rel="previous">Conventions</a>, Up: <a href="index.html#Top" accesskey="u" rel="up">Top</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
+</div>
+
+
+
+</body>
+</html>