String Literal Syntax#
Dylan string literals have two deficiencies when compared to many current
programming languages, most notably the inability to encode multi-line strings
into a single source token that crosses multiple source lines and no fully
supported way to represent unescaped (“raw”) strings. This DEP proposes a
literal syntax for both of these cases. Briefly, it proposes syntax for
one-line raw strings delimited by
#r", multi-line strings delimited by
#r""", and quoted symbols delimited by
Summary: Literal syntax to address both of these deficiencies would be nice to have and is something that people expect to exist in modern programming languages.
Multi-line String Literals#
While it should be noted that encoding long string literals into source code is not good practice when internationalization is desired, it is convenient to have this ability for quick scripts and especially for encoding test data. Most modern programming languages provide this ability.
For very short multi-line strings one can get away with using \n or \r\n in a regular string:
"line one\nline two"
but readability quickly suffers as these strings get longer. An alternative is to put one string on each line and rely on automatic string concatenation by the compiler:
"line one\n" "line two"
The problem here is that changing the contents of the string quickly becomes an editing chore as the programmer attempts to keep the lines a similar length, and remembering to put the \n at the end of each line. Having a multi-line string literal syntax would alleviate both of these problems.
Raw String Literals#
Raw string literals (that is, strings in which
'\\' is not a special
character) are useful for inputting data that needs to contain many
characters. The primary example of this is regular expressions, which quickly
become difficult to interpret when they themselves contain many backslashes.
Open Dylan has a fairly simple work-around for this problem, which it calls parser expansions:
define function string-parser (s) s end; define constant s1 = "\\.\\[\\*\\\\\\]"; // standard Dylan define constant s2 = #:string:"\.\[\*\\\]"; // uses string-parser
There are some problems with this mechanism:
Because the parser can have any name (“string” in the above example) and various different delimiters, it’s difficult to provide good editor support.
The parser only allows a single character to delimit the string, meaning that that character is forbidden within the string literal. For long string literals it can be hard to predict which delimiter character (if any) will not be needed.
For the relatively common task of specifying a raw string, each individual library needs to either define
string-parseror import it. (This could be made easier by exporting it from the
This kind of parser prevents consecutive string literal concatenation by the compiler because the reader doesn’t produce a string literal fragment for it. Instead it emits a call to
Having a standard raw string literal syntax would eliminate or reduce these problems.
Multi-line string literals are delimited by three double quote characters on
""". Any string literal, whether one-line or multi-line, may be
#R to disable backslash escape processing, i.e., to
make it a raw string literal.
Literal end-of-line sequences are always interpreted as a single LF character, in both raw and escaped string literals, regardless of operating system platform. Specifically,
CR (character code 13) by itself is converted to a single LF (character code 10).
CRLF (character codes 13 10, when occurring consecutively) are converted to a single LF (character code 10).
An implication of this design is that the only way to include a CR character in
a string literal is with the escape sequence
\r in a non-raw string.
All string literals, whether escaped, raw, one-line, or multi-line, continue to adhere to the rule that consecutive string literals separated by only whitespace are automatically concatenated.
Because Dylan’s quoted symbol syntax (also known as “unique string” syntax) is
just # followed by any standard string, we also allow
#""" to indicate a
multi-line quoted symbol, to be consistent. No new syntax is provided to create
a “raw” quoted symbols, i.e., quoted symbols without escape processing.
In the Dylan Reference Manual, in the section Tokens,
added to the
The BNF, which augments Character and String Literals in the DRM, is shown below.
CHARACTER-LITERAL: ' character ' character: any printing character (including space) except for ' or \ \ escape-character STRING: " more-string " " " multi-line-string # r raw-string more-string: string-character more-string " multi-line-string: " " " more-multi-line-string more-multi-line-string: multi-line-string-character more-multi-line-string " " " multi-line-string-character: any character except for \ or three " in a row \ escape-character raw-string: " more-raw-string " " " more-raw-string-multi-line more-raw-string: raw-string-character more-raw-string " more-raw-string-multi-line: raw-string-character-multi-line more-raw-string-multi-line " " " string-character: any printing character (including space) except for " or \ \ escape-character raw-string-character: any printing character (including space) except for " raw-string-character-multi-line: any character but not three " in a row escape-character: one of \ ' " a b e f n r t 0 < hex-digits >
"""abc""" or #r"abc" or #r"""abc"""
"line one\nline two" but never equivalent to
"""line one line two"""
"let x = \"foo\";":
"""let x = "foo";"""
""" foo bar """
A reference implementation is underway on github.
The revision history of this document is available here: https://github.com/dylan-lang/website/commits/master/source/proposals/dep-0012-string-literals.rst