C++: Is there a standard definition for end-of-line in a multi-line string constant?

The intent is that a newline in a raw string literal maps to a single
'\n' character. This intent is not expressed as clearly as it
should be, which has led to some confusion.

Citations are to the 2011 ISO C++ standard.

First, here’s the evidence that it maps to a single '\n' character.

A note in section 2.14.5 [lex.string] paragraph 4 says:

[ Note: A source-file new-line in a raw string literal results in a
new-line in the resulting execution string-literal. Assuming no
whitespace at the beginning of lines in the following example, the
assert will succeed:

    const char *p = R"(a\
    b
    c)";
    assert(std::strcmp(p, "a\\\nb\nc") == 0);

end note ]

This clearly states that a newline is mapped to a single '\n'
character. It also matches the observed behavior of g++ 6.2.0 and
clang++ 3.8.1 (tests done on a Linux system using source files with
Unix-style and Windows-style line endings).

Given the clearly stated intent in the note and the behavior of two
popular compilers, I’d say it’s safe to rely on this — though it
would be interesting to see how other compilers actually handle this.

However, a literal reading of the normative wording of the
standard could easily lead to a different conclusion, or at least
to some uncertainty.

Section 2.5 [lex.pptoken] paragraph 3 says (emphasis added):

Between the initial and final double quote characters of the
raw string, any transformations performed in phases 1 and 2
(trigraphs, universal-character-names, and line splicing)
are reverted; this reversion shall apply before any d-char,
r-char, or delimiting parenthesis is identified.

The phases of translation are specified in 2.2 [lex.phases]. In phase 1:

Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary.

If we assume that the mapping of physical source file characters to the
basic character set and the introduction of new-line characters are
tranformations“, we might reasonably conclude that, for example,
a newline in the middle of a raw string literal in a Windows-format
source file should be equivalent to a \r\n sequence. (I can imagine
that being useful for Windows-specific code.)

(This interpretation does lead to problems with systems where the
end-of-line indicator is not a sequence of characters, for example
where each line is a fixed-width record. Such systems are rare
these days.)

As “Cheers and hth. – Alf”‘s answer
points out, there is an open
Defect Report
for this issue. It was submitted in 2013 and has not yet been
resolved.

Personally, I think the root of the confusion is the word “any”
(emphasis added as before):

Between the initial and final double quote characters of the raw
string, any transformations performed in phases 1 and 2 (trigraphs,
universal-character-names, and line splicing)
are reverted; this
reversion shall apply before any d-char, r-char, or delimiting
parenthesis is identified.

Surely the mapping of physical source file characters to
the basic source character set can reasonably be thought of
as a transformation. The parenthesized clause “(trigraphs,
universal-character-names, and line splicing)” seems to be intended
to specify which transformations are to be reverted, but that
either attempts to change the meaning of the word “transformations”
(which the standard does not formally define) or contradicts the use
of the word “any”.

I suggest that changing the word “any” to “certain” would express
the apparent intent much more clearly:

Between the initial and final double quote characters of the raw
string, certain transformations performed in phases 1 and 2 (trigraphs,
universal-character-names, and line splicing) are reverted; this
reversion shall apply before any d-char, r-char, or delimiting
parenthesis is identified.

This wording would make it much clearer that “trigraphs,
universal-character-names, and line splicing” are the only
transformations that are to be reverted. (Not everything done
in translation phases 1 and 2 is reverted, just those specific
listed transformations.)

Leave a Comment