demangling docs

This commit is contained in:
rsc 2005-11-29 04:05:50 +00:00
parent 140c21e2f1
commit c8661ffad4
2 changed files with 611 additions and 0 deletions

236
src/libmach/gpcompare.texi Normal file
View file

@ -0,0 +1,236 @@
@node ANSI
@chapter @sc{gnu} C++ Conformance to @sc{ansi} C++
These changes in the @sc{gnu} C++ compiler were made to comply more
closely with the @sc{ansi} base document, @cite{The Annotated C++
Reference Manual} (the @sc{arm}). Further reducing the divergences from
@sc{ansi} C++ is a continued goal of the @sc{gnu} C++ Renovation
Project.
@b{Section 3.4}, @i{Start and Termination}. It is now invalid to take
the address of the function @samp{main()}.
@b{Section 4.8}, @i{Pointers to Members}. The compiler produces
an error for trying to convert between a pointer to a member and the type
@samp{void *}.
@b{Section 5.2.5}, @i{Increment and Decrement}. It is an error to use
the increment and decrement operators on an enumerated type.
@b{Section 5.3.2}, @i{Sizeof}. Doing @code{sizeof} on a function is now
an error.
@b{Section 5.3.4}, @i{Delete}. The syntax of a @i{cast-expression} is
now more strictly controlled.
@b{Section 7.1.1}, @i{Storage Class Specifiers}. Using the
@code{static} and @code{extern} specifiers can now only be applied to
names of objects, functions, and anonymous unions.
@b{Section 7.1.1}, @i{Storage Class Specifiers}. The compiler no longer complains
about taking the address of a variable which has been declared to have @code{register}
storage.
@b{Section 7.1.2}, @i{Function Specifiers}. The compiler produces an
error when the @code{inline} or @code{virtual} specifiers are
used on anything other than a function.
@b{Section 8.3}, @i{Function Definitions}. It is now an error to shadow
a parameter name with a local variable; in the past, the compiler only
gave a warning in such a situation.
@b{Section 8.4.1}, @i{Aggregates}. The rules concerning declaration of
an aggregate are now all checked in the @sc{gnu} C++ compiler; they
include having no private or protected members and no base classes.
@b{Section 8.4.3}, @i{References}. Declaring an array of references is
now forbidden. Initializing a reference with an initializer list is
also considered an error.
@b{Section 9.5}, @i{Unions}. Global anonymous unions must be declared
@code{static}.
@b{Section 11.4}, @i{Friends}. Declaring a member to be a friend of a
type that has not yet been defined is an error.
@b{Section 12.1}, @i{Constructors}. The compiler generates a
default copy constructor for a class if no constructor has been declared.
@ignore
@b{Section 12.4}, @i{Destructors}. In accordance with the @sc{ansi} C++
draft standard working paper, a pure virtual destructor must now be
defined.
@end ignore
@b{Section 12.6.2}, @i{Special Member Functions}. When using a
@i{mem-initializer} list, the compiler will now initialize class members
in declaration order, not in the order in which you specify them.
Also, the compiler enforces the rule that non-static @code{const}
and reference members must be initialized with a @i{mem-initializer}
list when their class does not have a constructor.
@b{Section 12.8}, @i{Copying Class Objects}. The compiler generates
default copy constructors correctly, and supplies default assignment
operators compatible with user-defined ones.
@b{Section 13.4}, @i{Overloaded Operators}. An overloaded operator may
no longer have default arguments.
@b{Section 13.4.4}, @i{Function Call}. An overloaded @samp{operator ()}
must be a non-static member function.
@b{Section 13.4.5}, @i{Subscripting}. An overloaded @samp{operator []}
must be a non-static member function.
@b{Section 13.4.6}, @i{Class Member Access}. An overloaded @samp{operator ->}
must be a non-static member function.
@b{Section 13.4.7}, @i{Increment and Decrement}. The compiler will now
make sure a postfix @samp{@w{operator ++}} or @samp{@w{operator --}} has an
@code{int} as its second argument.
@node Encoding
@chapter Name Encoding in @sc{gnu} C++
@c FIXME!! rewrite name encoding section
@c ...to give complete rules rather than diffs from ARM.
@c To avoid plagiarism, invent some different way of structuring the
@c description of the rules than what ARM uses.
@cindex mangling
@cindex name encoding
@cindex encoding information in names
In order to support its strong typing rules and the ability to provide
function overloading, the C++ programming language @dfn{encodes}
information about functions and objects, so that conflicts across object
files can be detected during linking. @footnote{This encoding is also
sometimes called, whimsically enough, @dfn{mangling}; the corresponding
decoding is sometimes called @dfn{demangling}.} These rules tend to be
unique to each individual implementation of C++.
The scheme detailed in the commentary for 7.2.1 of @cite{The Annotated
Reference Manual} offers a description of a possible implementation
which happens to closely resemble the @code{cfront} compiler. The
design used in @sc{gnu} C++ differs from this model in a number of ways:
@itemize @bullet
@item
In addition to the basic types @code{void}, @code{char}, @code{short},
@code{int}, @code{long}, @code{float}, @code{double}, and @code{long
double}, @sc{gnu} C++ supports two additional types: @code{wchar_t}, the wide
character type, and @code{long long} (if the host supports it). The
encodings for these are @samp{w} and @samp{x} respectively.
@item
According to the @sc{arm}, qualified names (e.g., @samp{foo::bar::baz}) are
encoded with a leading @samp{Q}. Followed by the number of
qualifications (in this case, three) and the respective names, this
might be encoded as @samp{Q33foo3bar3baz}. @sc{gnu} C++ adds a leading
underscore to the list, producing @samp{_Q33foo3bar3baz}.
@item
The operator @samp{*=} is encoded as @samp{__aml}, not @samp{__amu}, to
match the normal @samp{*} operator, which is encoded as @samp{__ml}.
@c XXX left out ->(), __wr
@item
In addition to the normal operators, @sc{gnu} C++ also offers the minimum and
maximum operators @samp{>?} and @samp{<?}, encoded as @samp{__mx} and
@samp{__mn}, and the conditional operator @samp{?:}, encoded as @samp{__cn}.
@cindex destructors, encoding of
@cindex constructors, encoding of
@item
Constructors are encoded as simply @samp{__@var{name}}, where @var{name}
is the encoded name (e.g., @code{3foo} for the @code{foo} class
constructor). Destructors are encoded as two leading underscores
separated by either a period or a dollar sign, depending on the
capabilities of the local host, followed by the encoded name. For
example, the destructor @samp{foo::~foo} is encoded as @samp{_$_3foo}.
@item
Virtual tables are encoded with a prefix of @samp{_vt}, rather than
@samp{__vtbl}. The names of their classes are separated by dollar signs
(or periods), and not encoded as normal: the virtual table for
@code{foo} is @samp{__vt$foo}, and the table for @code{foo::bar} is
named @samp{__vt$foo$bar}.
@item
Static members are encoded as a leading underscore, followed by the
encoded name of the class in which they appear, a separating dollar sign
or period, and finally the unencoded name of the variable. For example,
if the class @code{foo} contains a static member @samp{bar}, its
encoding would be @samp{_3foo$bar}.
@item
@sc{gnu} C++ is not as aggressive as other compilers when it comes to always
generating @samp{Fv} for functions with no arguments. In particular,
the compiler does not add the sequence to conversion operators. The
function @samp{foo::bar()} is encoded as @samp{bar__3foo}, not
@samp{bar__3fooFv}.
@item
The argument list for methods is not prefixed by a leading @samp{F}; it
is considered implied.
@item
@sc{gnu} C++ approaches the task of saving space in encodings
differently from that noted in the @sc{arm}. It does use the
@samp{T@var{n}} and @samp{N@var{x}@var{y}} codes to signify copying the
@var{n}th argument's type, and making the next @var{x} arguments be the
type of the @var{y}th argument, respectively. However, the values for
@var{n} and @var{y} begin at zero with @sc{gnu} C++, whereas the
@sc{arm} describes them as starting at one. For the function @samp{foo
(bartype, bartype)}, @sc{gnu} C++ uses @samp{foo__7bartypeT0}, while
compilers following the @sc{arm} example generate @samp{foo__7bartypeT1}.
@c Note it loses on `foo (int, int, int, int, int)'.
@item
@sc{gnu} C++ does not bother using the space-saving methods for types whose
encoding is a single character (like an integer, encoded as @samp{i}).
This is useful in the most common cases (two @code{int}s would result in
using three letters, instead of just @samp{ii}).
@end itemize
@c @node Cfront
@c @chapter @code{cfront} Compared to @sc{gnu} C++
@c
@c
@c FIXME!! Fill in. Consider points in the following:
@c
@c @display
@c Date: Thu, 2 Jan 92 21:35:20 EST
@c From: raeburn@@cygnus.com
@c Message-Id: <9201030235.AA10999@@cambridge.cygnus.com>
@c To: mrs@@charlie.secs.csun.edu
@c Cc: g++@@cygnus.com
@c Subject: Re: ARM and GNU C++ incompatabilities
@c
@c Along with that, we should probably describe how g++ differs from
@c cfront, in ways that the users will notice. (E.g., cfront supposedly
@c allows "free (new char[10])"; does g++? How do the template
@c implementations differ? "New" placement syntax?)
@c @end display
@c
@c XXX For next revision.
@c
@c GNU C++:
@c * supports expanding inline functions in many situations,
@c including those which have static objects, use `for' statements,
@c and other situations. Part of this versatility is due to is
@c ability to not always generate temporaries for assignments.
@c * deliberately allows divide by 0 and mod 0, since [according
@c to Wilson] there are actually situations where you'd like to allow
@c such things. Note on most systems it will cause some sort of trap
@c or bus error. Cfront considers it an error.
@c * does [appear to] support nested classes within templates.
@c * conversion functions among baseclasses are all usable by
@c a class that's derived from all of those bases.
@c * sizeof works even when the class is defined within its ()'s
@c * conditional expressions work with member fns and pointers to
@c members.
@c * can handle non-trivial declarations of variables within switch
@c statements.
@c
@c Cfront:

375
src/libmach/gxxint_15.html Normal file
View file

@ -0,0 +1,375 @@
<HTML>
<HEAD>
<!-- This HTML file has been created by texi2html 1.52
from gxxint.texi on 27 August 1999 -->
<TITLE>G++ internals - Mangling</TITLE>
</HEAD>
<BODY>
Go to the <A HREF="gxxint_1.html">first</A>, <A HREF="gxxint_14.html">previous</A>, <A HREF="gxxint_16.html">next</A>, <A HREF="gxxint_16.html">last</A> section, <A HREF="gxxint_toc.html">table of contents</A>.
<P><HR><P>
<H2><A NAME="SEC20" HREF="gxxint_toc.html#TOC20">Function name mangling for C++ and Java</A></H2>
<P>
Both C++ and Jave provide overloaded function and methods,
which are methods with the same types but different parameter lists.
Selecting the correct version is done at compile time.
Though the overloaded functions have the same name in the source code,
they need to be translated into different assembler-level names,
since typical assemblers and linkers cannot handle overloading.
This process of encoding the parameter types with the method name
into a unique name is called <EM>name mangling</EM>. The inverse
process is called <EM>demangling</EM>.
</P>
<P>
It is convenient that C++ and Java use compatible mangling schemes,
since the makes life easier for tools such as gdb, and it eases
integration between C++ and Java.
</P>
<P>
Note there is also a standard "Jave Native Interface" (JNI) which
implements a different calling convention, and uses a different
mangling scheme. The JNI is a rather abstract ABI so Java can call methods
written in C or C++;
we are concerned here about a lower-level interface primarily
intended for methods written in Java, but that can also be used for C++
(and less easily C).
</P>
<H3><A NAME="SEC21" HREF="gxxint_toc.html#TOC21">Method name mangling</A></H3>
<P>
C++ mangles a method by emitting the function name, followed by <CODE>__</CODE>,
followed by encodings of any method qualifiers (such as <CODE>const</CODE>),
followed by the mangling of the method's class,
followed by the mangling of the parameters, in order.
</P>
<P>
For example <CODE>Foo::bar(int, long) const</CODE> is mangled
as <SAMP>`bar__C3Fooil'</SAMP>.
</P>
<P>
For a constructor, the method name is left out.
That is <CODE>Foo::Foo(int, long) const</CODE> is mangled
as <SAMP>`__C3Fooil'</SAMP>.
</P>
<P>
GNU Java does the same.
</P>
<H3><A NAME="SEC22" HREF="gxxint_toc.html#TOC22">Primitive types</A></H3>
<P>
The C++ types <CODE>int</CODE>, <CODE>long</CODE>, <CODE>short</CODE>, <CODE>char</CODE>,
and <CODE>long long</CODE> are mangled as <SAMP>`i'</SAMP>, <SAMP>`l'</SAMP>,
<SAMP>`s'</SAMP>, <SAMP>`c'</SAMP>, and <SAMP>`x'</SAMP>, respectively.
The corresponding unsigned types have <SAMP>`U'</SAMP> prefixed
to the mangling. The type <CODE>signed char</CODE> is mangled <SAMP>`Sc'</SAMP>.
</P>
<P>
The C++ and Java floating-point types <CODE>float</CODE> and <CODE>double</CODE>
are mangled as <SAMP>`f'</SAMP> and <SAMP>`d'</SAMP> respectively.
</P>
<P>
The C++ <CODE>bool</CODE> type and the Java <CODE>boolean</CODE> type are
mangled as <SAMP>`b'</SAMP>.
</P>
<P>
The C++ <CODE>wchar_t</CODE> and the Java <CODE>char</CODE> types are
mangled as <SAMP>`w'</SAMP>.
</P>
<P>
The Java integral types <CODE>byte</CODE>, <CODE>short</CODE>, <CODE>int</CODE>
and <CODE>long</CODE> are mangled as <SAMP>`c'</SAMP>, <SAMP>`s'</SAMP>, <SAMP>`i'</SAMP>,
and <SAMP>`x'</SAMP>, respectively.
</P>
<P>
C++ code that has included <CODE>javatypes.h</CODE> will mangle
the typedefs <CODE>jbyte</CODE>, <CODE>jshort</CODE>, <CODE>jint</CODE>
and <CODE>jlong</CODE> as respectively <SAMP>`c'</SAMP>, <SAMP>`s'</SAMP>, <SAMP>`i'</SAMP>,
and <SAMP>`x'</SAMP>. (This has not been implemented yet.)
</P>
<H3><A NAME="SEC23" HREF="gxxint_toc.html#TOC23">Mangling of simple names</A></H3>
<P>
A simple class, package, template, or namespace name is
encoded as the number of characters in the name, followed by
the actual characters. Thus the class <CODE>Foo</CODE>
is encoded as <SAMP>`3Foo'</SAMP>.
</P>
<P>
If any of the characters in the name are not alphanumeric
(i.e not one of the standard ASCII letters, digits, or '_'),
or the initial character is a digit, then the name is
mangled as a sequence of encoded Unicode letters.
A Unicode encoding starts with a <SAMP>`U'</SAMP> to indicate
that Unicode escapes are used, followed by the number of
bytes used by the Unicode encoding, followed by the bytes
representing the encoding. ASSCI letters and
non-initial digits are encoded without change. However, all
other characters (including underscore and initial digits) are
translated into a sequence starting with an underscore,
followed by the big-endian 4-hex-digit lower-case encoding of the character.
</P>
<P>
If a method name contains Unicode-escaped characters, the
entire mangled method name is followed by a <SAMP>`U'</SAMP>.
</P>
<P>
For example, the method <CODE>X\u0319::M\u002B(int)</CODE> is encoded as
<SAMP>`M_002b__U6X_0319iU'</SAMP>.
</P>
<H3><A NAME="SEC24" HREF="gxxint_toc.html#TOC24">Pointer and reference types</A></H3>
<P>
A C++ pointer type is mangled as <SAMP>`P'</SAMP> followed by the
mangling of the type pointed to.
</P>
<P>
A C++ reference type as mangled as <SAMP>`R'</SAMP> followed by the
mangling of the type referenced.
</P>
<P>
A Java object reference type is equivalent
to a C++ pointer parameter, so we mangle such an parameter type
as <SAMP>`P'</SAMP> followed by the mangling of the class name.
</P>
<H3><A NAME="SEC25" HREF="gxxint_toc.html#TOC25">Qualified names</A></H3>
<P>
Both C++ and Java allow a class to be lexically nested inside another
class. C++ also supports namespaces (not yet implemented by G++).
Java also supports packages.
</P>
<P>
These are all mangled the same way: First the letter <SAMP>`Q'</SAMP>
indicates that we are emitting a qualified name.
That is followed by the number of parts in the qualified name.
If that number is 9 or less, it is emitted with no delimiters.
Otherwise, an underscore is written before and after the count.
Then follows each part of the qualified name, as described above.
</P>
<P>
For example <CODE>Foo::\u0319::Bar</CODE> is encoded as
<SAMP>`Q33FooU5_03193Bar'</SAMP>.
</P>
<H3><A NAME="SEC26" HREF="gxxint_toc.html#TOC26">Templates</A></H3>
<P>
A class template instantiation is encoded as the letter <SAMP>`t'</SAMP>,
followed by the encoding of the template name, followed
the number of template parameters, followed by encoding of the template
parameters. If a template parameter is a type, it is written
as a <SAMP>`Z'</SAMP> followed by the encoding of the type.
</P>
<P>
A function template specialization (either an instantiation or an
explicit specialization) is encoded by an <SAMP>`H'</SAMP> followed by the
encoding of the template parameters, as described above, followed by
an <SAMP>`_'</SAMP>, the encoding of the argument types template function (not the
specialization), another <SAMP>`_'</SAMP>, and the return type. (Like the
argument types, the return type is the return type of the function
template, not the specialization.) Template parameters in the argument
and return types are encoded by an <SAMP>`X'</SAMP> for type parameters, or a
<SAMP>`Y'</SAMP> for constant parameters, and an index indicating their position
in the template parameter list declaration.
</P>
<H3><A NAME="SEC27" HREF="gxxint_toc.html#TOC27">Arrays</A></H3>
<P>
C++ array types are mangled by emitting <SAMP>`A'</SAMP>, followed by
the length of the array, followed by an <SAMP>`_'</SAMP>, followed by
the mangling of the element type. Of course, normally
array parameter types decay into a pointer types, so you
don't see this.
</P>
<P>
Java arrays are objects. A Java type <CODE>T[]</CODE> is mangled
as if it were the C++ type <CODE>JArray&#60;T&#62;</CODE>.
For example <CODE>java.lang.String[]</CODE> is encoded as
<SAMP>`Pt6JArray1ZPQ34java4lang6String'</SAMP>.
</P>
<H3><A NAME="SEC28" HREF="gxxint_toc.html#TOC28">Table of demangling code characters</A></H3>
<P>
The following special characters are used in mangling:
</P>
<DL COMPACT>
<DT><SAMP>`A'</SAMP>
<DD>
Indicates a C++ array type.
<DT><SAMP>`b'</SAMP>
<DD>
Encodes the C++ <CODE>bool</CODE> type,
and the Java <CODE>boolean</CODE> type.
<DT><SAMP>`c'</SAMP>
<DD>
Encodes the C++ <CODE>char</CODE> type, and the Java <CODE>byte</CODE> type.
<DT><SAMP>`C'</SAMP>
<DD>
A modifier to indicate a <CODE>const</CODE> type.
Also used to indicate a <CODE>const</CODE> member function
(in which cases it precedes the encoding of the method's class).
<DT><SAMP>`d'</SAMP>
<DD>
Encodes the C++ and Java <CODE>double</CODE> types.
<DT><SAMP>`e'</SAMP>
<DD>
Indicates extra unknown arguments <CODE>...</CODE>.
<DT><SAMP>`f'</SAMP>
<DD>
Encodes the C++ and Java <CODE>float</CODE> types.
<DT><SAMP>`F'</SAMP>
<DD>
Used to indicate a function type.
<DT><SAMP>`H'</SAMP>
<DD>
Used to indicate a template function.
<DT><SAMP>`i'</SAMP>
<DD>
Encodes the C++ and Java <CODE>int</CODE> types.
<DT><SAMP>`J'</SAMP>
<DD>
Indicates a complex type.
<DT><SAMP>`l'</SAMP>
<DD>
Encodes the C++ <CODE>long</CODE> type.
<DT><SAMP>`P'</SAMP>
<DD>
Indicates a pointer type. Followed by the type pointed to.
<DT><SAMP>`Q'</SAMP>
<DD>
Used to mangle qualified names, which arise from nested classes.
Should also be used for namespaces (?).
In Java used to mangle package-qualified names, and inner classes.
<DT><SAMP>`r'</SAMP>
<DD>
Encodes the GNU C++ <CODE>long double</CODE> type.
<DT><SAMP>`R'</SAMP>
<DD>
Indicates a reference type. Followed by the referenced type.
<DT><SAMP>`s'</SAMP>
<DD>
Encodes the C++ and java <CODE>short</CODE> types.
<DT><SAMP>`S'</SAMP>
<DD>
A modifier that indicates that the following integer type is signed.
Only used with <CODE>char</CODE>.
Also used as a modifier to indicate a static member function.
<DT><SAMP>`t'</SAMP>
<DD>
Indicates a template instantiation.
<DT><SAMP>`T'</SAMP>
<DD>
A back reference to a previously seen type.
<DT><SAMP>`U'</SAMP>
<DD>
A modifier that indicates that the following integer type is unsigned.
Also used to indicate that the following class or namespace name
is encoded using Unicode-mangling.
<DT><SAMP>`v'</SAMP>
<DD>
Encodes the C++ and Java <CODE>void</CODE> types.
<DT><SAMP>`V'</SAMP>
<DD>
A modified for a <CODE>const</CODE> type or method.
<DT><SAMP>`w'</SAMP>
<DD>
Encodes the C++ <CODE>wchar_t</CODE> type, and the Java <CODE>char</CODE> types.
<DT><SAMP>`x'</SAMP>
<DD>
Encodes the GNU C++ <CODE>long long</CODE> type, and the Java <CODE>long</CODE> type.
<DT><SAMP>`X'</SAMP>
<DD>
Encodes a template type parameter, when part of a function type.
<DT><SAMP>`Y'</SAMP>
<DD>
Encodes a template constant parameter, when part of a function type.
<DT><SAMP>`Z'</SAMP>
<DD>
Used for template type parameters.
</DL>
<P>
The letters <SAMP>`G'</SAMP>, <SAMP>`M'</SAMP>, <SAMP>`O'</SAMP>, and <SAMP>`p'</SAMP>
also seem to be used for obscure purposes ...
</P>
<P><HR><P>
Go to the <A HREF="gxxint_1.html">first</A>, <A HREF="gxxint_14.html">previous</A>, <A HREF="gxxint_16.html">next</A>, <A HREF="gxxint_16.html">last</A> section, <A HREF="gxxint_toc.html">table of contents</A>.
</BODY>
</HTML>