2 @setfilename internals.info
3 @node Assembler Internals
4 @chapter Assembler Internals
7 This documentation is not ready for prime time yet. Not even close. It's not
8 so much documentation as random blathering of mine intended to be notes to
9 myself that may eventually be turned into real documentation.
11 I take no responsibility for any negative effect it may have on your
12 professional, personal, or spiritual life. Read it at your own risk. Caveat
13 emptor. Delete before reading. Abandon all hope, ye who enter here.
15 However, enhancements will be gratefully accepted.
18 * Data types:: Data types
25 BFD, MANY_SECTIONS, BFD_HEADERS
30 @cindex internals, data types
33 @cindex internals, symbols
34 @cindex symbols, internal
36 ... `local' symbols ... flags ...
38 The definition for @code{struct symbol}, also known as @code{symbolS}, is
39 located in @file{struc-symbol.h}. Symbol structures can contain the following
44 This is an @code{expressionS} that describes the value of the symbol. It might
45 refer to another symbol; if so, its true value may not be known until
48 More generally, however, ... undefined? ... or an offset from the start of a
49 frag pointed to by the @code{sy_frag} field.
52 This field is non-zero if the symbol's value has been completely resolved. It
53 is used during the final pass over the symbol table.
56 This field is used to detect loops while resolving the symbol's value.
58 @item sy_used_in_reloc
59 This field is non-zero if the symbol is used by a relocation entry. If a local
60 symbol is used in a relocation entry, it must be possible to redirect those
61 relocations to other symbols, or this symbol cannot be removed from the final
66 These pointers to other @code{symbolS} structures describe a singly or doubly
67 linked list. (If @code{SYMBOLS_NEED_BACKPOINTERS} is not defined, the
68 @code{sy_previous} field will be omitted.) These fields should be accessed
69 with @code{symbol_next} and @code{symbol_previous}.
72 This points to the @code{fragS} that this symbol is attached to.
75 Whether the symbol is used as an operand or in an expression. Note: Not all of
76 the backends keep this information accurate; backends which use this bit are
77 responsible for setting it when a symbol is used in backend routines.
80 If @code{BFD_ASSEMBLER} is defined, this points to the @code{asymbol} that will
81 be used in writing the object file.
84 (Only used if @code{BFD_ASSEMBLER} is not defined.) This is the position of
85 the symbol's name in the symbol table of the object file. On some formats,
86 this will start at position 4, with position 0 reserved for unnamed symbols.
87 This field is not used until @code{write_object_file} is called.
90 (Only used if @code{BFD_ASSEMBLER} is not defined.) This is the
91 format-specific symbol structure, as it would be written into the object file.
94 (Only used if @code{BFD_ASSEMBLER} is not defined.) This is a 24-bit symbol
95 number, for use in constructing relocation table entries.
98 This format-specific data is of type @code{OBJ_SYMFIELD_TYPE}. If no macro by
99 that name is defined in @file{obj-format.h}, this field is not defined.
102 This processor-specific data is of type @code{TC_SYMFIELD_TYPE}. If no macro
103 by that name is defined in @file{targ-cpu.h}, this field is not defined.
105 @item TARGET_SYMBOL_FIELDS
106 If this macro is defined, it defines additional fields in the symbol structure.
107 This macro is obsolete, and should be replaced when possible by uses of
108 @code{OBJ_SYMFIELD_TYPE} and @code{TC_SYMFIELD_TYPE}.
112 Access with S_SET_SEGMENT, S_SET_VALUE, S_GET_VALUE, S_GET_SEGMENT, etc., etc.
114 @subsection Expressions
115 @cindex internals, expressions
116 @cindex expressions, internal
118 Expressions are stored as a combination of operator, symbols, blah.
121 @cindex internals, fixups
125 @cindex internals, frags
128 The frag is the basic unit for storing section contents.
133 The address of the frag. This is not set until the assembler rescans the list
134 of all frags after the entire input file is parsed. The function
135 @code{relax_segment} fills in this field.
138 Pointer to the next frag in this (sub)section.
141 Fixed number of characters we know we're going to emit to the output file. May
145 Variable number of characters we may output, after the initial @code{fr_fix}
146 characters. May be zero.
153 Points to the lowest-addressed byte of the opcode, for use in relaxation.
156 Holds line-number info.
159 Relaxation state. This field indicates the interpretation of @code{fr_offset},
160 @code{fr_symbol} and the variable-length tail of the frag, as well as the
161 treatment it gets in various phases of processing. It does not affect the
162 initial @code{fr_fix} characters; they are always supposed to be output
163 verbatim (fixups aside). See below for specific values this field can have.
166 Relaxation substate. If the macro @code{md_relax_frag} isn't defined, this is
167 assumed to be an index into @code{md_relax_table} for the generic relaxation
168 code to process. (@xref{Relaxation}.) If @code{md_relax_frag} is defined,
169 this field is available for any use by the CPU-specific code.
173 These fields are not used yet. They are intended to keep track of the
174 alignment of the current frag within its section, even if the exact offset
175 isn't known. In many cases, we should be able to avoid creating extra frags
176 when @code{.align} directives are given; instead, the number of bytes needed
177 may be computable when the @code{.align} directive is processed. Hmm. Is this
178 the right place for these, or should they be in the @code{frchainS} structure?
180 @item fr_pcrel_adjust
182 These fields are only used in the NS32k configuration. But since @code{struct
183 frag} is defined before the CPU-specific header files are included, they must
184 unconditionally be defined.
187 Declared as a one-character array, this last field grows arbitrarily large to
188 hold the actual contents of the frag.
192 These are the possible relaxation states, provided in the enumeration type
193 @code{relax_stateT}, and the interpretations they represent for the other
199 The start of the following frag should be aligned on some boundary. In this
200 frag, @code{fr_offset} is the logarithm (base 2) of the alignment in bytes.
201 (For example, if alignment on an 8-byte boundary were desired, @code{fr_offset}
202 would have a value of 3.) The variable characters indicate the fill pattern to
203 be used. (More than one?)
206 This indicates that ``broken word'' processing should be done. @xref{Broken
207 Words,,Broken Words}. If broken word processing is not necessary on the target
208 machine, this enumerator value will not be defined.
211 The variable characters are to be repeated @code{fr_offset} times. If
212 @code{fr_offset} is 0, this frag has a length of @code{fr_fix}.
214 @item rs_machine_dependent
215 Displacement relaxation is to be done on this frag. The target is indicated by
216 @code{fr_symbol} and @code{fr_offset}, and @code{fr_subtype} indicates the
217 particular machine-specific addressing mode desired. @xref{Relaxation}.
220 The start of the following frag should be pushed back to some specific offset
221 within the section. (Some assemblers use the value as an absolute address; the
222 @sc{gnu} assembler does not handle final absolute addresses, it requires that
223 the linker set them.) The offset is given by @code{fr_symbol} and
224 @code{fr_offset}; one character from the variable-length tail is used as the
229 A chain of frags is built up for each subsection. The data structure
230 describing a chain is called a @code{frchainS}, and contains the following
235 Points to the first frag in the chain. May be null if there are no frags in
238 Points to the last frag in the chain, or null if there are none.
240 Next in the list of @code{frchainS} structures.
242 Indicates the section this frag chain belongs to.
244 Subsection (subsegment) number of this frag chain.
245 @item fix_root, fix_tail
246 (Defined only if @code{BFD_ASSEMBLER} is defined.) Point to first and last
247 @code{fixS} structures associated with this subsection.
249 Not currently used. Intended to be used for frag allocation for this
250 subsection. This should reduce frag generation caused by switching sections.
253 A @code{frchainS} corresponds to a subsection; each section has a list of
254 @code{frchainS} records associated with it. In most cases, only one subsection
255 of each section is used, so the list will only be one element long, but any
256 processing of frag chains should be prepared to deal with multiple chains per
259 After the input files have been completely processed, and no more frags are to
260 be generated, the frag chains are joined into one per section for further
261 processing. After this point, it is safe to operate on one chain per section.
264 @subsection Broken Words
265 @cindex internals, broken words
267 @cindex promises, promises
269 The ``broken word'' idea derives from the fact that some compilers, including
270 @code{gcc}, will sometimes emit switch tables specifying 16-bit @code{.word}
271 displacements to branch targets, and branch instructions that load entries from
272 that table to compute the target address. If this is done on a 32-bit machine,
273 there is a chance (at least with really large functions) that the displacement
274 will not fit in 16 bits. Thus the ``broken word'' idea is well named, since
275 there is an implied promise that the 16-bit field will in fact hold the
276 specified displacement.
278 If the ``broken word'' processing is enabled, and a situation like this is
279 encountered, the assembler will insert a jump instruction into the instruction
280 stream, close enough to be reached with the 16-bit displacement. This jump
281 instruction will transfer to the real desired target address. Thus, as long as
282 the @code{.word} value really is used as a displacement to compute an address
283 to jump to, the net effect will be correct (minus a very small efficiency
284 cost). If @code{.word} directives with label differences for values are used
285 for other purposes, however, things may not work properly. I think there is a
286 command-line option to turn on warnings when a broken word is discovered.
288 This code is turned off by the @code{WORKING_DOT_WORD} macro. It isn't needed
289 if @code{.word} emits a value large enough to contain an address (or, more
290 correctly, any possible difference between two addresses).
293 @section What Happens?
295 Blah blah blah, initialization, argument parsing, file reading, whitespace
296 munging, opcode parsing and lookup, operand parsing. Now it's time to write
299 In @code{BFD_ASSEMBLER} mode, processing of relocations and symbols and
300 creation of the output file is initiated by calling @code{write_object_file}.
302 @node Target Dependent Definitions
303 @section Target Dependent Definitions
305 @subsection Format-specific definitions
307 @defmac obj_sec_sym_ok_for_reloc (section)
308 (@code{BFD_ASSEMBLER} only.)
309 Is it okay to use this section's section-symbol in a relocation entry? If not,
310 a new internal-linkage symbol is generated and emitted if such a relocation
311 entry is needed. (Default: Always use a new symbol.)
315 @defmac obj_adjust_symtab
316 (@code{BFD_ASSEMBLER} only.)
317 If this macro is defined, it is invoked just before setting the symbol table of
318 the output BFD. Any finalizing changes needed in the symbol table should be
319 done here. For example, in the COFF support, if there is no @code{.file}
320 symbol defined already, one is generated at this point. If no such adjustments
321 are needed, this macro need not be defined.
325 @defmac EMIT_SECTION_SYMBOLS
326 (@code{BFD_ASSEMBLER} only.)
327 Should section symbols be included in the symbol list if they're used in
328 relocations? Some formats can generate section-relative relocations, and thus
329 don't need symbols emitted for them. (Default: 1.)
332 @defmac obj_frob_file
333 Any final cleanup needed before writing out the BFD may be done here. For
334 example, ECOFF formats (and MIPS ELF format) may do some work on the MIPS-style
335 symbol table with its integrated debug information. The symbol table should
336 not be modified at this time.
339 @subsection CPU-specific definitions
342 @subsubsection Relaxation
345 If @code{md_relax_frag} isn't defined, the assembler will perform some
346 relaxation on @code{rs_machine_dependent} frags based on the frag subtype and
347 the displacement to some specified target address. The basic idea is that many
348 machines have different addressing modes for instructions that can specify
349 different ranges of values, with successive modes able to access wider ranges,
350 including the entirety of the previous range. Smaller ranges are assumed to be
351 more desirable (perhaps the instruction requires one word instead of two or
352 three); if this is not the case, don't describe the smaller-range, inferior
355 The @code{fr_subtype} and the field of a frag is an index into a CPU-specific
356 relaxation table. That table entry indicates the range of values that can be
357 stored, the number of bytes that will have to be added to the frag to
358 accomodate the addressing mode, and the index of the next entry to examine if
359 the value to be stored is outside the range accessible by the current
360 addressing mode. The @code{fr_symbol} field of the frag indicates what symbol
361 is to be accessed; the @code{fr_offset} field is added in.
363 If the @code{fr_pcrel_adjust} field is set, which currently should only happen
364 for the NS32k family, the @code{TC_PCREL_ADJUST} macro is called on the frag to
365 compute an adjustment to be made to the displacement.
367 The value fitted by the relaxation code is always assumed to be a displacement
368 from the current frag. (More specifically, from @code{fr_fix} bytes into the
369 frag.) This seems kinda silly. What about fitting small absolute values? I
370 suppose @code{md_assemble} is supposed to take care of that, but if the operand
371 is a difference between symbols, it might not be able to, if the difference was
374 The end of the relaxation sequence is indicated by a ``next'' value of 0. This
375 is kinda silly too, since it means that the first entry in the table can't be
376 used. I think -1 would make a more logical sentinel value.
378 The table @code{md_relax_table} from @file{targ-cpu.c} describes the relaxation
379 modes available. Currently this must always be provided, even on machines for
380 which this type of relaxation isn't possible or practical. Probably fewer than
381 half the machines gas supports used it; it ought to be made conditional on some
382 CPU-specific macro. Currently, also that table must be declared ``const;'' on
383 some machines, though, it might make sense to keep it writeable, so it can be
384 modified depending on which CPU of a family is specified. For example, in the
385 m68k family, the 68020 has some addressing modes that are not available on the
388 The relaxation table type contains these fields:
391 @item long rlx_forward
392 Forward reach, must be non-negative.
393 @item long rlx_backward
394 Backward reach, must be zero or negative.
396 Length in bytes of this addressing mode.
398 Index of the next-longer relax state, or zero if there is no ``next''
402 The relaxation is done in @code{relax_segment} in @file{write.c}. The
403 difference in the length fields between the original mode and the one finally
404 chosen by the relaxing code is taken as the size by which the current frag will
405 be increased in size. For example, if the initial relaxing mode has a length
406 of 2 bytes, and because of the size of the displacement, it gets upgraded to a
407 mode with a size of 6 bytes, it is assumed that the frag will grow by 4 bytes.
408 (The initial two bytes should have been part of the fixed portion of the frag,
409 since it is already known that they will be output.) This growth must be
410 effected by @code{md_convert_frag}; it should increase the @code{fr_fix} field
411 by the appropriate size, and fill in the appropriate bytes of the frag.
412 (Enough space for the maximum growth should have been allocated in the call to
413 frag_var as the second argument.)
415 If relocation records are needed, they should be emitted by
416 @code{md_estimate_size_before_relax}.
418 These are the machine-specific definitions associated with the relaxation
421 @deftypefun int md_estimate_size_before_relax (fragS *@var{frag}, segT @var{sec})
422 This function should examine the target symbol of the supplied frag and correct
423 the @code{fr_subtype} of the frag if needed. When this function is called, if
424 the symbol has not yet been defined, it will not become defined later; however,
425 its value may still change if the section it is in gets relaxed.
427 Usually, if the symbol is in the same section as the frag (given by the
428 @var{sec} argument), the narrowest likely relaxation mode is stored in
429 @code{fr_subtype}, and that's that.
431 If the symbol is undefined, or in a different section (and therefore moveable
432 to an arbitrarily large distance), the largest available relaxation mode is
433 specified, @code{fix_new} is called to produce the relocation record,
434 @code{fr_fix} is increased to include the relocated field (remember, this
435 storage was allocated when @code{frag_var} was called), and @code{frag_wane} is
436 called to convert the frag to an @code{rs_fill} frag with no variant part.
437 Sometimes changing addressing modes may also require rewriting the instruction.
438 It can be accessed via @code{fr_opcode} or @code{fr_fix}.
440 Sometimes @code{fr_var} is increased instead, and @code{frag_wane} is not
441 called. I'm not sure, but I think this is to keep @code{fr_fix} referring to
442 an earlier byte, and @code{fr_subtype} set to @code{rs_machine_dependent} so
443 that @code{md_convert_frag} will get called.
446 @deftypevar relax_typeS md_relax_table []
450 @defmac md_relax_frag (@var{frag})
452 This macro, if defined, overrides all of the processing described above. It's
453 only defined for the MIPS target CPU, and there it doesn't do anything; it's
454 used solely to disable the relaxing code and free up the @code{fr_subtype}
455 field for use by the CPU-specific code.
460 Like @code{obj_frob_file}, this macro handles miscellaneous last-minute
461 cleanup. Currently only used on PowerPC/POWER support, for setting up a
462 @code{.debug} section. This macro should not cause the symbol table to be
467 @node Source File Summary
468 @section Source File Summary
470 @subsection File Format Descriptions
474 The @code{a.out} format is described by @file{obj-aout.*}.
478 The @code{b.out} format, described by @file{obj-bout.*}, is similar to
479 @code{a.out} format, except for a few additional fields in the file header
480 describing section alignment and address.
484 Originally, @file{obj-coff} was a purely non-BFD version, and
485 @file{obj-coffbfd} was created to use BFD for low-level byte-swapping. When
486 the @code{BFD_ASSEMBLER} conversion started, the first COFF target to be
487 converted was using @file{obj-coff}, and the two files had diverged somewhat,
488 and I didn't feel like first converting the support of that target over to use
489 the low-level BFD interface.
491 So @file{obj-coff} got converted, and to simplify certain things,
492 @file{obj-coffbfd} got ``merged'' in with a brute-force approach.
493 Specifically, preprocessor conditionals testing for @code{BFD_ASSEMBLER}
494 effectively split the @file{obj-coff} files into the two separate versions. It
495 isn't pretty. They will be merged more thoroughly, and eventually only the
496 higher-level interface will be used.
500 All ECOFF configurations use BFD for writing object files.
504 ELF is a fairly reasonable format, without many of the deficiencies the other
505 object file formats have. (It's got some of its own, but not as bad as the
506 others.) All ELF configurations use BFD for writing object files.
510 This is the format used on VMS. Yes, someone has actually written BFD support
511 for it. The code hasn't been integrated yet though.
521 The XCOFF configuration is based on the COFF cofiguration (using the
522 higher-level BFD interface). In fact, it uses the same files in the assembler.
526 This is the old Vax VMS support. It doesn't use BFD.
528 @subsection Processor Descriptions
530 Foo: a29k, alpha, h8300, h8500, hppa, i386, i860, i960, m68k, m88k, mips,
531 ns32k, ppc, sh, sparc, tahoe, vax, z8k.
536 The operand syntax handling is atrocious. There is no clear specification of
537 the operand syntax. I'm looking into using a Bison grammar to replace much of
540 Operands on the 68k series processors can have two displacement values
541 specified, plus a base register and a (possibly scaled) index register of which
542 only some bits might be used. Thus a single 68k operand requires up to two
543 expressions, two register numbers, and size and scale factors. The
544 @code{struct m68k_op} type also includes a field indicating the mode of the
545 operand, and an @code{error} field indicating a problem encountered while
548 An instruction on the 68k may have up to 6 operands, although most of them have
549 to be simple register operands. Up to 11 (16-bit) words may be required to
550 express the instruction.
552 A @code{struct m68k_exp} expression contains an @code{expressionS}, pointers to
553 the first and last characters of the input that produced the expression, an
554 indication of the section to which the expression belongs, and a size field.
555 I'm not sure what the size field describes.
557 @subsubheading M68k addressing modes
559 Many instructions used the low six bits of the first instruction word to
560 describe the location of the operand, or how to compute the location. The six
561 bits are typically split into three for a ``mode'' and three for a ``register''
562 value. The interpretation of these values is as follows:
565 Mode Register Operand addressing mode
567 1 An address register
569 3 An indirect, post-increment
570 4 An indirect, pre-decrement
571 5 An indirect with displacement
572 6 An indirect with optional displacement and index;
573 may involve multiple indirections and two
575 7 0 16-bit address follows
576 7 1 32-bit address follows
577 7 2 PC indirect with displacement
578 7 3 PC indirect with optional displacements and index
579 7 4 immediate 16- or 32-bit
583 On the 68000 and 68010, support for modes 6 and 7.3 are incomplete; the
584 displacement must fit in 8 bits, and no scaling or index suppression is
587 @subsubheading M68k relaxation modes
589 The relaxation modes used on the 68k are:
593 Case @samp{g} except when @code{BCC68000} is applicable.
595 Coprocessor branches.
597 Mode 7.2 -- program counter indirect with 16-bit displacement. This is
598 available on all processors. Widens to 32-bit absolute. Used only if the
599 original code used @code{ABSL} mode, and the CPU is not a 68000 or 68010.
600 (Why? Those processors support mode 7.2.)
602 A conditional branch instruction, on the 68000 or 68010. These instructions
603 support only 16-bit displacements on these processors. If a larger
604 displacement is needed, the condition is negated and turned into a short branch
605 around a jump instruction to the specified target. This jump will have an
606 long absolute addressing mode.
608 Like @code{BCC68000}, but for @code{dbCC} (decrement and branch on condition)
611 Not currently used?? Short form is mode 7.2 (program counter indirect, 16-bit
612 displacement); long form is 7.3/0x0170 (program counter indirect, suppressed
613 index register, 32-bit displacement). Used in progressive-930331 for mode
614 @code{AOFF} with a PC-relative addressing mode and a displacement that won't
615 fit in 16 bits, or which is variable and is not specified to have a size other
618 Newly added. PC indirect with index. An 8-bit displacement is supported on
619 the 68000 and 68010, wider displacements on later processors.
621 Well, actually, I haven't added it yet. I need to soon, though. It fixes a
622 bug reported by a customer.
625 @subsection ``Emulation'' Descriptions
627 These are the @file{te-*.h} files.
632 @subsection Warning and Error Messages
634 @deftypefun int had_warnings (void)
635 @deftypefunx int had_errors (void)
637 Returns non-zero if any warnings or errors, respectively, have been printed
638 during this invocation.
642 @deftypefun void as_perror (const char *@var{gripe}, const char *@var{filename})
644 Displays a BFD or system error, then clears the error status.
648 @deftypefun void as_tsktsk (const char *@var{format}, ...)
649 @deftypefunx void as_warn (const char *@var{format}, ...)
650 @deftypefunx void as_bad (const char *@var{format}, ...)
651 @deftypefunx void as_fatal (const char *@var{format}, ...)
653 These functions display messages about something amiss with the input file, or
654 internal problems in the assembler itself. The current file name and line
655 number are printed, followed by the supplied message, formatted using
656 @code{vfprintf}, and a final newline.
658 An error indicated by @code{as_bad} will result in a non-zero exit status when
659 the assembler has finished. Calling @code{as_fatal} will result in immediate
660 termination of the assembler process.
664 @deftypefun void as_warn_where (char *@var{file}, unsigned int @var{line}, const char *@var{format}, ...)
665 @deftypefunx void as_bad_where (char *@var{file}, unsigned int @var{line}, const char *@var{format}, ...)
667 These variants permit specification of the file name and line number, and are
668 used when problems are detected when reprocessing information saved away when
669 processing some earlier part of the file. For example, fixups are processed
670 after all input has been read, but messages about fixups should refer to the
671 original filename and line number that they are applicable to.
675 @deftypefun void fprint_value (FILE *@var{file}, valueT @var{val})
676 @deftypefunx void sprint_value (char *@var{buf}, valueT @var{val})
678 These functions are helpful for converting a @code{valueT} value into printable
679 format, in case it's wider than modes that @code{*printf} can handle. If the
680 type is narrow enough, a decimal number will be produced; otherwise, it will be
681 in hexadecimal (FIXME: currently without `0x' prefix). The value itself is not
682 examined to make this determination.
686 @node Writing a new target
687 @section Writing a new target
693 The test suite is kind of lame for most processors. Often it only checks to
694 see if a couple of files can be assembled without the assembler reporting any
695 errors. For more complete testing, write a test which either examines the
696 assembler listing, or runs @code{objdump} and examines its output. For the
697 latter, the TCL procedure @code{run_dump_test} may come in handy. It takes the
698 base name of a file, and looks for @file{@var{file}.d}. This file should
699 contain as its initial lines a set of variable settings in @samp{#} comments,
703 #@var{varname}: @var{value}
706 The @var{varname} may be @code{objdump}, @code{nm}, or @code{as}, in which case
707 it specifies the options to be passed to the specified programs. Exactly one
708 of @code{objdump} or @code{nm} must be specified, as that also specifies which
709 program to run after the assembler has finished. If @var{varname} is
710 @code{source}, it specifies the name of the source file; otherwise,
711 @file{@var{file}.s} is used. If @var{varname} is @code{name}, it specifies the
712 name of the test to be used in the @code{pass} or @code{fail} messages.
714 The non-commented parts of the file are interpreted as regular expressions, one
715 per line. Blank lines in the @code{objdump} or @code{nm} output are skipped,
716 as are blank lines in the @code{.d} file; the other lines are tested to see if
717 the regular expression matches the program output. If it does not, the test
720 Note that this means the tests must be modified if the @code{objdump} output