Hardware and Software Concepts:Compiling, Linking and Loading
Compiling, Linking and Loading
Before a program written in a high-level language can execute, it must be translated into machine language, linked with various other machine-language programs on which it depends and loaded into memory. In this section we consider how pro- grams written in high-level languages are compiled into machine-language code and we describe how linkers and loaders prepare compiled code for execution.40
Compiling
Although each type of computer can understand only its own machine language, nearly all programs are written in high-level languages. The first stage in the process of creating executable programs is compiling the high-level programming language code to machine language. A compiler accepts source code, which is written in a high-level language, as input and returns object code, containing the machine-language instructions to execute, as output. Nearly all commercially available pro- grams are delivered as object code, and some distributions (i.e., open-source software) also include the source code.41
The compiling process can be divided into several phases; one view of compil- ing is presented in Fig. 2.8. Each phase modifies the program so that it can be interpreted by the next phase, until the program has been translated into machine code. First, the source code is passed to the lexer (also called lexical analyzer or scanner), which separates the characters of a program’s source into tokens. Examples of tokens include keywords (e.g., if, else and int), identifiers (e.g., named variables and constants), operators (e.g., -, +, * and /) and punctuation (e.g., semicolons).
The lexer passes this stream of tokens to the parser (also called the syntax analyzer), which groups the tokens into syntactically correct statements. The inter-
mediate code generator converts this syntactic structure into a stream of simple instructions that resemble assembly language (although it does not specify the registers used for each operation). The optimizer attempts to improve the code’s execution efficiency and reduce the program’s memory requirements. In the final phase, the code generator produces the object file containing the machine-language instructions.42, 43
Self Review
1. What is the difference between compiling and assembling?
2. Could a Java program run directly on a physical machine instead of on a virtual machine?
Ans: 1) The assembly process simply translates assembly-language instructions into machine language.A compiler translates high-level language code into machine-language code and may also optimize the code. 2) A Java program could run on a physical machine by using a compiler that translates Java source code or bytecode into the corresponding machine language.
Linking
Often, programs consist of several independently developed subprograms, called modules. Functions to perform common computer routines such as I/O manipulations or random number generation are packaged into precompiled modules called libraries. Linking is the process of integrating the various modules referenced by a program into a single executable unit.
Input to the linker can include object modules, load modules and control statements, such as the location of referenced library files.44
The linker often is provided with several object files that form a single pro- gram. These object files typically specify the locations of data and instructions using addresses that are relative to the beginning of each file, called relative addresses.
In Fig. 2.10, symbol X in object module A and symbol Y in object module B have the same relative address in their respective modules. The linker must modify these addresses so that they do not reference invalid data or instructions when the modules are combined to form a linked program. Relocating addresses ensures that each statement is uniquely identified by an address within a file. When an address is modified, all references to it must be updated with the new location. In the resulting load module, X and Y have been relocated to new relative addresses that are unique within the load module. Often, linkers also provide relative addressing in the load module; however, the addresses are assigned such that they are all relative to the beginning of the entire load module.
Fig. 2.11, the external reference to symbol C in object module 2 is resolved with the external name C from object module 1. Once an external reference is paired with the corresponding name in a separate module, the address of the external reference must be modified to reflect this integration.
Often, linkage occurs in two passes. The first pass determines the size of each module and constructs a symbol table. The symbol table associates each symbol (such as a variable name) with an address, so that the linker can locate the reference. On the second pass, the linker assigns addresses to different instruction and data units
and resolves external symbol references.47 Because a load module can become the input of another linking pass, the load module contains a symbol table, in which all symbols are external names. Notice that, in Fig. 2.11, the external reference to symbol Y is not listed in the load module’s symbol table because it has been resolved.
The time at which a program is linked depends on the environment. A pro- gram can be linked at compile time if the programmer includes all necessary code in the source file so that there are no references to external names. This is accomplished by searching for the source code for any externally referenced symbols and placing those symbols in the resulting object file. This method typically is not feasible because many programs rely on shared libraries, which are collections of functions that can be shared between different processes. Many programs can reference the same functions (such as library functions that manipulate input and output streams) without including them in their object code. This type of linking is typically performed after compilation but before loading. As discussed in the Mini Case Study, Mach, shared libraries enable the Mach microkernel to emulate multiple operating systems.
This same process can be performed at load time (see Section 2.8.3, Loading). Linking and loading are sometimes both performed by one application called a
Mini Case Study
The Mach system was developed at Carnegie-Mellon University from 1985–1994 and was based on CMU’s earlier Accent research OS.48 The project was directed by Richard Rashid, now the senior vice president of Microsoft Research.49 Mach was one of the first and best-known microkernel operating systems (see Section 1.13.3, Microkernel Architecture). It has been incorporated into later systems, including Mac OS X, NeXT and OSF/1, and had a strong influence on Windows NT (and ultimately on Windows XP, which we discuss in Chapter 21).50, 51, 52 An open- source implementation, GNU Mach, is used as the kernel for the GNU Hurd operating system, which is currently under development.53
A powerful capability of the Mach microkernel system is that it can emulate other operating systems. Mach achieves this using “transparent shared libraries.”54 A transparent shared library implements the actions for the system calls of the OS it is emulating, then intercepts system calls made by programs that are written to run on the emulated OS.55, 56 The intercepted system calls can then be translated into Mach system calls, and any results are translated back into the emulated form.57, 58 Thus the user’s program does not have to be ported to run on a system running Mach. In addition, any number of these transparent libraries can be in memory, so Mach can emulate multiple operating systems simul- taneously.59
linking loader. Linking can also occur at runtime,a process called dynamic linking. In this case, references to external functions are not resolved until the process is loaded into memory or issues a call to the function. This is useful for large programs that use programs controlled by another party, because a dynamically linked pro- gram does not have to be relinked when a library that it uses is modified.60 Further, because dynamically linked programs are not linked until they are in main memory, shared library code can be stored separately from other program code. Thus, dynamic linking also saves space in secondary storage because only one copy of a shared library is stored for any number of programs that use it.
Self Review
1. How does linking facilitate the development of large programs built by many developers?
2. What is one possible drawback of using a dynamic linker? What is a benefit?
Ans: 1) Linking permits programs to be written as many separate modules. The linker com- bines these modules into a final load module when all pieces of the program have been com- piled. 2) If a library cannot be found during execution, an executing program will be forced to terminate, possibly losing all of the work performed up to that point. A benefit is that pro- grams that are dynamically linked do not have to be relinked when a library changes.
Loading
Once the linker has created the load module, it passes it to a loader program. The loader is responsible for placing each instruction and data unit at a particular memory address, a process called address binding. There are several techniques for loading programs into main memory, most of which are important only for systems that do not support virtual memory. If the load module already specifies physical addresses in memory, the loader simply places the instruction and data units at the addresses specified by the programmer or compiler (assuming the memory addresses are available), a technique called absolute loading. Relocatable loading is performed when the load module contains relative addresses that need to be converted to actual memory addresses. The loader is responsible for requesting a block of memory space in which to place the program, then relocating the program’s addresses to correspond to its location in memory.
In Fig. 2.12, the operating system has allocated the block of memory beginning with memory address 10,000. As the program is loaded, the loader must add 10,000 to each address in the load module. The loader updates the memory address of the vari- able Example in the Fig. 2.12 to 10,450 from its original relative address of 450.
Dynamic loading is a technique that loads program modules upon first use.61 In many virtual memory systems, each process is assigned its own set of virtual addresses starting at zero, so the loader is responsible for loading the program into a valid memory region.
We review the entire compiling, linking and loading process (using load-time address binding) from source code to execution in Fig. 2.13. The programmer begins by writing the source code in some high-level language—in this case, C.
Next, the compiler transforms the foo.c and bar.c source-code files into machine language, creating the object modules foo.o and bar.o. In the code, the programmer has defined variable X in foo.c and variable Y in bar.c; both are located at relative address 100 in their respective object modules. The object modules are placed in secondary storage until requested by the user or another process, at which point the modules must be linked.
In the next step, the linker integrates the two modules into a single load module. The linker accomplishes this task by collecting information about module sizes and external symbols in the first pass and linking the files together in the second pass. Notice that the linker relocates variable Y to relative address 400.
In the third step, the loader requests a block of memory for the program. The operating system provides the address range of 4000 to 5050, so the loader relocates variable X to the absolute address 4100 and variable Y to the absolute address 4400.
Self Review
1. How might absolute loading limit a system’s degree of multiprogramming?
2. How does dynamic loading improve a system’s degree of multiprogramming?
Ans: 1) Two programs that specify overlapping addresses cannot execute at the same time, because only one can be resident at the same location in memory at once. 2) Modules are loaded as needed, so memory contains only the modules that are used.
Comments
Post a Comment