[Software Carpentry logo]
[ACL Logo]

[CodeSourcery Logo]
Copyright (c) 2000 David Ascher and Trent Mick. This material may be distributed only subject to the terms and conditions set forth in the Software Carpentry Open Publication License, which is available at:
http://www.software-carpentry.com/openpub-license.html


Black: a make replacement

David Ascher <DavidA@ActiveState.com>
Trent Mick <TrentM@ActiveState.com>

ActiveState Tool Corporation

Version 1.0 - Friday, March 31, 2000 11:42 PM

 

Background

Software projects involve one or more files, some of which are edited by hand exclusively, some of which are generated by program and then edited by hand, and some of which are only generated by hand. Some of the files are processed by tools which in turn generate other files as 'side-effects'. When a file is modified and the project needs to be updated, the task of identifying which files need to be updated and how can quickly become daunting. With large numbers of files, the task is impossible to perform by hand without either performing tremendous amounts of unnecessary processing or failing to update the right set of files in the right order.  When the same tree of files is used to build targets for more than one target machine, or when several types of targets are needed (e.g. optimized and debug), many current tools (such as Make [Make]) become unwieldy, as has been widely documented (e.g. [Miller]).

Make and its derivatives allow the user to specify in a text file (typically one per directory) a set of 'targets', the set of files which each target depends on (the 'sources'), and the steps needed to go from each set of sources to a given target. Make then uses the timestamp of each file to determine which targets need to be rebuilt.  Make's format gives the user great freedom in specifying how to go from a given set of sources to a given target -- any sequence of shell commands can be used, invoking any tool available on the operating system. The corollary of this extreme freedom is an utter lack of control by the tool process over the target-making process, and as a result Make is unable to use optimal algorithms, as has been well documented [Miller].

Regardless of Make's emerging flaws after thirty years of use, Make's authors correctly identified the three phases which make up the process of updating a project:

  1. determining what the targets are, what sources they derive from, and what the derivation process is.
  2. determining what sources have changed since the last update, what targets need to be rebuilt as a result, and in what order.
  3. performing the steps associated with rebuilding the appropriate targets.

General Design Philosophy

The major feature of the design presented below is its reliance on 

The emphasis in this design, as in the related proposal for the configuration tool ("Tan"), is on the data format.  Given an open data format, several competing data-manipulation programs can be developed to explore optimal algorithms and processes, and customized for different target audiences.  Our analysis of the existing build tools found that while many tools included powerful features, the learning curve for the sophisticated tools was so steep that most users keep using under-powered tools or no tool at all when building their software.  By divorcing the data representation from the processing tool, our proposal is designed to allow the users to upgrade from simple tools to powerful tools, while maintaining the same underlying data format and metaphors. Ideally the user's knowledge becomes more detailed as his/her needs become more complex, but no radical shift is needed, simply an evolutionary development of expertise.

The reliance on XML should allow the tool to evolve faster and with more flexibility than if a new, specialized syntax were invented, by shifting focus away from process (the process of building tools is well understood) to data representation (understanding the data model of tools is the hard part).  XML is a terrible user interface, however, so several front-end systems are described which allow users of various level of sophistication to enter the data as well as allowing programs such as Integrated Development Environments (IDEs) to automatically build data files.  XML does not specify how to process the data, but that part of the problem is, remarkably, the most well-understood part of the build process.  Several processing models are also described, from single-process builds to tracing (debugging) builds and  distributed parallel builds with client-server GUI feedback monitors and database logging.  The separation between data format and processing tool means that different implementation strategies can be used depending on the size of the problem (e.g. whole in-memory DOM tree building for small problems, event-driven SAX processing for very large problems).

Apart from the open database format, the current design is novel in that it requires the user to make available to the build tool all of the relevant consequences of the tools invoked by the build process  (naturally, it makes this requirement reasonable by presenting the user with tools designed to help in this task).  This requirement allows the design to be optimal in its traversal of the dependency graph.  As documented in Recursive Make Considered Harmful [Miller], standard uses of Make typically result in several invocation of the stat() system call per file.  If the build tool can be made aware of all of the consequences of tool invocations (which files are created, modified, etc.), then the overhead due to the build process should remain negligible, even on huge projects (a stat() call from Python takes 1-3 msec on a mid-range laptop running Windows 2000, a negligible time compared to a compiler invocation -- on the same machine, a simple-minded Python script can perform a stat() call on every file in the Mozilla source tree (41,000 files) in less than three minutes). 

Finally, a key concept in the design is the separation between static dependencies and dynamic processing.  Static dependency graphs are the 'meat and potatoes' of the build system, and novice users will do nothing beyond specifying such graphs (although sometimes without knowledge that they are doing so).  Complicated processing, however, typically involves runtime decisions and configuration of the build system, for example to support remote builds, platform-specific changes in source and target files, compilation switches, etc.  As described in detail below, these configurations are done with Python scripts modifying the graph object which is built by the system, before it is traversed for tool invocation. 

Typical user scenarios

The mode of interaction with the system depends on the level of expertise of the user and the complexity of the task at hand.  In what follows, IDE refers to an Integrated Development Environment integrated with the proposed build system (not all integrated development environments will qualify). 

Small project, no IDE

In this scenario, there are at most a few targets per directory, and up to a few dozen source files which are processed to produce the target. Common target types include 1) executable programs derived from C, C++, FORTRAN, Java, etc. source files, 2) typeset or web-publishable documents derived from TeX, Lout, troff, XML, SGML, etc. marked-up documents. The developer of such a system would typically simply have to type:

shell> build -auto

If a configuration file has not yet been built, the -auto option causes the build system to invoke the 'smart parser/guesser subsystem'.  This separate tool inspects the current directory, scan the present files for patterns which indicate the names of the likely targets.  For example, if a file foo.c includes a function definition for main() and it is the only file  in the directory to do so, the build system will infer that the target is the executable named foo (or foo.exe on Windows), and that it links foo.o; Similarly, a LaTeX file which includes \documentclass is deduced to yield a DVI file with the same name, and, depending on user preference settings, a PostScript target as well. 

Similar scanning is performed to guess at the sets of dependencies. For example, if the abovementioned foo.c also contains the lines

# include "foo.h"
# include "bar.h"

and a file bar.c contains the line '#include "bar.h" and '#include "foo.h"', then the build system will deduce that bar.o depends on bar.c, bar.h and foo.h, and that foo.o depends on foo.c, foo.h and bar.h, and that foo.exe links foo.o and bar.o.  This type of parsing is already done by tools like JAM and by Visual C++ , among others.

These guesses will be presented to the user who will be allowed to verify whether they are correct or not, edit the resulting file with one of several tools (see below), and then proceed with the build.

If a configuration file is present, then the build system will use it after checking that no new source files were apparently added, and after doing a quick check through the files which have been modified since the last build to see if any new guesses about dependencies need to be made (assuming the -auto option was specified.  If that is not the case, no such checking is performed).

The set of heuristics used to perform these guesses, and whether they should be performed at all, will be configurable on a per-system and per-user basis. 

The configuration file for a 'hello world' example which just involved hello.c, .h, .o and .exe could be as simple as (XML shown in text form and in a graphical representation thanks to the XMLSpy software):

<?xml version="1.0" encoding="UTF-8"?>
<project>
	<target id="hello_exe" filename="hello.exe" virtualname="hello" type="Executable">
		<sources>
			<source idref="hello_o"/>
		</sources>
		<derivation idref="LinkExecutable"/>
	</target>
	<target id="hello_c" filename="hello.c" type="CsourceFile"/>
	<target id="hello_h" filename="hello.h" type="CHeaderFile"/>
	<target id="hello_o" filename="hello.o" type="ObjectFile">
		<sources>
			<source idref="hello_c"/>
			<source idref="hello_h"/>
		</sources>
	</target>
	<target id="default" type="Virtual">
		<sources>
			<source idref="hello_exe"/>
		</sources>
	</target>
</project>

Small - Medium project, with IDE

IDEs such as Microsoft's Developer Studio (DevStudio) provide a GUI interface to the development environment which has an implicit dependency mechanism and make-like facility built-in.  By grouping files into project and subprojects, and using parsing of the source files, DevStudio makes the dependency mechanism automatic and transparent to the user.  Our design accommodates such a transparent user interface, and we will be implementing something along those lines in our IDE environment.  While DevStudio does a very good job at automatically 'guessing' the dependencies and hierarchical nature of a project for "standard" files, it does a terrible job at allowing control over the build process. There are hooks in DevStudio for 'pre-build' and 'post-build' steps associated with each target, but the capabilities offered by these obscure configuration dialogs are hidden and unclear.  Determining how to add the ability to process SWIG files generically (as opposed to on a file-by-file basis) was beyond the capabilities of the first author.  In this sense, we aim to match the ease of use of the solution used by DevStudio when it comes to dependency management, but to exceed its ease of use when it comes to configurability, both within the IDE and from outside of it.  A screenshot from DevStudio shows a reasonable GUI for the project specification of a complex project (Python).

ActiveState is currently developing an IDE with integrated build control.  Depending on scheduling constraints and feasibility, we will either implement this design or another design submitted to the competition.

Large System

For a large system, the user will most likely want precise control over sources, targets and derivation rules on a per-project basis.  The user will also likely need detailed control over the dependency graph on a per-machine basis, and may need to customize the analysis of the dependency graph by the system. Doing this effectively requires a deeper understanding of the build system, presented in the Design Overview section.

Design Overview

The notion of project is central to our design.  A project defines a set of targets (the products of the build), each target's sources (the files or targets which need to be built/updated before the specified target can be built/updated), and the derivations, i.e. the steps needed to go from the sources to the targets.  The graph of targets and derivations forms a directed acyclic graph (DAG) -- the acyclicity of the graph is ensured by project configuration file editing tools, and the acyclicity of hand-edited project configuration files is verified before a build begins.

A given project can have linked subprojects. Each project has a configuration file associated with it, and a parent project can specify 'links' to subprojects' configuration files. Note that these links are like pointers or references, and not like #include statements. These links are resolved at the onset of a build, and all loading of project files is performed before the build is started. Each project is self-contained. While a standard assumption will be to have a project file per directory, and the syntax will make it easiest to specify files in the current directory of a given project file, a project file can refer to files which live in any directory. [It might be worth considering whether, instead of limiting the design to filenames, URIs should also be allowed -- this is a decision left for later, as it does not significantly affect the overall design]. Targets are of two kinds, either 'virtual targets', which are simply names for a particular state of affairs, much like 'clean' in standard Makefiles, or filenames (either sources, such as 'foo.c', or targets, such as 'foo.o').  Filenames are automatically mapped to a normalized filename notation which allows the tool to recognize that './foo.c' and '/home/da/src/foo.c' are the same file.  On tool invocation, however, relative pathnames can be used if the absolute pathname is likely to exceed length requirements (a requirement pointed out by Wingerd & Seiwald [WingerSeiwald]).  If a tool requires it, a change in the current working directory can be performed before the tool is invoked, as all pathnames are known to the Build system, and the system can conform to whatever file naming restrictions the invoked tool mandates.

Each project is defined by a configuration file and an optional script file.  The former defines a static platform-independent model of the dependency DAG.  The latter allows for runtime customization and modification of that DAG, as outlined in the "Scriptability" section below.  The processing of the tree is therefore based either on a standard DAG traversal algorithm or on a possibly highly-customized process specified in the script file.  The consequence of the tree-walking is typically the on-line building of the software package.  However, there are situations where one does not want any possibility of variability in the steps which occur across builds, and one wishes to remove any runtime decision.  To support this goal, a command line switch can be specified when the build process is performed which outputs a distinct XML file which simply states the steps which need to be performed in order to perform the build. This XML file, called a 'customized recipe' is simply a thin wrapper around tool invocations, and assumes that all of the tool invocations succeed.  The customized recipe is tied to a specific machine, project, and state of affairs (file dirtyness).

A schematic diagram of the stages of processing is presented in the following figure:

The Configuration File

The user builds the configuration file (the static dependency DAG) either through the use of editing tools or through the use of scripts.  The end-product of this first step defines the relationships between all the files in a platform-independent fashion.  It will therefore typically contain at least some information which is not necessary for the build on the current platform, and may refer to non-existent files.  This step is also where the greatest flexibility is needed due to the varied backgrounds of the user population.  No restriction is imposed on the tools used to build these configuration files, although naturally some front-ends will be provided with the tool, such as a Makefile parser to aid in migration of existing projects to the new tool, a Python scripting API, and a COM/XPCOM API.   Instead of a mandated user-interface, strong restrictions are placed on the syntax of these files.

Unlike in Make, the set of dependencies is explicit -- there is no way to define a 'pattern' which is used to build dependencies on the fly.  Instead, different front-ends to the system are to be used to build the static dependency model.  This will result in DAG files which are more verbose than is strictly necessary, but at the significant advantage of providing a reliable description of the files which might be involved in a given build.

The configuration file of a project specifies:

Dependency Nodes

Each dependency node specifies:

Attribute lookup is performed on a given node, and if that node doesn't have the named attribute, the lookup is performed on the project file.  If that project file does not have the name attribute, lookup is performed in the parent project, until the root project is reached.  The root project is the project on which the build process is invoked. The relationship between projects, subprojects and dependency nodes is one of containment, not of subtyping, and so is similar to environmental acquisition [GilLorenz, Fulton].  This scheme allows one to specify per-build parameters, per-project parameters, and per-file parameters.  We will see in a later section how this allows flexibility e.g. in deciding optimization levels or debugging support on a per-project and per-file basis. 

Derivations

Derivations encapsulate the steps needed to "resolve" a dependency, i.e. go from a source to a target.  When the build system is given a DAG, it determines which dependency nodes are "out of date", meaning that one or more of the sources have been modified significantly since the target of the dependency was built.  The definition of 'significantly modified' defaults to a simple timestamp check, but there is a mechanism described below for customizing what rule is used. The build process then invokes the derivation of the dependency node, which either succeeds or fails.  If it succeeds, then the target of the dependency is considered "up to date".

To perform a derivation, the Derivation objects are invoked, and are passed the dependency node, thus gaining access to the list of sources, the target, and the attributes of the dependency being resolved.  A derivation needs to be well-specified for the build tool to be able to use it efficiently. For example, a C compiler has well-known parameters and produces well-known outputs (or failures). The build system uses that information to optimally identify the tasks that need to be performed. 

Derivations form a type hierarchy. This type hierarchy, while theoretically independent of programming language, will be prototyped using Python classes, although eventual conversion to a language-neutral data representation such as an XML Schema should be kept in mind.  The top-level abstract derivation defines a set of attributes which all tools must define at a minimum:

As an example, a derivation for the LaTeX compiler would specify that the tool takes the a ".tex" file as the command line argument, and that the target is the .dvi file whose name is derived from the toplevel .tex file, and that .aux, .toc and .log files will be generated.  This information can then be used by a clean-up mechanism. 

Different configurations of a given derivation (e.g. debug vs. release compiler invocation) can be specified via the use of mixin classes or other combination mechanisms.  Whether a C compiler includes debugging symbols is probably a standard option which a 'factory-built' CCompiler tool specification could support.  

A catchall derivation can be defined which allows arbitrary shell escapes, although its use should be discouraged.  Special care will be needed to let users define how each shell invocation is expected to modify files (which files are created, deleted, updated, etc.), to allow such shell invocations to fit in the dependency resolution mechanism. 

Dependency DAG

Once a set of linked configuration files is defined, the build program will parse them (using standard XML parsers) and build an in-memory dependency DAG (in-memory data structure, cacheable/pickleable).  The first tree processing task is a path analysis to verify that there are no directed cycles in the graph (that a target is not dependent, via others, on itself).  Then, given a set of ultimate targets, a path through the dependency tree is identified, and processed.  As the processing goes on, events can be sent to a specialized service installed either on the same machine as the build occurs or on a separate machine accessed through the internet.  This allows GUI front-ends to the build process to be updated as the build process progresses.  Should an error occur, one of two behaviors can occur.  For interactive runs, all processing is immediately stopped (including parallel builds), to minimize delay between the detection of the bug and the notification to the user.  When large nightly builds are in progress, it may be preferable to pursue all independent paths throughout the make tree to allow the build to proceed to maximal completion.  Should there exist another path to a given target (for example, should multiple compilers be available to compile a given executable), then this scheme could complete the build even in the presence of individual derivation failures.

Guessing Tool

The guessing tool is not specified in detail here -- prior art in this domain such as DevStudio and JAM indicate that finding out what files are included by a given file is relatively straightforward.  Finding heuristics regarding the expected dependencies between files which do not explicitly include one another is an empirical problem.  

Scriptability of the model

The model as presented thus far defines the interrelationships between the files, and the expected behavior for a variety of tasks (compilation, linking, text processing, revision control, etc.), and the default processing of the DAG tree.  Many configuration possibilities are offered via the use of the optional script.

Different DAG traversal mechanisms.

The mechanism for traversal of the dependency DAG outlined above is designed for optimal processing on a serial machine.  Other traversals can be performed.  For example, a traversal can be performed which emitted instructions, but did not perform tasks, for use on a remote machine.  Alternatively, one could identify parallel paths through the DAG and launch them in parallel, either on a multiprocessor system or on a cluster.  

DAG modifications

The dependency DAG is serializable as an XML file.  Thus, it is available for modification by any XML-processor.  The most common mechanisms for modifying the DAG is expected to be the Document Object Model (DOM). In addition to the 'standard' view offered of the DOM, utility routines are available which provide a customized access to the tree specifically for the customization of this tool.   

Platform-specific tests.  

Even the most portable code sometimes stresses compilers in unexpected ways.  Additionally, many code bases have features which only make sense to enable on specific platforms.  Thus nodes of the DAG DOM need to be removable at runtime, and we need the ability to specify code to run as the DOM is traversed and the derivations are invoked.  As an example, the code in the complexobject.c file in the Numeric Pyhon distribution causes the C compiler on SGIs to fail, if run with optimization enabled. This could be expressed simply as:

if platform == 'sgi':
    NumPy.targets['complexobject.o'].c_optimization_level = 0

Interface

There are several interfaces to the proposed system. There are interfaces to the dependency-description XML tree, the tool description tree, the scripting capabilities, the build control monitor and error reporting mechanisms.

Interfaces to the dependency description mechanism

As described in the typical user scenarios section, there are several modes of interaction by which the user can build a dependency description.  One of these is the use of an 'automatic guessing tool' which does analysis of the files in the current working directories for common patterns (or, possibly, depending on the languages, comprehensive parsing). Another is the use of the GUI in an integrated development environment, possibly using file parsing as an ancillary tool.  Another is the use of a Makefile, which is automatically converted to the data structure described above (although developing a full solution covering all existing Makefiles is not a worthy use of anyone's time).  Alternative input modes can be added as desired, as the data format will be well-defined and portable XML.  It is important to note that this flexibility means that the tool has a chance of being adopted by non-Python users, as they can use their favorite programming language to interface to the tool.

Interfaces to the Processor Description Tree

The tool description tree, being just another branch of the XML tree described in the Registry document, has the same flexibility of input mode.  That said, the specification of the input-output requirements of a tool need to be entered in a manner which is understandable by the tool.  Careful definition of the XML schemas which are allowed needs to be completed.  

Interfaces to the Build Control Monitor

For simple runs, all aspects of processing can be performed in a single process, from parsing of the DAG file, running of the Python scripts, to traversal of the DOM and invocation of the derivations.  As build size increase, however, it is advantageous to have a central build process, controller processes and monitor processes.  For example, the build process on a cluster can run in one of the cluster nodes, but users may want to start the build from a remote machine, and have GUI feedback of the evolution of the build either through a website or in a customized GUI client.  The protocol used for this interaction is fairly irrelevant, since fairly small amounts of data are expected to be transferred over the wire, and fairly infrequently (a few packets per derivation step).  An initial implementation over XML-RPC or SOAP seems easiest to start with.

Interfaces to the Error Reporting Mechanism 

The tool as 'shipped' should include interfaces to the platform-standard logging mechanisms, as well as allow GUI feedback on small runs, and database logging for complex systems.  Given the event-driven nature of the derivation process, the level of specificity of error and diagnostic logging can be determined by the user registering event in everything from just failures in derivations to successes, reports on file sizes at completion or at periodic intervals (to check for hung compiler jobs), etc.  It should be clear that the error reporting mechanism should be layered on top of the build control monitor.  Error reporting and warning systems also need to be added to the tools which allow the building of the configuration file. 

Miscellaneous Issues

Bootstrapping

Much like Make uses shell tools to solve its bootstrapping problem, this tool can use Make to solve its bootstrapping problem.  In addition, the ability to create "customized recipes" and the access to remote registries (as described in the Tan document) should allow the creation of a tool which builds the tool on a remote machine without resorting to legacy technology, but resorting instead to the network.

Mechanism for Determining "Significantly Modified"

By using the script file, one can customize the procedure used by the build to determine whether a file needs to be rebuilt or not.  According to this mechanism, users can register a tool which should be used to store the relevant state of a file, along with a tool which is used to determine whether that state has changed.  The mechanism can be applied on a project, subproject or file basis.  In all but the last case, the user can specify a MIME-type pattern to further specify files which should be affected.  Finally, the user specifies a digesting function and a comparison function. 

For example, on a system where timestamp data is changed for irrelevant reasons, one can register a digest mechanism such as md5.  To register the use of md5 on all files in a project, for example, one can do:

def md5_digester(file):
    import md5
    return md5.new(file.read()).digest()

project.register_change_detector(match='*', digester=md5_digester, comparison=cmp) 

The cmp() built-in function can be used as the comparison mechanism as md5 digests are strings.  

Alternatively, suppose one wants to register a new mechanism for detecting changes in the major version numbers stored as attributes in all the XML files in a subproject, one can do:

def get_version_number(file):
    """using Sean McGrath's Pyxie tool to do the xml parsing"""
    import popen2
    # the following extracts all XML attributes named 'version_number'
    outf, inf = popen2.popen2('xmln - | grep "^Aversion_number$"')
    inf.write(file.read())
    inf.close()
    version = outf.readline().split()[1] # return the value of the attribute, e.g. "1.5.2"
    return version

def compare_version_numbers(v1, v2):
    major_version = v1.split('.', 1)[0]
    return not v2.startswith(major_version)  # return true if version mismatch

project.subproject.register_change_detector(match='text/xml', 
                                            digester=get_version_number,
                                            comparison=compare_version_numbers) 

Clearly, the sophistication of the change detection smarts is up to the user.  The only constraints are that the return value from the digester be serializable (pickleable), as it needs to be saved by the build system as part of the state of the project in between invocations. 

Development Plan Outline


Assuming that the general design as presented above is deemed sound and survives at least grossly, the development could proceed on two fronts simultaneously.  The first is the committee-work of determining the schema used for the XML trees in this tool and the Registry tool described in the joint proposal [Tan].  This process should take as a starting point the features described in the existing tools (Make, autoconf, etc.) and organize them in a logical, hierarchical nature.  A call for feedback from advanced users must be performed to solicit feedback regarding the full extent of the problem requirements (cross-platform compiling, parallel builds on long-latency networks, who knows what), so as to at least consider the hardest aspects of the problem before settling on a feature set.  Usability testing should be performed with novice users as well to sketch out the user interfaces for the various front-ends.

In parallel, the general infrastructure of parsing XML, building dependency graphs and graph-traversal algorithms, the intra- and inter-process communication needed to implement the various tools described above can be tackled early.

Summary

 

References:

[make] http://www.gnu.org/software/make/make.html

[Miller] Recurive make considered harmful: http://www.pcug.org.au/~millerp/rmch/recu-make-cons-harm.html

[Jam] http://www.perforce.com/jam/jam.html

[WingerdSeiwald] http://www.perforce.com/jam/doc/scm7.html

[Fulton] http://www.digicool.com/releases/ExtensionClass/Acquisition.html

[GilLorenz] Gil, J., Lorenz, D., Environmental Acquisition--A New Inheritance-Like Abstraction Mechanism OOPSLA '96 Proceedings, ACM SIG-PLAN, October,
1996

Man page


NAME

build [options] [target] - keep a project up to date

DESCRIPTION

  The options are as follows:

-auto:   Invoke the guessing/parsing system, to check for included files, likely patterns of dependencies, etc.

-auto_interactive: Like -auto, except that the user is prompted to confirm every guess

-generate_recipe [filename]: do not build project, but create recipe file (recipe.xml by default) 

...

If a target is not specified all targets of type 'Default' are built.

FILES

The configuration files created by this program are all called 'build.xml'.

SEE ALSO

registry(1)

http://www.software-carpentry.com


   [Home]       [FAQ]       [License]       [Rules]       [Resources]       [Archives]   

Powered by Zope

Zope management by SPVI

Last modified 2000/05/15 18:34:20.0089 US/Mountain