Private Homepage of Hartmut Henkel

Experimental PdfTeX Patch: Speeding up PdfTeX by Improving PDF-Object Search Using AVL Trees.

Update

This information is obsolete, as the described patch, in modified form, is now part of pdftex-1.20a.

Introduction

This is just an informal write-up of my private/spare-time fiddling with pdfTeX (it's for fun).

When pdfTeX has to process files with many links, it gets increasingly slow (sometimes even excruciatingly slow :-) when the source file size grows. The main reason for this seems to be the organization of objects in a linear linked list within the pdfTeX program. This list must be traversed from the beginning to find, whether an object with a certain identifier (string or number) is already existing in the list, before a new object with that identifier can be appended. Obviously the lookup time per object goes up, when the list is getting longer.

This speed limitation might be annoying, if pdfTeX is to be used for publishing of large documents with many hyperlinks, e. g. documents generated automatically from large databases. Such documents typically are catalogs, phonebooks, dictionaries, and parallel-multilingual books (hyperlinked Rosetta-Stone).

Once the cause of the slow-down was identified to lie in function find_obj in file pdftex.ch, I built a low-invasive scaffolding along the list, using an AVL tree. The original linked list is kept fully intact. The only, very local changes to the pdfTeX code are collected in file tex.ch2. When now an object is appended to the linked list, a pointer to this object is put into the AVL search tree, where the sort criteria is taken from the info entry of the object. This allows then to use the avl_find_obj() function to check, whether an object is already existing, rather than using the slow original find_obj function in the program pdftex.ch. Actually the functionality of the find_obj function is now implemented by the avl_find_obj() function.

The Code

A snapshot of the patch, which is against pdfTeX, Version 3.141592-1.11a (Web2C 7.5.2), can be downloaded as a gzipped tar file: pdftex-avlpatch3-20031018.tgz. Included are also the AVL-tree files avl.c and avl.h from the avl-2.0.1 package (a brilliant AVL implementation) by Ben Pfaff. Unpack the tar-file into directory texk/web2c/pdftexdir, apply the patches, go to the pdfTeX root directory, and type ./Build.

Follow compilation, and check particularily, that in directory texk/web2c/pdftexdir the file libpdf.a really is rebuilt and includes files avl.o and avlstuff.o, and that in directory texk/web2c the file pdftex.ch is recreated.

Experimenting

Before I removed the original find_obj Pascal code, I let run the find_obj (Pascal) and avl_find_obj (C) functions in parallel, and put in an assert(), to check if there are any discrepancies between the outcomes of both searches. After a while of testing with various available TeX files, I switched completely to avl_find_obj, now waiting for the first crash.

For checking the speed improvement, I generated a set of testfiles with hyperlinks by an awk script, with the number of internal links as parameter. Here are some results from my 400 MHz Pentium-III PC, running under debian (Woody) Linux (processing time with AVL-patch / without patch):

1000 links (44 pages): 2.14s / 2.61s - 2000 links (87 pages): 3.42s / 4.72s - 5000 links (218 pages): 7.3s / 18s - 10000 links (435 pages): 14s / 63s - 20000 links (870 pages): 30s / 208s - 50000 links (2174 pages): 80s / 1590s.

Bugs, TODOs

News

End Remark

Help, advice, bug notices, critics greatly welcome! Have fun!

This page first put online 03 October 2003.