Introduction¶
Aduana is a component to be used with a web crawler. It contains the logic to decide which page to crawl next. It accepts as inputs crawled pages and it outputs the next pages to be crawled (requests).
Aduana input/output
The main objectives of Aduana are:
- Speed: it must be able to output thousands of requests per second.
- Scalability: it must be able to consider billions of crawled pages.
- Intelligence: it must be able to direct the crawl to interesting pages.
Components¶
There are two main components right now: a C library and Python bindings.
The C library does the heavy lifting. In addition it also ships with several command line tools. It’s portable ANSI C99 code and all the necessary dependencies are bundled with the library. Ideally you should not concern yourself with this library unless you plan to extend Aduana.
The Python bindings contain low-level bindings to the C library and also: