C Library¶
This section is aimed at developers that want to understand the architecture of the library, in order to extend it.
The library can be compiled and installed independently of the python bindings. To build and install:
cd lib
mkdir debug
cd debug
cmake .. -DCMAKE_BUILD_TYPE=Debug
make && sudo make install
When trying to understand some code I like to start with the data
structures that make the inputs and the outputs of the
code. CrawledPage
is the input of Aduana and the best
place to start.
CrawledPage¶
Data structures¶
-
struct
CrawledPage
¶ The information that comes with a crawled page.
Public Members
-
char *
url
¶ ASCII, null terminated string for the page URL
-
double
time
¶ Number of seconds since epoch
-
float
score
¶ A number giving an idea of the page content’s value
-
char *
content_hash
¶ A hash to detect content change since last crawl. Arbitrary byte sequence
-
size_t
content_hash_length
¶ Number of byes of the content_hash
-
char *
The utility of CrawledPage::url
and
CrawledPage::links
are quite obvious, the others need an explanation:
CrawledPage::time
: this is used to compute how often a page changes and also it would be useful for a revisiting schedule to know how much time has passed since the page was crawled.CrawledPage::score
: one of the objectives of Aduana is to guide the crawl to interesting pages. Since the definition of interesting is application dependent each crawler can give a measure of how interesting they found the page to be. How this number will be exactly used depends on which scorer are we going to use. This field is not mandatory and actually Aduana can be configured to ignore it.CrawledPage::content_hash
: in order to detect if a page has changed this hash is compared with the hash previously stored for this same page. If the hash has changed we consider that the page has changed. Notice that the content hash is also application dependent: some applications may consider that the page has changed only if there are new links, others will consider a page has changed if the body text, after stripping HTML tags, has changed, etc... This field can be ignored too, in which case the pages will be considered as unchanging.
C is not known for its powerful and flexible data structures. In order to store a variable number of links per crawled page we implement this resizable array. Each time we run out of allocated memory the size of the reserved memory is doubled.
-
struct
PageLinks
¶ A (resizable) array of page links.
Initially: n_links = 0 m_links = PAGE_LINKS_MIN_LINKS
Always: 0 <= n_links <= m_links
Initially we reserve this number of links
-
PAGE_LINKS_MIN_LINKS
¶ Allocate at least this amount of memory for link info
Finally, each link not only carries an URL, but also a score. The
score gives an idea of how good the (maybe uncrawled) link is,
according to the web crawler. Think of the link score as an
approximation to CrawledPage::score
when we have not
crawled the link yet.
-
struct
LinkInfo
¶ The information that comes with a link inside a crawled page.
The link score is used to decide which links should be crawled next. It is application dependent and tipically computed by looking at the link surrounding text.
Constructor/Destructor¶
-
CrawledPage *
crawled_page_new
(const char *url)¶ Create a new CrawledPage
url is a new copy
The following defaults are used for the different fields:
- links: no links initially. Use crawled_page_add_link to add some.
- time: current time
- score: 0. It can be setted directly.
- content_hash: NULL. Use crawled_page_set_hash to change
- Return
- NULL if failure, otherwise a newly allocated CrawledPage
-
void
crawled_page_delete
(CrawledPage *cp)¶ Delete a Crawled Page created with crawled_page_new
Manipulate links¶
-
int
crawled_page_add_link
(CrawledPage *cp, const char *url, float score)¶ Add a new link to the crawled page
-
const LinkInfo *
crawled_page_get_link
(const CrawledPage *cp, size_t i)¶ Get a pointer to the link
-
size_t
crawled_page_n_links
(const CrawledPage *cp)¶ Get number of links inside page
Set content hash¶
-
int
crawled_page_set_hash
(CrawledPage *cp, const char *hash, size_t hash_length)¶ Set content hash
The hash is a new copy
-
int
crawled_page_set_hash128
(CrawledPage *cp, char *hash)¶ Set content hash from a 128bit hash
-
int
crawled_page_set_hash64
(CrawledPage *cp, uint64_t hash)¶ Set content hash from a 64bit hash
-
int
crawled_page_set_hash32
(CrawledPage *cp, uint32_t hash)¶ Set content hash from a 32bit hash
PageInfo¶
Data structures¶
This structure contains all we know about a given page, and it’s
changed as new CrawledPage
arrive.
And here it is:
-
struct
PageInfo
¶ The information we keep about crawled and uncrawled pages
PageInfo are created at the PageDB, that’s why there are no public constructors/destructors available.
Public Members
-
char *
url
¶ A copy of either CrawledPage::url or CrawledPage::links[i]
-
uint64_t
linked_from
¶ The page that first linked this one
-
double
first_crawl
¶ First time this page was crawled
-
double
last_crawl
¶ Last time this page was crawled
-
size_t
n_changes
¶ Number of content changes detected between first and last crawl
-
size_t
n_crawls
¶ Number of times this page has been crawled. Can be zero if it has been observed just as a link
-
float
score
¶ A copy of the same field at the last crawl
-
size_t
content_hash_length
¶ Number of bytes in PageInfo::content_hash
-
char *
content_hash
¶ Byte sequence with the hash of the last crawl
-
char *
Constructor/Destructor¶
There is no constructor available for this structure. The reason is
that they are automatically created from the info inside
CrawledPage
when page_db_add()
is called.
Functions¶
-
int
page_info_print
(const PageInfo *pi, char *out)¶ Write printed representation of PageInfo.
This function is intended mainly for debugging and development. The representation is: first_crawl last_crawl n_crawls n_changes url
Each field is separated with an space. The string is null terminated. We use the following format for each field:
- first_crawl: the standard fixed size (24 bytes) as output by ctime. For example: Mon Jan 1 08:01:59 2015
- last_crawl: the same as first_crawl
- n_crawls: To ensure fixed size representation this value is converted to double and represented in exponential notation with two digits. It has therefore always 8 bytes length: 1.21e+01
- n_changes: The same as n_crawls
- url: This is the only variable length field. However, it is truncated at 512 bytes length.
- Return
- size of representation or -1 if error
- Parameters
pi
-The PageInfo to be printed
out
-The output buffer, which must be at least 580 bytes long
PageDB¶
This is one of the main components of the library. Here we store all
the PageInfo
and how pages are linked between them.
The first thing to understand is that there are two different ways to refer to a given page, using either the URL hash or the index. Both ways of addressing the page are linked in the hash2idx database.
URL hash¶
The URL hash is computed using the following function:
-
uint64_t
page_db_hash
(const char *url)¶ Hash function used to convert from URL to hash.
The hash is a 64 bit number where the first 32 bits are a hash of the domain and the last 32 bits are a hash of the full URL. In this way all URLs whith the same domain get grouped together in the database. This has some good consequences:
- We can access all pages inside a domain by accessing the first of them in the database and moving sequentially.
- When streaming links this improves locality since pages in the same domain tend to have similar links.
When a new CrawledPage
arrives we compute the hash of
CrawledPage::url
and use this as the key inside the
hash2info database, to retrieve the associated
PageInfo
. If no entry is found inside the database a new
one is created. We do the same with each one of the links inside
CrawledPage::links
.
The following two functions are useful to extract the different parts of the hash.
-
uint32_t
page_db_hash_get_domain
(uint64_t hash)¶ Extract the domain hash from the full hash
-
uint32_t
page_db_hash_get_url
(uint64_t hash)¶ Extract the URL hash from the full hash
Index¶
We could store links between pages using their URL hash, for example, as a list of pairs of the form:
004619df1e9191ff 004619df1eb839e2
004619df1e9191ff 004619df1f1a5477
004619df01e223ae 00115773f1ea355c
...
However the hashing would spoil one interesting property of links: locality. Locality means that pages usually link to pages inside their same domain. For example, here are the first links extracted from the front page of Wikipedia:
https://en.wikipedia.org/wiki/Main_Page#mw-head
https://en.wikipedia.org/wiki/Main_Page#p-search
https://en.wikipedia.org/wiki/Wikipedia
https://en.wikipedia.org/wiki/Free_content
https://en.wikipedia.org/wiki/Encyclopedia
https://en.wikipedia.org/wiki/Wikipedia:Introduction
https://en.wikipedia.org/wiki/Special:Statistics
https://en.wikipedia.org/wiki/English_language
Locality can also happen when there are several links outgoing to the same domain, but a different one of the originating page. For example, from among the 135 links at the front page of Hacker News more than 100 remained on the same domain but there were also the following groups:
http://www.ycombinator.com/
http://www.ycombinator.com/apply/
https://github.com/blog/2024-read-only-deploy-keys
https://github.com/whamtet/Excel-REPL
https://github.com/tadast/switching-to-contracting-uk/blob/master/README.md
https://github.com/HackerNews/API
Instead of storing links using the URL hash we instead assign each page an integer, that starts at zero with the first page and it’s automatically incremented when a new page is added to the database. Links are stored then as lists where the first element is the originating page index and the rest of the elements are the indices of the outoging links. For example, taken from a real crawl:
7 1243 1245 1251 1254 1260 1262 1263
1264 1267 1269 1271 1274 1275 1276
1277 1280 1283 1286 1289 1291 1295
1309 1311 ...
Since we want be able to perform big crawls with billions of pages we use 64 bit integers for the indices, which means they still take as much space as the URL hashes. However, these links are delta-encoded: starting at the second element of the list we substract the previous one:
7 2 6 3 6 2 1 1 3 2 2 3 1 1 1 3 3 3 3 2 4 14 2 ...
Finally we use varint encoding for each integer. As you can see in the above example each link requires just 8 bits, instead of the 64 bits (or 32 bits if somehow we could reuse the domain part of the hash) URL hashing would.
Having indices instead of hashes is also convenient for the PageRank and HITS algorithms. They can store the pages scores using arrays where the position of each page inside those arrays are just their index. Having fast O(1) access time greatly improves the speed of the computation when using billions of pages. Besides, locality also helps access speed, even when working in-memory.
The index for a given page is automatically created when
page_db_add()
is called.
Data structures¶
-
struct
PageDB
¶ Page database.
We are really talking about 4 diferent key/value databases:
- info: contains fixed size information about the whole database. Right now it just contains the number of pages stored.
- hash2idx: maps URL hash to index. Indices are consecutive identifier for every page. This allows to map pages to elements inside arrays.
- hash2info: maps URL hash to a PageInfo structure.
- links: maps URL index to links indices. This allows us to make a fast streaming of all links inside a database.
Public Members
-
char *
path
¶ Path to the database directory
-
TxnManager *
txn_manager
¶ The transaction manager counts the number of read and write transactions active and is capable of safely performing a database resize
-
DomainTemp *
domain_temp
¶ Track the most crawled domains
-
int
persist
¶ If true, do not delete files after deleting object
Constructor/Destructor¶
-
PageDBError
page_db_new
(PageDB **db, const char *path)¶ Creates a new database and stores data inside path
- Return
- 0 if success, otherwise the error code
- Parameters
db
-In case of page_db_error_memory *db could be NULL. In case of other failures it is nevertheles allocated memory so that the error code and message can be accessed.
path
-Path to directory. In case it doesn’t exist it will created. If it exists and a database is already present operations will resume with the existing database. Note that you must have read, write and execute permissions for the directory.
-
PageDBError
page_db_delete
(PageDB *db)¶ Close database
Close database, delete files if it should not be persisted, and free memory
Add page¶
-
PageDBError
page_db_add
(PageDB *db, const CrawledPage *page, PageInfoList **page_info_list)¶ Update PageDB with a new crawled page
It performs the following actions:
- Compute page hash
- If the page is not already into the database:
- It generates a new ID and stores it in hash2idx
- It creates a new PageInfo and stores it in hash2info
- If already present if updates the PageInfo inside hash2info
- For each link:
- Compute hash
- If already present in the database just retrieves the ID
- If not present:
- Generate new ID and store it in hash2idx
- Creates a new PageInfo and stores it in hash2info
- Create or overwrite list of Page ID -> Links ID mapping inside links database
- Return
- 0 if success, otherwise the error code
- Parameters
db
-The database to update
page
-The information of the crawled page
page_info_list
-If not NULL this function will allocate and populate a new PageInfoList which contains the PageInfo of the updated pages. It is your responsability to call when you no longer need this structure.
Get info from database¶
-
PageDBError
page_db_get_info
(PageDB *db, uint64_t hash, PageInfo **pi)¶ Retrieve the PageInfo stored inside the database.
Beware that if not found it will signal success but the PageInfo will be NULL
-
PageDBError
page_db_get_idx
(PageDB *db, uint64_t hash, uint64_t *idx)¶ Get index for the given URL
-
PageDBError
page_db_get_scores
(PageDB *db, MMapArray **scores)¶ Build a MMapArray with all the scores
Database settings¶
-
PageDBError
page_db_set_domain_temp
(PageDB *db, size_t n_domains, float window)¶ Set domain temperature tracking options
Export database¶
This functions are used by the page_db_dump command line utility.
-
PageDBError
page_db_info_dump
(PageDB *db, FILE *output)¶ Dump database to file in human readable format
-
PageDBError
page_db_links_dump
(PageDB *db, FILE *output)¶ Dump database to file in human readable format
PageInfoList¶
This structure exists just because page_db_add()
needs a way
of returning which pages had their info created/modified. This
information is necessary for schedulers. It’s just a linked list so we
are not going to make more comments about it.
Data structures¶
-
struct
PageInfoList
¶ A linked list of PageInfo (and hash), to be returned by page_db_add
Public Members
-
uint64_t
hash
¶ Hash inside the hash2info database
-
struct PageInfoList *
next
¶ A pointer to the next element, or NULL
-
uint64_t
Constructor/Destructor¶
-
PageInfoList *
page_info_list_new
(PageInfo *pi, uint64_t hash)¶ Create a new PageInfoList, with just one element.
- Return
- A pointer to the first element of the list, or NULL if failure
- Parameters
pi
-The PageInfo to add. From this point it is the property of the list, so deleting the list deletes this element.
hash
-
-
void
page_info_list_delete
(PageInfoList *pil)¶ Deletes the list and all its contents
Functions¶
-
PageInfoList *
page_info_list_cons
(PageInfoList *pil, PageInfo *pi, uint64_t hash)¶ Add a new element to the head of the list.
- Return
- A pointer to the first element of the list, or NULL if failure
- Parameters
pi
-The PageInfo to add. From this point it is the property of the list, so deleting the list deletes this element.
hash
-
LinkStream¶
Maybe the most interesting stream going out of PageDB
is
the link stream, because it’s the main interface between
PageDB
and the different scorers like PageRank and
HITS. This stream outputs a list of Link
, which are just
pairs of from and to indices. Right now, because of the way links
are stored inside the database the stream groups together all the
links with the same from index, however this could change in the
future and it’s actually not necessary for the current PageRank or
HITS implementations.
The reason for using a link stream is that when billions of pages are crawled the size of the links database can grow to several hundreds of megabytes.
Data structures¶
-
struct
PageDBLinkStream
¶
-
struct
Link
¶
Constructor/Destructor¶
-
void
page_db_link_stream_delete
(PageDBLinkStream *es)¶ Delete link stream and free any transaction hold inside the database.
Functions¶
The signature of these functions use void because they must agree with the following interfaces:
-
typedef
StreamState( LinkStreamNextFunc)(void *state, Link *link)
for
-
StreamState
page_db_link_stream_next
(void *es, Link *link)¶ Get next element inside stream.
- Return
- ::link_stream_state_next if success
and
-
typedef
StreamState( LinkStreamResetFunc)(void *state)
for
-
StreamState
page_db_link_stream_reset
(void *es)¶ Rewind stream to the beginning
HashInfoStream¶
Data structures¶
This is used by the command line utility page_db_find, which iterates over all the pages and returns which ones have their URL matching some regexp.
Constructor/Destructor¶
-
PageDBError
hashinfo_stream_new
(HashInfoStream **st, PageDB *db)¶ Create a new stream
-
void
hashinfo_stream_delete
(HashInfoStream *st)¶ Free stream
Functions¶
-
StreamState
hashinfo_stream_next
(HashInfoStream *st, uint64_t *hash, PageInfo **pi)¶ Get next element in stream
HashIdxStream¶
This is used in two different places. The first one is the command line utility page_db_links which returns which pages link or are linked from other page.
The other more important use case is inside schedulers, which after pages scores are updated, need to iterate over all of them to see which ones have changed enough to be rescheduled.
Data structures¶
Constructor/Destructor¶
-
PageDBError
hashidx_stream_new
(HashIdxStream **st, PageDB *db)¶ Create a new stream
-
void
hashidx_stream_delete
(HashIdxStream *st)¶ Free stream
Functions¶
-
StreamState
hashidx_stream_next
(HashIdxStream *st, uint64_t *hash, size_t *idx)¶ Get next element in stream
DomainTemp¶
This is used inside PageDB
to track how many times the
most often domains are crawled. This information will in turn be used
by the scheduler, which will try to not serve requests for the most
crawled domains.
Ideally, for each domain we would store a (growing) list of timestamps when some page in the domain has been crawled. With this list in hand we could answer questions like How many times the domain has been crawled in the last 60 seconds?. Instead of that we make the following approximation: imagine that we store only how many times the domain has been crawled in the last \(T\) seconds. We don’t know how the crawls have been distributed in that time, it could be that thay are distributed all at the beginning:
or maybe following some strange pattern:
Instead we will assume they are evenly distributed:
Now, if some time \(t\) is elapsed without any more crawled, how many crawls remain in the time window?
The answer is that since there are \(n\) crawls evenly distributed then there are \(n/T\) crawls per second, and then \(n\frac{t}{T}\) have moved out of the time window.
If \(t \to dt\) then we have the following differential equation:
The solution of the above equation is obviously:
And \(n\) would evolve following some similar shape to:
The above figure has a time window of just 2 seconds and there are crawls at instants 1, 2.5, 2.6, 2.7, 4 and 5.
Data structures¶
-
struct
DomainTemp
¶ Tracks how “hot” are the most crawled domains.
We want to avoid crawling the same domain repeatedly. For this purpose this structure tracks how many times a domain has been crawled in the specified time window. For performance reasons an approximation of the actual number of crawls is maintained. Under certain assumptions it can be shown that if ‘n’ is the number of crawled for a domain it follows the following (cool down) differential equation:
\[ \frac{dn}{dt} = -\frac{1}{T}n \]where \(T\) is the time window.
Public Members
-
DomainTempEntry *
table
¶ An array of domain/temperature pairs
-
size_t
length
¶ Length of DomainTemp::table
-
float
time
¶ Last time temperatures were updated
-
float
window
¶ Time window to consider in the cooldown
-
DomainTempEntry *
-
struct
DomainTempEntry
¶ Associate a domain hash with a temperature
Constructor/Destructor¶
-
DomainTemp *
domain_temp_new
(size_t length, float window)¶ Create a new domain temp tracking structure
- Return
- A pointer to the new struct of NULL if failure
- Parameters
length
-Maximum number of domains to track
window
-Time window
-
void
domain_temp_delete
(DomainTemp *dh)¶ Free memory
Functions¶
-
void
domain_temp_update
(DomainTemp *dh, float t)¶ Updates temp up to current time t
-
void
domain_temp_heat
(DomainTemp *dh, uint32_t hash)¶ Adds another count to domain.
If the domain already in already tracked its counter is incremented. If the domain is not present then we try to initialize it in an empty slot. If not empty slot is available then the domain with fewest crawls is replaced with the new domain if its counter is below 1.
-
float
domain_temp_get
(DomainTemp *dh, uint32_t hash)¶ Gets domain temp
Error handling¶
Errors are signaled in the following ways:
- For functions not returning pointers 0 means success and any other value some kind of failure. Usually an enumeration of error codes is defined, otherwise -1 is used as failure code.
- For functions returning pointers failure is signaled returning a null pointer.
- If the causes of error are varied enough the structures inside this
library have an
Error
structure, which contains the error code and an error message. The error message usually resembles an stack trace to aid debugging the problem.
Data structures¶
-
MAX_ERROR_LENGTH
¶ Maximum length of error message
-
struct
Error
¶
Constructor/Destructor¶
Functions¶
-
void
error_set
(Error *error, int code, const char *msg)¶ Set error.
If an error is already present then do nothing. If you want to overwrite an already existing error then first call error_clean
TxnManager¶
Data structures¶
-
struct
TxnManager
¶ Transaction Manager.
LMDB has several restrictions in the operations it allows in multiple threads, but some of these restrictions must be imposed in the application code. In particular:
- Some operations require that no transactions in the same process are active, for example mdb_env_set_mapsize
- Some operations require that no write transactions are active. For example it is not documented, but it seems to happen that, mdb_env_info crashes if write transactions are active.
This structure tracks the number of read and write transactions active inside the process and allows blocking until all of them are aborted or commited.
Public Members
-
MDB_env *
env
¶ LMDB environment where transactions happen
-
InvSemaphore
txn_counter_read
¶ Counter of read transactions
-
InvSemaphore
txn_counter_write
¶ Counter of write transactions
-
struct
InvSemaphore
¶ Inverse Semaphore.
An inverse semaphore blocks when the count is greater than zero (a regular semaphore blocks when the count is at zero).
Constructor/Destructor¶
-
TxnManagerError
txn_manager_new
(TxnManager **tm, MDB_env *env)¶ Allocate a new TxnManager
- Return
- 0 if success, otherwise error code.
- Parameters
tm
-The new transaction manager.
env
-The LMDB environment where transactions will be opened, aborted or commited.
-
TxnManagerError
txn_manager_delete
(TxnManager *tm)¶ Destroy and free manager
Functions¶
The following functions are wrappers around the corresponding ones in LMDB. They will increment/decrement automatically the read and write transactions counters.
-
TxnManagerError
txn_manager_begin
(TxnManager *tm, int flags, MDB_txn **txn)¶ Begin a new transaction.
- Return
- 0 if success, otherwise error code.
- Parameters
tm
-flags
-The flags that you pass to LMDB’s mdb_txn_begin. These flags will be checked for MDB_RDONLY to decide which transaction counter to increment. This operation will block if an environment resize is in progress.
txn
-New transaction.
-
TxnManagerError
txn_manager_commit
(TxnManager *tm, MDB_txn *txn)¶ Commit transaction.
The corresponding counter will be decremented
-
TxnManagerError
txn_manager_abort
(TxnManager *tm, MDB_txn *txn)¶ Abort transaction.
The corresponding counter will be decremented
The following function is the main reason for the existence of
TxnManager
.
-
TxnManagerError
txn_manager_expand
(TxnManager *tm)¶ Check if the environment must be resized. If this is the case then resize it.
This call will block for sure until there are no write transactions active. This call may block until there are no read transactions active, only if a resize is necessary.
If a resize happens then creation of new read and write transactions will be blocked until it finishes.
-
MDB_MINIMUM_FREE_PAGES
¶ Parameter associated to txn_manager_expand.
The mmap is resized when the remaining free space is less than this amount.
BFScheduler¶
Data structures¶
-
BF_SCHEDULER_DEFAULT_SIZE
¶ Size of the mmap to store the schedule
-
BF_SCHEDULER_DEFAULT_PERSIST
¶ Default value for BFScheduler::persist
-
struct
BFScheduler
¶ BestFirst scheduler.
As it name implies this scheduler follows a greedy strategy to decide which page is going to crawl next. It mains an ordered list of uncrawled pages. To decide the next page to be crawled this scheduler picks the highest score page and removes it from the top of the list.
The key is then to assign valid scores to the pages. If no scorer is selected this scheduler will use the score provided when the page is crawled. Additionally an alternative scorer can be set up, see for example page_rank_scorer_setup or hits_scorer_setup.
Public Members
-
PageDB *
page_db
¶ Page database
The page database is neither created nor destroyed by the scheduler. The rationale is that the scheduler can be changed while using the same PageDB. The schedule is “attached” to the PageDB.
-
Scorer *
scorer
¶ The scorer use to get page score.
If not set up, the PageInfo.score will be used
-
TxnManager *
txn_manager
¶ The scheduler state is maintained inside am LMDB environment
-
char *
path
¶ Path to the env
It is built by appending
_bfs
to the PageDB::path
-
int
persist
¶ If true, do not delete files after deleting object
-
float
max_soft_domain_crawl_rate
¶ Maximum crawls per second per domain
-
float
max_hard_domain_crawl_rate
¶ Maximum crawls per second per domain
-
PageDB *
Constructor/Destructor¶
-
BFSchedulerError
bf_scheduler_new
(BFScheduler **sch, PageDB *db)¶ Allocate memory and create a new scheduler
- Return
- 0 if success, otherwise the error code
- Parameters
sch
-Where to create it.
*sch
can be NULL in case of memory errordb
-PageDB to attach. Remember it will not be created nor destroyed by the scheduler
-
void
bf_scheduler_delete
(BFScheduler *sch)¶ Delete scheduler.
It may or may not delete associated disk files depending on the BFScheduler::persist flag
Input/Output¶
-
BFSchedulerError
bf_scheduler_add
(BFScheduler *sch, const CrawledPage *page)¶ Add a new crawled page
It will add the page also to the PageDB.
- Return
- 0 if success, otherwise the error code
- Parameters
sch
-page
-
-
BFSchedulerError
bf_scheduler_request
(BFScheduler *sch, size_t n_pages, PageRequest **request)¶ Add a new crawled page
It will add the page also to the PageDB.
- Return
- 0 if success, otherwise the error code
- Parameters
sch
-page
-
Update scores¶
-
BF_SCHEDULER_UPDATE_BATCH_SIZE
¶ Size of the batch used in updating the schedule.
Updating the schedule involves starting a write transaction. However write transactions coming from multiple threads are serialized. Since adding new pages to the schedule and returning requests also start write transactions it means that the update thread could block this more critical operations. To avoid this we avoid long write transactions and split them in batches.
-
BF_SCHEDULER_UPDATE_NUM_PAGES
¶ Don’t update scores until this amount of new pages has arrived
-
BF_SCHEDULER_UPDATE_PER_PAGES
¶ Don’t update scores until this percentage of new pages has arrived
-
BFSchedulerError
bf_scheduler_update_start
(BFScheduler *sch)¶ Start the update thread.
The update thread will run periodically the scorer, in case there is one, to recompute page scores.
-
BFSchedulerError
bf_scheduler_update_stop
(BFScheduler *sch)¶ Stop the update thread
Settings¶
-
void
bf_scheduler_set_persist
(BFScheduler *sch, int value)¶ Set persist option for scheduler
-
BF_SCHEDULER_CRAWL_RATE_STEPS
¶ Number of steps to take between soft and hard crawl rate limit
-
BFSchedulerError
bf_scheduler_set_max_domain_crawl_rate
(BFScheduler *sch, float max_soft_crawl_rate, float max_hard_crawl_rate)¶ Set BFScheduler::max_soft_domain_crawl_rate and BFScheduler::max_hard_domain_crawl_rate
Scorer¶
-
struct
Scorer
¶ Scorers are responsible of computing a measure between 0 and 1 of the relevance of a given page.
In order to be used in different schedulers they must obey the following interface.
-
typedef
int( ScorerUpdateFunc)(void *state)
Scorer update function interface
-
typedef
int( ScorerAddFunc)(void *state, const PageInfo *page_info, float *score)
Scorer add page function interface
-
typedef
int( ScorerGetFunc)(void *state, size_t idx, float *score_old, float *score_new)
Scorer get page score function
To see concrete implementations have a look at
PageRankScorer
and HitsScorer
.
PageRankScorer¶
Data structures¶
-
PAGE_RANK_SCORER_USE_CONTENT_SCORES
¶ Default value for PageRankScorer::use_content_scores
-
PAGE_RANK_SCORER_PERSIST
¶ Default value for PageRankScorer::persist
-
struct
PageRankScorer
¶ Public Members
-
int
persist
¶ If true files will not be removed by page_rank_scorer_delete
-
int
use_content_scores
¶ If true use content scores inside PageRank algorithm
-
int
Constructor/Destructor¶
-
PageRankScorerError
page_rank_scorer_new
(PageRankScorer **prs, PageDB *db)¶ Create new scorer
-
PageRankScorerError
page_rank_scorer_delete
(PageRankScorer *prs)¶ Delete scorer.
Files will be deleted unles PageRankScorer::persist is true
Functions¶
-
int
page_rank_scorer_add
(void *state, const PageInfo *page_info, float *score)¶ Add new page to scorer.
Function signature complies with Scorer::add
-
int
page_rank_scorer_get
(void *state, size_t idx, float *score_old, float *score_new)¶ Access PageRank scorer as with page_rank_get.
Function signature complies with Scorer::get
-
int
page_rank_scorer_update
(void *state)¶ Update scores.
Function signature complies with Scorer::update
-
void
page_rank_scorer_setup
(PageRankScorer *prs, Scorer *scorer)¶ Given a Scorer fill its fields with the necessary info
Settings¶
-
void
page_rank_scorer_set_persist
(PageRankScorer *prs, int value)¶
-
void
page_rank_scorer_set_use_content_scores
(PageRankScorer *prs, int value)¶
-
void
page_rank_scorer_set_damping
(PageRankScorer *prs, float value)¶ Sets PageRankScorer::page_rank::damping
HitsScorer¶
Data structures¶
-
HITS_SCORER_USE_CONTENT_SCORES
¶ Default value for HitsScorer::use_content_scores
-
HITS_SCORER_PERSIST
¶ Default value for HitsScorer::persist
-
struct
HitsScorer
¶
Constructor/Destructor¶
-
HitsScorerError
hits_scorer_new
(HitsScorer **hs, PageDB *db)¶ Create new scorer
-
HitsScorerError
hits_scorer_delete
(HitsScorer *hs)¶ Delete scorer.
Files will be deleted unles HitsScorer::persist is true
Functions¶
-
int
hits_scorer_add
(void *state, const PageInfo *page_info, float *score)¶ Add new page to scorer.
Function signature complies with Scorer::add
-
int
hits_scorer_get
(void *state, size_t idx, float *score_old, float *score_new)¶ Access HITS scorer as with hits_get_authority.
Function signature complies with Scorer::get
-
int
hits_scorer_update
(void *state)¶ Update scores.
Function signature complies with Scorer::update
-
void
hits_scorer_setup
(HitsScorer *hs, Scorer *scorer)¶ Given a Scorer fill its fields with the necessary info
Settings¶
-
void
hits_scorer_set_persist
(HitsScorer *hs, int value)¶ Sets HitsScorer::persist
-
void
hits_scorer_set_use_content_scores
(HitsScorer *hs, int value)¶
PageRank¶
Data structures¶
-
PAGE_RANK_DEFAULT_DAMPING
¶ Default PageRank::damping
-
PAGE_RANK_DEFAULT_MAX_LOOPS
¶ Default PageRank::max_loops
-
PAGE_RANK_DEFAULT_PRECISION
¶ Default PageRank::precision
-
PAGE_RANK_DEFAULT_PERSIST
¶ Default PageRank::persist
-
struct
PageRank
¶ Implementation of the PageRank algorithm.
See for example Wikipedia.
Additionally, it allows to merge the pure link based original algorithm with page content scores.
Public Members
-
MMapArray *
out_degree
¶ Number of outgoing links.
If page content scores are used then this array is actually the aggregated scores of all the outgoing links.
-
size_t
n_pages
¶ Number of pages
-
char *
path_out_degree
¶ Path to the out degree mmap array file
-
char *
path_pr
¶ Path to page rank mmap array file
-
float
damping
¶ Probability of making a random page jump: 1.0 - damping
-
float
total_score
¶ Total score
-
size_t
max_loops
¶ If greater than 0 stop computation even if precision was not achieved
-
float
precision
¶ Stop iteration when the the largest change in any page score is below this threshold
-
int
persist
¶ If true, do not delete files after deleting
-
MMapArray *
Constructor/Destructor¶
-
PageRankError
page_rank_new
(PageRank **pr, const char *path, size_t max_vertices)¶ Create a new structure.
- Return
- 0 if success, otherwise an error code.
- Parameters
pr
-The new structure is returned here. NULL if memory error.
path
-Directory where all files will be stored.
max_vertices
-Initial hint of the number of pages.
-
PageRankError
page_rank_delete
(PageRank *pr)¶ Free memory and close associated resources.
Files will be deleted or not depending on the value of PageRank::persist.
Functions¶
-
PageRankError
page_rank_set_n_pages
(PageRank *pr, size_t n_pages)¶ Reserve memory for the specified number of pages
-
PageRankError
page_rank_compute
(PageRank *pr, void *link_stream_state, LinkStreamNextFunc *link_stream_next, LinkStreamResetFunc *link_stream_reset)¶ Compute PageRank score for all pages.
The algorithm makes random access of pages scores and sequential access of the links.
- Return
- 0 if success, otherwise an error code.
- Parameters
pr
-link_stream_state
-For example PageDBLinkStream
link_stream_next
-For example page_db_link_stream_next
link_stream_reset
-For example page_db_link_stream_reset
-
PageRankError
page_rank_get
(const PageRank *pr, size_t idx, float *score_old, float *score_new)¶ Get PageRank score associated to a given page.
- Return
- 0 if success, otherwise an error code.
- Parameters
pr
-idx
-Page index.
score_old
-Score on the previous call to page_rank_compute.
score_new
-Score on the last call to page_rank_compute.
-
void
page_rank_set_persist
(PageRank *pr, int value)¶ Set value of PageRank::persist
Hits¶
Data structures¶
-
HITS_DEFAULT_MAX_LOOPS
¶ Default Hits::max_loops
-
HITS_DEFAULT_PRECISION
¶ Default Hits::precision
-
HITS_DEFAULT_PERSIST
¶ Default Hits::persist
-
struct
Hits
¶ Implementation of the HITS algorithm.
See for example Wikipedia.
Additionally, it allows to merge the pure link based original algorithm with page content scores. The idea is that the authority scores are distributed back to the hub according to the content score. For example imagine that page A links to B, C and D and the content/authority scores are:
-B: 0.5 / 0.1 -C: 0.1 / 1.0 -D: 0.9 / 0.5
Then the hub score of A would be computed as:
Hub(A) = 0.5*0.1 + 0.1*1.0 + 0.9*0.5
Public Members
-
size_t
n_pages
¶ Number of pages
-
size_t
max_loops
¶ If greater than 0 stop computation even if precision was not achieved
-
float
precision
¶ Stop iteration when the the largest change in any page score is below this threshold
-
int
persist
¶ If true, do not delete files after deleting object
-
size_t
Constructor/Destructor¶
-
HitsError
hits_new
(Hits **hits, const char *path, size_t max_vertices)¶ Create a new structure.
- Return
- 0 if success, otherwise an error code.
- Parameters
pr
-The new structure is returned here. NULL if memory error.
path
-Directory where all files will be stored.
max_vertices
-Initial hint of the number of pages.
-
HitsError
hits_delete
(Hits *hits)¶ Free memory and close associated resources.
Files will be deleted or not depending on the value of Hits::persist.
Functions¶
-
HitsError
hits_set_n_pages
(Hits *hits, size_t n_pages)¶ Reserve memory for the specified number of pages
-
HitsError
hits_compute
(Hits *hits, void *link_stream_state, LinkStreamNextFunc *link_stream_next, LinkStreamResetFunc *link_stream_reset)¶ Compute HITS score for all pages.
The algorithm makes random access of pages scores and sequential access of the links.
- Return
- 0 if success, otherwise an error code.
- Parameters
pr
-link_stream_state
-For example PageDBLinkStream
link_stream_next
-For example page_db_link_stream_next
link_stream_reset
-For example page_db_link_stream_reset
-
HitsError
hits_get_hub
(const Hits *pr, size_t idx, float *score_old, float *score_new)¶ Get hub score associated to a given page.
- Return
- 0 if success, otherwise an error code.
- Parameters
pr
-idx
-Page index.
score_old
-Score on the previous call to hits_compute.
score_new
-Score on the last call to hits_compute.
Get authority score associated to a given page.
- Return
- 0 if success, otherwise an error code.
- Parameters
pr
-idx
-Page index.
score_old
-Score on the previous call to hits_compute.
score_new
-Score on the last call to hits_compute.
-
void
hits_set_persist
(Hits *hits, int value)¶ Set value of Hits::persist
MMapArray¶
Data structures¶
-
struct
MMapArray
¶ A memory mapped array
Constructor/Destructor¶
-
MMapArrayError
mmap_array_new
(MMapArray **marr, const char *path, size_t n_elements, size_t element_size)¶ Create a new MMapArray
- Return
- 0 if success, otherwise the error code (also available in marr if not NULL)
- Parameters
marr
-Will be changed to point to the newly allocated structure, or NULL if failure
path
-Path to the associated file. Can be NULL in which case the mapping is made anonymous.
n_elements
-Number of elements (can be changed later with mmap_array_resize)
element_size
-Number of bytes of each element
-
MMapArrayError
mmap_array_delete
(MMapArray *marr)¶ Delete MMapArray
If the structure cannot be deleted, the memory will not be freed
- Return
- 0 if success, otherwise the error code (also available in marr)
Functions¶
-
MMapArrayError
mmap_array_advise
(MMapArray *marr, int flag)¶ Advise memory use pattern
It accepts any flag that madvise accepts
- Return
- 0 if success, otherwise the error code (also available in marr)
-
MMapArrayError
mmap_array_sync
(MMapArray *marr, int flag)¶ Force memory-disk syncronization
It accepts any flag that msync accepts
- Return
- 0 if success, otherwise the error code (also available in marr)
-
void *
mmap_array_idx
(MMapArray *marr, size_t n)¶ Returns pointer to the array element
- Return
- In case of failure it will return NULL. The error code is available in marr
-
MMapArrayError
mmap_array_set
(MMapArray *marr, size_t n, const void *x)¶ Set array element value
- Return
- 0 if success, otherwise the error code (also available in marr)
-
MMapArrayError
mmap_array_resize
(MMapArray *marr, size_t n_elements)¶ Change number of elements
The new memort is initialized to 0
- Return
- 0 if success, otherwise the error code (also available in marr)