The term weighting and ranking function is at the core of any information retrieval system. The vector space model with the cosine similarity is maybe the best known and most widely used, but there are plenty of alternatives. We're looking at two here, the BM25 function based around a probabilistic model, and a function based around language modeling.
Just to get something to work with we'll we'll build a quick index on some strings which will stand in for our documents - for a real world usage this structure would be too simple, but it provides the bits the ranking functions need. We build a quick inverted index of term => document ID mappings, and store the count of the number of times each term was seen in each document.
We collect a couple of statistics you wouldn't necessarily always need, such as the total and average number of tokens in the collection, and number in each document, to allow us to do some length related operations. The code below gives us the $collection and $queryTerms arrays, which are used in the two ranking functions.
<?php$docs = array(
"d1" => "this document is the first document that is quite long",
"d2" => "this is yet another document that is very slightly longer",
"d3" => "this isn't a very interesting string",
"d4" => "this isn't a very interesting document either"
);
$query = 'interesting document';
preg_match_all('/\w+/', $query, $matches);
$queryTerms = $matches[0];
$collection = array('terms' => array(), 'length' => 0, 'documents' => array());
foreach($docs as $docID => $doc) {
preg_match_all('/\w+/', $doc, $matches);
// store the document length
$collection['documents'][$docID] = count
Truncated by Planet PHP, read more at the original (another 24562 bytes)