Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Add full text search with tokenization and classification to your website

TNTSearch is a full-text Search (FTS) engine written entirely in PHP. It’s free and licensed under the MIT License. Let’s look at some of its key features along with code examples.

TNTsearch offers the following functionality out of the box.

  • Fuzzy search
  • Search as you type
  • Geo-search
  • Text classification
  • Stemming
  • Custom tokenizers
  • Bm25 ranking algorithm
  • Boolean search
  • Result highlighting
  • Dynamic index updates (no need to reindex each time)
  • Easily deployable via Packagist.org

Installating TNTSearch for PHP

The easiest way to install TNTSearch is via composer:

composer require teamtnt/tntsearch

Before you proceed, make sure your server meets the following requirements:

  • PHP >= 7.1
  • PDO PHP Extension
  • SQLite PHP Extension
  • mbstring PHP Extension

How to Create an index in TNTSearch

In order to be able to make full text search queries, you have to create an index.

Usage:

use TeamTNT\TNTSearch\TNTSearch;

$tnt = new TNTSearch;

$tnt->loadConfig([
    'driver'    => 'mysql',
    'host'      => 'localhost',
    'database'  => 'dbname',
    'username'  => 'user',
    'password'  => 'pass',
    'storage'   => '/var/www/tntsearch/examples/',
    'stemmer'   => \TeamTNT\TNTSearch\Stemmer\PorterStemmer::class//optional
]);

$indexer = $tnt->createIndex('name.index');
$indexer->query('SELECT id, article FROM articles;');
//$indexer->setLanguage('german');
$indexer->run();

Important: “storage” settings marks the folder where all of your indexes will be saved so make sure to have permission to write to this folder otherwise you might expect the following exception thrown:

  • [PDOException] SQLSTATE[HY000] [14] unable to open database file *

Note: If your primary key is different than id set it like:

$indexer->setPrimaryKey('article_id');

Boolean Search

use TeamTNT\TNTSearch\TNTSearch;

$tnt = new TNTSearch;

$tnt->loadConfig($config);
$tnt->selectIndex("name.index");

//this will return all documents that have romeo in it but not juliet
$res = $tnt->searchBoolean("romeo -juliet");

//returns all documents that have romeo or hamlet in it
$res = $tnt->searchBoolean("romeo or hamlet");

//returns all documents that have either romeo AND juliet or prince AND hamlet
$res = $tnt->searchBoolean("(romeo juliet) or (prince hamlet)");

Fuzzy Search

The fuzziness can be tweaked by setting the following member variables:

public $fuzzy_prefix_length  = 2;
public $fuzzy_max_expansions = 50;
public $fuzzy_distance       = 2; //represents the Levenshtein distance;
use TeamTNT\TNTSearch\TNTSearch;

$tnt = new TNTSearch;

$tnt->loadConfig($config);
$tnt->selectIndex("name.index");
$tnt->fuzziness = true;

//when the fuzziness flag is set to true, the keyword juleit will return
//documents that match the word juliet, the default Levenshtein distance is 2
$res = $tnt->search("juleit");

Updating the index

Once you created an index, you don’t need to reindex it each time you make some changes to your document collection. TNTSearch supports dynamic index updates.

use TeamTNT\TNTSearch\TNTSearch;

$tnt = new TNTSearch;

$tnt->loadConfig($config);
$tnt->selectIndex("name.index");

$index = $tnt->getIndex();

//to insert a new document to the index
$index->insert(['id' => '11', 'title' => 'new title', 'article' => 'new article']);

//to update an existing document
$index->update(11, ['id' => '11', 'title' => 'updated title', 'article' => 'updated article']);

//to delete the document from index
$index->delete(12);

Custom Tokenizer

First, create your own Tokenizer class. It should extend AbstractTokenizer class, define word split $pattern value and must implement TokenizerInterface:

use TeamTNT\TNTSearch\Support\AbstractTokenizer;
use TeamTNT\TNTSearch\Support\TokenizerInterface;

class SomeTokenizer extends AbstractTokenizer implements TokenizerInterface
{
    static protected $pattern = '/[\s,\.]+/';

    public function tokenize($text) {
        return preg_split($this->getPattern(), strtolower($text), -1, PREG_SPLIT_NO_EMPTY);
    }
}

This tokenizer will split words using spaces, commas and periods.

After you have the tokenizer ready, you should pass it to TNTIndexer via setTokenizer method.

$someTokenizer = new SomeTokenizer;

$indexer = new TNTIndexer;
$indexer->setTokenizer($someTokenizer);

Another way would be to pass the tokenizer via config:

use TeamTNT\TNTSearch\TNTSearch;

$tnt = new TNTSearch;

$tnt->loadConfig([
    'driver'    => 'mysql',
    'host'      => 'localhost',
    'database'  => 'dbname',
    'username'  => 'user',
    'password'  => 'pass',
    'storage'   => '/var/www/tntsearch/examples/',
    'stemmer'   => \TeamTNT\TNTSearch\Stemmer\PorterStemmer::class//optional,
    'tokenizer' => \TeamTNT\TNTSearch\Support\SomeTokenizer::class
]);

$indexer = $tnt->createIndex('name.index');
$indexer->query('SELECT id, article FROM articles;');
$indexer->run();

Classification

use TeamTNT\TNTSearch\Classifier\TNTClassifier;

$classifier = new TNTClassifier();
$classifier->learn("A great game", "Sports");
$classifier->learn("The election was over", "Not sports");
$classifier->learn("Very clean match", "Sports");
$classifier->learn("A clean but forgettable game", "Sports");

$guess = $classifier->predict("It was a close election");
var_dump($guess['label']); //returns "Not sports"

Saving the classifier

$classifier->save('sports.cls');

Loading the classifier

$classifier = new TNTClassifier();
$classifier->load('sports.cls');

This work is licensed under the MIT license

Tags: Classification PHP, Tokenization PHP, Search Indexes, Search Indexes PHP, PHP Search Engine, PHP search, PHP classifier, PHP tokenizer,
Fuzzy search PHP, Text classification php, Stemming php Custom tokenizers php, Bm25 ranking algorithm php, Boolean search php

The post Add full text search with tokenization and classification to your website first appeared on CodeSnippetsandTutorials™ - Code Tutorials, Graphic Design and More.



This post first appeared on Code Snippets And Tutorials, please read the originial post: here

Share the post

Add full text search with tokenization and classification to your website

×

Subscribe to Code Snippets And Tutorials

Get updates delivered right to your inbox!

Thank you for your subscription

×