Bug Tracker Blog by Corey Trager

Full-Text Search in ASP.NET using Lucene.NET

by Corey Trager 22. February 2009 07:00


This post is about the full-text search engine Lucene.NET and how I integrated it into BugTracker.NET .   If you are thinking of adding full-text search to your application, you might find this post useful.  I'm not saying this is THE way of using Lucene.NET, but it is an example of ONE way.
 
Lucene.NET is a C# port of the original Lucene , an Apache Foundation open source project, written in java.

Why did I use Lucene.NET instead of the SQL Server full-text search engine?  Well, I'd like to say that I did some research into the pros and cons of the two choices, but actually I didn't do any comparative research.  What happened was that during a Stackoverflow podcast I heard Joel Spolsky mention that FogBugz uses Lucene as its engine and that he was happy with it.   I trust him, and  I was curious, so, one weekend I downloaded Lucene.NET and played with it a bit and before the weekend was over I was already done integrating it into BugTracker.NET.   I never looked at the SQL Server alternative at all, so I can't tell you anything about it.

Lucene itself is a class library, not an executable.   You call Lucene functions to do the search.  There is an open source standalone server built on Lucene called Solr .   You send Solr messages to do the search.  One way of using Lucene would have been to have my users run Solr side-by-side with SQL Server.   As with SQL Server full-text search, I can't tell you anything about Solr because I didn't try it.   It wouldn't have made sense to use Solr for BugTracker.NET, I think, because Solr would have been an additional installation hassle.   And running a server wouldn't have been doable at all at a cheap shared host like GoDaddy, where my own BugTracker.NET demo lives.    So, instead of using Solr, I used the Lucene class libraries directly.

To integrate Lucene, I had to build the following, which I list here and then describe in more detail below.

1) How Lucene would build its searchable index.   Lucene doesn't search my SQL Server database directly.   Instead, it searches its own "database", its own index.
2) The design of the Lucene index
3) How I would update Lucene's index whenever data in my database changes.
4) Sending the search query to Lucene.
5) Displaying the results.


Now the details.  I've simplied my code for this post, so that you can more easily see the overall design and understand the concepts and my design choices. 



1) Building the index.

When an ASP.NET application receives its first HTTP request after having been shut down, the Application_OnStart event fires, which I handle in Global.asax.   I call my "build_lucene_index" method.    Notice that I have a configuration setting "EnableLucene".    I was nervous about the my understanding of Lucene and whether my way of using it was the right architecture, and so I wanted to make sure I gave my users a way of turning Lucene off in case it was causing trouble.    More on that in a bit.

For a really big database, you wouldn't want to necessarily build the search index from scratch over and over, but I'm counting on BugTracker.NET databases being on the small side.   Is that a safe assumption?  A bug database shouldn't be that big or else you're doing it wrong, right?


public void Application_OnStart(Object sender, EventArgs e)
{
    if (btnet.Util.get_setting("EnableLucene", "1") == "1")
    {
        build_lucene_index(this.Application);
    }
}



The build_lucene_index method starts a new worker thread, where the real work is done.


public static void build_lucene_index(System.Web.HttpApplicationState app)
{
    System.Threading.Thread thread = new System.Threading.Thread(threadproc_build);
    thread.Start(app);
}


The worker thread first grabs a lock so that it can build the index without being disturbed by other threads.   The other threads would be the result of users either searching or users updating text, triggering a modification to Lucene's index.    I don't want those threads to be dealing with a partially built index, so I make those threads wait for the one-and-only lock.

My way of handling multithreading was one of the things that I was nervous about.   I feared some sort of hard-to-reproduce deadlock condition, or race condition, but so far, there have been no reports from BugTracker.NET users of any trouble, so I my design appears to be solid.

To create the index, I create a Lucene "IndexWriter".   I run a SQL query against my database to fetch the text I want to be able to search and the database keys that go with that text.   Then I loop through the query results adding a Lucene "Document" for each row.   Actually, in my real code, I get the searchable text from several different fields in my database, but in the snippet below I have simplified my harvesting of text from my database.



Lucene.Net.Analysis.Standard.StandardAnalyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();

static void threadproc_build(object obj)
{
    lock (my_lock)
    {
        try
        {

            Lucene.Net.Index.IndexWriter writer = new Lucene.Net.Index.IndexWriter("c:\\folder_where_lucene_index_lives", analyzer , true);
           
            DataSet ds = btnet.DbUtil.get_dataset("select bug_id, bug_text from bugs")
            
            foreach (DataRow dr in ds.Tables[0].Rows)
            {
                writer.AddDocument(create_doc(
                    (int)dr["bug_id"],
                    (string)dr["bug_text"]));
            }
            
            writer.Optimize();
            writer.Close();
        }
        catch (Exception e)
        {
            btnet.Util.write_to_log("exception building Lucene index: " + e.Message);
        }
    }
}



2) The design of the Lucene Index

Here's where I create a Lucene "Document".    An index contains a list of documents.   A doc has fields that you define.   My doc shown here has three fields.   The first field "text" is what Lucene will analyze and index, the searchable text.  The second field is the key I will use to link the Lucene data to the rows in my database.   Notice I tell Lucene that this key should be UN_TOKENIZED, stored as is.   That's all you need for a minimal Lucene doc, a key for you and some text to search on for Lucene.  The third field in my example is the text again, but this time, UN_TOKENIZED, stored as is.  I will use that text for having Lucene highlight in my results page the snippets where the hits are.   More on highlighting later.

One of the decisions you'll have to make when using Lucene is what text to index and how to package it for Lucene.    In my database, the text doesn't just live in one field.    A bug has a short text description, a list of comments, a list of incoming and outgoing emails, and even Digg-style tags.   In my real code as opposed to the snippets here,  I fetch text from all these places.   My real Lucene doc has four fields, the forth being another database key that I can use to link to the specific comment or email where the search hit is.   BugTracker.NET supports custom text fields and in the future I hope to harvest that text from the database and add it to the Lucene doc.

So, if your app is like mine, with text in many different places, then you'll have a challenge like mine, how to package the text into a Lucene doc.


static Lucene.Net.Documents.Document create_doc(int bug_id, string text)
{   
    Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
    
    doc.Add(new Lucene.Net.Documents.Field(
        "text",
        new System.IO.StringReader(text)));
    
    doc.Add(new Lucene.Net.Documents.Field(
        "bug_id",
        Convert.ToString(bug_id),
        Lucene.Net.Documents.Field.Store.YES,
        Lucene.Net.Documents.Field.Index.UN_TOKENIZED));
    
    // For the highlighter, store the raw text
    doc.Add(new Lucene.Net.Documents.Field(
        "raw_text",
        text,
        Lucene.Net.Documents.Field.Store.YES,
        Lucene.Net.Documents.Field.Index.UN_TOKENIZED));

    return doc;
}



3) Updating the index

Whenever a user updates text in a bug I launch a worker thread to update the index.    The worker thread grabs a lock so that only one thread is updating the index at a time. 

The worker thread creates a Lucene "IndexModifier", deletes the old doc, and replaces it with a new one.

Notice that the thread closes the "searcher".   The searcher is a Lucene "Searcher".   The life cycle of a Searcher is that it first loads the index and then does its searches using that loaded, cached version of the index.   If the real index changes on disk, the searcher wouldn't know about it.   It would continue searching the out-of-date cached copy of the index in its memory.    That might be ok for your situation, and if your index is very big and the cost of creating a new searcher is high, you might be forced to use a searcher with a stale index.   BugTracker.NET databases tend to be small, so I can get away with making sure my searcher always has an up-to-date index to work with.

The official Lucene fact says that a Searcher (aka IndexSearcher) "is thread-safe. Multiple search threads may use the same instance of IndexSearcher concurrently without any problems. It is recommended to use only one IndexSearcher from all threads in order to save memory."



Lucene.Net.Search.Searcher searcher = null;

static void threadproc_update(object obj)
{
    lock (my_lock) // If a thread is updating the index, no other thread should be doing anything with it.
    {
        
        try
        {
            if (searcher != null)
            {
                try
                {
                    searcher.Close();
                }
                catch (Exception e)
                {
                    btnet.Util.write_to_log("Exception closing lucene searcher:" + e.Message);
                }
                searcher = null;
            }
            
            Lucene.Net.Index.IndexModifier modifier = new Lucene.Net.Index.IndexModifier("c:\\folder_where_lucene_index_lives", analyzer, false);
            
            // same as build, but uses "modifier" instead of write.
            // uses additional "where" clause for bugid
            
            int bug_id = (int)obj;
            
            modifier.DeleteDocuments(new Lucene.Net.Index.Term("bug_id", Convert.ToString(bug_id)));
            
            DataSet ds = btnet.DbUtil.get_dataset("select bug_id, bug_text from bugs where bug_id = " + ConvertToString(bug_id));
            
            foreach (DataRow dr in ds.Tables[0].Rows) // one row...
            {
                modifier.AddDocument(create_doc(
                    (int)dr["bug_id"],
                    (string)dr["bug_text"]));
            }
            
            modifier.Flush();
            modifier.Close();
            
        }
        catch (Exception e)
        {
            btnet.Util.write_to_log("exception updating Lucene index: " + e.Message);
        }
    }
}




4) Sending the search query to Lucene

To search, create a Lucene "QueryParser".    Call its Parse() method passing the text the user typed in.   The Parse() method returns a "Query".   Call the Searcher's Search() method passing the Query.   The Search() method returns a Lucene "Hits" object, a collection of the search hits.  
        
As I've mentioned, I want my searcher to always be using the most up-to-date index, so whenever I do update the index, I destroy the old searcher, and then recreate it again the next time it's needed.   

Since IIS is handling the HTTP requests with multiple threads, these searches are happening on multiple threads.   Each search tries to grab my one-and-only lock, the one that keeps the updating threads from conflicting with each other and that keeps the updating threads from conflicting with searches.     Because there is just this one-and-only lock, all the searches on the website have to line up in single-file to get through this bottleneck.   Sounds terrible, doesn't it?   But so far, no reports of any problems.   It's just a bug tracker, not twitter, and so I can get away with this design, and there's no confusion ever about people doing searches with out-of-date indexes.

   
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser("text", analyzer );
Lucene.Net.Search.Query query = null;

try
{
    if (string.IsNullOrEmpty(text_user_entered))
    {
        throw new Exception("You forgot to enter something to search for...");
    }
    
    query = parser.Parse(text_user_entered);
    
}
catch (Exception e)
{
    display_exception(e);
}


lock (my_lock)
{
    
    Lucene.Net.Search.Hits hits = null;
    try
    {
        if (searcher == null)
        {
            searcher = new Lucene.Net.Search.IndexSearcher("c:\\folder_where_lucene_index_lives");
        }

        hits = searcher.Search(query);

    }
    catch (Exception e)
    {
        display_exception(e);
    }
    
    for (int i = 0; i < hits.Length(); i++)
    {
        Lucene.Net.Documents.Document doc = hits.Doc(i);
        ~~
        ~~ more processing of the hits and the Lucene docs here ~~
        ~~
    }
}



5) Displaying the results

If you didn't like my design prior to this point, what with the locking and the bottleneck, then you are going to really hate it now, because it gets weird now.    The search results I get back from Lucene is in the form of a Hits object, a collection of hits that you access by index.   The collection is in the order of the probability score, which you can get using the Hits.Score() method.   You can also get at the Lucene Document related to the hit via the Hits.Doc() method.

Now, back when I was designing my Lucene Document, I had to be thinking ahead regarding how I would display the results.   Would I display the results based purely on what's in the document?   If so, then I would have had to add fields to the doc for everything I wanted to eventually be displaying, not just the fields I needed for search.   The more fields I put in the doc, the more I would have to be updating the doc and the index to keep it in sync with my database, and the more I would be duplicating database data in the Lucene index.   So, there was a downside to relying strickly on the Lucene doc for my display.

Also, and for me more importantly, I already have a page in my app that knows how to display a list of bugs based on the result of a SQL query.   I didn't want to have to adopt that page to work with a Lucene Hits object.    I wanted to somehow convert the Lucene results into the format expected by that existing page.

So, I decided to try importing the Hits into the database, then letting my existing page fetch the hits out of the database, joining the hits to my bugs table to pick up the fields that I had not bothered to duplicate in the Lucene doc as fields.

The code below shows how I imported the Lucene hits into the database.    In short, I create a big batch of SQL Statements and execute them in one trip to the server.    The batch of SQL Statements creates a temporary table with a unique name plus a bunch of insert statements, one for every Lucene hit I want to import and display.    I import the best 100 hits, which is more than enough.   Lucene can find multiple hits in the same document, but I only want to list a given bug once in the search results, so I have logic for that below, the dict_already_seen_ids.

You will probably want to show your users the text around where the hit is, with the searched-for words highlighted, displayed in their context.   Lucene can prepare that displayable snippet of text for your.   You have to create a bunch of Lucene objects, a Formatter, a SimplerFragmenter, a QueryScorer, a Highlighter, etc, as does my code below.   I specified a snippet length of 400 characters and I specified the highlighting to be done using this HTML:  <span style='background:yellow;'></span>.    I feed to the highlighter the original Query and the raw text that I had saved in the doc.   Lucene then gave me the formatted, highlighted snippets, which I inserted into my temporary database table.

You might think that the import of the Lucene hits into the database would perform poorly, but actually, it's fast.    Had this not worked, then my plan B would have been to create a more complete Lucene Doc, and then somehow programmatically synthesize an ADO.NET recordset for my page downstream that displays results.



Lucene.Net.Highlight.Formatter formatter = new Lucene.Net.Highlight.SimpleHTMLFormatter(
    "<span style='background:yellow;'>",
    "</span>");

Lucene.Net.Highlight.SimpleFragmenter fragmenter = new Lucene.Net.Highlight.SimpleFragmenter(400);
Lucene.Net.Highlight.QueryScorer scorer = new Lucene.Net.Highlight.QueryScorer(query);
Lucene.Net.Highlight.Highlighter highlighter = new Lucene.Net.Highlight.Highlighter(formatter, scorer);
highlighter.SetTextFragmenter(fragmenter);

StringBuilder sb = new StringBuilder();
string guid = Guid.NewGuid().ToString().Replace("-", "");
Dictionary&lt;string, int&gt; dict_already_seen_ids = new Dictionary&lt;string, int&gt;();
sb.Append(@"
    create table #$GUID
    (
        temp_bg_id int,
        temp_score float,
        temp_text nvarchar(3000)
    )
");


// insert the search results into a temp table which we will join with what's in the database
for (int i = 0; i < hits.Length(); i++)
{
    if (dict_already_seen_ids.Count < 100)
    {
        Lucene.Net.Documents.Document doc = hits.Doc(i);
        string bg_id = doc.Get("bg_id");
        if (!dict_already_seen_ids.ContainsKey(bg_id))
        {
            dict_already_seen_ids[bg_id] = 1;
            sb.Append("insert into #");
            sb.Append(guid);
            sb.Append(" values(");
            sb.Append(bg_id);
            sb.Append(",");
            //sb.Append(Convert.ToString((hits.Score(i))));
            sb.Append(Convert.ToString((hits.Score(i))).Replace(",", "."));  // Somebody said this fixes a bug. Localization issue?
            sb.Append(",N'");
            
            string raw_text = Server.HtmlEncode(doc.Get("raw_text"));


            Lucene.Net.Analysis.TokenStream stream = analyzer.TokenStream("", new System.IO.StringReader(raw_text));

            string highlighted_text = highlighter.GetBestFragments(stream, raw_text, 1, "...").Replace("'", "''");


            if (highlighted_text == "") // someties the highlighter fails to emit text...

            {
                highlighted_text = raw_text.Replace("'","''");
            }
            if (highlighted_text.Length > 3000)
            {
                highlighted_text = highlighted_text.Substring(0,3000);
            }
            sb.Append(highlighted_text);
            sb.Append("'");
            sb.Append(")\n");
        }
    }
    else
    {
        break;
    }
}


We're done.  I'd be very interested in your feedback.   Was my explanation here helpful to you?   Were my design choices stupid?   I'd like to hear from you.


 

Tags:

Comments

2/22/2009 2:45:52 PM #

Corey, this is great. I'm considering doing something similar, and being able to read about your experience has been extremely helpful. Thanks for taking the time to put this together!

Richard |

3/11/2009 7:36:50 PM #

hmm, interesting post. i'm also plan to do the similar things with Python.
i'm interested how did you manage the fields (i.e., the Scheme) in your real code. Wink
as you mentioned a bug usually has a list of comment, a comment could also mentioned an another bug, how did you tell the Lucene that bug mentioned in a comment has a bigger weight?

tpeng |

3/15/2009 6:11:47 AM #

@tpeng - What I did... A bug has a list of posts (comments, emails, etc).   My Lucene doc has bugid, postid, text.  For the bug itself.   Postid is zero for the bug itself, non-zero for posts.   When I display search results, I only display one result for a given bugid.  If that one result is for a post, then I display that.  That's good enough to get the user to the right place.   I don't add any weight to anything - I just let Lucene do its thing.  It seems ok.

Corey Trager |

3/20/2009 1:29:13 AM #

there were a few kinks wherein a habitual programmer would be able to follow what you did and troubleshoot. but nevertheless, good design especially with Lucene. i'm sure it's more easy to follow the code with a more experienced programmer. Smile

guest |

3/26/2009 4:07:44 PM #

Very insightful. I am currently integrating Lucene.net within a asp.net MVC project.
Lucene.net to LINQ is a great project but it does not yet meet the full Lucene functionality.

However I am new to threads and would like clarification on the "my_lock" variable used in your code.
Can I create a ReaderWriterLock and apply it to the writer?
Do I need to manage the thread from within the Global.asax or can I manage it with a Repository class?




xman |

3/26/2009 7:42:03 PM #

@xman - "my_lock" is just:

object my_lock = new object();

Just a dummy .NET object.

I don't know what a ReaderWriterLock is.
I don't know what a Repository class is.

More on my_lock:

lock(my_lock)  <<<  grab the lock and execute the code in the block.  Another thread that tries to grab the lock will have to wait for me to release the lock, at the end of the block.
{
   block of code
} <<< lock gets released here

Corey Trager |

4/29/2009 6:35:45 PM #

All locking(while reading or writing) is handled by Lucene internally. You
don't have to use any other locking mechanism.

mail-archives.apache.org/.../%3C002401c9a8b5$254ed4c0$6fec7e40$@com%3E

arachnode.net |

4/29/2009 6:36:18 PM #

Let's try that again...

"mail-archives.apache.org/.../%3C002401c9a8b5$254ed4c0$6fec7e40$@com%3E"

arachnode.net |

4/29/2009 6:47:50 PM #

@arachnode - Sorry, I can't remember the specific details, but I'm sure that I encountered something that caused me to add the locking logic.  It may have been something like...  trying to do a search when the index was being updated.  Nothing bad happened to the index.  Nothing got corrupted.   So in that sense, Lucene handled the threading.  But my search failed because the writer was busy with the index.  I didn't want my search to fail.  I wanted it to wait.

Like I said, I don't remember the details, but they may have been something like that.

Corey Trager |

6/12/2009 4:27:21 PM #

Could you please explain where you put this code and how you use it?
On the web application I am developing I have a page where the user adds/removes entries from the database. The entries contain the information about a document (like the author, for example) and a field to store the name of the file they are associated with. The associated file is added to the index when the entry is created, and removed from the index when the entry is deleted. So, on that page I have written functions to add and remove the files from the index.
I know I am misunderstanding something... It works but I don't think I'm doing it the right way.
Is all this code supposed to be put in the Global.asax file? And if it is, how are you calling it on the appropriate pages? I hope this makes sense, I'm having a hard time understanding how this is used, and apparently I am the only one ;)

Nathan Fast |

6/13/2009 12:44:06 AM #

I read a post in StackOverflow where you replied about the use of Lucne, and that you had to make some changes in order to make it work in MEDIUM TRUST.
Could you please post some of the changes that are required to make Lucene .NET run in MEDIUM TRUST.
Do you submitted the changes to the Lucene team?

Thanks in advance

Luis Ramriez |

6/19/2009 3:46:49 AM #

The Lucene.NET issue is documented here:
http://issues.apache.org/jira/browse/LUCENENET-169

Corey Trager |

6/19/2009 3:50:03 AM #

@Nathan - The creation of the index is in Global.asax.   I update the index when I change data in the database that is indexed, i.e., when somebody updates a bug.  

Corey Trager |

6/23/2009 8:48:50 AM #

hi Corey Trager ,

where can i download Highlighter.dll ?

chuzon |

10/16/2009 5:15:33 AM #

Lucene.NET相关文章推荐

前端时间用Lucene.NET实现了一个WebService方式的搜索服务,通过修改索引配置信息,就可以适应大部分的中小型应用的全文搜索需求。期间阅读了不少的介绍Lucene.NET的文章...

Spect3 Tech Team |

3/18/2010 12:23:53 PM #

Examples of using Lucene.NET in an ASP.NET application

Examples of using Lucene.NET in an ASP.NET application

my great discovery |

8/2/2010 1:51:13 PM #

Pingback from tuts9.com

Concurency with Lucene.NET. | The Largest Forum Archive

tuts9.com |

12/13/2010 9:40:22 PM #

Multi Query Search Using Lucene.NET

Multi Query Search Using Lucene.NET

Surinder's Blog |

7/4/2011 1:49:30 PM #

Pingback from programmersgoodies.com

Lucene.Net and SQL Server «  « Programmers Goodies Programmers Goodies

programmersgoodies.com |

8/19/2012 2:35:55 PM #

Come iniziare a sviluppare con Lucene.Net per Asp.NET e C#

Come iniziare a sviluppare con Lucene.Net per Asp.NET e C#

Il fondo del Web |

12/14/2012 9:26:07 AM #

Pingback from aspnet.deveronline.com

Best way to create a search function ASP.NET and SQL server | Asp.Net developed Tutorials | Asp.Net Developed Tutorials

aspnet.deveronline.com |

1/4/2013 2:53:10 AM #

Pingback from siteduct.com

ASP Script Installation Service

siteduct.com |

2/11/2013 1:40:03 PM #

Pingback from sitefixing.com

website virus removal service

sitefixing.com |

5/13/2013 1:46:57 AM #

Pingback from rdhwwo.com

recherche de texte intégral pour l'ensemble de données | Partager du contenu

rdhwwo.com |

7/11/2013 11:14:47 PM #

Pingback from qlabnol.biz

How does Lucene.Net store Indexed-only fields? [duplicate] | Q Lab

qlabnol.biz |

9/10/2013 5:16:14 PM #

Full-Text Search in ASP.NET using Lucene.NET

Thank you for submitting this cool story - Trackback from AnantLeaves

AnantLeaves |

9/15/2013 5:33:30 PM #

Pingback from alexatrafficracer.org

Wine Lovers Guide

alexatrafficracer.org |

11/2/2013 12:48:21 PM #

Pingback from askprogramming.com

Concurrency in Lucene.NET. | Ask Programming & Technology

askprogramming.com |

11/14/2013 3:24:09 PM #

Pingback from siliks.com

siliks » Search in sql server database by part of names books

siliks.com |

7/1/2014 10:30:49 PM #

Pingback from asp.thekollectable.com

[RESOLVED]Lucene.net 2.0.0.4 | ASP Questions & Answers

asp.thekollectable.com |

12/6/2014 10:54:42 PM #

Pingback from birvanswers.org

Deleting and updating documents in Lucene index | Birva Answers

birvanswers.org |

Comments are closed

Powered by BlogEngine.NET 1.5.0.7

RecentComments

Comment RSS