Is this your first visit? You may want to subscribe to the feed.

Articles tagged with search

Cucumber scenarios that depend on Sphinx

I love writing apps that make heavy use of search indexes, but testing them can be a bit of a pain. Here is how I got ThinkingSphinx to play nice with Cucumber.

Here is the relevant part of what I put in features/support/env.rb:

# Cucumber::Rails.use_transactional_fixtures

# http://github.com/bmabey/database_cleaner
require 'database_cleaner'
DatabaseCleaner.strategy = :truncation
Before do
  DatabaseCleaner.clean
end

ts = ThinkingSphinx::Configuration.instance
ts.build
FileUtils.mkdir_p ts.searchd_file_path
ts.controller.index
ts.controller.start
at_exit do
  ts.controller.stop
end
ThinkingSphinx.deltas_enabled = true
ThinkingSphinx.updates_enabled = true
ThinkingSphinx.suppress_delta_output = true

# Re-generate the index before each Scenario
Before do
  ts.controller.index
end

What’s going on here?

Start by commenting out the line about using transactional fixtures in env.rb. Using transactional fixtures will run each scenario inside of a transaction and roll it back at the end of the scenario to revert the database state. Thinking Sphinx uses an after_commit callback for kicking off the delta indexing, but the callback never gets run when transactional fixtures are enabled because the entire scenario is run inside of a big transaction.

Once we’ve disabled transactional fixtures, our test database will start to fill up, likely causing some problems. So we need to add a Before block that clears out the database before each scenario. I’m using database_cleaner, which gives you some different strategies for cleaning the database. Alternatively, the brute-force solution is just to reload the schema before each scenario, but this is slower than truncating the data.

Before do
  ActiveRecord::Base.establish_connection(ActiveRecord::Base.configurations['test'])
  ActiveRecord::Schema.verbose = false
  load "#{RAILS_ROOT}/db/schema.rb"
end

Next, we start Sphinx when env.rb is loaded, and shut it down when the Ruby process exists. We also enable deltas and updates, which are disabled by default in test mode. Finally, we define a Before block that updates the index before each scenario so we don’t end up with a stale index.

Putting it all together

I’m using Sphinx’s delayed delta support, so whenever I update records, I need to have delayed_job process jobs. Instead of trying to get delayed_job to run in the background, I took the easy way out and defined a step: “When the system processes jobs”.

Scenario: Posting a new listing
  Given I am logged in as "MovinMan" 
  When I create a new listing titled "Lots of Boxes" near "49423" 
  And the system processes jobs
  And I browse listings near "49423" 
  Then I can see a listing titled "Lots of Boxes" 

Which is just implemented as:

When 'the system processes jobs' do
  Delayed::Job.work_off
end

If you’re just using the default deltas, and not delayed deltas, then you can update the index like this:

When /^the system updates the index$/ do
  MyModel.sphinx_indexes.first.delta_object.index(MyModel)
end

I hope that helps. Post your suggestions in the comments for improving this.

Code: search Jun 01, 2009 ● updated Jun 25, 2009 13 comments

Keepin' Sphinx Indexes Fresh

<infomercial-voice>Stale indexes got you down? Embarrassed that your users’ searches are coming up empty? Act now and you can avoid stale indexes with NEW and IMPROVED delayed delta indexing!!</infomercial-voice>

Ok, maybe it’s not new and improved. It’s actually been around since January, but it’s still awesome. Thinking Sphinx can use delayed_job to keep indexes fresh.

I was slow at jumping on the Sphinx bandwagon for one reason: the index has to be rebuilt to incorporate new data. Delta indexing alleviated some of this by storing frequent changes in a small separate index, but it still had to be occasionally reindexed. It also seemed to only index existing records in my trials with it. New records didn’t ever seem to show up until I rebuilt the whole index.

From what I can tell, delayed delta indexing makes everything Just Work™, and here’s how to use it…

After you’ve setup ThinkingSphinx, set the :delayed property to :delta in your index:

class Listing < ActiveRecord::Base
  define_index do
    indexes title
    indexes description
    indexes user.login, :as => :user

    set_property :delta => :delayed
  end
end

The delayed delta support depends on delayed_job, but if you’re using the gem version, it’s already bundled in. I’m using delayed job for some other things in my project, so I still have it installed separately and that seems to be working just fine.

delayed_job uses a table to keep track of the jobs that need run, so create a migration containing:

create_table :delayed_jobs, :force => true do |table| 
   table.integer  :priority, :default => 0 
   table.integer  :attempts, :default => 0 
   table.text     :handler 
   table.string   :last_error 
   table.datetime :run_at 
   table.datetime :locked_at 
   table.datetime :failed_at 
   table.string   :locked_by 
   table.timestamps 
end

And lastly, all you need to do is fire up the worker process:

$ rake thinking_sphinx:delayed_delta

Now whenever changes are made to your models, the index will be updated moments later. And that’s how you keep it fresh!

Code: search Apr 29, 2009 ● updated Apr 29, 2009 2 comments

Location-based search with Sphinx and acts_as_geocodable

Sphinx, everybody’s favorite search engine, has support for location-based search, giving you geo-aware, full-text searching. So now you can find all of the garage sales on Saturday within 20 miles that have LPs and a reel mower.

All you need to do is add latitude and longitude (in radians) to the index, allowing you to limit the results to records within a distance of a point. The hardest part is getting the coordinates, but acts_as_geocodable makes that really easy.

To start, install acts_as_geocodable. Once you have that configured properly, install ThinkingSphinx, define an index on your geocodable model and add the coordinates to the index:

class Listing < ActiveRecord::Base
  acts_as_geocodable

  define_index do
    indexes title
    indexes description

    has geocoding.geocode(:id), :as => :geocode_id
    has 'RADIANS(geocodes.latitude)', :as => :latitude, :type => :float
    has 'RADIANS(geocodes.longitude)', :as => :longitude, :type => :float
  end
end

The three lines that start with has add the geocode id, and the latitude and longitude in radians to the index. Our index doesn’t need the geocode id, but we have to add it so that ThinkingSphinx properly joins the geocodes table.

After you rebuild the index and start the daemon, you can search for records by location. Here’s an example of taking a zip code from a user and finding all records with in 20 miles. (Note: you will need to grab the latest version, 0.2.9, of Graticule for this next bit of code to work)

def search
  location = Geocode.geocoder.locate(params[:zip]).coordinates.map(&:to_radians)
  @listings = Listing.search(params[:q], :geo => location,
    :with => {'@geodist' => 0.0..(20 * 1610.0)})
end

After looking up the coordinates of the zip code that the user entered, we do a search with the :geo parameter, limiting the results using the special @geodist condition. We have to pass in a range of floats that represent the distance of the points in meters (and since the US is in the stone age, I’m converting from miles).

That’s all there is to it. Now go write some cool location-based search and comment here about it.

Code: search Apr 14, 2009 ● updated Apr 14, 2009 3 comments

It's a search party!

While chatting with Dr. John Nunemaker at RubyConf, I realized that I have several problems. Ignoring the many character flaws that are beyond the scope of this post, my problems are:

  1. I tag things I find useful on Delicious. But I rarely look back to delicious because it’s just easier to search Google, resorting to Delicious if I can’t find something that I remember tagging.
  2. It’s easier to search Google because not everything that I find useful is in Delicious. I don’t want to have to think about where useful things are, I just want to search for them.

I want a search engine that prioritizes things that I’ve found useful in the past. Ideally, google would let me tag things and take that into account when calculating page rank. But, in the mean time…

Let’s have a search party!

I threw together a simple search interface that pulls in results from Google, Delicious, and just for fun, GitHub. It shows primarily Google results, but then in the sidebar, it shows results from Delicious, with my bookmarks highlighted at the top, and also results from GitHub. Check it out.

And if you use Firefox, you can add it as your search provider:

It is an extremely simple app that has 2 pages and uses JavaScript to read in JSON from all of the services. There is one Ruby class that does screen-scraping since Delicious doesn’t provide an API to their full search.

I first implemented it in Merb, but upon realizing how simplistic it is, I switched it over to Sinatra. And I used jQuery to do the JSON and other JavaScripty goodness. Thanks to Mark Van Holstyn for helping implement it.

The code is available on GitHub, so check it out, fork it, and make it more awesome.

I will be making a few other posts about specific things I learned when building this, including deploying sinatra apps and using jQuery to do JSON with Merb and Sinatra.

Code: search Nov 08, 2008 ● updated Nov 09, 2008 2 comments

Using shared indexes with acts_as_ferret

So by now we all know how to do wicked-cool search with acts_as_ferret. (If not, the RailsEnvy guys can lend a hand, but the tutorial is a little outdated for the latest trunk version of acts_as_ferret. Just replace #find_by_contents with #find_with_ferret and you should be good.)

But searching a single model is so last year. All the cool kids are getting promiscuous with their searches and involving multiple models. Fortunately for us, recent revisions of acts_as_ferret makes this easy-peasy.

The key to making this happen is a shared index: we want all of our models indexed in one place so ferret only has to do one search. We can do it without a shared index, but then we have to do a ferret search and thus a SQL select for each model. Plus, if we want your search results interspersed and sorted by rank, we have to have a shared index.

Enough chit-chat, show us how!

Ok, I’m getting to it. Grab the latest version of acts as ferret from trunk:

script/plugin install svn://projects.jkraemer.net/acts_as_ferret/trunk/plugin/acts_as_ferret

Now, instead of defining acts_as_ferret in our models, we define them all in config/aaf.rb

ActsAsFerret::define_index('shared',
 :models => {
   Person  => {:fields => [:first_name, :last_name, :phone, :bio]},
   Company => {:fields => [:name, :description]},
   Post    => {:fields => [:title, :body]}
 },
 :ferret   => {
   :default_fields => [:first_name, :last_name, :phone, :bio, :name, :description, :title, :body]
 }
)

This defines a new index, called “shared”, and then defines the acts_as_ferret configuration for each model.

Now for the fun part: searching our shiny new index.

def search
  @results = ActsAsFerret.find(params[:q], 'shared')
end

This will give us one array with any models that matched the search query, ordered by rank. And for those times when we only want to search one model, we can still do that.

def search
  @people = Person.find_with_ferret(params[:q])
end

How do we display the results?

Update: Sorry, originally I had an example using resources, but that doesn’t work as-is; I was doing something a little different in the app that this example came from.

To display our search results, we just render a partial for each model in the result:

<% @results.each do |result| %>
  <%= render :partial => "search/#{dom_class(result)}" %>
<% end %>

This will just look for a partial for each model (like search/_person.html.erb).

So there you have it. Now you too can have promiscuous searching.

Update: I’ve put together an example rails app that uses the shared index. It uses sqlite and has some date pre-populated. Start up script/server and do a search for “John”.

Code: search Apr 28, 2008 ● updated Oct 14, 2008 37 comments

Hack for partial matches in Ferret

I love ferret (the ruby port of Lucene, not the fuzzy little creatures, you sicko). But something I fight on every project is that ferret turns into a bear when you try to get it to do partial matches, like "ferr" matching "ferret" and "ferrari".

Ferret allows you to append an asterisk to your search query ("ferr*"), which works great, but we can’t expect our users to do that because damn Google 1 has set the expectation that search just works; I don’t need to use any funky syntax to find my pogs, Harry Potter gossip or BRATZ 2.

So, we can do this manually in code by appending an asterisk to anything users enter and problem solved, right? Not quite.

  • It breaks if you’re using the StemFilter, which allows you to match variations of words ("happy" would match "happiness" and "happiest")
  • It will only match partials on the last word that the user entered ("Ed Brad" won’t find "Edward Bradley")
  • Apparently the asterisk tells ferret that there has to be more characters, because full matches no longer work ("ferret*" won’t match "ferret")

So, here’s my hack.

Book.find_by_contents "(#{term})^2 OR (#{term.split.map {|t| t + "*" }.join(' ')})"

This ugly little thing will match exactly what the user entered (making use of stemming and all the magic that comes from it) and give it a little boost in the ranking, or match any part of any of the words entered, giving me partial matches.

I acknowledge that this is an ugly hack at the moment, and will break miserably if the user is any kind of a wizard that knows how to do advanced searches, but it works for now. I have no idea what kind of consequences this will have as far as search performance and such. The goal is to wrap this into a filter.

Any one else have any cleaver ideas for doing partial matches?

  1. Yes, Google, we love and hate you for raising the bar.
  2. We’re talking normal users here, which excludes anyone that is reading this.
Code: search Dec 12, 2007 ● updated Dec 12, 2007 4 comments

acts_as_ferret will_paginate

Update: This is not needed with recent versions of acts_as_ferret.

Here’s a little nugget to add to acts_as_ferret to make your searches paginate with will_paginate.

module ActsAsFerret
  module ClassMethods
    def paginate_search(query, options = {})
      page, per_page, total = wp_parse_options(options)
      pager = WillPaginate::Collection.new(page, per_page, total)
      options.merge!(:offset => pager.offset, :limit => per_page)
      result = find_by_contents(query, options)
      returning WillPaginate::Collection.new(page, per_page, result.total_hits) do |pager|
        pager.replace result
      end
    end
  end
end

Updated from Behrang’s comment based on changes to will_paginate.

There was a slight challenge in that will_paginate expects that you do one query to get the count, create a new collection object based on that count, and then perform the actual search. But acts_as_ferret does it all in one method call, so I have to create a temporary collection object to get the offset, then do the search and create the collection object. It’s a little messier than it needs to be, but it works.

Product.paginate_search params[:q], :page => params[:page]
Code: search Aug 17, 2007 ● updated May 07, 2009 49 comments

Subscribe

Browse by Tag