11,000 html pages searchable
- Started
- Last post
- 17 Responses
- Hue
Hi, I have just built a wordpress site for a client who is a local museum, and now they would like to include the museum collection in the site. Currently the collection is 11,000 html pages and they want to make it searchable. Originally I thought of importing it into wordpress, but I think this would be too much for wordpress to handle. As it is html I thought of using google custom search instead. But wondered if anyone else had any ideas on what to use?
- mikotondria30
I'd do the google option - the 11k pages are online, aren't they ?
- Hue0
They will be online, just in folders at the moment. Only thought is that Google search comes with adverts, and not sure they will be keen on that. The paid option isn't an option for the museum at the moment.
- ETM0
http://www.freefind.com. The paid accounts for non-profits are super reasonable. Their site looks like its from 1997 but the search tech isn't bad, including defining sub-sections and indexing PDFs and Word docs.
- Hue0
cool, will have a look at the links, also found this: http://codecanyon.net/item/stati… which I was wondering whether you guys would think would be suitable
- utopian0
what is the worpress link to your museum site?
- Hue0
Its not live yet I am afraid, as hoping to put it live with the collection. Do you think the codecanyon script would handle it all?
- acescence0
I used google search without ads, not sure how I managed it..
http://www.ps1.org/search/wordpress is mysql based, mysql can handle 11k pages no problem.
that script you linked basically uses the filesystem and php to mimic an indexed database, a db would probably be faster.
- yes..but possibly slow with more than 5 or 10 searches at the same time...vaxorcist
- Hue0
Maybe I will give google a go then, any tips regarding setting this up or pretty straightforward? I started importing the HTML into wordpress, but with so many posts it seemed to slow down considerably. Is there any other simple methods to import the HTML table into mysql, in order to make a search?
- acescence0
apparently the api i used for search has been replaced. this is the most current: http://code.google.com/apis/cust…
the easiest way to get all of the pages into a database is to write a script that inserts them all in one go in a loop, or batches of a few hundred or whatever
this is a simple wordpress script that will insert a post, you could work this into a loop that opens each page and inserts it in the db..
http://pastebin.com/vQDuV5V9
- vaxorcist0
I used:
http://www.wrensoft.com/zoom/edi…it worked very nicely on a HUGE site we did some years ago, the site had some legacy pages outside the CMS, so we coulnd't use a CMS's database search....
wrensoft's system was.... very fast, no mySQL slowing down with huge queries for each search, and it took an hour to setup.... works on PHP/asp/cgi/etc.... but it does require a windows PC to run an indexing script and upload an index.. we automate3d this to run nightly
- DoktorDavid0
Two words: document management. It is unlikely these pages will stay static, with some vaulted, some active, some dead, some alive. Do it right and think (very) long term.
- vaxorcist0
I would really suggest a nightly running external spider -> index -> search index method, rather than a database method... i.e. google or wrensoft's zoomsearch or similar.... as this method would run fast and require the least modification of the existing site, fewer surprises messes to fix,etc....
I'm curous about a document management idea for this, but may not work well in a museum context as there are probably some pages written by some people who won't be available or follow rules....
I had a very good experience with wrensoft.com's zoomsearch, but I have no biz dealings except as a customer... it saved my sanity once...
- vaxorcist0
I would really suggest a nightly running external spider -> index -> search index method, rather than a database method... i.e. google or wrensoft's zoomsearch or similar.... as this method would run fast and require the least modification of the existing site, fewer surprises messes to fix,etc....
I'm curous about a document management idea for this, but may not work well in a museum context as there are probably some pages written by some people who won't be available or follow rules....
I had a very good experience with wrensoft.com's zoomsearch, but I have no biz dealings except as a customer... it saved my sanity once...
- acescence0
searching aside, I think you should definitely put the content into some sort of db. whether or not you're doing the searching yourself, I think it would simplify things.
as far as search is concerned, mysql is definitely not optimized for fulltext search, but it's certainly capable to a degree. to put things in perspective, I worked on a production db that was up to 30 million rows and a lot of text. fulltext searches would take up to 20 seconds on a properly indexed db, but that's with a dataset maybe 2000 times larger than what you're dealing with. anyway, if you the whole index doesn't fit in ram, you're still dealing with the overhead of going to disk no matter what means of local search you use.
anyway, I would just use google.