Open, Distributed Web Search
Murray Walker, search@minty.org
Consider this an RFC (Request For Comments).
Please add them to the wiki
--------------------------------------------------------------------------------
Abstract
This article discusses the concepts, goals and ideas for an open
source, community-run (distributed) web search service that allows
anyone to search, and thus navigate around the internet. A key tenet is
impartiality and objectivity - avoiding bias, corruption and "spamming"
of the service.
An outline of one possible technical architecture is presented. This
does not cover the details of any specific algorithms or search
techniques. No working code or implementation exists.
Introduction
The ability to find things on the internet, and particularly the world
wide web, is often taken for granted by the vast majority of people who
(often many times a day) are "searching the web".
Whilst there are a huge number of tools and technologies available to do this, a tiny handful dominate the market:
Google
Yahoo
Microsoft [1]
They, along with many other similar services, share a number of key features:
Their algorithms, or rules for deciding which results to display, are "closed" and often tightly guarded secrets.
They generate money, selling various forms of advertising around the search results.
They are commercial companies whose first priority is to generate value and profit for their shareholders.
There are no obligations or requirements for them to be objective,
impartial, comprehensive or even accurate. And there is very little
anyone can do to measure them against these metrics.
Given the power and influence these services have over everyone that
uses them, it could be argued there is scope for abuse and at least the
possibility for a conflict of interest.
Which leads to the idea for an "open" [2] web search service that is
free to use - both financially and (more importantly) in the sense that
it can be used by anyone, for any purpose. This means that the
technical code and information needed to run such a system are "open
sourced", but also that the service is provided by the community of
users without imposing any restrictions on how the service is used.
It is not known if this idea would work, or if it is simply a
description of a hypothetical, impractical or impossible scenario.
Hopefully by reading and adding to the debate, it will be possible to
find out.
The rest of this article discusses some issues that such a project
might encounter, and outlines very briefly one possibly technical
architecture that might be used to implement such a system.
If you would like to comment, you can do so on the accompanying wiki.
Assumptions and Scope
This article will discuss only a plain text web search service,
including HTML files. This excludes features like image search for the
sake of simplicity.
Many existing search services, including the "big three" mentioned
above, have paid advertisements around the search results. They also
all have separate, custom technology for delivering such advertisements
[3]. The ability to do, or technology of, "ad serving" is not discussed
here.
Custom search services, such as searching for books on Amazon or
auctions on ebay, are out of scope. A generic web search service must
be able to handle unstructured, plain text data.
The primary focus is a search service that requires no manual
intervention or classification on a site by site, or page by page
basis. Namely, once it is setup and running, it is fully autonomous.
Advanced features of existing services, such as viewing cached versions of pages, spell checking, etc. are also out of scope.
Resources : Bandwidth, Disk space and CPU
Consider an index of around 4 billion items, roughly the size of
Google's current web index. Taking an average of 30Kb per item or url
(only looking at the plain text - including html source), it would take
around 111 Terabytes of data being downloaded to consider each document
exactly once.
Then consider updates, which could range from quarterly or monthly
for resources infrequently used or rarely changed, to weekly, daily or
even hourly in the case of very popular or highly dynamic sites, such
as news sites.
It is not the goal, at least initially, to store complete copies of
everything downloaded. Thus, if a 30Kb item could be reduced to 300
bytes of information, disk space requirements would be around 1000
Gigabytes. It is also possible that in order to produce fast results
when a query is run, it would require storing more than 30Kb per item.
If it was a ten fold increase, it would put disk storage requirements
at close to 1 Petabyte.
CPU or raw processing power is even harder to predict, but it breaks
down into two distinct areas. There is the task of crawling or indexing
pages. This can be done "offline" or in a batch approach. For the sake
of argument, if a typical 2Ghz Pentium 4(r) computer could process 100
urls per second, that computer could do just over 8.5 million urls in a
day. (Assuming the data could be downloaded this fast.) It would take
that (single) computer about a year and a half to process the 4 billion
urls. A hundred such computers could do it in just under 5 days.
Then there is the task of returning answers to queries in real time.
Again, for the sake of argument, assume our typical 2Ghz Pentium
computer could deliver 100 requests per second. Assuming it can scale
close to linearly, the CPU resource required here would then become a
function of how popular the service became.
Considering a worse case, where our benchmark machine is capable of
crawling only 1 url per second, and answering queries at a rate of 1
request per second, it could easily require thousands of computers.
It is clear that the resources required are considerable.
Historically, approaches to this problem used a small number of
massively powerful computers (the Mainframe approach). More recently,
and the technique used by Google, Yahoo and most likely Microsoft,
involve many hundreds or thousands of "commodity" computers (cheap,
standard, generic machines) working in some degree of parallelism,
using the network to communicate.
Taking this model one step further, it would appear that a
distributed, "peer-to-peer" model could potentially be used to provide
these resources. There is the potential for hundreds of thousands of
computers, distributed across the internet all working together.
Existing (working) examples of peer-to-peer networks:
Traditional file-sharing networks such as Kazaa, which already allow
millions of users to search across millions of files, representing
terabytes of data, and then download those files simultaneously from
multiple sources.
Internet Telephony networks, like Skype, which use peer-to-peer networks to route calls for free around the world in real time.
Less traditional (newer?) file sharing networks, such as BitTorrent
which implement an "open" service similar to the traditional service
offered by Akamai. Also Freenet which implements an entirely
decentralised hosting network where both publishers and consumers are
anonymous.
Spamming, corruption and abuse
Because it will drive more people to a web site, many people and
organisations go to considerable lengths and cost to try and ensure
their sites appear as high as possible in search results. This is
commonly referred to as "Search Engine Optimisation".
An example might be trying to ensure your web site appears above
those of your competitors - even when the search query was for your
competitors name, or product.
However this is only the tip of the iceberg when it comes to bias
and corruption in search results. For instance, a technique often
referred to as "Google Bombing" resulted in Microsoft's website
appearing at the top of the results when someone searched for "more
evil than satan himself" towards the end of 1999.
And then, of course, there are laws, lawyers, legal systems and governments. Consider liable and propaganda to name but two.
There are a number of ways to address these problems, which are
typically coded into the algorithms and rules used by the search
service and are often highly guarded secrets. Although the specific
implementation details are not public knowledge, many of the concepts
are generally well understood and documented. PageRank(tm), as used by
Google, is a good example of this. Various forms of hashing to try and
identify similar documents being another.
Discussion of these techniques, while a critical and very complex
part of any web search solution, are beyond the scope of this document.
There are however two problems specific to a distributed