The Internet has exploded to more than 320 million Web pages, and even the most dedicated surfer using the best search system would be able to find barely one-third of the pages, a study reported earlier this month.
It will not get easier soon, says Steve Lawrence of the NEC Research Institute, because the number of Web pages is expected to grow by 1,000 percent in just a few years."Hundreds of pages are being added constantly," said Lawrence, co-author of a study published in the journal Science. "There is no simple way to index it all. There could be any percentage of pages out there that nobody has actually accessed yet."
Lawrence and C. Lee Giles, also of NEC, analyzed how well scientists are able to find specific information on the Web using "search engines" - which are kind of like electronic librarians that sort and index millions of pages of data by subject or phrase.
What they found, said Lawrence, is that the amount of information on the Web overwhelms even the most sophisticated efforts to sort it all out, and there may be huge numbers of pages existing in an electronic shadowland never seen by humans.
"The results show that each search engine indexes only a fraction of the Web," said Lawrence.
The researchers analyzed the responses to 575 scientific search questions from the five largest search engines. They then individually checked about 150,000 pages for duplication, errors and mis-indexing. They also checked out the links, the Internet addresses of other sites that were referenced by the search engines.
Based on the study, Lawrence said, he estimated the Web has about 320 million pages that are accessible to casual browsers.
"This is probably a low estimate," said Lawrence, but, nonetheless, it is much larger than earlier studies that had found the Web to be only 80 million to 175 million pages.
Hyman Hirsh, a computer science professor at Rutgers University, said that knowing the size of the problem could help experts find ways to control the explosion of information.
"Everybody knows the Web is enormous and that finding things on it is very difficult," he said. "It is an unorganized, uncoordinated collection of information sources that is totally overwhelming."
Lawrence said the estimate of 320 million does not include millions of pages that are protected by passwords or "search walls" that block access to browsers or search engines.
A search engine called HotBot had the most comprehensive index of the Web, the researchers said, but it only covers about 34 percent of the indexable pages. At the bottom of the search engine list was Lycos, with 3 percent coverage.
The study said that of the other three engines, AltaVista had 28 percent coverage, Northern Light about 20 percent and Excite about 14 percent.
Graham Spencer, chief technical officer of Excite, said it would be impractical for a search engine to attempt to index the entire Web because people already are complaining about being flooded with information.
A single inquiry can produce a response involving millions of pages, leaving people drowning in data, while still thirsting for information.
Spencer said that instead of trying to swallow the whole Web, Excite and other search engines gobble up only what they consider the best of the data.
"We try to focus on relevance, the information our customers will actually use," he said. "We scan a lot more pages than we actually index."
Lawrence said this results in some data virtually disappearing unseen into cyberspace.
People searching the Web, he said, could increase their chances of success by using two or more of the search engines. When all five engines were turned loose on one query, said Lawrence, the result could be as much as three times what was produced by one engine.
Lawrence said the Web's data explosion may be better controlled by the "meta-search engines," such as Meta-Crawler and Ahoy! that have developed thinking techniques that sense what readers are looking for and seek out pages not found on most indexes.
More search engines that index only highly specialized subjects also are being developed to help people fight their way through the Web, Lawrence said.