Paris

Student use of a new ‘killer’ Internet application that allows anyone connected to the web to share music files stored on the hard disk of their own computer has become so heavy that US campuses want to ban the software for fear that it will saturate their academic Internet connections.

But some scientists are thinking of adopting the principles behind the so-called ‘Napster’ technology themselves. They believe these could herald a new era in distributed computing, and in particular solve the thorny problem of how the vast community of biologists can collaborate on assigning functions to the genes in the human genome.

Anyone connected to the Napster software, which can be downloaded from the Internet (http://www.napster.com/), can do a single search for a song across all the hard disks of other Napster users and download it directly from the user's computer.

When Lincoln Stein, a bioinformaticist at the Cold Spring Harbor Laboratory in New York, heard about Napster, he was struck by the parallels with his own work on writing software for a distributed sequence annotation system for the human genome (see http://stein.cshl.org/das/). Napster, he realized, can be used to find and distribute information located anywhere on the Internet.

Stein believes that annotation, which involves predicting which sequence stretches are genes and what their function might be, calls for a radically different global database structure from the centralized system used for gene- and protein-sequence data. At present, users submit and retrieve records to a few central databases, such as GenBank.

But many scientists believe that annotation is too large a task for a few large genome centres. It is more of an art than a science, and up to half the predictions are wrong. No single group is likely to be able to produce a definitive version.

Annotation centres could in principle contribute data to GenBank-like centres. But GenBank entries, which can be modified only by those who submitted them, are sometimes erroneous. The problem is likely to be worse with the more subjective annotation data; the creation of a single authorative annotated sequence seems unlikely.

A better solution, Stein argues, might be to allow biologists worldwide to annotate the human genome sequence interactively using diverse computational and experimental methods, much as developers worldwide debug open-source software.

But until now this sort of decentralized solution raised the spectre of duplication of effort, and the risk that scientists, instead of being able to consult a single central database, would have to search a series of separate versions of human genome databases. Data integration would become a real problem.

Stein believes Napster-like technology could be the answer. A centralized reference server holding a detailed genome map would act as an anchor for data produced locally by third-party annotation servers. Researchers could publish their data electronically without having to maintain their own websites.

Such a system could avoid the need to label entries with subjective identifiers, such as gene names or accession numbers, as occurs now. Instead, maps, gene predictions and functional activities could all be superimposed on the reference map using a system of coordinates, much as astronomers combine multiple-wavelength data with positional or coordinate information. A user could zoom in on any part of the genome and view the region in many ways by calling up related data from the hard disks of participating laboratories.

Ewan Birney, joint head of Ensembl, a joint venture between the European Bioinformatics Institute and the Sanger Centre in Cambridge to develop an automatic annotation on eukaryotic genomes, is backing Stein's idea, and has proposed that Ensembl be used as the reference map. The US National Center for Biotechnology Information, however, apparently opposes the idea on the grounds that it would lead to a proliferation of junk.

That risk is real. Although few genome scientists admit it publicly, the quality control of many smaller laboratories does not match that of specialized centres. “It's a Catch 22,” says Birney. “We want to democratize people's ability to present their work, but quality is a problem.” He believes one answer would be to offer users a full interactive choice between an approved annotation, composed of data from recognized gold-standard laboratories, and the full data.

Few researchers are willing to publicly endorse Napster-like technology. The idea of leaving desktop hard disks open to the Internet is a network manager's nightmare, and is not helped by Napster's ability to scan firewalls and breach weaknesses.

But Birney insists that developing the system should be taken seriously. “If we don't get it right it will be the difference between a biological web and people just continuing to use the Internet to send e-mails and read web pages.”