One thing at which software seems to be really getting better, is classifying stuff. Just look at your email spam filter. My Gmail filter is pretty good in sorting the wheat from the Canadian Viagra. Spam filtering is a two-way classification problem: the software has to “decide” whether any email needs to go to the Inbox, or to Spam. Other problems are multi-way classification problems, for example, that of protein function annotation. What is the function of a given protein? There are millions of different things proteins do in life. Also, there are millions of protein sequences in databases, and most of them are unannotated: we have no idea what they do. The overwhelming majority (98%) are annotated by computational methods, with no human oversight. Since genomes starting coming out in droves, a new class of biologists — biocurators — are working to properly assign functions to sequences, mostly protein sequences. Biocurators look at sequences, and assign function using whatever evidence they can find. The best evidence would be if a protein was subject to an experiment that has been published. The curator reads the article, and assigns a function or functions to the protein (annotates the protein). The commonly accepted vocabulary for protein function annotation is the Gene Ontology. The Gene Ontology, or GO, is used to describe what we know about protein function in a standard fashion. Proteins are assigned standard terms such as “protein tyrosine kinase activity“, or “hydrogen:potassium-exchanging ATPase complex“. GO not only lets us record function in a universally accepted standard, it provides a mechanism for us to record how we know what we know through the use of evidence codes. Evidence codes are used by curators to record how they inferred the gene function. The first class of evidence codes describes different types of experimental evidence: “inferred by protein interaction” or “inferred by genetic assay”. A second class of evidence codes is comprised of GO terms provided by curators but where no experimental evidence exists for the protein’s function. From the GO site: “Use of the computational analysis evidence codesindicates that the annotation is based on an in silico analysis of the gene sequence and/or other data as described in the cited reference. The evidence codes in this category also indicate a varying degree of curatorial input.” The computational analysis evidence codes include inferred from key residues (IKR) or inferred from sequence or structure similarity (ISS). It’s important to remember here that function assignment is not done by producing experimental evidence, but by bioinformatic means, supervised by the human curator.