Analyzing CEGS RACE Products

OK, so you've got a sequencing product from a fluorescing fish, and it looks like a genuine insertion. What next?

Your goals:

  • what gene/message is the product from?
  • where is the insertion?
  • what does the gene do?

What gene or message is the product from?

To figure out what gene the product is from, you should first do a nucleotide-nucleotide BLAST search ('blastn') against known zebrafish messages: go here, and then select 'RefSeq RNA' as the database to search. If that turns up a good match to a named gene (e.g. "Sox10") or an anonymous message (e.g. "BC019275") then you have your gene! This is what my simple RACE Web annotator site does.

Note that for this kind of search, a good match should be 1e-30 or lower (1e-40, 1e-50, etc.) -- you're searching with an actual zebrafish sequence, not something cross species where you would expect lower scoring hits to be relevant.

Once you have the whole-gene sequence, do further searches with that sequence, which won't include cloning sites or other trap sequence but will include all of the interesting protein domains etc.

If blastn against known zebrafish messages doesn't work then (lucky you!) you may have found a new gene! Or (unlucky you) you may have a crappy sequence. It can sometimes be tough to tell which it is; the simplest way is to do a whole-genome search. I personally like the UCSC Genome Browser for this kind of search -- take your sequence and use BLAT to do a search against the zebrafish genome. The resulting hit list is confusing; what you want to check is the QSIZE (Query Size) and the SCORE (number of nucleotides that matched). The SCORE should be relatively close to the QSIZE; otherwise, most of the sequence isn't being matched.

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details YourSeq          817    71   968   977  97.9%     2   -    6663257   6666644   3388
browser details YourSeq          157   526   926   977  88.3%    24   -    5428980   5433946   4967
browser details YourSeq           65   594   673   977  91.2%     6   -    9027597   9027676     80

Then click on the "browser" link (NOT the details) and zoom out 10x or more. Your BLAT hit will be matched at the top of the page and all of the genome annotations will be mapped below it. This will give you a sense for the genomic neighborhood of the match and let you see if there's a match to genes in other species, etc. At this point you have to guess at whether or not the sequence you found is an undiscovered exon from a previously known gene, or an entirely novel gene.

If searching the genome doesn't work, odds are that you do not have a real zebrafish gene. However, there is a slim chance that you have a novel gene that is not part of the existing zebrafish genome assembly. Your best bet in this case is to refer your sequence to your nearest friendly genome expert.

Where is the insertion?

Assuming that you find the insert is part of a particular gene, you can also map the location of the insertion by using the UCSC Genome Browser. Just do a BLAT search against the zebrafish genome and go to the browser; there you'll see the BLAT hit mapped against the message, including its exonic structure.

(This is one of the things that my annotator does for you.)

OK, so what does the gene do?

You really have only two options for this. First, you can hope that the gene itself is the subject of one or more publications; you can figure this out by going to the NCBI Web page and looking at the publication list associated with the RefSeq? entry. In the (quite likely) case that the gene is from a large-scale cDNA collection or is just a predicted gene, it might be worth looking at other species for matches to the message or its associated protein, but as a rule NCBI will already have annotated genes based on this information.

The second (and much worse) option is to look at conserved protein domains. I'm not really an expert in this, but I can suggest a few sites: the PFAM site will do a search for known and interesting protein domains in your protein sequence (which you can always get from NCBI for a RefSeq? gene). To use PFAM, select "Sequence Search" and paste in your gene. You can also use NCBI's Conserved Domain Database to do a similar search.