pygr + Google Summer of Code project page, GSoC 2008
Introduction
pygr is a Python-based graph database for bioinformatics. It is a general toolkit for storing and retrieving biological sequence relationships and annotations. Unlike other Python+biology toolkits (BioPython, corebio) pygr provides a high level abstraction layer for working with these objects that makes it particularly suitable for genome-scale analysis (in our opinion :).
In general, Python seems to be underrepresented in bioinformatics compared to other research fields; this is partly because Perl and R are dominant, but it is still startling given the depth of Python projects available for e.g. numerical work. pygr itself is being primarily used by a small subset of labs at UCLA, Caltech, Michigan State, and in Korea.
The aim of these projects for GSoC '08 is to help bring pygr to a wider community by increasing the base quality level of the codebase, improving new user and expert user documentation, providing more and better pre-fab data sets, and adding more generally useful features.
Possible Mentors
- Chris Lee, UCLA
- Namshin Kim, UCLA/Korea
- Titus Brown, Caltech/MSU
- Rob Kirpatrick (BCGSC)
- Diane Trout, Caltech
Student submissions (confirmed)
- Jenny Qing Qian (Rob K.) - Ensembl functionality
- Alex Nolley (Chris Lee) - pygr codebase maintenance
- Rachel McCreary (Titus Brown) - examples, documentation, datasets, examples
Subproject ideas
Also see http://lists.idyll.org/pipermail/biology-in-python/2007-September/000145.html
pygr codebase maintenance
Incomplete implementations of dict/etc interfaces need to be rounded out, to either raise NotImplemented? or to work as expected.
A major consideration for me (Titus) is that I don't have a sense of what computation & retrieval demands are placed on what classes. So, when building an NLMSA on a small genome, I tend not to worry about how the features are stored because it's fast; but then when I scale up to a big genome, there's a slowdown due to e.g. unpickling. Examples and information on these issues needs to be available somewhere, preferably in the code and in the docs.
Interface expectations between classes seem undocumented; build some simple in-memory implementations, or stub implementations, that "pass" the interface requirements. And/or specify interfaces in some other way.
pygr contains large swathes of completely uncommented code, both python and pyrex/C. This could should be gone through and automated tests built.
Similarly some of the code is pretty evolved and could use some systematic refactoring, post-test-addition.
Automated test architecture: add continuous integration, systematic unit/functional/doctest tests, and more large-data-set tests.
Benchmark and profile a few of the large-data-set tests?
PEP-8 compliance should (IMO -- titus) be a long-term goal.
Measure test coverage of basic tests, improve => 80% or so.
pygr.Data additions
sqlite support (sqlite now built into Python as of 2.5)
"download=True" option to download data sets to local machine if not present (item #12 on mailing list posting, above)
Better dependency detection/hooks for running custom "unpickling" function for installation/detection of dependencies
pygr.Data server Web interface to list available resources
object deletion schema (?) -
This means, if obj has edge relations to other objects managed under the pygr.Data schema, what happens if that object is deleted. These become data integrity rules in the schema: e.g. just delete its edges; or delete objects that it has a certain relation to (e.g. its "child objects").
Fix pickle (in)security. Signed pickles, e.g. with GPG? TrustedPickle? prototype? See item #13 on mailing lilst posting, above.
"dry run" query: ability to see how pygr.Data would fulfill a specific request, without actually getting it
Documentation, user help, and community building
The Developer Forum has *tons& of examples. Many of these could be written up as text-file doctests or otherwise made "executable", and then run as part of the continuous integration, standard tests, etc.
Installation etc.
Make easy_install-able, build binary eggs for Windows, Mac OS X
Go through and test Windows installer
Add packages/make packages for debian, fink, redhat
Examples
From start to finish:
- Microbial genome annotations
- importing Wormbase annotations
- importing UCSC, Ensembl (much work already done)
- Using SQL db as a backend for the above
Data sets
Post, maintain UCSC alignment data sets
Post leelab databases
Ensembl annotation databases
Expansion of feature set
Build tools & libraries to wrap/import a variety of alignment types and larger data sets, e.g. CLUSTALW, blastz, LAGAN, ENSEMBL, etc. (#10 & 11 on mailing list post, above)
Serve BLAST/MegaBLAST via XML-RPC (#2 on mailing list posting, above) Serve AnnotationDB via XML-RPC (#1 on mailing list posting, above)
Fast result filtering (#5)
Store BLAST edge info (#7)
Nucleotide-to-AA annotation and separate coordinates (#8)
TBLASTN and BLASTX support (#9)
