Pharmaceutical Company Data Mining

 

You have been hired to develop a database intensive application for a large pharmaceutical company, "Drug R Us". The pharmaceutical researchers/developers rely on the company’s IT department to develop ways to handle the storage and retrieval of large amounts of complex data such as:

            1.  Genbank Nucleotide Sequences

2.  PubMed References

3.  Basic Local Alignment Search Tool (BLAST) Output

4.  Georgetown’s Protein Information Resource Records

The data is generated, as needed, by the scientist.  This data must be stored so that it is readily available to those involved in the process of rational drug design.  Again, you have been designated to play an important role in storing, managing and collecting this information.

1. Design the necessary relations and their dependencies in order to support the above described.  If you have questions, see the instructor.

Obviously you will have to make the right assumptions and you are required to have primary key and foreign key constraints, potentially NOT NULL constraints, defined for your schema.

2. All input to and output from your application will be done in XML format. You will be provided with a DTD schema. It is up to you if you have a mapping between XML and the DDL schema or you use pretty much the same names for the fields when applicable.

3. You will be provided with all your test data (in XML format).

4. Each team will consists of 2 students, in the case of black sheep or solitaires you will be paired with the instructor.

5. You will need to provide the ability to load data into your database, and also the ability to query data. The format should be XML. The emphasis will not be set on the User Interface, unless you are getting bored with the server side, i.e. works well and you still feel like working on your project. You will need some user interface, however it doesn't have to be web based, or even graphical, it can be command line.

6. Protein Information Resource

a. Given a sequence, get protein entry id and protein name.
b. Given a protein name, get accession number and all reference
information.
c. Given a classification superfamily, get all protein entry id's, their
names, their alt names, and all organism information.
d. Given a gene db and gene uid, get protein name, protein id,
map-position, and introns.


BLAST Output

a. Given a BlastOutput_query-def, get all Hit_num's and number of Hsp's per
each Hit.
b. Given a BlastOutput_query-def, get all Parameters and Statistics
information.
c. Given a BlastOutput_query-def and a decimal value, get all Hit_num,
Hit_id, Hit_def, Hit_accession, Hit_len where there exists an Hsp with a
Hsp_bit-score greater that the given decimal value.
d. Given a BlastOutput_query-def and a decimal value (be aware of
scientific notation), get Hsp_qseq, Hsp_hseq, and Hsp_midline for every Hsp
with a Hsp_evalue less than the given decimal value.
 

 You will need to accommodate at least these, and feel free to be creative and provide even more queries.

Only the query results have to be provided in XML format, the query parameters can be entered command line, or through user interaction.

7. The technologies to be used are JDBC, accessing an Oracle database, and using the JAXP API for XML parsing.

8. There is no preferred solution for this project. Obviously some will be better or more detailed than others. Try to do your best, clearly documenting your assumptions.

9. Additional details will be provided in class.