The National Science Foundation has made a $9 million, five-year grant to a collaboration of researchers from Carnegie Mellon, the University of Pittsburgh, the Massachusetts Institute of Technology, Boston University and the National Canadian Research Council to advance a new field called Computational Biolinguistics.
Computational Biolinguistics, which combines the use of computational tools, including statistical language modeling, machine learning methods and high-level language processing, will allow scientists to better understand how proteins work inside cells.
As in languages, where there are sequences of letters that fall into patterns that make them understandable, there are sequences of amino acids in proteins that can be read to understand their structure, dynamics and function. Sequences of amino acids and their constituents can be thought of as syllables or words that have particular properties.
A deeper understanding of the relationship between protein structure, dynamics and function can help to extract information hidden in the gene sequences of genomes, which may, in turn, help develop drugs to fight disease. Today, there is great societal demand to understand and treat degenerative diseases, many of which are based on defective triggers for protein shape and interactions.
The project's principal investigators are Raj Reddy, Carnegie Mellon's Herbert A. Simon University Professor of Computer Science and Robotics, and Judith Klein-Seetharaman, assistant professor of pharmacology at the University of Pittsburgh Medical School, who also holds an appointment at Carnegie Mellon's Language Technologies Institute (LTI).
"The Human Genome Project and related genome sequencing efforts have provided a wealth of data, which has stirred great hopes for increasing our understanding and treating of disease or for mimicking nature's inventions in nanomachine design," said Klein-Seetharaman. "But the precise relationship between a primary sequence and the structure, dynamics and function of the encoded proteins is one of the most fundamental unanswered questions in biology.
"The Computational Biolinguistics Project promises to provide novel views and approaches to solving these challenges that would not be obvious without thinking in terms of the analogy between language and biology."
Carnegie Mellon will be the central site for the computational biolinguistics project. Its scientists will supply all of the necessary computational and language modeling technologies. Other partners will provide the bulk of biological and proteomic research and the laboratories where experimental work will take place.
There is also an industrial component to the project. Mathworks, Inc., of Natick, Mass., will work with Carnegie Mellon scientists to enhance its MatLab mathematical software to better support computational biolinguistics research. Medstory, Inc., Burlingame, Calif., which deals with drug innovation informatics, will focus on the clinical and drug development relevance of computational discoveries made under this program.
The Computational Biolinguistics grant is one of more than 300 announced by the National Science Foundation as part of its Information Technology Research (ITR) program. This year, NSF awarded a total of $144 million in new grants under the program.
NSF Aids Million Book Project
The National Science Foundation's Information Technology Research Program has also awarded a $3 million, three-year grant to the Million Books Project (MBP) to support digitization of core academic materials, technical reports, government documents and cultural treasures.
The project involves partners at Carnegie Mellon, Carnegie Library
of Pittsburgh, Indiana University, National Agriculture Library, OCLC, Penn State University, Stanford University, University of California-Berkeley, University of Washington, and 17 institutions in China and India. Principal investigators are Raj Reddy and university librarian Gloriana
St. Clair.
The MBP will create a large testbed of academic resources of all types,
in many languages, and make these materials available free for all to read on the Internet. The project is expected to be completed by 2007.
For more on the MBP, see
|
Anne Watzman
(10/10/02)