There has been an increasing effort in recent years to extract relations between entities from natural language text. In this dissertation, I will focus on various aspects of recognizing biomedical relations between entities reported in scientific articles.
Approaches to the relation extraction task can be categorized into two major classes: (1) pattern-based approaches and (2) machine learning-based approaches. Pattern-based approaches often use manually-designed rules to extract relations. They do not need annotated corpora for the training, and it is generally possible to obtain patterns that yield good precision. But pattern-based approaches require domain experts to be closely involved in the design of the system. It is impractical to manually encode all the patterns necessary for a high recall. On the other hand, machine learning-based approaches are data-driven that can derive models for automated extraction from a set of annotated data. But their ability to generalize from examples can be hindered by lack of domain and linguistic knowledge and their performance is critically dependent on availability of large amount of annotated data, which might be expensive to create.
In this study, I propose a set of methods which leverages linguistic theories to alleviate problems in both classes of approaches.
For pattern-based approaches, I propose a framework that enables the fast development of pattern-based systems to reduce the involvement of domain experts and yet to attain high precision and recall. The approach starts by identifying a list of triggers for the target relation and their corresponding specifications. Given this information, we make use of linguistic principles to derive variations of lexico-syntactic patterns in a systematic manner. These lexico-syntactic patterns are matched with the input text in order to extract target relations. I also incorporate text simplification and referential relations linking to improve the applicability of the generated patterns.
For machine learning-based approaches, I propose a new class of tree-based kernels which incorporate the linguistic knowledge explored above. Its innovative feature is adding expressiveness to the kernel by increasing the domain of locality, thus allowing us to state linguistic relations in these kernels. I also propose methods of how to generate the kernels in a systematic way and how to apply sentence simplification to improve the coverage of extraction.
To evaluate the performance of proposed methods, I plan to implement both approaches for biomedical relation extraction tasks. I would like to show that experiments on different corpora will provide empirical support for our hypothesis, and the results on these datasets compare favorably with state-of-art results. I also plan to evaluate the impact of the different aspects of our proposed methods.