Does an Automated Assignment of Biological Topics Produce Relevant Semantic Meaning?
David C. McLean, Jr.1,2, Bin Zheng2, and Xinghua Lu2 1 Marine Biomedicine and Environmental Sciences Center, Medical University of South Carolina, Charleston, SC 2 Department of Biostatistics, Bioinformatics, and Epidemiology, Medical University of South Carolina, Charleston, SC As scientific literature grows rapidly, it becomes increasingly valuable to provide accurate, relevant, and automatic identification of topics in the electronic literature stream for both archival and real-time information retrieval. Text documents can be considered as mixtures of words from different topics; these topics can be inferred by statistical learning techniques. A probabilistic topic model, known as the latent Dirichlet allocation (LDA) model, was applied to automatically identify arbitrary biological topics from a corpus of Medline abstracts collected to describe protein function. The documents within the corpus were also annotated with Gene Ontology (GO) terms. The correlation between latent topics