Research

Medical Text Mining

Clinical trials are the gold standard for generating reliable medical evidence. The biggest bottleneck in clinical trials is recruitment. To facilitate recruitment, tools for patient search of relevant clinical trials have been developed, but users often suffer from information overload. 1) With nearly 700 coronavirus disease 2019 (COVID-19) trials conducted in the United States as of August 2020, it is imperative to enable rapid recruitment to these studies. The COVID-19 Trial Finder was designed to facilitate patient-centered search of COVID-19 trials, first by location and radius distance from trial sites, and then by brief, dynamically generated medical questions to allow users to prescreen their eligibility for nearby COVID-19 trials with minimum human computer interaction. 2) We developed the Clinical Trial Knowledge Base, a regularly updated knowledge base of discrete clinical trial eligibility criteria equipped with a web-based user interface for querying and aggregate analysis of common eligibility criteria.

Press Coverage of Our Work

“COVID-19trial finder provides simplified search process for COVID-related clinical trials” (Eurekalert, 05,2020; Medical Press, 05,2020; DBMI, Columbia University,05,2020)

“A Knowledge Base of Clinical Trial Eligibility Criteria” ( DBMI twitter, Columbia University,04,2021 ; DBMI Linkedin, Columbia University,04,2021)

Web Text Mining

In the era of the Social Web, there has been explosive growth of user-generated content published on various online web forums. Segments of short texts have become a fashionable writing format because they are convenient to post and respond. Examples include comments, tweets, reviews, questions/answers, to name a few. Given the large volume of short texts that are available online, quick comprehension and filtering have become a challenging problem. In this dissertation, we explore two questions related on short texts: what are they talking about and can you trust the source?

To answer the first question, an effective and efficient approach is to discover latent topics from large text datasets. Because of the text sparseness of text in online discussions, traditional topic models have had limited success when directly applied to the topic mining tasks. Short texts do not provide sufficient term co-occurrence information for the reliable discovery of topics. To overcome that limitation, we use (1) the discussion thread tree structure and propose a “popularity” metric to quantify the number of replies to a given comment and extend the frequency of word occurrences, and (2) the “transitivity” concept to characterize topic dependency among nodes in a nested discussion thread. We then build a Conversational Structure Aware Topic Model (CSATM) based on popularity and transitivity to infer topics and their assignments to comments.

Yingcheng Sun

Research

Medical Text Mining

Press Coverage of Our Work

Web Text Mining