Yanchuan Sim, Developer in Singapore, Singapore
Yanchuan is available for hire
Hire Yanchuan

Yanchuan Sim

Verified Expert  in Engineering

Machine Learning Developer

Singapore, Singapore
Toptal Member Since
June 6, 2016

Yanchuan is a PhD candidate in language technologies from Carnegie Mellon University with over seven years of experience working with machine learning systems and cutting-edge technologies in natural language processing. Currently, he works with several startups to incorporate ML and NLP technologies into their products.


Bot MD
PostgreSQL, Chatbots, WebSockets, Django, Python
Search Engines, Information Retrieval, Machine Learning...
Institute for Infocomm Research
Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT)...




Preferred Environment

Git, Sublime Text, OS X, Linux

The most amazing...

...project I've worked on is a method to uncover and measure the ideological stances of election candidates through their speeches.

Work Experience


2017 - PRESENT
Bot MD
  • Led and managed a remote team of two back-end engineers, two Android engineers, and an assortment of freelancers for Bot MD, a clinical AI assistant for doctors, as part of YCombinator's S18 batch.
  • Spearheaded the development of a full-featured Android chat application with various productivity features for doctors.
  • Built the chat engine from scratch, leveraging my deep understanding of linguistics and NLP.
Technologies: PostgreSQL, Chatbots, WebSockets, Django, Python

Technical Advisor

2016 - PRESENT
  • Provided technical expertise and advised on information from unstructured text.
  • Led a team of two engineers to build a customized text search algorithm for market research documents.
Technologies: Search Engines, Information Retrieval, Machine Learning, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), Elasticsearch, Python


2016 - PRESENT
Institute for Infocomm Research
  • Researched novel techniques for improving state-of-the-art NLP systems.
Technologies: Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), Machine Learning

Advisor and Data Scientist in Residence

2016 - PRESENT
  • Advised and collaborated with the engineering team on topics and techniques related to natural language processing, information retrieval, and machine learning.
  • Provided domain knowledge and input on product roadmaps.
Technologies: Machine Learning, Information Retrieval, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT)

Technical Advisor

2015 - PRESENT
AirPR, Inc.
  • Built an automatic key phrase extraction module for PR news (soundbites).
  • Designed customized author ranking algorithms for LinkedIn publishers using social and influence metrics.
  • Improved the Elasticsearch relevance ranking algorithm by designing custom features and metrics. We improved results relevance rankings by 30%.
  • Implemented a state-of-the-art customized sentiment classifier for Tweets using crowdsourcing and ensemble methods.
  • Built a data processing pipeline for handling millions of articles using Spark and Elasticsearch.
  • Built an NLP pipeline for processing millions of news articles.
  • Utilized techniques that included logistic regression, support vector machines (SVM), random forests, and ensemble methods.
Technologies: Amazon Web Services (AWS), Amazon Elastic MapReduce (EMR), MongoDB, MySQL, Java, Ruby on Rails (RoR), Ruby, Flask, Elasticsearch, Spark, Scala, Python

Visiting PhD Scholar

2015 - 2016
University of Washington
  • Performed a variety of academic duties as scholar in residence with the University of Washington Computer Science and Engineering department.

Graduate Research Assistant

2011 - 2016
Carnegie Mellon University
  • Assisted the course Introduction to Natural Language Processing (NLP) and Graduate Seminar on Advanced NLP.
  • Pursued research interests in Machine Learning (ML), Natural Language Processing (NLP), and Computational Social Science (CSS).
  • Applied NLP techniques to text mining and information extraction tasks.
  • Built tools to help automatic discovery and analysis of decision making in the U.S. Supreme Court.
  • Built tools to help political scientists analyze and explore speeches of U.S. presidential candidates.
  • Gained expert knowledge of statistical models, probabilistic graphical models, MCMC and variational methods, deep learning, and topic modeling.
Technologies: LaTeX, Julia, Python, C++, Java

Research Intern

2013 - 2013
Google, Inc.
  • Worked with the Google Knowledge team to improve their state of the art NLP pipeline.
  • Proposed and implemented a novel model for joint inference on named entity recognition/tagging and coreference resolution.
  • Developed efficient algorithms for performing inference in high-dimension combinatorics space using dual decomposition.
  • Utilized techniques including dual decomposition, support vector machine (SVM), conditional random fields (CRF), and graphical models.
Technologies: C++

Research Officer

2010 - 2011
Institute for Infocomm Research
  • Built a state-of-the-art entity resolution system by leveraging unsupervised latent topic features.
  • Designed a robust high precision acronym identification module using carefully crafted features.
  • Ranked #3 in the 2011 Knowledge Base Population shared task.
  • Utilized algorithms including SVM, Naive Bayes, Latent Dirichlet Allocation topic modeling, and UIMA for the NLP pipeline.
Technologies: Apache UIMA, Java

Bot MD - A Clinical AI Assistant for Doctors

Get instantaneous responses to clinical questions on drugs, diseases, guidelines, and medical calculators. Bot M.D. also helps to transcribe your dictated case notes. He knows 50+ languages.

Bot MD is created by me and my team.

Scalable Text Extraction from Documents

I wrote an open source software for extracting text from binary documents scalably using AWS Lambda. With AWS Lambda's serverless architecture, we can perform OCR on hundreds of pages of text within minutes.

Some key features of the software are:

• Out-of-the-box support for many common binary document formats
• Scalable PDF parsing using OCR in parallel using AWS Lambda and asyncio
• Creation of text searchable PDFs after OCR
• Serverless architecture makes deployment quick and easy
• Detailed instruction for preparing libraries and dependencies necessary for processing binary documents
• Sensible Unicode handling

Twitter Sentiment Classifier

We built a state of the art 3-class (positive, neutral, negative) sentiment classifier for tweets. We performed in-depth analysis to evaluate the performance of the system, including incorporating crowdsourcing to achieve the best performance gains with least cost. The system is exposed using JSON and REST API.

Soundbite Identification

We built a system for automatically extracting relevant soundbites from text corpora using a hybrid of multiple frequency measures—TF-IDF, PMI, SAGE, WAPMI

Identifying the Salient Entity in an Article

We implemented a system for automatically scoring and ranking named entities within an article by the salience. A "salient" entity is one that is highly relevant to the document and is the main entity that is being discussed. These features are incorporated into Elasticsearch to improve the relevance scoring algorithm.

Ark-SAGE for Learning Keywords and Text Representations

Ark-SAGE is a Java library that implements the L1-regularized version of Sparse Additive GenerativE models of Text (Einsenstein et al, 2011). SAGE is an algorithm for learning sparse representations of text (you can read more about it here).

Learning a Combined System for Entity Linking

In NLP, entity linking is the task of determining the identity of entities mentioned in text. For example, given the sentence "Paris is the capital of France," the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred as "Paris." Hence, EL maps mentions in the documents to entries in a knowledge base through resolving the name variations and ambiguities. In this project, we proposed 3 advancements for entity linking.

1. Expanding acronyms can effectively reduce the ambiguity of the acronym mentions. Rule-based approaches rely heavily on the presence of text markers. Here, we propose a supervised learning algorithm to expand more complicated acronyms which leads to 15.1% accuracy improvement over state-of-the-art acronym expansion methods.
2. Entity linking annotation is expensive and labor intensive. We propose an instance selection strategy to effectively utilize the automatically generated annotation. In our selection strategy, an informative and diverse set of instances are selected for effective disambiguation.
3. Topic modeling is used to model the semantic topics of the articles, significantly improving entity linking performance.

Discovering Factions in the Computational Linguistics Community

We present a joint probabilistic model of who cites whom in computational linguistics, and also of the words they use to do the citing. The model reveals latent factions, or groups of individuals whom we expect to collaborate more closely within their faction, cite within the faction using language distinct from citation outside the faction, and be largely understandable through the language used when cited from without. We conduct an exploratory data analysis on the ACL Anthology. We extend the model to reveal changes in some authors’ faction memberships over time.

Learning Topics and Positions from Debatepedia

We explore Debatepedia, a community authored encyclopedia of socio-political debates, as evidence for inferring a low dimensional, human-interpretable representation in the domain of issues and positions. We introduce a generative model positing latent topics and cross-cutting positions that gives special treatment to person mentions and opinion words. We evaluate the resulting representation's usefulness in attaching opinionated documents to arguments and its consistency with human judgements about positions.

Measuring Ideological Proportions in Political Speeches

We seek to measure political candidates’ ideological positioning from their speeches. To accomplish this, we infer ideological cues from a corpus of political writings annotated with known ideologies. We then represent the speeches of U.S. presidential candidates as sequences of cues and lags (filler distinguished only by its length in words). We apply a domain-informed Bayesian HMM to infer the proportions of ideologies each candidate uses in each campaign. The results are validated against a set of preregistered, domain expert authored hypotheses.

The Utility of Text: The Case of Amicus Briefs and the Supreme Court

We improved the state of the art on a Supreme Court vote prediction task.

We explored the idea that authoring a piece of text is an act of maximizing one’s expected utility. To make this idea concrete, we consider the societally important decisions of the Supreme Court of the United States. Extensive past work in quantitative political science provides a framework for empirically modeling the decisions of justices and how they relate to text. We incorporate into such a model texts authored by amici curiae (“friends of the court” separate from the litigants) who seek to weigh in on the decision, then explicitly model their goals in a random utility model. We demonstrate the benefits of this approach in improved vote prediction and the ability to perform counterfactual analysis.

Modeling User Arguments, Interactions, and Attributes for Stance Prediction in Online Debate Forums

Online debate forums are important social media for people to voice their opinions and debate with each other. Mining user stances or viewpoints from these forums has been a popular research topic. However, most current work does not address an important problem: For a specific issue, there may not be many users participating and expressing their opinions. Despite the sparsity of user stances, users may provide rich side information—for example, users may write arguments to back up their stances, interact with each other, and provide biographical information. In this work, we propose an integrated model to leverage side information. Our proposed method is a regression-based latent factor model which jointly models user arguments, interactions, and attributes. Our method can perform stance prediction for both warm-start and cold-start users. We demonstrate in experiments that our method has promising results on both micro-level and macro-level stance prediction.

A Utility Model of Authors in the Scientific Community

Authoring a scientific paper is a complex process involving many decisions. We introduce a probabilistic model of some of the important aspects of that process: that authors have individual preferences, that writing a paper requires trading off among the preferences of authors as well as extrinsic rewards in the form of community response to their papers, that preferences (of individuals and the community) and tradeoffs vary over time. Variants of our model lead to improved predictive accuracy of citations given texts and texts given authors. Further, our model’s posterior suggests an interesting relationship between seniority and author choices.

NLP Utility Library

This is an assortment of utilities/functions that I have written and found useful for data manipulation and NLP. It is written mostly in C++, along with a mish-mash of scripts, libraries, modules, headers, etc., written in Python and Java.
2011 - 2016

Ph.D. in Language and Information Technologies

Carnegie Mellon University - Pittsburgh, PA

2007 - 2010

Bachelor of Science Degree in Computer Science

University of Illinois at Urbana-Champaign - Urbana, IL

MAY 2001 - MAY 2006

Cisco Certified Network Associate



Natural Language Toolkit (NLTK), OpenNLP, Stanford NLP, Scikit-learn, Matplotlib, libsvm, MPI, Sidekiq, Twitter API, SciPy, NumPy, Google API, Facebook API, jQuery


Amazon Simple Queue Service (SQS), Solr, Terraform, Amazon Elastic Container Service (Amazon ECS), Stanford NER, Sublime Text 3, MATLAB, Subversion (SVN), Apache HTTP Server, Amazon Elastic MapReduce (EMR), Sendmail, Notepad++, LaTeX, Sublime Text, Git, Apache UIMA, Docker Compose, Celery, NGINX, Adobe ColdFusion


Apache Spark, Flask, Bootstrap, Django, Spark, Ruby on Rails 4, Ruby on Rails (RoR), Scrapy, Hadoop


C, Java, HTML, BASIC, Python, C++, Bash, Visual Basic, SQL, Scala, Ruby, PHP, CSS, JavaScript, Julia


Data Science, Functional Programming, REST, MapReduce


Linux, Salesforce, Amazon EC2, DigitalOcean, OS X, Amazon Web Services (AWS), Docker


PostgreSQL, Elasticsearch, Amazon S3 (AWS S3), Redis, Amazon DynamoDB, MySQL, SQLite, MongoDB


Networks, Deep Learning, Machine Learning, Text Processing, Unsupervised Learning, Sentiment Analysis, Text Mining, Linear Algebra, Convex Optimization, Optimization, Topic Modeling, Neural Networks, Text Classification, Web Scraping, Data Engineering, Data Mining, Information Extraction, Natural Language Processing (NLP), Graphical Models, Bayesian Statistics, Statistics, Recurrent Neural Networks (RNNs), Generative Pre-trained Transformers (GPT), Gunicorn, Crowdsourcing, Attribution Modeling, Big Data, Information Retrieval, Search Engines, WebSockets, Chatbots, Amazon Mechanical Turk, CrowdFlower

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.


Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring