Yanchuan Sim, Machine Learning Developer in Singapore, Singapore
Yanchuan Sim

Machine Learning Developer in Singapore, Singapore

Member since June 6, 2016
Yanchuan is a PhD candidate in language technologies from Carnegie Mellon University with over seven years of experience working with machine learning systems and cutting-edge technologies in natural language processing. Currently, he works with several startups to incorporate ML and NLP technologies into their products.
Yanchuan is now available for hire




Singapore, Singapore



Preferred Environment

Git, Sublime Text, OS X, Linux

The most amazing...

...project I've worked on is a method to uncover and measure the ideological stances of election candidates through their speeches.


  • CTO

    2017 - PRESENT
    Bot MD
    • Led and managed a remote team of two back-end engineers, two Android engineers, and an assortment of freelancers for Bot MD, a clinical AI assistant for doctors, as part of YCombinator's S18 batch.
    • Spearheaded the development of a full-featured Android chat application with various productivity features for doctors.
    • Built the chat engine from scratch, leveraging my deep understanding of linguistics and NLP.
    Technologies: PostgreSQL, Chatbots, WebSockets, Django, Python
  • Technical Advisor

    2016 - PRESENT
    • Provided technical expertise and advised on information from unstructured text.
    • Led a team of two engineers to build a customized text search algorithm for market research documents.
    Technologies: Search Engines, Information Retrieval, Machine Learning, Natural Language Processing (NLP), Elasticsearch, Python
  • Scientist

    2016 - PRESENT
    Institute for Infocomm Research
    • Researched novel techniques for improving state-of-the-art NLP systems.
    Technologies: Natural Language Processing (NLP), Machine Learning
  • Advisor and Data Scientist in Residence

    2016 - PRESENT
    • Advised and collaborated with the engineering team on topics and techniques related to natural language processing, information retrieval, and machine learning.
    • Provided domain knowledge and input on product roadmaps.
    Technologies: Machine Learning, Information Retrieval, Natural Language Processing (NLP)
  • Technical Advisor

    2015 - PRESENT
    AirPR, Inc.
    • Built an automatic key phrase extraction module for PR news (soundbites).
    • Designed customized author ranking algorithms for LinkedIn publishers using social and influence metrics.
    • Improved the Elasticsearch relevance ranking algorithm by designing custom features and metrics. We improved results relevance rankings by 30%.
    • Implemented a state-of-the-art customized sentiment classifier for Tweets using crowdsourcing and ensemble methods.
    • Built a data processing pipeline for handling millions of articles using Spark and Elasticsearch.
    • Built an NLP pipeline for processing millions of news articles.
    • Utilized techniques that included logistic regression, support vector machines (SVM), random forests, and ensemble methods.
    Technologies: Amazon Web Services (AWS), AWS EMR, AWS, MongoDB, MySQL, Java, Ruby on Rails (RoR), Ruby, Flask, Elasticsearch, Spark, Scala, Python
  • Visiting PhD Scholar

    2015 - 2016
    University of Washington
    • Performed a variety of academic duties as scholar in residence with the University of Washington Computer Science and Engineering department.
  • Graduate Research Assistant

    2011 - 2016
    Carnegie Mellon University
    • Assisted the course Introduction to Natural Language Processing (NLP) and Graduate Seminar on Advanced NLP.
    • Pursued research interests in Machine Learning (ML), Natural Language Processing (NLP), and Computational Social Science (CSS).
    • Applied NLP techniques to text mining and information extraction tasks.
    • Built tools to help automatic discovery and analysis of decision making in the U.S. Supreme Court.
    • Built tools to help political scientists analyze and explore speeches of U.S. presidential candidates.
    • Gained expert knowledge of statistical models, probabilistic graphical models, MCMC and variational methods, deep learning, and topic modeling.
    Technologies: LaTeX, Julia, Python, C++, Java
  • Research Intern

    2013 - 2013
    Google, Inc.
    • Worked with the Google Knowledge team to improve their state of the art NLP pipeline.
    • Proposed and implemented a novel model for joint inference on named entity recognition/tagging and coreference resolution.
    • Developed efficient algorithms for performing inference in high-dimension combinatorics space using dual decomposition.
    • Utilized techniques including dual decomposition, support vector machine (SVM), conditional random fields (CRF), and graphical models.
    Technologies: C++
  • Research Officer

    2010 - 2011
    Institute for Infocomm Research
    • Built a state-of-the-art entity resolution system by leveraging unsupervised latent topic features.
    • Designed a robust high precision acronym identification module using carefully crafted features.
    • Ranked #3 in the 2011 Knowledge Base Population shared task.
    • Utilized algorithms including SVM, Naive Bayes, Latent Dirichlet Allocation topic modeling, and UIMA for the NLP pipeline.
    Technologies: Apache UIMA, Java


  • Bot MD - A Clinical AI Assistant for Doctors

    Get instantaneous responses to clinical questions on drugs, diseases, guidelines, and medical calculators. Bot M.D. also helps to transcribe your dictated case notes. He knows 50+ languages.

    Bot MD is created by me and my team.

  • Scalable Text Extraction from Documents

    I wrote an open source software for extracting text from binary documents scalably using AWS Lambda. With AWS Lambda's serverless architecture, we can perform OCR on hundreds of pages of text within minutes.

    Some key features of the software are:

    • Out-of-the-box support for many common binary document formats
    • Scalable PDF parsing using OCR in parallel using AWS Lambda and asyncio
    • Creation of text searchable PDFs after OCR
    • Serverless architecture makes deployment quick and easy
    • Detailed instruction for preparing libraries and dependencies necessary for processing binary documents
    • Sensible Unicode handling

  • Twitter Sentiment Classifier

    We built a state of the art 3-class (positive, neutral, negative) sentiment classifier for tweets. We performed in-depth analysis to evaluate the performance of the system, including incorporating crowdsourcing to achieve the best performance gains with least cost. The system is exposed using JSON and REST API.

  • Soundbite Identification

    We built a system for automatically extracting relevant soundbites from text corpora using a hybrid of multiple frequency measures—TF-IDF, PMI, SAGE, WAPMI

  • Identifying the Salient Entity in an Article

    We implemented a system for automatically scoring and ranking named entities within an article by the salience. A "salient" entity is one that is highly relevant to the document and is the main entity that is being discussed. These features are incorporated into Elasticsearch to improve the relevance scoring algorithm.

  • Ark-SAGE for Learning Keywords and Text Representations

    Ark-SAGE is a Java library that implements the L1-regularized version of Sparse Additive GenerativE models of Text (Einsenstein et al, 2011). SAGE is an algorithm for learning sparse representations of text (you can read more about it here).

  • Learning a Combined System for Entity Linking

    In NLP, entity linking is the task of determining the identity of entities mentioned in text. For example, given the sentence "Paris is the capital of France," the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred as "Paris." Hence, EL maps mentions in the documents to entries in a knowledge base through resolving the name variations and ambiguities. In this project, we proposed 3 advancements for entity linking.

    1. Expanding acronyms can effectively reduce the ambiguity of the acronym mentions. Rule-based approaches rely heavily on the presence of text markers. Here, we propose a supervised learning algorithm to expand more complicated acronyms which leads to 15.1% accuracy improvement over state-of-the-art acronym expansion methods.
    2. Entity linking annotation is expensive and labor intensive. We propose an instance selection strategy to effectively utilize the automatically generated annotation. In our selection strategy, an informative and diverse set of instances are selected for effective disambiguation.
    3. Topic modeling is used to model the semantic topics of the articles, significantly improving entity linking performance.

  • Discovering Factions in the Computational Linguistics Community

    We present a joint probabilistic model of who cites whom in computational linguistics, and also of the words they use to do the citing. The model reveals latent factions, or groups of individuals whom we expect to collaborate more closely within their faction, cite within the faction using language distinct from citation outside the faction, and be largely understandable through the language used when cited from without. We conduct an exploratory data analysis on the ACL Anthology. We extend the model to reveal changes in some authors’ faction memberships over time.

  • Learning Topics and Positions from Debatepedia

    We explore Debatepedia, a community authored encyclopedia of socio-political debates, as evidence for inferring a low dimensional, human-interpretable representation in the domain of issues and positions. We introduce a generative model positing latent topics and cross-cutting positions that gives special treatment to person mentions and opinion words. We evaluate the resulting representation's usefulness in attaching opinionated documents to arguments and its consistency with human judgements about positions.

  • Measuring Ideological Proportions in Political Speeches

    We seek to measure political candidates’ ideological positioning from their speeches. To accomplish this, we infer ideological cues from a corpus of political writings annotated with known ideologies. We then represent the speeches of U.S. presidential candidates as sequences of cues and lags (filler distinguished only by its length in words). We apply a domain-informed Bayesian HMM to infer the proportions of ideologies each candidate uses in each campaign. The results are validated against a set of preregistered, domain expert authored hypotheses.

  • The Utility of Text: The Case of Amicus Briefs and the Supreme Court

    We improved the state of the art on a Supreme Court vote prediction task.

    We explored the idea that authoring a piece of text is an act of maximizing one’s expected utility. To make this idea concrete, we consider the societally important decisions of the Supreme Court of the United States. Extensive past work in quantitative political science provides a framework for empirically modeling the decisions of justices and how they relate to text. We incorporate into such a model texts authored by amici curiae (“friends of the court” separate from the litigants) who seek to weigh in on the decision, then explicitly model their goals in a random utility model. We demonstrate the benefits of this approach in improved vote prediction and the ability to perform counterfactual analysis.

  • Modeling User Arguments, Interactions, and Attributes for Stance Prediction in Online Debate Forums

    Online debate forums are important social media for people to voice their opinions and debate with each other. Mining user stances or viewpoints from these forums has been a popular research topic. However, most current work does not address an important problem: For a specific issue, there may not be many users participating and expressing their opinions. Despite the sparsity of user stances, users may provide rich side information—for example, users may write arguments to back up their stances, interact with each other, and provide biographical information. In this work, we propose an integrated model to leverage side information. Our proposed method is a regression-based latent factor model which jointly models user arguments, interactions, and attributes. Our method can perform stance prediction for both warm-start and cold-start users. We demonstrate in experiments that our method has promising results on both micro-level and macro-level stance prediction.

  • A Utility Model of Authors in the Scientific Community

    Authoring a scientific paper is a complex process involving many decisions. We introduce a probabilistic model of some of the important aspects of that process: that authors have individual preferences, that writing a paper requires trading off among the preferences of authors as well as extrinsic rewards in the form of community response to their papers, that preferences (of individuals and the community) and tradeoffs vary over time. Variants of our model lead to improved predictive accuracy of citations given texts and texts given authors. Further, our model’s posterior suggests an interesting relationship between seniority and author choices.

  • NLP Utility Library

    This is an assortment of utilities/functions that I have written and found useful for data manipulation and NLP. It is written mostly in C++, along with a mish-mash of scripts, libraries, modules, headers, etc., written in Python and Java.


  • Languages

    C, Java, HTML, BASIC, Python, C++, Bash, Visual Basic, SQL, Scala, Ruby, PHP, CSS, JavaScript, Julia
  • Frameworks

    Apache Spark, Flask, Bootstrap, Django, Spark, AWS EMR, Ruby on Rails 4, Ruby on Rails (RoR), Scrapy, Hadoop
  • Libraries/APIs

    NLTK, OpenNLP, Stanford NLP, Scikit-learn, Matplotlib, libsvm, MPI, Sidekiq, Twitter API, SciPy, NumPy, Google API, Facebook API, jQuery
  • Tools

    Amazon SQS, Solr, Terraform, Amazon ECS (Amazon Elastic Container Service), Stanford NER, Sublime Text 3, MATLAB, Subversion (SVN), Apache HTTP Server, Amazon Elastic MapReduce (EMR), Sendmail, Notepad++, LaTeX, Sublime Text, Git, Apache UIMA, Docker Compose, Celery, NGINX, Adobe ColdFusion
  • Paradigms

    Data Science, Functional Programming, REST, MapReduce
  • Platforms

    Linux, Salesforce, Amazon EC2 (Amazon Elastic Compute Cloud), DigitalOcean, OS X, Amazon Web Services (AWS), Docker
  • Storage

    PostgreSQL, Elasticsearch, Amazon S3 (AWS S3), Redis, Amazon DynamoDB, MySQL, SQLite, MongoDB
  • Other

    Networks, Deep Learning, Machine Learning, Text Processing, Unsupervised Learning, Sentiment Analysis, Text Mining, Linear Algebra, Convex Optimization, Optimization, Topic Modeling, Neural Networks, Text Classification, Web Scraping, Data Engineering, Data Mining, Information Extraction, Natural Language Processing (NLP), Graphical Models, Bayesian Statistics, Statistics, Recurrent Neural Networks, Gunicorn, Crowdsourcing, Attribution Modeling, Big Data, AWS, Information Retrieval, Search Engines, WebSockets, Chatbots, Amazon Mechanical Turk, CrowdFlower


  • Ph.D. in Language and Information Technologies
    2011 - 2016
    Carnegie Mellon University - Pittsburgh, PA
  • Bachelor of Science Degree in Computer Science
    2007 - 2010
    University of Illinois at Urbana-Champaign - Urbana, IL


  • Cisco Certified Network Associate
    MAY 2001 - MAY 2006

To view more profiles

Join Toptal
Share it with others