
Yanchuan Sim
Verified Expert in Engineering
Machine Learning Developer
Yanchuan is a PhD candidate in language technologies from Carnegie Mellon University with over seven years of experience working with machine learning systems and cutting-edge technologies in natural language processing. Currently, he works with several startups to incorporate ML and NLP technologies into their products.
Portfolio
Experience
Availability
Preferred Environment
Git, Sublime Text, OS X, Linux
The most amazing...
...project I've worked on is a method to uncover and measure the ideological stances of election candidates through their speeches.
Work Experience
CTO
Bot MD
- Led and managed a remote team of two back-end engineers, two Android engineers, and an assortment of freelancers for Bot MD, a clinical AI assistant for doctors, as part of YCombinator's S18 batch.
- Spearheaded the development of a full-featured Android chat application with various productivity features for doctors.
- Built the chat engine from scratch, leveraging my deep understanding of linguistics and NLP.
Technical Advisor
Stravito
- Provided technical expertise and advised on information from unstructured text.
- Led a team of two engineers to build a customized text search algorithm for market research documents.
Scientist
Institute for Infocomm Research
- Researched novel techniques for improving state-of-the-art NLP systems.
Advisor and Data Scientist in Residence
Intelllex
- Advised and collaborated with the engineering team on topics and techniques related to natural language processing, information retrieval, and machine learning.
- Provided domain knowledge and input on product roadmaps.
Technical Advisor
AirPR, Inc.
- Built an automatic key phrase extraction module for PR news (soundbites).
- Designed customized author ranking algorithms for LinkedIn publishers using social and influence metrics.
- Improved the Elasticsearch relevance ranking algorithm by designing custom features and metrics. We improved results relevance rankings by 30%.
- Implemented a state-of-the-art customized sentiment classifier for Tweets using crowdsourcing and ensemble methods.
- Built a data processing pipeline for handling millions of articles using Spark and Elasticsearch.
- Built an NLP pipeline for processing millions of news articles.
- Utilized techniques that included logistic regression, support vector machines (SVM), random forests, and ensemble methods.
Visiting PhD Scholar
University of Washington
- Performed a variety of academic duties as scholar in residence with the University of Washington Computer Science and Engineering department.
Graduate Research Assistant
Carnegie Mellon University
- Assisted the course Introduction to Natural Language Processing (NLP) and Graduate Seminar on Advanced NLP.
- Pursued research interests in Machine Learning (ML), Natural Language Processing (NLP), and Computational Social Science (CSS).
- Applied NLP techniques to text mining and information extraction tasks.
- Built tools to help automatic discovery and analysis of decision making in the U.S. Supreme Court.
- Built tools to help political scientists analyze and explore speeches of U.S. presidential candidates.
- Gained expert knowledge of statistical models, probabilistic graphical models, MCMC and variational methods, deep learning, and topic modeling.
Research Intern
Google, Inc.
- Worked with the Google Knowledge team to improve their state of the art NLP pipeline.
- Proposed and implemented a novel model for joint inference on named entity recognition/tagging and coreference resolution.
- Developed efficient algorithms for performing inference in high-dimension combinatorics space using dual decomposition.
- Utilized techniques including dual decomposition, support vector machine (SVM), conditional random fields (CRF), and graphical models.
Research Officer
Institute for Infocomm Research
- Built a state-of-the-art entity resolution system by leveraging unsupervised latent topic features.
- Designed a robust high precision acronym identification module using carefully crafted features.
- Ranked #3 in the 2011 Knowledge Base Population shared task.
- Utilized algorithms including SVM, Naive Bayes, Latent Dirichlet Allocation topic modeling, and UIMA for the NLP pipeline.
Experience
Bot MD - A Clinical AI Assistant for Doctors
https://www.botmd.io/en/Bot MD is created by me and my team.
Scalable Text Extraction from Documents
https://github.com/skylander86/lambda-text-extractorSome key features of the software are:
• Out-of-the-box support for many common binary document formats
• Scalable PDF parsing using OCR in parallel using AWS Lambda and asyncio
• Creation of text searchable PDFs after OCR
• Serverless architecture makes deployment quick and easy
• Detailed instruction for preparing libraries and dependencies necessary for processing binary documents
• Sensible Unicode handling
Twitter Sentiment Classifier
Soundbite Identification
Identifying the Salient Entity in an Article
Ark-SAGE for Learning Keywords and Text Representations
https://bitbucket.org/skylander/ark-sage/Learning a Combined System for Entity Linking
1. Expanding acronyms can effectively reduce the ambiguity of the acronym mentions. Rule-based approaches rely heavily on the presence of text markers. Here, we propose a supervised learning algorithm to expand more complicated acronyms which leads to 15.1% accuracy improvement over state-of-the-art acronym expansion methods.
2. Entity linking annotation is expensive and labor intensive. We propose an instance selection strategy to effectively utilize the automatically generated annotation. In our selection strategy, an informative and diverse set of instances are selected for effective disambiguation.
3. Topic modeling is used to model the semantic topics of the articles, significantly improving entity linking performance.
Discovering Factions in the Computational Linguistics Community
Learning Topics and Positions from Debatepedia
Measuring Ideological Proportions in Political Speeches
http://www.cs.cmu.edu/~ark/CLIP/The Utility of Text: The Case of Amicus Briefs and the Supreme Court
We explored the idea that authoring a piece of text is an act of maximizing one’s expected utility. To make this idea concrete, we consider the societally important decisions of the Supreme Court of the United States. Extensive past work in quantitative political science provides a framework for empirically modeling the decisions of justices and how they relate to text. We incorporate into such a model texts authored by amici curiae (“friends of the court” separate from the litigants) who seek to weigh in on the decision, then explicitly model their goals in a random utility model. We demonstrate the benefits of this approach in improved vote prediction and the ability to perform counterfactual analysis.
Modeling User Arguments, Interactions, and Attributes for Stance Prediction in Online Debate Forums
A Utility Model of Authors in the Scientific Community
NLP Utility Library
https://bitbucket.org/skylander/yc-utils/Skills
Languages
C, Java, HTML, BASIC, Python, C++, Bash, Visual Basic, SQL, Scala, Ruby, PHP, CSS, JavaScript, Julia
Frameworks
Apache Spark, Flask, Bootstrap, Django, Spark, Ruby on Rails 4, Ruby on Rails (RoR), Scrapy, Hadoop
Libraries/APIs
Natural Language Toolkit (NLTK), OpenNLP, Stanford NLP, Scikit-learn, Matplotlib, libsvm, MPI, Sidekiq, Twitter API, SciPy, NumPy, Google API, Facebook API, jQuery
Tools
Amazon Simple Queue Service (SQS), Solr, Terraform, Amazon Elastic Container Service (Amazon ECS), Stanford NER, Sublime Text 3, MATLAB, Subversion (SVN), Apache HTTP Server, Amazon Elastic MapReduce (EMR), Sendmail, Notepad++, LaTeX, Sublime Text, Git, Apache UIMA, Docker Compose, Celery, NGINX, Adobe ColdFusion
Paradigms
Data Science, Functional Programming, REST, MapReduce
Platforms
Linux, Salesforce, Amazon EC2, DigitalOcean, OS X, Amazon Web Services (AWS), Docker
Storage
PostgreSQL, Elasticsearch, Amazon S3 (AWS S3), Redis, Amazon DynamoDB, MySQL, SQLite, MongoDB
Other
Networks, Deep Learning, Machine Learning, Text Processing, Unsupervised Learning, Sentiment Analysis, Text Mining, Linear Algebra, Convex Optimization, Optimization, Topic Modeling, Neural Networks, Text Classification, Web Scraping, Data Engineering, Data Mining, Information Extraction, Natural Language Processing (NLP), Graphical Models, Bayesian Statistics, Statistics, Recurrent Neural Networks (RNN), GPT, Generative Pre-trained Transformers (GPT), Gunicorn, Crowdsourcing, Attribution Modeling, Big Data, Information Retrieval, Search Engines, WebSockets, Chatbots, Amazon Mechanical Turk, CrowdFlower
Education
Ph.D. in Language and Information Technologies
Carnegie Mellon University - Pittsburgh, PA
Bachelor of Science Degree in Computer Science
University of Illinois at Urbana-Champaign - Urbana, IL
Certifications
Cisco Certified Network Associate
Cisco