Yanchuan Sim

Yanchuan Sim

Singapore, Singapore
Hire Yanchuan
Scroll To View More
Yanchuan Sim

Yanchuan Sim

Singapore, Singapore
Member since April 28, 2016
Yanchuan is a PhD candidate in language technologies from Carnegie Mellon University with over seven years of experience working with machine learning systems and cutting-edge technologies in natural language processing. Currently, he works with several startups to incorporate ML and NLP technologies into their products.
Yanchuan is now available for hire
  • Java, 15 years
  • C/C++, 15 years
  • Python, 9 years
  • Machine Learning, 7 years
  • Natural Language processing, 7 years
  • Deep Learning, 3 years
  • Scala, 2 years
  • Apache Spark, 1 year
Singapore, Singapore
Preferred Environment
Linux, OS X, Sublime, Git
The most amazing...
...project I've worked on is a method to uncover and measure the ideological stances of election candidates through their speeches.
  • Scientist
    Institute for Infocomm Research
    2016 - PRESENT
    • Researched into novel techniques for improving state-of-the-art NLP systems.
    Technologies: Machine Learning, NLP
  • Advisor and Data Scientist in Residence
    2016 - PRESENT
    • Advised and collaborated with the engineering team on topics and techniques related to natural language processing, information retrieval, and machine learning.
    • Provided domain knowledge and input on product roadmaps.
    Technologies: NLP, Information retrieval, Machine learning
  • Technical Advisor
    AirPR, Inc.
    2015 - PRESENT
    • Built an automatic key phrase extraction module for PR news (soundbites).
    • Designed customized author ranking algorithms for LinkedIn publishers using social and influence metrics.
    • Improved the Elasticsearch relevance ranking algorithm by designing custom features and metrics. We improved results relevance rankings by 30%.
    • Implemented a state-of-the-art customized sentiment classifier for Tweets using crowdsourcing and ensemble methods.
    • Built a data processing pipeline for handling millions of articles using Spark and Elasticsearch.
    • Built an NLP pipeline for processing millions of news articles.
    • Utilized techniques that included logistic regression, support vector machines (SVM), random forests, and ensemble methods.
    Technologies: Python, Scala, Spark, Elasticsearch, Flask, Ruby on Rails, Java, MySQL, MongoDB, AWS, AWS Elastic MapReduce
  • Visiting PhD Scholar
    University of Washington
    2015 - 2016
    • Performed a variety of academic duties as scholar in residence with the University of Washington Computer Science and Engineering department.
    Technologies: N/A
  • Graduate Research Assistant
    Carnegie Mellon University
    2011 - 2016
    • Assisted the course Introduction to Natural Language Processing (NLP) and Graduate Seminar on Advanced NLP.
    • Pursued research interests in Machine Learning (ML), Natural Language Processing (NLP), and Computational Social Science (CSS).
    • Applied NLP techniques to text mining and information extraction tasks.
    • Built tools to help automatic discovery and analysis of decision making in the U.S. Supreme Court.
    • Built tools to help political scientists analyze and explore speeches of U.S. presidential candidates.
    • Gained expert knowledge of statistical models, probabilistic graphical models, MCMC and variational methods, deep learning, and topic modeling.
    Technologies: Java, C++, Python, Julia, LaTeX
  • Research Intern
    Google, Inc.
    2013 - 2013
    • Worked with the Google Knowledge team to improve their state of the art NLP pipeline.
    • Proposed and implemented a novel model for joint inference on named entity recognition/tagging and coreference resolution.
    • Developed efficient algorithms for performing inference in high-dimension combinatorics space using dual decomposition.
    • Utilized techniques including dual decomposition, support vector machine (SVM), conditional random fields (CRF), and graphical models.
    Technologies: C++, Borg
  • Research Officer
    Institute for Infocomm Research
    2010 - 2011
    • Built a state-of-the-art entity resolution system by leveraging unsupervised latent topic features.
    • Designed a robust high precision acronym identification module using carefully crafted features.
    • Ranked #3 in the 2011 Knowledge Base Population shared task.
    • Utilized algorithms including SVM, Naive Bayes, Latent Dirichlet Allocation topic modeling, and UIMA for the NLP pipeline.
    Technologies: Java, UIMA
  • Twitter Sentiment Classifier (Development)

    We built a state of the art 3-class (positive, neutral, negative) sentiment classifier for tweets. We performed in-depth analysis to evaluate the performance of the system, including incorporating crowdsourcing to achieve the best performance gains with least cost. The system is exposed using JSON and REST API.

  • Soundbite Identification (Development)

    We built a system for automatically extracting relevant soundbites from text corpora using a hybrid of multiple frequency measures -- TF-IDF, PMI, SAGE, WAPMI

  • Identifying the Salient Entity in an Article (Development)

    We implemented a system for automatically scoring and ranking named entities within an article by the salience. A "salient" entity is one that is highly relevant to the document and is the main entity that is being discussed. These features are incorporated into Elasticsearch to improve the relevance scoring algorithm.

  • Learning a Combined System for Entity Linking (Development)

    In NLP, entity linking is the task of determining the identity of entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred as "Paris". Hence, EL maps mentions in the documents to entries in a knowledge base through resolving the name variations and ambiguities. In this project, we proposed 3 advancements for entity linking.

    1. Expanding acronyms can effectively reduce the ambiguity of the acronym mentions. Rule-based approaches rely heavily on the presence of text markers. Here, we propose a supervised learning algorithm to expand more complicated acronyms which leads to 15.1% accuracy improvement over state-of-the-art acronym expansion methods.
    2. Entity linking annotation is expensive and labor intensive. We propose an instance selection strategy to effectively utilize the automatically generated annotation. In our selection strategy, an informative and diverse set of instances are selected for effective disambiguation.
    3. Topic modeling is used to model the semantic topics of the articles, significantly improving entity linking performance.

  • Discovering Factions in the Computational Linguistics Community (Development)

    We present a joint probabilistic model of who cites whom in computational linguistics, and also of the words they use to do the citing. The model reveals latent factions, or groups of individuals whom we expect to collaborate more closely within their faction, cite within the faction using language distinct from citation outside the faction, and be largely understandable through the language used when cited from without. We conduct an exploratory data analysis on the ACL Anthology. We extend the model to reveal changes in some authors’ faction memberships over time.

  • Learning Topics and Positions from Debatepedia (Development)

    We explore Debatepedia, a community authored encyclopedia of socio-political debates, as evidence for inferring a low dimensional, human-interpretable representation in the domain of issues and positions. We introduce a generative model positing latent topics and cross-cutting positions that gives special treatment to person mentions and opinion words. We evaluate the resulting representation's usefulness in attaching opinionated documents to arguments and its consistency with human judgements about positions.

  • Measuring Ideological Proportions in Political Speeches (Development)

    We seek to measure political candidates’ ideological positioning from their speeches. To accomplish this, we infer ideological cues from a corpus of political writings annotated with known ideologies. We then represent the speeches of U.S. presidential candidates as sequences of cues and lags (filler distinguished only by its length in words). We apply a domain-informed Bayesian HMM to infer the proportions of ideologies each candidate uses in each campaign. The results are validated against a set of preregistered, domain expert authored hypotheses.

  • The Utility of Text: The Case of Amicus Briefs and the Supreme Court (Development)

    We improved the state of the art on Supreme Court a vote prediction task.

    We explored the idea that authoring a piece of text is an act of maximizing one’s expected utility. To make this idea concrete, we consider the societally important decisions of the Supreme Court of the United States. Extensive past work in quantitative political science provides a framework for empirically modeling the decisions of justices and how they relate to text. We incorporate into such a model texts authored by amici curiae (“friends of the court” separate from the litigants) who seek to weigh in on the decision, then explicitly model their goals in a random utility model. We demonstrate the benefits of this approach in improved vote prediction and the ability to perform counterfactual analysis.

  • Modeling User Arguments, Interactions, and Attributes for Stance Prediction in Online Debate Forums (Development)

    Online debate forums are important social media for people to voice their opinions and debate with each other. Mining user stances or viewpoints from these forums has been a popular research topic. However, most current work does not address an important problem: for a specific issue, there may not be many users participating and expressing their opinions. Despite the sparsity of user stances, users may provide rich side information, for example, users may write arguments to back up their stances, interact with each other, and provide biographical information. In this work, we propose an integrated model to leverage side information. Our proposed method is a regression-based latent factor model which jointly models user arguments, interactions, and attributes. Our method can perform stance prediction for both warm-start and cold-start users. We demonstrate in experiments that our method has promising results on both micro-level and macro-level stance prediction.

  • A Utility Model of Authors in the Scientific Community (Development)

    Authoring a scientific paper is a complex process involving many decisions. We introduce a probabilistic model of some of the important aspects of that process: that authors have individual preferences, that writing a paper requires trading off among the preferences of authors as well as extrinsic rewards in the form of community response to their papers, that preferences (of individuals and the community) and tradeoffs vary over time. Variants of our model lead to improved predictive accuracy of citations given texts and texts given authors. Further, our model’s posterior suggests an interesting relationship between seniority and author choices.

  • Ark-SAGE for Learning Keywords and Text Representations (Development)

    Ark-SAGE is a Java library that implements the L1-regularized version of Sparse Additive GenerativE models of Text (Einsenstein et al, 2011). SAGE is an algorithm for learning sparse representations of text (you can read more about it here).

  • NLP Utility Library (Development)

    This is an assortment of utilities/functions that I have written and found useful for data manipulation and NLP. It is written mostly in C++, along with a mis-mash of scripts, libraries, modules, headers, etc., written in Python and Java.

  • Languages
    Python, MATLAB, HTML, BASIC, C/C++, Java, HTML/CSS, Bash, SQL, Scala, Visual Basic, Julia, ColdFusion, CSS, JavaScript, PHP
  • Frameworks
    Flask, Spark, Django, Bootstrap, Rails 4, Hadoop, Rails, Ruby on Rails, Scrapy
  • Libraries/APIs
    Stanford NLP, Scikit-learn, OpenNLP, MPI, matplotlib, libsvm, Facebook API, Sidekiq, Twitter API, SciPy, NumPy, Google API, jQuery
  • Tools
    Sublime Text 3, Stanford NER, Apache Spark, LaTeX, Notepad++, SVN, DigitalOcean, Sendmail, Nginx
  • Platforms
    Linux, NLTK, AWS EC2
  • Misc
    Data Mining, Text Processing, Text Classification, Recurrent neural networks, Statistics, Bayesian statistics, Graphical models, Neural Networks, Topic Modeling, Optimization, Data Science, Information extraction, Data Engineering, Natural Language processing, Web Scraping, Convex optimization, Machine Learning, Deep Learning, Linear Algebra, Text Mining, Sentiment Analysis, Unsupervised Learning, Apache HTTP Server, AWS S3, Amazon Elastic MapReduce, Big Data, Attribution Modeling, Crowdsourcing, Gunicorn, Celery, Amazon Mechanical Turk, CrowdFlower
  • Paradigms
    Functional Programming, REST, MapReduce
  • Storage
    MySQL, Elasticsearch, DynamoDB, PostgreSQL, SQLite, MongoDB, Redis
  • Ph.D. in Language and Information Technologies
    Carnegie Mellon University - Pittsburgh, PA
    2011 - 2016
  • Bachelor of Science degree in Computer Science
    University of Illinois at Urbana-Champaign - Urbana, IL
    2007 - 2010
I really like this profile
Share it with others