Shih-hsuan Lee, Developer in Taipei, Taiwan
Shih-hsuan is available for hire
Hire Shih-hsuan

Shih-hsuan Lee

Verified Expert  in Engineering

Machine Learning Developer

Location
Taipei, Taiwan
Toptal Member Since
July 28, 2020

Shih-Hsuan is an entrepreneur, data scientist, and top competitor in machine learning competitions. He specializes in analyzing data pipelines and modeling business problems to deliver data projects with business impact. He built a real-time analytics system that monitored national product roll-out and provided decision support. Shih-Hsuan excels at sales forecasting, niche image classification, short text classification, and conditional text generation, along with AI, ML, and statistics.

Portfolio

Veritable Technology, Co.
Pandas, R, TensorFlow, PyTorch, SQL, Data Modeling, Data Analysis
Baiwang
Pandas, Apache Airflow, TensorFlow, PostgreSQL, Docker, Python, SQL...
Yongdata
Pandas, Scala, R, Python, SQL, Data Modeling, Data Analysis

Experience

Availability

Part-time

Preferred Environment

Julia, R, Python, TensorFlow, PyTorch, Linux

The most amazing...

...team I led developed a data analytics pipeline for monitoring a roll-out of a product across China in two weeks.

Work Experience

Data Scientist and Founder

2018 - PRESENT
Veritable Technology, Co.
  • Won seventh place in the third YouTube Video Understanding Challenge and published a paper in its ICCV 2019 workshop.
  • Assisted clients that required expertise in data science, machine learning, and artificial intelligence.
  • Created open source research projects, indie data products, and public technical notes and tutorials to help democratize AI.
Technologies: Pandas, R, TensorFlow, PyTorch, SQL, Data Modeling, Data Analysis

Chief Data Scientist

2017 - 2018
Baiwang
  • Built data pipelines to merge data from different sources in the company to a data warehouse.
  • Developed an automatic NLP merchandise classification system, including setting up an annotation procedure, data quality control, and experiment processes.
  • Built a real-time analytics system that monitored national product roll-out and provided decision support.
Technologies: Pandas, Apache Airflow, TensorFlow, PostgreSQL, Docker, Python, SQL, Data Modeling, Data Analysis

Senior Data Scientist

2015 - 2016
Yongdata
  • Developed a customer churn prediction system for a mobile phone company.
  • Developed a monitoring and forecast system of sales and inventory for a smart vending machine company.
  • Implemented anomaly detection algorithms in the company's analytics SaaS product.
Technologies: Pandas, Scala, R, Python, SQL, Data Modeling, Data Analysis

Software Engineer

2013 - 2015
Soshio
  • Maintained the back end of the company's NLP public opinion analysis product.
  • Developed data visualization in the dashboard facing customers.
  • Maintained the scrapping system and merged it with the firehoses from commercial data providers.
Technologies: JavaScript, Python, SQL

Seventh Place Solution to The Third YouTube-8M Video Understanding Challenge

https://github.com/ceshine/yt8m-2019
Challenge description: "In this third challenge based on the YouTube 8M dataset, Kagglers will localize video-level labels to the precise time in the video where the label appears and do this at an unprecedented scale. To put it another way, at what point in the video does the cat sneeze?"

Solution: To deal with the limited number of annotated segments, video-level models were pre-trained on the YouTube-8M frame-level features dataset to create meaningful video representations from frames. The weights of the two models were used to build two types of segment classifiers: context-aware and context-agnostic.

Paraphrasing English Sentences

https://github.com/ceshine/finetuning-t5
This open-source project is to build models that are automatically paraphrasing English sentences. It fine-tunes a pretrained T5 transformer model using several public paraphrase datasets to obtain paraphrased sentences. The fine-tuned model can create both semantically and dramatically correct paraphrases. Two fine-tuned models have been published on Huggingface Model Hub.

Self-Supervised Domain Adaptation

https://blog.ceshine.net/post/byol-domain-adaptation/
Inspired by the recent development in self-supervised learning in CV, I speculated that an unsupervised/self-supervised domain adaptation approach might help these cases. We take a model pre-trained on Imagenet, and run self-supervised learning on an unlabeled dataset from a different domain, hoping that this process will transfer some general CV knowledge into the new domain. The goal is to achieve more label efficiency in the downstream tasks within the new domain.

My preliminary experiments show visible improvements from the self-supervised domain adaptation approach using images from the downstream task. With longer pre-training and bigger unlabelled datasets, we can probably get further improvements.

Forecasting Challenges

https://github.com/ceshine/favorita_sales_forecasting
High-ranking results in forecasting competitions:

1. Corporación Favorita Grocery Sales Forecasting: predicting sales for a large grocery chain—placed 20th out of 1,671 teams

2. Recruit Restaurant Visitor Forecasting: predicting how many future visitors a restaurant will receive—placed 21st out of 1,248 teams

3. Web Traffic Time Series Forecasting: forecasting future traffic to Wikipedia pages—placed 43rd out of 1,095 teams

Languages

Python, SQL, R, Julia, Scala, JavaScript

Libraries/APIs

PyTorch, Pandas, TensorFlow, XGBoost

Paradigms

Data Science, Data-driven Testing

Other

Statistical Modeling, Machine Learning, Deep Learning, Image Classification, Data Analytics, Time Series, Forecasting, Statistical Data Analysis, Data Modeling, Data Analysis, Bayesian Inference & Modeling, Statistics, Time Series Analysis, Gradient Boosting, Data Visualization, Natural Language Processing (NLP), Recommendation Systems, Big Data, Image Recognition, GPT, Generative Pre-trained Transformers (GPT), Experimental Design, Stochastic Modeling, Risk Models, Genetic Algorithms, Computer Vision

Frameworks

LightGBM

Platforms

Google Cloud Platform (GCP), Linux, Docker, Amazon Web Services (AWS)

Storage

Data Pipelines, PostgreSQL

Tools

Apache Airflow

2014 - 2015

Master's Degree in Applied Statistics

National Australian University - Canberra, Australia

2004 - 2009

Bachelor of Science Degree in Computer Science and Information Engineering

National Taiwan University - Taipei, Taiwan

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring