
J Chan
Verified Expert in Engineering
Data Engineer and Software Developer
J is a data engineer with extensive experience in master data management and data governance best practices. She has five years of experience using SQL and Python for data mining, extracting master attributes, and performing general big data quality control. For ETL, J mostly relies on Spark SQL and AWS cloud services to build reliable and seamless workflows.
Portfolio
Experience
Availability
Preferred Environment
Python 3, SQL, Linux
The most amazing...
...(or one of the most amazing) pieces of code I've written was a handful of lines using PyTorch to categorize eCommerce products by image by client demand.
Work Experience
Head of Master Data Management
Early Data
- Helped manage the transfer of inherited Talend master data management workflows to modern technologies such as Python, Spark SQL, and AWS, mapping a further 20% of product master brands.
- Applied text and image classification neural nets to significantly increase the accuracy of master data management attribute tagging by over 90% accuracy, beyond what was feasible by simple regex patterns.
- Develop an overarching master data management program to improve downstream business intelligence efforts.
- Restructured the team and improved overall work culture, increasing team productivity by 50%.
- Refined the product name clustering and SKU identification with sklearn's linear kernel module and proactively arranged human resources to support master data management SKU tagging workflow, reducing the corresponding workload by 80%.
- Developed automated Spark QC scripts, reducing the workload of quality control workflow by 40%.
- Managed a data management platform testing document sign-off, conducting software unit testing, user acceptance testing, and operational qualification protocol for a Fortune 500 medical device company.
- Maintained an ELT solution spanning Microsoft Server, SSIS, and stored procedures for a Fortune 500 personal care company.
Account Manager
PureLiving
- Ensured the proper maintenance of air filtration systems through regular client follow-ups.
- Created custom bilingual proposals for 60+ existing and prospective clients for a wide range of professional indoor environmental quality programs.
- Educated the public about the importance of environmental air quality strategy at a variety of local community events.
Researcher
Harvard University
- Collected tree crown property data for 235 trees to establish the relationship between functional properties and demographic rates to determine climate change impact on carbon sequestration rates (Bukit Timah Nature Reserve, Singapore).
- Constructed capacitance curves for 100 leaves and projected how climate change will alter forest structure (Pasoh Forest Reserve, Malaysia). Presented results of the study at the 9th Annual Harvard Plant Biology Symposium.
- Moderated discussion sections and provided constructive feedback for undergraduate students in a course on Classical Chinese Ethical and Political Theory.
Experience
Master Data Management Project | A Text and Image Classifier Duo for Intelligent Category Mapping
We found that relying on regex patterns to tag crawled products to the nutritional health category was not accurate, so I collaborated with a data science colleague to deploy text classification in the form of MXNet's Transformer model to perform more intelligent product name tagging.
To further improve the accuracy of category mapping, I managed the further implementation of fast.ai and PyTorch's VGG-16 image classifier to utilize product images as well, as sometimes the product name sounded like a nutritional health item, but the image would indicate otherwise. Consolidating the two results as an enriched tagging workflow increased our average accuracy of category mapping to 93.5% and passed the client quality requirement to publish to the production environment and launch the final digital product.
ETL Project | Transfer of Talend ETL workflows to Spark SQL/AWS
After painstakingly reviewing the logic, documenting the logic, and monitoring topline statistics of the existing load, I slowly rewrote all nine workflows and made sure to compare my test load with the production load to make sure none of the columns were missing any crucial logic. After several months, the last workflow was written, and I rounded out the endeavor by packaging the scripts to s3 with a slew of shell scripts that allowed the entire ETL workflow to run on Amazon EMR. Not only did the new workflow run more stably, it also ran approximately 40% faster and was easier to QC.
Master Data Management Project | Optimizing Master Brand Mapping in Spark Sql
The biggest challenge by far was utilizing the existing keyword look-up table used to do keyword mapping, as merging the look-up tables effectively as a cross-join with no way to partition efficiently greatly reduced the distributed computing potential of Spark SQL. After days of fervent research, I came across the concept of using another column to force certain partitions to travel together so that they were processed together, thereby allowing Spark's distributed power to take over as expected. From this experience, I gained not only more confidence in my Spark code, but also as a developer holistically.
Skills
Languages
Python 3, SQL, Snowflake
Libraries/APIs
Pandas, PyTorch, TensorFlow
Paradigms
ETL, Business Intelligence (BI)
Platforms
Amazon Web Services (AWS), Talend, Linux
Storage
Master Data Management (MDM), SQL Server Integration Services (SSIS)
Other
Data Engineering, IT Project Management, eCommerce, Data Mining, Machine Learning, People Management, Data Warehouse Design, Cloud Computing, Statistics, Research, Field Research, Customer Service, Customer Service Strategy, Public Speaking, Pitch Presentations, Quotations
Tools
Git, Spark SQL, MATLAB, Microsoft Power BI
Frameworks
MXNet
Education
Master's Degree in Organismic and Evolutionary Biology
Harvard University - Cambridge, MA, USA
Bachelor's Degree in Ecology, Behavior, and Evolution
University of California, Los Angeles - Los Angeles, CA, USA
Certifications
Snowflake Decoded - Fundamentals and Hands On Training
Udemy
Microsoft Power BI Desktop for Business Intelligence
Udemy
AWS Certified Cloud Practitioner
Amazon Web Services
AWS Partner: Accreditation (Technical)
Amazon Web Services