Wenjie Xu
Verified Expert in Engineering
Software Developer
Wenjie is a highly skilled and experienced senior data engineer with a proven track record of designing and implementing robust ETL/ELT pipelines. Wenjie is proficient in AWS technologies, including Athena, Glue, Lambda, and Airflow. Wenjie is also skilled in handling large amounts of data from multiple sources and transforming it into usable information for analytics and data science initiatives.
Portfolio
Experience
Availability
Preferred Environment
Zeppelin, Jupyter Notebook, PyCharm, Visual Studio Code (VS Code)
The most amazing...
...projects I've developed, deployed, and managed are robust, fault-tolerant data pipelines in AWS and Airflow.
Work Experience
Senior Data Engineer
Carbon Arc
- Developed robust ETL/ELT pipelines with AWS Athena, Glue, Lambda, and Airflow to consume large amounts of data (1,000+ object files daily) from multiple data sources into a data lake and multiple warehouse endpoints.
- Aggregated, tuned, partitioned, and indexed data in support of large-scale analytics and data science initiatives.
- Performed data modeling design and leveraged an advanced understanding of data and analytics concepts to build tables and views autonomously.
- Handled large relational data replication as well as streaming and unstructured data inputs.
- Designed and implemented an ontology platform for data ingestion, cleaning, tagging, and internal databases linkage; deployed the platform on EC2 with Docker.
Data Architect
BASF
- Integrated and maintained unstructured data (metadata and raw data) from various lab equipment, ensuring they adhered to data quality and accessibility standards. Designed data models and data schemas to standardize the data acquisition process.
- Built and automated data collection, wrangling, processing, and visualization using Plotly on internal Kubernetes clusters.
- Created and designed an ETL data pipeline using Python to consolidate data from a variety of sources into cloud storage. Utilized Airflow to schedule and maintain data parsers when integrating data in Elasticsearch.
- Evaluated data frameworks and conducted POC to determine harmonized data format standards for heterogeneous lab processes and analytical data across internal lab LIMS systems.
- Led team meetings and drove technical discussions to track project progress and achieve OKRs.
- Developed internal Python packages to simplify data processing workflows,, reducing data processing and calculation time costs by about 80%.
Data Analytics Engineer
PACO Technologies, Inc.
- Built an internal data entry system by Python-Flask to improve data quality and eliminate data noise.
- Automated the data acquisition process to reduce human errors significantly. Configured the server environment on AWS EC2 and RDS with reliable security groups.
- Designed and developed complex SQL queries, Python script, and triggers for ETL jobs. Integrated and maintained data from a variety of sources, assuring they adhere to data quality and accessibility standards.
- Generated bi-weekly Ad-Hoc data reports by CloudWatch, Lambda, and SQL/Excel to prevent manual queries.
- Developed a KPI dashboard by Power BI to track company recruiting performance internally and facilitate the decision-making process.
- Developed, deployed, and managed the data pipeline (DocumentDB, Athena, Redshift, S3, Lambda) that cleans, transforms, and aggregates unorganized and messy data into databases, allowing for seamless collection, storage, and management of big data.
- Developed data classifiers, mining algorithms, and models for engineering documents sentiment analysis, topic mining, and data visualization.
Experience
ETL Project: Data Integration from CSV and XML to Relational Database
https://git.toptal.com/Ivan-Ilijasic/wenjie-xuWeb Scraping Using Scrapy
https://github.com/xwjsarah/scraping/blob/master/homedepot.pyPython-Flask Data Entry System Development
Skills
Languages
Python 3, SQL, Python
Tools
Microsoft Power BI, Amazon Elastic MapReduce (EMR), AWS Glue, Spark SQL, CircleCI, PyCharm, Apache Airflow, Amazon Athena
Frameworks
Flask, Scrapy, Spark
Libraries/APIs
Pandas, PySpark, Spark ML
Platforms
AWS Lambda, Jupyter Notebook, Zeppelin, Amazon Web Services (AWS), Visual Studio Code (VS Code)
Storage
Amazon S3 (AWS S3), Redshift, Elasticsearch
Education
Master's Degree in Computer Science
Montclair State University - Montclair, New Jersey, USA
Certifications
AWS Certified Developer
Amazon Web Services (AWS)
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring