
Dauren Baitursyn
Verified Expert in Engineering
Data Transformation Developer
Dubai, United Arab Emirates
Toptal member since November 18, 2021
Dauren is a skilled data scientist and software engineer with vast experience across the banking sector, fast-moving consumer goods (FMCG), and chat-bots. He holds a computer science bachelor's degree from the Korea Institute of Science and Technology (KAIST), one of the top 50 universities by QS ranking. He has excellent knowledge of Python, SQL, and Power BI. Dauren has worked with several teams on different business domains and is an excellent communicator.
Portfolio
Experience
- Python - 6 years
- SQL - 5 years
- Data Transformation - 5 years
- ETL - 4 years
- Generative Pre-trained Transformers (GPT) - 4 years
- Natural Language Processing (NLP) - 4 years
- Machine Learning - 4 years
- Deep Learning - 3 years
Availability
Preferred Environment
Visual Studio Code (VS Code), Jupyter, Git, Linux, Jupyter Notebook
The most amazing...
...project I've worked on is a multi-label text categorization model for 46 topics using the DeepPavlov and Power BI to visualize results.
Work Experience
Senior Data Engineer
Amgreat North America
- Created a robust ETL pipeline that involved extracting data from multiple sources, transforming it into a format suitable for analysis, and loading it into a more organized, accessible system.
- Developed an LSTM sales prediction model in PyTorch, outperforming ARIMA with a custom Mean Quartic Error (MQE) loss to manage outliers. Achieved a test loss of 5.8 and validation loss of 24.5, pending production deployment.
- Took full ownership of the company's sales data and other vital datasets, managing and maintaining these datasets as well as creating the workflows that included automating data extraction, transformation, and loading.
NLP Developer | Chatbot
Eduworks Corporation
- Led the development and implementation of the chatbot project. The chatbot could handle more than ten intents and complex user stories with forms applied. Two forms and up to ten simple dialogue paths were implemented and deployed.
- Cleaned, processed, and ingested raw data from two different sources to an Elasticsearch (ES) database for data retrieval on user request. Maintained the ES database indexes and ensured the retrieval process occurred within the time constraints.
- Implemented data retrieval based on the cosine similarity of the user input query and data sources. Embedded and vectorized data sources put into ES for fast retrieval.
- Increased the accuracy and recall metrics for data retrieval: top 1: 37.2% (+9.2%), top 3: 61.2% (+30.5), top 5: 63.6% (+27.5%), top 10: 69.0% (+26.4), r and recall for OOS queries: 79.3% (+65.4%).
- Incorporated synonym handling to user queries using the spaCy package.
Consumer Insights Data Scientist
Philip Morris International
- Built a multi-label text categorization model (46 topics) to replace the model based on heuristic hard rules. Text categorization model F1 score improvement from 0.11 to 0.81 on production data.
- Migrated static reports to dynamic dashboards using Power BI. Built, maintained, and improved dashboards with complex data models with more than 30 entities.
- Built complex ETL pipelines using Python to transform and extract data from different data sources into data models (STAR schema) in Power BI.
- Communicated findings and insights to stakeholders in data-story presentations.
Software Engineer
One Technologies
- Migrated the chatbot service from Dialogflow to a Rasa open source chatbot platform.
- Achieved a baseline NLU model performance on production data: F1 score of 0.732.
- Implemented a CRUD microservice for a chatbot platform.
- Created pipelines for data cleaning, data transformation, and data check for the chatbot.
- Automatized tests for the chatbot data consistency and integrity.
Data Scientist
PrimeSource
- Built and deployed statistical models using SPSS modelers like loan pre-approval and top-up models for targeted consumer campaigns.
- Designed and implemented data views and tables for campaign data panels using SQL and built and maintained ETL processes for the temporary tables needed for statistical models.
- Performed the analysis for a diverse set of ad hoc requests from internal stakeholders, measured the effectiveness of rolled-out campaigns, and communicated the results with data-story presentations.
- Led a team of two data analysts, conducted daily stand-ups for check-ins and progress on tasks, and communicated the projects' status to the principal data scientist.
Assistant Researcher
Graduate School of Knowledge Service Engineering | KAIST
- Implemented an information retrieval framework for clinical decision support in cancer diagnosis and treatment.
- Submitted the paper to the TREC 2017 Precision Medicine Track conference, but it has not been accepted.
- Assisted with the research in the field of precision medicine.
Experience
NewsAgg
https://github.com/biddy1618/newsProjectThe project has a search engine to retrieve the articles based on the cosine similarity of the frequency-inverse document frequency (TF-IDF) representation of the queries and articles.
Kaggle Alice Competition
https://github.com/biddy1618/alicekagglecompetitionHere we will try to identify a user on the internet by tracking their sequence of attended web pages. The algorithm to be built will take a webpage session—a series of web pages attended consequently by the same person—and predict whether it belongs to Alice or somebody else.
As for 23.12.18, my current LB standing is 87th out of 2,000ish.
For this competition, the data was time-series session information regarding user browser history. We had to distinguish between a regular user and an intruder user based on sites visited and time of visit information.
I performed the exploratory data analysis (EDA), data wrangling, and feature engineering to achieve a ROC-AUC score of 0.95856.
IMPLEMENTED FEATURES
• Dummy hour feature of the session start time (from now on start time).
• Sin and cos transformation of the start time.
• Active start hours of the intruder.
• Dummy weekday feature of the start time.
• Active weekday of intruder feature.
• Dummy month feature of the start time.
• Sin and cos transformation of the year and day feature.
• Session length.
• Session sites stay standard deviation.
Prediction of Churn | Telco Customer Churn Sample
https://github.com/biddy1618/churn-rateWe managed to train two models; logistic regression and random forest. The motivation behind including these statistical methods is based on the fact that logistic regression is a classical method for classification that gives good interpretability and is more or less stable. In contrast, random forest is robust and works well with small datasets due to its bagging sampling.
Final ROC-AUC scores on test data are as follows:
• For logistic regression - 0.652
• For random forest - 0.989
The top five major features for logistic regression are state of the state of residence, age, highest education acquired, use of internet services, and several complaints.
Random forest showed much better performance measures than logistic regression, and its major features make much more sense.
From a business perspective, age, unpaid balance, annual income (and other major features) seem to be valid features for churn rate.
Complete ML Project - From Getting Raw Data to Deployment
https://github.com/biddy1618/udacity-mldevops-3-project-mlmodel-fastapi-herokuThis project showcases CI/CD pipeline implementation using GitHub Actions and Heroku deployment.
CI includes PyTest tests and PEP8 code correspondence. DVC encapsulates different training components into stages.
This project is well-documented and has a model card description.
Education
Bachelor's Degree in Computer Science
Korea Advanced Institute of Science and Technology (KAIST) - Daejeon, South Korea
Exchange Program Specialized in Computer Science
Innopolis - Kazan, Republic of Tatarstan, Russia
Exchange Program Specialized in Computer Engineering
Middle East Technical University - Ankara, Turkey
Certifications
Machine Learning DevOps Engineer Nanodegree
Udacity
Natural Language Processing
Coursera
Introduction to Deep Learning
Coursera
Mathematics for Machine Learning Specialization | PCA
Coursera
IBM Certified Specialist | SPSS Modeler Professional V3
IBM
Machine Learning Specialization
Coursera
Machine Learning
Coursera
Skills
Libraries/APIs
Pandas, PyTorch, Scikit-learn, NumPy, SciPy, Rasa NLU, SQLAlchemy, TensorFlow, REST APIs, SpaCy, LSTM
Tools
Jupyter, Microsoft Power BI, SPSS Modeler, Rasa.ai, IBM SPSS, Git, Pytest, GitLab CI/CD, Docker Compose, GitLab, Logging, Plotly, Tableau
Languages
Python, SQL, Java, JavaScript, Scala
Paradigms
Object-oriented Programming (OOP), ETL, Functional Programming, DevOps, Clean Code, Testing, Human-computer Interaction (HCI), CRUD
Platforms
Visual Studio Code (VS Code), Jupyter Notebook, Docker, Linux, Amazon EC2, Heroku, Amazon Web Services (AWS), Apache Kafka, Databricks
Frameworks
Flask, DeepPavlov, GraphLab, Spark, OAuth 2
Storage
PostgreSQL, Elasticsearch, Data Pipelines, Databases, Relational Databases, MySQL
Other
Algorithms, Data Structures, Machine Learning, Data Transformation, Data Science, Predictive Modeling, Data Analysis, Natural Language Processing (NLP), Deep Learning, Deployment, Data Visualization, Generative Pre-trained Transformers (GPT), BERT, MLflow, Data Versioning, Hugging Face, FastAPI, CI/CD Pipelines, GitHub Actions, Web Crawlers, Web Scraping, DevOps Engineer, Machine Learning Automation, Code Versioning, GitOps, Experiment Tracking, APIs, Flake8, Probability Theory, Linear Algebra, Operating Systems, System Programming, Discrete Mathematics, Linear Optimization, OOP Designs, Computer Science, Information Retrieval, Live Chat, Chatbots, Search Engines, Containers, User Intent Scoring, Conda, Data, Data Cleaning, Dashboards, Data Modeling, Reporting, Statistical Methods, Data Extraction, Authorization, Data Migration, Scripting, Ad Campaigns, Customer Segmentation, Market Segmentation, Time Series, Time Series Analysis, Machine Learning Operations (MLOps), Data Mining
How to Work with Toptal
Toptal matches you directly with global industry experts from our network in hours—not weeks or months.
Share your needs
Choose your talent
Start your risk-free talent trial
Top talent is in high demand.
Start hiring