Laura Tolosi, Developer in Sofia, Bulgaria
Laura is available for hire
Hire Laura

Laura Tolosi

Verified Expert  in Engineering

Machine Learning Developer

Location
Sofia, Bulgaria
Toptal Member Since
February 4, 2019

Laura has a Ph.D. from the Max Planck Institute for Informatics, Germany, in the field of computational biology, focused on cancer biomarker detection using statistics and machine learning. She worked on projects in the field of natural language processing such as named entity recognition, sentiment analysis, fake news detection. Recently, she has worked on applying reinforcement learning methodology for trading financial instruments.

Availability

Part-time

Preferred Environment

R, Python

The most amazing...

...project I did was to analyze a novel neuroblastoma-tumor dataset and search for viral DNA that could be causing cancer in small children.

Work Experience

Data Scientist and Machine Learning Engineer

2018 - PRESENT
Self-employed
  • Implemented a reinforcement learning framework for algorithmic trading of cryptocurrencies.
  • Implemented chatbots from scratch using NLP state-of-the-art methods, based on Transformers (BERT).
  • Executed chatbots using Google Dialogflow and Google Cloud.
  • Implemented a framework for automated relation extraction from technical documents.
  • Implemented a module for estimating product repurchase-rate for an eCommerce client. In the same context, wrote algorithms for identifying abnormal purchase rates.
  • Worked on a machine learning-based solution for pattern detection in trading data (financial domain). Wrote heuristics as a semi-automated procedure for producing labeled data.
Technologies: Python

Lead Scientist | Text Analysis

2012 - 2018
Ontotext Ad
  • Developed ML models for NLP, including methods for domain adaptation, methods for automated feature selection, methods for optimization of F-measure. Applied models such as logistic regression, SVM, CRF for both classification and sequence tagging.
  • Developed a machine learning model for classification of tweets as either rumor/not rumor in R.
  • Acquired in-depth knowledge in relational databases, ontologies, and linked data. Implemented a classification model written in Java, that automatically categorizes Wikipedia pages as either belonging to the topic "Food and Drink" or not.
  • Experimented with topic models with LDA in order to help with a reccommender system for a large publishing company.
  • Built prototypes for training word-vectors embeddings and graph embeddings.
  • Developed models for sentiment analysis for English and Bulgarian, in R and Java. The methods were supervised for English and unsupervised for Bulgarian.
  • Acquired significant experience with automated and semi-automated integration of various RDF resources as DBpedia and Geonames.
Technologies: Ontologies, RDF, Java, R

PhD

2006 - 2012
Max-Planck Institute für Informatik
  • Gained expertise in cancer genetics, with a focus on copy number aberrations and acquired additional in-depth knowledge in domains like epigenetics, transcriptomics, and viral genomes.
  • Used supervised and unsupervised machine learning methods for modeling cancer genetic data. The supervised methods used were: logistic regression, elastic net, SVM, decision trees, and random forest.
  • Wrote machine learning models in the statistical language R and acquired in-depth expertise with visualization techniques in R.
  • Acquired solid experience with presenting complex AI models to non-experts (medical doctors), by giving the intuition behind the mathematical models.
  • Performed feature selection with various methods: filters with statistical tests, penalty methods for linear models, and pruning.
  • Acquired solid knowledge in computational statistics and statistical learning. This includes statistical tests, statistical distributions, estimators, and bias-variance decomposition.
  • Wrote scientific papers and learned how to deliver high-quality presentations in conferences and in front of clients.
  • Worked closely with medical doctors in hospitals. Conducted interdisciplinary communication with medical doctors, in order to maximize the benefit of the machine learning solutions for their patients.
Technologies: Python, R

Canadian Heritage Information Network (CHIN) - Data Analysis

https://lauratolosi.shinyapps.io/museums/
CHIN hold a large database of artifacts coming from museums all over Canada. This database was obtained by collecting all digital datasets that the museums provided, which are end-of-life, meaning that they are not maintained anymore. In its current state, the data collection is hard to use, suffering from ambiguities (eg. many spellings of same author name), repetitions, mixed and unresolvable language input (EN or FR) by automatic means, lack of standard object taxonomies (eg for object materials, types). Ontotext was given the task to evaluate the effort necessary to clean the database and link to LOD resources (DBpedia, Getty AAT, and others).

I worked with two colleagues on this project. My role was to statistically estimate the proportion of malformed data, focusing on its most important features (eg. museum, objects category, type, name, language). I also had to estimate what proportion of the errors are systematic and are addressable by automatic methods (NLP).

Eventually, the project was successful, exceeding the expectations of the Canadian institution.

Brexit Twitter Analysis

In the weeks preceding the Brexit referendum, Ontotext streamed discussions from Twitter on the topic and ingested them in GraphDB, using the PHEME (www.pheme.eu) ontology model and the semantic enrichment pipeline, which affords the linking of entities to LOD data like Geonames and DBpedia. I was assigned the task to analyze these tweets and determine which are the main actors of the British referendum and their stand w.r.t. the main question: #leave or #stay? By using some simple bootstrapping techniques around polarizing hashtags (eg. #leave, #exit; #stay, #remain), I was able to conduct sentiment analysis on the set of tweets, which provided with an estimate of sentiment for various types of entities: politicians, political parties, geographical locations., age groups. My analysis showed that the lobby for #exit was much stronger than for #stay, at least on Twitter. The report was published days before the referendum ( https://ontotext.com/twitter-users-support-brexit/ ) and it received a lot of attention after the voting result matched the outcome of the analysis.

Algorithmic Trading of Cryptocurrencies

I had the curiosity and ambition to find out how easy it is to implement an algorithmic high-frequency trader for the cryptocurrency market. I conducted this project by myself. The technology employed was Python and Keras, and the methodology was Reinforcement learning with deep neural networks. I worked with open data, namely three-years prices of Bitcoin. I also generated synthetic data, for testing of the algorithms. I have educated myself on the topic of reinforcement learning and time series forecasting.

Rumor Detection on Social Media (Twitter)

PHEME was a very successful research project funded by the EU’s 7th WP for research, technological development and demonstration. The consortium had the ambitious goal of developing a platform for automated detection of rumors on social media, intended to help journalists fight the proliferation of misinformation and disinformation through such channels. The tool was streaming data from Twitter, filtered to cover interesting political events. The data analysis was a multi-language text-processing pipeline on top of an ontology that modeled rumors on Twitter.

I was involved in many aspects of the PHEME project. As a data scientist, I developed an ML model for prediction of rumors on Twitter. As a member of Ontotext's team, I coordinated the integration of various pipeline components coming from all partners. I wrote deliverables, reports and scientific papers describing our work.

Mining Highly Structured Information (MobiBiz, London)

I have been hired by MobiBiz as a freelancing data scientist (through Toptal) to implement and bring to production a system that is able to extract relations from highly structured documents. These documents include tables, sections, figures with captions, etc. The solution is based on the open-source Fonduer algorithm. My task is to apply this algorithm to a specific dataset but to keep the solution general enough to ensure applicability to future similar cases. The code is in Python and the application uses a Postgres relational database, with the SQLalchemy interface to Python.

Chatbot for Dialogue with Book Characters (USC Libraries)

I helped with the implementation of two chatbots, depicting characters Alice and Cheshire Cat from the book The Adventures of Alice in Wonderland by Lewis Carroll. The characters can interact with users who ask questions verbally. A speech recognition system translates voice to text and a question answering system is providing an appropriate answer. For this prototype, the character has a fixed set of responses to choose from, referring to facts from the book, biographical information on Lewis Carroll and his books, and some topics about the University of Southern California’s library.

My role in the project was to help my team select a speech recognition system that can be used to translate users’ questions into text and to implement a question answering model that is able to select the appropriate answer from the list of possible answers. I used BERT for question answering. The system is deployed as a web service and takes requests in real-time, through a Flask app.

Languages

R, Python 2, Python 3, Python, RDF, Java, SPARQL, SQL

Other

Machine Learning, Data Visualization, Random Forests, Clustering Algorithms, Natural Language Processing (NLP), Sentiment Analysis, Scientific Data Analysis, Research, Statistics, Computational Biology, GPT, Generative Pre-trained Transformers (GPT), BERT, Neural Networks, Convolutional Neural Networks (CNN), Deep Neural Networks, Generalized Linear Model (GLM), Information Retrieval, Applied Mathematics, Algorithms, Reinforcement Learning, Deep Reinforcement Learning, Chatbots, Custom BERT, ASR, Mixed-effects Models, Marketing Mix, Meta Robyn, Time Series, Ontologies, Deep Learning, Agile Data Science, Natural Language Understanding (NLU), Time Series Analysis

Libraries/APIs

Scikit-learn, TensorFlow, SQLAlchemy

Tools

PyCharm, Dialogflow, Git, GitLab

Platforms

Linux, Jupyter Notebook, RStudio

Frameworks

RStudio Shiny, Flask

Storage

JSON, PostgreSQL, Amazon S3 (AWS S3)

2006 - 2012

PhD in Computational Biology

Max-Planck-Insitute for Informatics - Saarbrücken, Germany

2005 - 2006

Master's Degree in Computational Biology

Max-Planck-Insitute for Informatics - Saarbrücken, Germany

1999 - 2003

Bachelor's Degree in Computer Science

University of Bucharest - Bucharest, Romania

JULY 2020 - PRESENT

Participation in EEML Summer School for Deep Learning, Organized by Google Deep Mind

EEML

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring