Laura Tolosi, Clustering Algorithms Developer in Sofia, Bulgaria
Laura Tolosi

Clustering Algorithms Developer in Sofia, Bulgaria

Member since December 30, 2018
Laura has a Ph.D. from the Max Planck Institute for Informatics, Germany, in the field of computational biology, focused on cancer biomarker detection using statistics and machine learning. She worked on projects in the field of natural language processing such as named entity recognition, sentiment analysis, fake news detection. Recently, she has worked on applying reinforcement learning methodology for trading financial instruments.
Laura is now available for hire



  • Machine Learning 14 years
  • Random Forests 14 years
  • Data Visualization 14 years
  • R 14 years
  • Clustering Algorithms 14 years
  • Sentiment Analysis 7 years
  • Natural Language Processing (NLP) 7 years
  • Reinforcement Learning 1 year


Sofia, Bulgaria



Preferred Environment

R, Python

The most amazing...

...project I did was to analyze a novel neuroblastoma-tumor dataset and search for viral DNA that could be causing cancer in small children.


  • Data Scientist and Machine Learning Engineer

    2018 - PRESENT
    • Implemented a reinforcement learning framework for algorithmic trading of cryptocurrencies.
    • Implemented chatbots from scratch using NLP state-of-the-art methods, based on Transformers (BERT).
    • Executed chatbots using Google Dialogflow and Google Cloud.
    • Implemented a framework for automated relation extraction from technical documents.
    • Implemented a module for estimating product repurchase-rate for an eCommerce client. In the same context, wrote algorithms for identifying abnormal purchase rates.
    • Worked on a machine learning-based solution for pattern detection in trading data (financial domain). Wrote heuristics as a semi-automated procedure for producing labeled data.
    Technologies: Python
  • Lead Scientist | Text Analysis

    2012 - 2018
    Ontotext Ad
    • Developed ML models for NLP, including methods for domain adaptation, methods for automated feature selection, methods for optimization of F-measure. Applied models such as logistic regression, SVM, CRF for both classification and sequence tagging.
    • Developed a machine learning model for classification of tweets as either rumor/not rumor in R.
    • Acquired in-depth knowledge in relational databases, ontologies, and linked data. Implemented a classification model written in Java, that automatically categorizes Wikipedia pages as either belonging to the topic "Food and Drink" or not.
    • Experimented with topic models with LDA in order to help with a reccommender system for a large publishing company.
    • Built prototypes for training word-vectors embeddings and graph embeddings.
    • Developed models for sentiment analysis for English and Bulgarian, in R and Java. The methods were supervised for English and unsupervised for Bulgarian.
    • Acquired significant experience with automated and semi-automated integration of various RDF resources as DBpedia and Geonames.
    Technologies: Ontologies, RDF, Java, R
  • PhD

    2006 - 2012
    Max-Planck Institute für Informatik
    • Gained expertise in cancer genetics, with a focus on copy number aberrations and acquired additional in-depth knowledge in domains like epigenetics, transcriptomics, and viral genomes.
    • Used supervised and unsupervised machine learning methods for modeling cancer genetic data. The supervised methods used were: logistic regression, elastic net, SVM, decision trees, and random forest.
    • Wrote machine learning models in the statistical language R and acquired in-depth expertise with visualization techniques in R.
    • Acquired solid experience with presenting complex AI models to non-experts (medical doctors), by giving the intuition behind the mathematical models.
    • Performed feature selection with various methods: filters with statistical tests, penalty methods for linear models, and pruning.
    • Acquired solid knowledge in computational statistics and statistical learning. This includes statistical tests, statistical distributions, estimators, and bias-variance decomposition.
    • Wrote scientific papers and learned how to deliver high-quality presentations in conferences and in front of clients.
    • Worked closely with medical doctors in hospitals. Conducted interdisciplinary communication with medical doctors, in order to maximize the benefit of the machine learning solutions for their patients.
    Technologies: Python, R


  • Canadian Heritage Information Network (CHIN) - Data Analysis (Development)

    CHIN hold a large database of artifacts coming from museums all over Canada. This database was obtained by collecting all digital datasets that the museums provided, which are end-of-life, meaning that they are not maintained anymore. In its current state, the data collection is hard to use, suffering from ambiguities (eg. many spellings of same author name), repetitions, mixed and unresolvable language input (EN or FR) by automatic means, lack of standard object taxonomies (eg for object materials, types). Ontotext was given the task to evaluate the effort necessary to clean the database and link to LOD resources (DBpedia, Getty AAT, and others).

    I worked with two colleagues on this project. My role was to statistically estimate the proportion of malformed data, focusing on its most important features (eg. museum, objects category, type, name, language). I also had to estimate what proportion of the errors are systematic and are addressable by automatic methods (NLP).

    Eventually, the project was successful, exceeding the expectations of the Canadian institution.

  • Brexit Twitter Analysis (Development)

    In the weeks preceding the Brexit referendum, Ontotext streamed discussions from Twitter on the topic and ingested them in GraphDB, using the PHEME ( ontology model and the semantic enrichment pipeline, which affords the linking of entities to LOD data like Geonames and DBpedia. I was assigned the task to analyze these tweets and determine which are the main actors of the British referendum and their stand w.r.t. the main question: #leave or #stay? By using some simple bootstrapping techniques around polarizing hashtags (eg. #leave, #exit; #stay, #remain), I was able to conduct sentiment analysis on the set of tweets, which provided with an estimate of sentiment for various types of entities: politicians, political parties, geographical locations., age groups. My analysis showed that the lobby for #exit was much stronger than for #stay, at least on Twitter. The report was published days before the referendum ( ) and it received a lot of attention after the voting result matched the outcome of the analysis.

  • Algorithmic Trading of Cryptocurrencies (Development)

    I had the curiosity and ambition to find out how easy it is to implement an algorithmic high-frequency trader for the cryptocurrency market. I conducted this project by myself. The technology employed was Python and Keras, and the methodology was Reinforcement learning with deep neural networks. I worked with open data, namely three-years prices of Bitcoin. I also generated synthetic data, for testing of the algorithms. I have educated myself on the topic of reinforcement learning and time series forecasting.

  • Rumor Detection on Social Media (Twitter) (Development)

    PHEME was a very successful research project funded by the EU’s 7th WP for research, technological development and demonstration. The consortium had the ambitious goal of developing a platform for automated detection of rumors on social media, intended to help journalists fight the proliferation of misinformation and disinformation through such channels. The tool was streaming data from Twitter, filtered to cover interesting political events. The data analysis was a multi-language text-processing pipeline on top of an ontology that modeled rumors on Twitter.

    I was involved in many aspects of the PHEME project. As a data scientist, I developed an ML model for prediction of rumors on Twitter. As a member of Ontotext's team, I coordinated the integration of various pipeline components coming from all partners. I wrote deliverables, reports and scientific papers describing our work.

  • Mining Highly Structured Information (MobiBiz, London) (Development)

    I have been hired by MobiBiz as a freelancing data scientist (through Toptal) to implement and bring to production a system that is able to extract relations from highly structured documents. These documents include tables, sections, figures with captions, etc. The solution is based on the open-source Fonduer algorithm. My task is to apply this algorithm to a specific dataset but to keep the solution general enough to ensure applicability to future similar cases. The code is in Python and the application uses a Postgres relational database, with the SQLalchemy interface to Python.

  • Chatbot for Dialogue with Book Characters (USC Libraries) (Development)

    I helped with the implementation of two chatbots, depicting characters Alice and Cheshire Cat from the book The Adventures of Alice in Wonderland by Lewis Carroll. The characters can interact with users who ask questions verbally. A speech recognition system translates voice to text and a question answering system is providing an appropriate answer. For this prototype, the character has a fixed set of responses to choose from, referring to facts from the book, biographical information on Lewis Carroll and his books, and some topics about the University of Southern California’s library.

    My role in the project was to help my team select a speech recognition system that can be used to translate users’ questions into text and to implement a question answering model that is able to select the appropriate answer from the list of possible answers. I used BERT for question answering. The system is deployed as a web service and takes requests in real-time, through a Flask app.


  • Languages

    R, Python 2, Python 3, Python, RDF, Java, SPARQL, SQL
  • Other

    Machine Learning, Data Visualization, Random Forests, Clustering Algorithms, Natural Language Processing (NLP), Sentiment Analysis, Scientific Data Analysis, Research, Statistics, Computational Biology, Google BERT, Neural Networks, Convolutional Neural Networks, Deep Neural Networks, Generalized Linear Model (GLM), Information Retrieval, Applied Mathematics, Algorithms, Reinforcement Learning, Deep Reinforcement Learning, Chatbots, Custom BERT, ASR, Ontologies, Deep Learning, Agile Data Science, Natural Language Understanding, Time Series Analysis
  • Libraries/APIs

    Sklearn, TensorFlow, SQLAlchemy
  • Tools

    PyCharm, Dialogflow, Git, GitLab
  • Platforms

    Linux, Jupyter Notebook, RStudio
  • Frameworks

    RStudio Shiny, Flask
  • Storage

    JSON, PostgreSQL, AWS S3


  • PhD in Computational Biology
    2006 - 2012
    Max-Planck-Insitute for Informatics - Saarbrücken, Germany
  • Master's degree in Computational Biology
    2005 - 2006
    Max-Planck-Insitute for Informatics - Saarbrücken, Germany
  • Bachelor's degree in Computer Science
    1999 - 2003
    University of Bucharest - Bucharest, Romania


  • Participation in EEML Summer School for Deep Learning, Organized by Google Deep Mind
    JULY 2020 - PRESENT

To view more profiles

Join Toptal
Share it with others