Michael Völske, Machine Learning Developer in Aschaffenburg, Bavaria, Germany
Michael Völske

Machine Learning Developer in Aschaffenburg, Bavaria, Germany

Member since September 9, 2021
Michael has a PhD in computer science, a decade of experience solving complex data problems, and a dozen publications at top-tier international venues like SIGIR, CIKM, and ACL—many of them based on web-scale datasets. He excels in planning, procuring, and installing the on-premise data processing infrastructure spanning hundreds of servers, petabytes of disk, and petaflops of compute. Michael has a broad knowledge of modern machine learning and teaches the fundamentals to hundreds of students.
Michael is now available for hire

Portfolio

Experience

Location

Aschaffenburg, Bavaria, Germany

Availability

Part-time

Preferred Environment

Linux, Emacs, Visual Studio Code, Python 3, Spark, Pandas, Kubernetes, Scikit-learn, Git

The most amazing...

...dataset I've analyzed is a billion-item query log from a major search engine that taught me the power of Apache Spark and resulted in a top-ranked publication.

Employment

  • Assistant Lecturer Online Teaching

    2020 - PRESENT
    Berliner Hochschule für Technik
    • Taught an online class on the fundamentals of computer operating systems as a freelance assistant lecturer to classes of 20+ students per semester.
    • Designed and administered the practical exercise labs on digital circuits, assembly language, process scheduling, and file systems.
    • Created and administered the mid-term and final exams.
    Technologies: University Teaching
  • Postdoctoral Research Assistant

    2019 - PRESENT
    Bauhaus-Universität Weimar
    • Published 7+ research papers on information retrieval and data mining over three years at the A-level international and several minor venues, often leveraging state-of-the-art machine learning and data processing techniques.
    • Designed teaching materials on the fundamentals of machine learning as part of an annual lecture and led an associated programming lab for more than 100 students each year.
    • Led a team to plan, procure, and install a state-of-the-art GPU computing cluster; designed and implemented the systems for authenticating dozens of researchers using this infrastructure across four institutions.
    • Supported and mentored 11 successful students in writing bachelor's and master's theses.
    Technologies: Machine Learning, Technical Writing, University Teaching, LaTeX, Python 3, Git, SaltStack
  • Research Assistant

    2013 - 2019
    Bauhaus-Universität Weimar
    • Contributed to more than a dozen scientific publications on information retrieval, natural language processing, and data mining, half of which were published at A-rated international venues. Implemented the experimental systems in Python and Java.
    • Taught lab classes on the fundamentals of machine learning and held a recurring seminar class on big data processing architectures. Led small student project groups on a variety of hot topics in machine learning and data mining.
    • Planned and implemented the procurement, installation, maintenance, and monitoring for computing infrastructure, which involved more than 200 individual servers.
    • Supported and mentored 11 successful students in the preparation of bachelor's and master's theses.
    Technologies: Data Science, Data Engineering, DevOps, Technical Writing, Python 3, Java, Hadoop, Kubernetes, Spark

Experience

  • Lecture and Lab Class "Introduction to Machine Learning"
    https://webis.de/lecturenotes#machine-learning

    The annual lecture on the fundamentals of machine learning is part of the curricula at Bauhaus-Universität Weimar and Universität Leipzig.

    My responsibilities included teaching in-person and holding online classes, designing lecture materials and lab exercises, and supervising teaching assistants. To support the associated programming labs, I rolled out a Kubernetes-based Jupyterhub deployment servicing up to a hundred students at a time. I significantly expanded the teaching materials on neural networks and deep learning.

  • Query Classification and Log Analysis on a Billion-item Query Log

    To better understand how web users ask questions and how search engines can serve their information needs, I analyzed a query log of more than one billion individual entries from a major commercial search engine. Due to the amount of data involved, I leveraged big data technologies, including Apache Spark deployed on a 100-node Hadoop cluster, to implement a comprehensive data cleaning and analysis pipeline. Using user-labeled community question answering data as a training set, I developed query classification models based on different feature sets—such as unigrams and topic models—and evaluated their effectiveness and efficiency. The resulting publication was accepted at CIKM 2015 and served as the basis for varied downstream research on question classification.

  • Research and Implementation for Axiomatic Information Retrieval
    https://webis.de/publications.html?q=author:volske+axiomatic

    Axiomatic IR deals with formal properties that a good relevance scoring function should fulfill.

    This research project investigated strategies to make retrieval axioms directly usable to benefit real-world search engines. A pilot study published at CIKM 2016 showed how retrieval axioms could directly modify result rankings and thus improve search result quality. Follow-up work in ICTIR 2021 showed how retrieval axioms could generate explanations for arbitrary rankings, making complex relevance scoring functions such as those based on deep neural networks more interpretable. The axiomatic re-ranking pipeline I implemented has contributed to several further publications.

  • Procurement, Installation, and Maintenance of Computing Infrastructure
    https://webis.de/facilities.html#hardware

    I was primarily responsible for the planning, procurement, installation, and maintenance of research computing infrastructure spanning nearly 300 individual servers over more than five years. To provision, maintain, and monitor these systems, my team leveraged infrastructure-as-code technologies such as SaltStack, and developed custom solutions for user authentication and access control on top of Gitlab, Kubernetes, and Slurm. Altogether, the infrastructure has enabled hundreds of successful publications at high-ranking venues across our research group.

  • Mining Reddit for Abstractive Summarization Ground Truth Data
    https://webis.de/data/webis-tldr-17.html

    An abstractive summary distills the core content of a source text using novel words and phrases not necessarily found in the source. This yields a shorter summary and more fluent reading experience than a mere selection of relevant phrases from the source could achieve. To learn automatic abstractive summarization, machine learning models require a large number of source-summary pairs as training data.

    I led an effort to mine more than four million human-written source-summary pairs from the social media posts made to the Reddit platform, where users frequently summarize long messages as a courtesy to their readers, prefixing the summary with "TL;DR" ("too long; didn't read") or similar. To handle the scale of the input data (all Reddit posts ever made up to the year 2017), I leveraged technologies like Hadoop and Spark. The resulting Webis-TLDR-17 dataset formed the basis for a shared-task competition on abstractive summarization organized by a mixed team of researchers from the industry and academia at INLG 2019. Our dataset was subsequently included in the Huggingface and TensorFlow datasets libraries and has been cited in more than 40 publications so far.

Skills

  • Languages

    Python 3, Python, Java, C, Ada, JavaScript, Scala, SQL
  • Paradigms

    Data Science, DevOps
  • Other

    Machine Learning, Information Retrieval, Data Engineering, University Teaching, Technical Writing, IT Infrastructure, GPU Computing, Programming, Data Mining, Web Technologies, Statistics, Linear Algebra, Optimization, Data Analysis, Cloud Architecture, Natural Language Processing (NLP), Data Visualization, Text Mining, Big Data, Regular Expressions
  • Frameworks

    Spark, Hadoop
  • Libraries/APIs

    Pandas, Scikit-learn, PyTorch
  • Tools

    Git, LaTeX, SaltStack, Emacs, Jupyter, GitLab
  • Platforms

    Linux, Kubernetes, Docker, Visual Studio Code, Jupyter Notebook
  • Storage

    JSON, Ceph, PostgreSQL, On-premise

Education

  • PhD in Computer Science
    2013 - 2019
    Bauhaus-Universität Weimar - Weimar, Germany
  • Master's Degree in Computer Science
    2010 - 2013
    Bauhaus-Universität Weimar - Weimar, Germany

To view more profiles

Join Toptal
Share it with others