Thiago De Sousa Silveira, Developer in Beijing, China
Thiago is available for hire
Hire Thiago

Thiago De Sousa Silveira

Verified Expert  in Engineering

Machine Learning Developer

Location
Beijing, China
Toptal Member Since
July 10, 2019

Currently, Thiago is working as a machine learning and algorithm engineer with a focus on sentiment analysis, natural language processing, text classification, and recommender systems. He has a master’s degree in computer science from Tsinghua University and over five years of experience developing with Python specifically with machine learning, data processing, and data scraping.

Availability

Part-time

Preferred Environment

IntelliJ IDEA, PyCharm, Sublime Text, Anaconda, Git, MacOS

The most amazing...

...code I've written was a semisupervised sentiment analysis method for social media texts. The project was eventually published in a journal.

Work Experience

Machine Learning Engineer

2018 - PRESENT
Giance Technologies
  • Created aspect-based sentiment analysis methods (supervised and unsupervised) for polarity classification of social media posts and new sources.
  • Developed a neural network method for text classification focused on news categorization for multiple languages.
  • Constructed deep neural methods for aspect extraction. The tools were used together with the developed sentiment analysis method for social media content.
Technologies: GPT, Natural Language Processing (NLP), Generative Pre-trained Transformers (GPT), Deep Neural Networks

Big Data Analyst

2017 - 2018
Alpha Lawyer
  • Developed a method for calculating document similarity on a large collection of text documents. As the documents were stored in HBase, we created a MapReduce method for calculating cosine similarity between all pairs documents; duplicate or very similar documents were removed.
  • Created a neural network model for topic segmentation. The documents had a general format of topics. A time distributed bi-directional long-short time memory network was created to find the correct sentences in which the document should be split.
  • Built an annotation tool called YeddaSeg for tagging topic segments in documents. The annotated documents were used for training the topic segmentation model.
Technologies: Deep Neural Networks, MapReduce, HBase, Hadoop

Summer Intern

2013 - 2014
CSIRO
  • Built a crawler for gathering the map coordinates of a transportation station once given an address.
  • Developed a transportation simulator based on a random walk model for simulating layers of public transportation.
Technologies: Python

UNuSUAL | Unified Unexpectedness Evaluation Sentiment Analysis Tool

https://github.com/fhmourao/UNuSUAL
The source code for the publication "A Framework for Unexpectedness Evaluation in a Recommendation."

• https://dl.acm.org/citation.cfm?id=3019760

Technologies: Python, Java

How Good is Your Recommender System? | A Survey on Evaluations in Recommendation

Recommender systems have become a handy tool for a large variety of domains. Researchers have been attempting to improve their algorithms to issue better predictions to the users. However, one of the current challenges in the area refers to how to evaluate correctly the projections generated by a recommender system.

In the extent of offline evaluations, some traditional assessment concepts were explored, such as accuracy, root-mean-square error, and P/N for top-k recommendations.

In recent years, more research has proposed some new concepts such as novelty, diversity, and serendipity. These concepts have been addressed to satisfy the users’ requirements. We proposed numerous definitions and metrics in the previous work.

Due to the absence of a specific summarization on evaluations of recommendation combining traditional metrics and recent progress, this paper surveys and organizes the primary research that present definitions about concepts and propose metrics or strategies to evaluate proposals. Besides, this survey also settles the relationship between the concepts, categorizes them according to their objectives, and suggests potential future topics on user satisfaction.

SACI: Sentiment Analysis by Collective Inspection on Social Media Content

Observed collective opinions in social media represent valuable data for a range of apps, and current methods require prior knowledge of each opinion to determine the collective one.

We assumed that a better collective analysis could be had when exploiting overlaps among distinct posts of the collection, so we proposed SACI. SACI is sentiment analysis by collective inspection: a lexicon-based unsupervised method that extracts collective sentiment without individual classifications. We based SACI on a directed transition graph among terms of a post set and used a prior classification of these terms regarding their roles in consolidating opinions. Paths represent subsets of posts on the chart, and the collective opinion is defined by traversing all of the ways.

We demonstrated that collective analysis outperforms individual one concerning approximating collection opinions. However, assessments on SACI show that proper individual classifications do not guarantee reliable aggregate analyses and vice-versa. Further, SACI fulfills simultaneous requirements of efficacy, efficiency, and handle of dynamicity posed by high demanding scenarios. Indeed, the consolidation of a SACI-based web tool for real-time analysis of tweets evinces the usefulness.

LEGi: Context-aware Lexicon Consolidation by Graph Inspection

The value of subjective content available in social media has boosted the importance of sentiment analysis on this kind of scenario. Performing a social media sentiment analysis is challenging due to the vast volume of short textual posts and that high dynamicity requires efficiency and scalability.

Despite all efforts, the literature still lacks proposals that address both requirements. In this sense, we propose LEGi, a corpus-based method for consolidating context-aware sentiment lexicons. We based it on a semi-supervised strategy for the propagation of lexicon-semantic classes on a transition graph of terms.

Empirical analyses on two distinct domains, derived from Twitter, demonstrate that LEGi outperformed four well-established methods for lexicon consolidation. Further, we found that LEGi's lexicons may improve the quality of the sentiment analysis performed by a traditional approach in the literature. Thus, our results point out LEGi as a promising method for consolidating lexicons in high demanding scenarios, such as social media.

FAiR: A Framework for Analyses and Evaluations on Recommender Systems

Recommender systems (RS) have become essential tools in eCommerce applications, helping users in the decision-making process. Evaluation of these tools is, however, a significant divergence point nowadays, since there is no consensus regarding which metrics are necessary to consolidate new recommender systems.

For this reason, distinct frameworks have been developed to ease the deployment of recommender systems in research and production environments. In the present work, we performed an extensive study of the most popular evaluation metrics, organizing them into three groups: effectiveness-based, parallel dimensions of quality, and domain profiling. Further, we consolidated a framework named FAiR to help researchers. It helped to evaluate their recommender systems using these metrics and to identify the characteristics of data collections that may intrinsically affect a RS's performance. FAiR is compatible with the output format.

Combining Data Mining Techniques to Enhance Cardiac Arrhythmia Detection

Detection of cardiac arrhythmia (CA) is performed using the clinical analysis of the electrocardiogram (ECG) of a patient to prevent cardiovascular diseases.

Machine learning algorithms have been presented as promising tools in aid of CA diagnoses, with emphasis on those related to automatic classification. However, these algorithms suffer from two traditional problems related to classification: (1) excessive number of numerical attributes generated from the decomposition of an ECG; and (2) the number of patients diagnosed with CAs is much lower than those classified as “normal” leading to very unbalanced datasets.

In this paper, we combined in a coordinated way several data mining techniques, such as clustering, feature selection, oversampling strategies, and automatic classification algorithms to create more efficient classification models to identify the disease. In our evaluations, using a traditional dataset provided by the UCI, we improved significantly the effectiveness of Random Forest classification algorithm achieving an accuracy of over 88%, a value higher than the best already reported in the literature.

A Framework for Unexpectedness Evaluation in Recommendations

Nowadays, assessing the usefulness of recommender systems (RS) is a significant research challenge. Due to its close relation to the notion of value, unexpectedness has become the focus of several works. However, there is no consensus in the literature about how to measure it.

In this context, this work implements the most referenced metrics, consolidating a framework of unexpectedness assessments in the recommendation—allowing us to characterize, compare, and combine all those metrics.

Empirical evaluations on real data and different recommender systems demonstrated our framework's usefulness. Besides showing that the existing metrics diverge about which recommender system, the framework enabled the combining of all metrics so that we could capture different perspectives.

We aimed to help researchers and professionals learn about the recommender systems. They needed to understand the actual impact of distinct metrics concerning unexpectedness as well as how to select the proper metric to highlight gains or loses.

Using Aspect-based Analysis for Explainable Sentiment Predictions

http://tcci.ccf.org.cn/conference/2019/papers/XAI98.pdf
Aiming for the development of more explanatory systems, we
argue that aspect-based analysis can help deriving deep interpretation of the sentiment predicted by a document-level analysis, working as a proxy method.

We propose a framework to verify if predictions produced by a trained aspect-based model can be used to explain document-level sentiment classifications, by calculating an agreement metric between the two models.

In our case study with two benchmark datasets, we achieve 90% of agreement between the models, thus showing the an aspect-based analysis should be favored for the sake of explainability.

YEEDASeg | Text Segmentation Annotation Tool

https://github.com/ThiagoSousa/YEEDASeg
I developed this tool for the manual annotation of topics in textual information. It can be used for any language, all styles of text.

Text segmentation, in this case, concerns itself in finding chunks/sequential blocks of texts that are semantically close in the text. Later this can be used for automatic text segmentation models, such as paragraph segmentation.

Languages

Python, Java, AspectJ

Libraries/APIs

Scikit-learn, Keras, TensorFlow

Tools

VADER Sentiment Analysis, PyCharm, Sublime Text, Git, IntelliJ IDEA

Platforms

Jupyter Notebook, MacOS, Anaconda, Apache Kafka

Storage

MySQL, MongoDB, NoSQL, HBase, Elasticsearch

Other

Machine Learning, Sentiment Analysis, Classification Algorithms, Clustering Algorithms, Text Classification, Custom BERT, Natural Language Processing (NLP), Recommendation Systems, GPT, Generative Pre-trained Transformers (GPT), Artificial Intelligence (AI), Deep Neural Networks, Annotation Processors

Frameworks

Flask, Hadoop

Paradigms

MapReduce, Data Science

2016 - 2018

Master's Degree in Computer Science

Tsinghua University - Beijing, China

2011 - 2016

Bachelor's Degree in Computer Science

Universidade Federal de São João Del Rei - São João Del Rei, Minas Gerais, Brazil

NOVEMBER 2015 - PRESENT

IELTS Academic

British Council

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

1

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.
2

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.
3

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring