A data set funded by the Chan Zuckerberg Initiative shows how research software and tools are used across disciplines — and helps developers gain credit for their work.
nature Article highlighting the impact of the CZ Software Mentions Dataset , on which I worked on at Chan Zuckerberg Initiative. The dataset mines Software Mentions from the biomedical literature at scale.
Article Link Author: Matthew Hutson
Understanding software used by scientists by mining the biomedical literature
Medium Blog Post on one of the projects I worked on at Chan Zuckerberg Initiative about mining for Software Mentions from the biomedical literature.
Blog Post Link Co-Authors: Boris Veytsman, Donghui Li, Dario Taraborelli, Ivana Williams
Meet six trailblazing CZI women working in the technology field and learn how their careers started and how they are going now.
Quote: How It Started
For me, it started with math!
Ever since elementary school, math was something I’ve always been very good at. I liked it because it’s logical and the only subject I didn’t have to study for at home because I picked it up very easily in class. Of course, that’s also because I had an amazing math teacher.
As I grew up, I started developing a similar interest in science. Chemistry, in particular, gave me a logical framework to understand the world around me through molecule interactions. I participated in a number of science competitions, including national and international Science Olympiads. Because of these experiences, I’ve never thought of myself as pursuing anything other than a career in STEM.
...
Blog Post Link
Essential frontiers: open data & software citations, an automated ML approach
Abstract: Science is progressive, and every discovery, set of data, and publication builds on previous work. Today, it's impossible to put every new development in the context of what's gone before. Comprehensive open citations can both enable the attribution of scientific progress as well as the evaluation of research and its impacts. For citations to live up to its promise as a vehicle for the discovery, dissemination, and evaluation of all scholarly knowledge, the open citation frontier needs to expand beyond traditional bibliographic metadata into other essential scientific resources such as research data and software. We describe a new open corpus of dataset and software mentions in biomedical papers created by applying machine learning to full text biomedical literature. We share the process of extraction and transformation of mentions into citations, as well as opportunities and challenges that come with disambiguating and linking the mentions in an open dataset of this size.
Presentation Link (Zenodo) Joint Presentation with Jennifer Lin
Using transformer models for your own NLP task - building an NLP model End To End
Abstract: Transformer models have revolutionized the NLP field and are currently state-of-the art on a variety of tasks, such as named entity recognition, language inference or question answering. With new, more performant models being continuously developed (BERT, RoBERTa, AlBERT, ELECTRA, ERNIE, etc), these models are ubiquitous in virtually all domains that make use of natural language processing.So how can you apply these models on your own task? In this talk, we will go over the process of using state-of-the-art transformer models for your own NLP task. We will discuss the entire pipeline, from building a training corpus, developing a NLP model and evaluating the model. We will offer an example of building a model to extract mentions of Experimental Methods and Datasets from full-text biomedical papers. Even though our example will focus on an NLP task for the biomedical text, the framework can be applied to any domain.
....
Featured along some amazing women from CZI!
Quote: Ana-Maria went into a career in science because of the beauty she saw in trying to understand the universe and the technology helping to do that in an automated way. In college, she was intrigued by the ideas of randomness and probability, and ended up specializing in artificial intelligence. Now at CZI, she builds machine learning solutions to support the questions brought up by program areas. At the end of the day, Ana-Maria feels that it’s all about the joy she gets out of coming up with solutions to challenging problems.
Blog Post Link
BERT for Named Entity Recognition (NER) on specialized corpora.
Abstract: Transformer models have revolutionized the NLP field and are currently state-of-the art on a variety of tasks, such as named entity recognition, language inference or question answering. With new, more performant models being continuously developed (BERT, RoBERTa, AlBERT, ELECTRA, ERNIE, etc), these models are ubiquitous in virtually all domains that make use of natural language processing. So how can you apply these models on a specific task at hand, especially when the distribution of the data is different from the one the models have been trained on? In this talk, we will go over the process of using transformer models for the Named Entity Recognition (NER) task on specialized corpora. We will offer a specific example of building a BERT-based Named Entity Recognition model for mining for Experimental Methods and Datasets from full-text biomedical papers.
....
Deep Linking: Machine learning to connect up the PIDs
Abstract: Science is progressive, and every discovery, set of data, and publication builds on previous work. Today, it's impossible to put every new development in the context of what's gone before, especially if research outputs are largely invisible living all over the web disconnected to each other. Meta aims to remove this barrier to scientific progress with its graph of biomedical research connecting up PIDs across "people, places, things." We apply machine learning to the scientific literature as a way to get retrieve more connections between these essential elements. During this session, we will share the work we've done, lessons we're learning, and open up the remaining time as a group discussion on best practices, pitfalls, areas of opportunity.
Presentation Link
Joint Presentation with Jennifer Lin, Alex Wade