Snorkel is an open source data platform that provides a way to generate large amounts of labeled data using weak supervision techniques. Weak supervision allows you to label data with noisy or incomplete sources of supervision, such as heuristics, rules, or patterns.
Snorkel primarily operates within the paradigm of weak supervision rather than traditional semi-supervised learning. Snorkel is a framework designed for weak supervision, where the labeling process may involve noisy, limited, or imprecise rules rather than a large amount of labeled data.
In Snorkel, users create labeling functions (LFs) that express heuristic or rule-based labeling strategies. These LFs might not be perfect, and there can be conflicts or noise in the generated labels. Snorkel’s labeling model then learns to denoise and combine these weak labels to create more accurate and reliable labeling for the training data.
While semi-supervised learning typically involves having a small amount of labeled data and a large amount of unlabeled data, Snorkel focuses on the weak supervision scenario, allowing users to leverage various sources of noisy or incomplete supervision to train machine learning models.
In summary, Snorkel is more aligned with the principles of weak supervision, where the emphasis is on handling noisy or imprecise labels generated by heuristic rules, rather than being strictly categorized as a semi-supervised learning framework.
In this section, we will explore the concept of weak supervision and how to generate labels using Snorkel.
Weak supervision
Weak supervision is a technique for generating large amounts of labeled data using noisy or incomplete sources of supervision. The idea is to use a set of LFs that generate noisy labels for each data point. These labels are then combined to generate a final label for each data point. The key advantage of weak supervision is that it allows you to generate labeled data quickly and at a low cost.
Snorkel is a framework that provides a way to generate labels using weak supervision. It provides a set of tools to create LFs, combine them, and train a model to learn from the generated labels. Snorkel uses a technique called data programming to combine the LFs and generate a final label for each data point.
An LF is a function that generates a noisy label for a data point. The label can be any value, including continuous or discrete values. In the context of image classification, an LF is a function that outputs a label of 1 if the image contains the object of interest, and 0 otherwise.
LFs are created using heuristics, rules, or patterns. The key idea is to define a set of rules that capture the relevant information for each data point.
Now, let us see how to define the rules and an LF based on the manual visualization of an image’s object color for plant disease labeling.