Data Labeling:
- Process of adding labels or categories to training data.
- Challenges include time consumption and subjectivity.
- Methods include expert labeling, crowdsourcing, third-party services, and programmatic labeling.
Tokenization:
- Dividing text into smaller units called tokens.
- Techniques include word tokenization, character tokenization, and sub-word tokenization.
- Tokenizers are programs or algorithms that perform this task.
Vectorization:
- Converting text data into numerical representations for machine learning.
- Techniques include Bag of Words, TF-IDF, and word embeddings.
- Each technique has its advantages and disadvantages in terms of information representation, context preservation, and efficiency.
Quantitative Values and Names of Related Theories:
Data Labeling:
- Expert labeling: High accuracy; limitation due to expert availability and time.
- Crowdsourcing: Large-scale; inaccuracies and inconsistencies.
- Third-party service: Professional and accurate; expensive.
- Programmatic labeling (e.g., Snorkel): Scalable, low-cost, but requires time to build accurate logic.
Tokenization:
- Word tokenization: Easily understandable; doesn’t handle out-of-vocabulary words.
- Character tokenization: No need for vocabulary; loses word associations.
- Sub-word tokenization: More suitable for vocabulary coverage.
Vectorization:
- Bag of Words: Simple; loses contextual information.
- TF-IDF: Evaluates word importance; doesn’t capture word relationships.
- Word embeddings: Captures word relationships; time-consuming to build but pre-built options available for common languages
- Understanding Attention in Transformers
- Training and Inference with Transformers
- Advancements of Transformer Model and Attention Mechanism
- Exploring Diverse Tokenization Methods
- Advancements in Vectorization Techniques
- Progress and Limitations of Large Language Models
- Comparison Analysis between Google’s PALM and PALM-2 Language Models
- Exploring Decision Trees in Data Science and Machine Learning
- Optimizing Customer Experience through Copilot
- The Power of AI in Marketing