Close Menu
Emirates InsightEmirates Insight
  • The GCC
    • Duabi
  • Business & Economy
  • Startups & Leadership
  • Blockchain & Crypto
  • Eco-Impact

Subscribe to Updates

Get the latest creative news from FooBar about art, design and business.

What's Hot

Beyond Accuracy: 5 Metrics That Actually Matter for AI Agents

March 7, 2026

Vitalik Buterin Proposes Human-Verified AI Wallets for Crypto Transactions

March 7, 2026

GAMING: Are game developers using AI? Players want to know

March 7, 2026
Facebook X (Twitter) Instagram LinkedIn
  • Home
  • Get Featured
  • Guest Writer Policy
  • Privacy Policy
  • Terms of Use
  • Contact Us
Facebook X (Twitter) Instagram LinkedIn
Emirates InsightEmirates Insight
  • The GCC
    • Duabi
  • Business & Economy
  • Startups & Leadership
  • Blockchain & Crypto
  • Eco-Impact
Emirates InsightEmirates Insight
Home»AI & Innovation»How to Combine LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline
AI & Innovation

How to Combine LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline

Emirates InsightBy Emirates InsightMarch 7, 2026No Comments
Facebook Twitter Pinterest LinkedIn WhatsApp Reddit Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email

In this article, you will learn how to fuse dense LLM sentence embeddings, sparse TF-IDF features, and structured metadata into a single scikit-learn pipeline for text classification.

Topics we will cover include:

  • Loading and preparing a text dataset alongside synthetic metadata features.
  • Building parallel feature pipelines for TF-IDF, LLM embeddings, and numeric metadata.
  • Fusing all feature branches with ColumnTransformer and training an end-to-end classifier.

Let’s break it down.

Combine LLM Embeddings TF-IDF Metadata Scikit-learn Pipeline

How to Combine LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline (click to enlarge)
Image by Editor

Introduction

Data fusion, or combining diverse pieces of data into a single pipeline, sounds ambitious enough. If we talk not just about two, but about three complementary feature sources, then the challenge — and the potential payoff — goes to the next level. The most exciting part is that scikit-learn allows us to unify all of them cleanly within a single, end-to-end workflow. Do you want to see how? This article walks you step by step through building a complete fusion pipeline from scratch for a downstream text classification task, combining dense semantic information from LLM-generated embeddings, sparse lexical features from TF-IDF, and structured metadata signals. Interested? Keep reading.

Step-by-Step Pipeline Building Process

First, we will make all the necessary imports for the pipeline-building process. If you are working in a local environment, you might need to pip install some of them first:

import numpy as np

import pandas as pd

 

from sklearn.datasets import fetch_20newsgroups

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.decomposition import TruncatedSVD

 

from sentence_transformers import SentenceTransformer

Let’s look closely at this — almost endless! — list of imports. I bet one element has caught your attention: fetch_20newsgroups. This is a freely available text dataset in scikit-learn that we will use throughout this article: it contains text extracted from news articles belonging to a wide variety of categories.

To keep our dataset manageable in practice, we will pick the news articles belonging to a subset of categories specified by us. The following code does the trick:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

categories = [

    “rec.sport.baseball”,

    “sci.space”,

    “comp.graphics”,

    “talk.politics.misc”

]

 

dataset = fetch_20newsgroups(

    subset=“all”,

    categories=categories,

    remove=(“headers”, “footers”, “quotes”)

)

 

X_raw = dataset.data

y = dataset.target

 

print(f“Number of samples: {len(X_raw)}”)

We called this freshly created dataset X_raw to emphasize that this is a raw, far-from-final version of the dataset we will gradually construct for downstream tasks like using machine learning models for predictive purposes. It is fair to say that the “raw” suffix is also used because here we have the raw text, from which three different data components (or streams) will be generated and later merged.

For the structured metadata associated with the news articles obtained, in real-world contexts, this metadata might already be available or provided by the dataset owner. That’s not the case with this publicly available dataset, so we will synthetically create some simple metadata features based on the text, including features describing character length, word count, average word length, uppercase ratio, and digit ratio.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

def generate_metadata(texts):

    lengths = [len(t) for t in texts]

    word_counts = [len(t.split()) for t in texts]

    

    avg_word_lengths = []

    uppercase_ratios = []

    digit_ratios = []

 

    for t in texts:

        words = t.split()

        if words:

            avg_word_lengths.append(np.mean([len(w) for w in words]))

        else:

            avg_word_lengths.append(0)

 

        denom = max(len(t), 1)

 

        uppercase_ratios.append(

            sum(1 for c in t if c.isupper()) / denom

        )

 

        digit_ratios.append(

            sum(1 for c in t if c.isdigit()) / denom

        )

 

    return pd.DataFrame({

        “text”: texts,

        “char_length”: lengths,

        “word_count”: word_counts,

        “avg_word_length”: avg_word_lengths,

        “uppercase_ratio”: uppercase_ratios,

        “digit_ratio”: digit_ratios

    })

 

# Calling the function to generate a structured dataset that contains: raw text + metadata

df = generate_metadata(X_raw)

df[“target”] = y

 

df.head()

Before getting fully into the pipeline-building process, we will split the data into train and test subsets:

X = df.drop(columns=[“target”])

y = df[“target”]

 

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.2, random_state=42, stratify=y

)

Very important: splitting the data into training and test sets must be done before extracting the LLM embeddings and TF-IDF features. Why? Because these two extraction processes become part of the pipeline, and they involve fitting transformations with scikit-learn, which are learning processes — for example, learning the TF-IDF vocabulary and inverse document frequency (IDF) statistics. The scikit-learn logic to implement this is as follows: any data transformations must be fitted (learn the transformation logic) only on the training data and then applied to the test data using the learned logic. This way, no information from the test set will influence or bias feature construction or downstream model training.

Now comes a key stage: defining a class that encapsulates a pre-trained sentence transformer (a language model like all-MiniLM-L6-v2 capable of generating text embeddings from raw text) to produce our custom LLM embeddings.

class EmbeddingTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, model_name=“all-MiniLM-L6-v2”):

        self.model_name = model_name

        self.model = None

 

    def fit(self, X, y=None):

        self.model = SentenceTransformer(self.model_name)

        return self

 

    def transform(self, X):

        embeddings = self.model.encode(

            X.tolist(),

            show_progress_bar=False

        )

        return np.array(embeddings)

Now we are building the three main data branches (or parallel pipelines) we are interested in, one by one. First, the pipeline for TF-IDF feature extraction, in which we will use scikit-learn’s TfidfVectorizer class to extract these features seamlessly:

tfidf_pipeline = Pipeline([

    (“tfidf”, TfidfVectorizer(max_features=5000)),

    (“svd”, TruncatedSVD(n_components=300, random_state=42))

])

Next comes the LLM embeddings pipeline, aided by the custom class we defined earlier:

embedding_pipeline = Pipeline([

    (“embed”, EmbeddingTransformer())

])

Last, we define the branch pipeline for the metadata features, in which we aim to standardize these attributes due to their disparate ranges:

metadata_features = [

    “char_length”,

    “word_count”,

    “avg_word_length”,

    “uppercase_ratio”,

    “digit_ratio”

]

 

metadata_pipeline = Pipeline([

    (“scaler”, StandardScaler())

])

Now we have three parallel pipelines, but nothing to connect them — at least not yet. Here comes the main, overarching pipeline that will orchestrate the fusion process among all three data branches, by using a very useful and versatile scikit-learn artifact for the fusion of heterogeneous data flows: a ColumnTransformer pipeline.

preprocessor = ColumnTransformer(

    transformers=[

        (“tfidf”, tfidf_pipeline, “text”),

        (“embedding”, embedding_pipeline, “text”),

        (“metadata”, metadata_pipeline, metadata_features),

    ],

    remainder=“drop”

)

And the icing on the cake: a full, end-to-end pipeline that will combine the fusion pipeline with an example of a machine learning-driven downstream task. In particular, here’s how to combine the entire data fusion pipeline we have just architected with the training of a logistic regression classifier to predict the news category:

full_pipeline = Pipeline([

    (“features”, preprocessor),

    (“clf”, LogisticRegression(max_iter=2000))

])

The following instruction will do all the heavy lifting we have been designing so far. The LLM embeddings part will particularly take a few minutes (especially if the model needs to be downloaded), so be patient. This step will undertake the whole threefold process of data preprocessing, fusion, and model training:

full_pipeline.fit(X_train, y_train)

To finalize, we can make predictions on the test set and see how our fusion-driven classifier performs.

y_pred = full_pipeline.predict(X_test)

 

print(classification_report(y_test, y_pred, target_names=dataset.target_names))

And for a visual wrap-up, here’s what the entire pipeline looks like:

Text data fusion pipeline with scikit-learn

Wrapping Up

This article guided you through the process of building an entire machine learning-oriented workflow that focuses on the fusion of several information sources derived from raw text data, so that everything can be put together in downstream predictive tasks like text classification. We have seen how scikit-learn provides a set of useful classes and methods to make the process easier and more intuitive.

Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email
Emirates Insight
  • Website

Related Posts

Beyond Accuracy: 5 Metrics That Actually Matter for AI Agents

March 7, 2026

When and why agent systems work

January 29, 2026

ATLAS: Practical scaling laws for multilingual models

January 27, 2026
Leave A Reply Cancel Reply

Emirates Insight
LIMITED FEATURE SPOTS
Get Featured. Get Seen.
Position your brand in front of founders, decision makers and professionals across the UAE.
APPLY TO GET FEATURED
Top Posts

Global Leaders Unite at World Climate Summit, The Investment COP 2023 to Redefine Climate Action

December 11, 20235,009 Views
AI & Innovation 2 Mins ReadSponsor: Doers Summit

Doers Summit 2025 opens in Dubai with strong Global participation

Sponsor: Doers Summit November 26, 2025

Australia Risks Falling Behind in Climate Investment, New Report Warns

August 21, 20253,049 Views

How to Start and Scale an E-Commerce Business in the UAE

May 15, 20253,016 Views
Emirares Insight

Emirates Insight - Lens on the Gulf provides in-depth analysis of the Gulf's business landscape, entrepreneurship stories, economic trends, and technological advancements, offering keen insights into regional developments and global implications.

We're accepting always open for new ideas and partnerships.

Email Us:[email protected]

Facebook X (Twitter)
Our Picks

Beyond Accuracy: 5 Metrics That Actually Matter for AI Agents

March 7, 2026

Vitalik Buterin Proposes Human-Verified AI Wallets for Crypto Transactions

March 7, 2026

GAMING: Are game developers using AI? Players want to know

March 7, 2026
© 2020 - 2026 Emirates Insight. | Designed by Linc Globa Hub inc.
  • Home
  • Get Featured
  • Guest Writer Policy
  • Privacy Policy
  • Terms of Use
  • Contact Us

Type above and press Enter to search. Press Esc to cancel.