Skip to content

scincl for sentence embedding #10

@orhansonmeztr

Description

@orhansonmeztr

Hi.
Thank you for publishing the model.
I have a problem. But I can't find where I went wrong.
I have the titles and the abstracts of some articles and want to get vector embeddings of these.
I used the code below using your suggestion on how to use it.
Interestingly, while trying to process data consisting of about 500 records, I could not get a response because my computer's 16GB RAM was full.
So I divided it into chunks and got the embeddings quickly.
But, for example, the vectors I get by choosing chunk sizes 10 and the vectors I get by choosing chunk sizes 20 are different.
I'm probably using the tokenizer wrong.
If you have any ideas, I would be glad.
Best wishes.
Orhan

import json
import numpy as np
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('scincl')
model = AutoModel.from_pretrained('scincl')
csize = 20

def normalizer(x):
    normalized_vector = x / np.linalg.norm(x)
    return np.array(normalized_vector)

def split_chunks(data):
    return [data[x:x + csize] for x in range(0, len(data), csize)]

def get_vectors(chunks):
    title_vecs = np.empty(shape=[0, 768])
    abstract_vecs = np.empty(shape=[0, 768])
    for chunk in chunks:
        title = [d['title'] for d in chunk]
        abstract = [d['abstract'] for d in chunk]

        inputs = tokenizer(title, padding=True, truncation=True, return_tensors="pt", max_length=512)
        result = model(**inputs)
        embedT = result.last_hidden_state[:, 0, :]
        title_vecs = np.append(title_vecs, normalizer(embedT.detach().numpy()), axis=0)

        inputs = tokenizer(abstract, padding=True, truncation=True, return_tensors="pt", max_length=512)
        result = model(**inputs)
        embedA = result.last_hidden_state[:, 0, :]
        abstract_vecs = np.append(abstract_vecs, normalizer(embedA.detach().numpy()), axis=0)

    return title_vecs, abstract_vecs

print("started")
f = open('abstracts.json', "r")
data = json.loads(f.read())
chunks = split_chunks(data)
title_vecs, abstract_vecs = get_vectors(chunks)
np.save('data_title_scincl_norm_' + str(csize) + '.npy', title_vecs)
np.save('data_abstract_scincl_norm_' + str(csize) + '.npy', abstract_vecs)
print("finished")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions