Hi.
Thank you for publishing the model.
I have a problem. But I can't find where I went wrong.
I have the titles and the abstracts of some articles and want to get vector embeddings of these.
I used the code below using your suggestion on how to use it.
Interestingly, while trying to process data consisting of about 500 records, I could not get a response because my computer's 16GB RAM was full.
So I divided it into chunks and got the embeddings quickly.
But, for example, the vectors I get by choosing chunk sizes 10 and the vectors I get by choosing chunk sizes 20 are different.
I'm probably using the tokenizer wrong.
If you have any ideas, I would be glad.
Best wishes.
Orhan
import json
import numpy as np
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('scincl')
model = AutoModel.from_pretrained('scincl')
csize = 20
def normalizer(x):
normalized_vector = x / np.linalg.norm(x)
return np.array(normalized_vector)
def split_chunks(data):
return [data[x:x + csize] for x in range(0, len(data), csize)]
def get_vectors(chunks):
title_vecs = np.empty(shape=[0, 768])
abstract_vecs = np.empty(shape=[0, 768])
for chunk in chunks:
title = [d['title'] for d in chunk]
abstract = [d['abstract'] for d in chunk]
inputs = tokenizer(title, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
embedT = result.last_hidden_state[:, 0, :]
title_vecs = np.append(title_vecs, normalizer(embedT.detach().numpy()), axis=0)
inputs = tokenizer(abstract, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
embedA = result.last_hidden_state[:, 0, :]
abstract_vecs = np.append(abstract_vecs, normalizer(embedA.detach().numpy()), axis=0)
return title_vecs, abstract_vecs
print("started")
f = open('abstracts.json', "r")
data = json.loads(f.read())
chunks = split_chunks(data)
title_vecs, abstract_vecs = get_vectors(chunks)
np.save('data_title_scincl_norm_' + str(csize) + '.npy', title_vecs)
np.save('data_abstract_scincl_norm_' + str(csize) + '.npy', abstract_vecs)
print("finished")
Hi.
Thank you for publishing the model.
I have a problem. But I can't find where I went wrong.
I have the titles and the abstracts of some articles and want to get vector embeddings of these.
I used the code below using your suggestion on how to use it.
Interestingly, while trying to process data consisting of about 500 records, I could not get a response because my computer's 16GB RAM was full.
So I divided it into chunks and got the embeddings quickly.
But, for example, the vectors I get by choosing chunk sizes 10 and the vectors I get by choosing chunk sizes 20 are different.
I'm probably using the tokenizer wrong.
If you have any ideas, I would be glad.
Best wishes.
Orhan