Custom Encodings¶
Custom Encodings (Bring Your Own Vectors) is for advanced users who have their own encoders and want to use their models with Vi. This is especially important as users are expected to perform a lot of experimentation with their vectors (it seems new SOTA models are achieved with every week!)
[2]:
vecdb_url = 'https://api.vctr.ai'
username = 'your_username'
api_key = 'your_api_key'
collection_name = 'your_collection_name'
How To Write Custom Encoders¶
The main requirement we have for encoders is that they need to be json-serializable outputs. We made this easy by implementing a method to check inputs prior to uploading, garbage collection and chunking for different data formats. This can be seen below.
Note: Vi tries to handle inputs at a document-level. This means that functions are created with inputs mainly being dictionaries.
[3]:
from vectorai.models.base import ViText2Vec, ViAudio2Vec, ViImage2Vec
For example - we may want to use tensorflow hub models.
[4]:
texts = [
{
"text": "Vi is the ultimate database to upload vectors."
},
{
"text": "The authors of Vi are both named Jacky."
}
]
[5]:
%%capture
import tensorflow_hub as hub
model = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")
[6]:
from vectorai import ViClient
vi_client = ViClient(username, api_key)
# Here, we want to inherit from the base class in order.
class USEEncoder(ViText2Vec):
def encode_text(self, text):
"""Encode text an item level if possible, otherwise encode string directly.
"""
return model(text)
Logged in. Welcome test. To view list of available collections, call list_collections() method.
[7]:
encoder = USEEncoder()
vectors = encoder.encode_text(["HI"])
[8]:
# This will automatically instantiate a bulk-encoding methodology.
vectors = encoder.bulk_encode_text(["Hi", "Aman!"])
vectors
[8]:
[<tf.Tensor: shape=(2, 512), dtype=float32, numpy=
array([[-0.00337365, 0.07936656, -0.06529631, ..., -0.04366795,
-0.00061513, -0.03553966],
[-0.02114094, -0.01992319, 0.03013013, ..., 0.05736907,
-0.00945254, -0.000416 ]], dtype=float32)>]
[1]:
# Automatically bulk-encode items as well
vectors = encoder.bulk_encode_text(texts, text_input_field="text", vector_output_field="text_vector_")
Training On Transformers¶
We made encoding with transformers really really easy. You can not access state-of-the-art NLP encoders with 1 line of code with tonnes of customisability.
The current supported models can be seen here: https://github.com/img-more/vecdb-python/blob/master/vectorai/models/transformer_models/transformer_models.py
You can now load the encodings from this model in 1 line of code and you can load custom pre-trained weights in the same function!
[10]:
from vectorai.models import Transformer2Vec
[11]:
encoder = Transformer2Vec('distilbert')
[2]:
vector = encoder.encode_text(text="Vectors are cool.")
[14]:
encoder.bulk_encode_text(texts=texts, text_input_field='text', vector_output_field='text_distilbert_vector_')
Finished updating documents with additional field.
We also added a generic way to fine-tune your encodings.
[15]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import MeanAbsoluteError
[16]:
x_fields = ['name', 'description']
y_fields = ['store_id']
optimizer = Adam()
[17]:
with open('product.json', 'r') as infile:
documents = json.load(infile)
# Select the first 300 documents
documents = documents[:300]
[21]:
encoder.run_finetuning_for_classification(
documents=documents,
x_fields=['name', 'description'],
y_fields=['price'],
optimizer=Adam(),
loss=MeanSquaredError(),
metric=[MeanAbsoluteError()])
Epoch 1/2
115/115 [==============================] - 62s 543ms/step - loss: 1858391.0000 - val_loss: 1567845.0000
Epoch 2/2
115/115 [==============================] - 62s 543ms/step - loss: 1608951.8750 - val_loss: 687381.9375
Saved model. This can be found in /home/jacky/.cache/transformers/vectorai-trained-distilbert-base-cased
You can then use the classification model by using the argument use_classification_model=True
.
[3]:
vectors = encoder.encode_text(document=documents[0], document_fields=['name', 'description'],
use_classification_model=True)
Or you can load your pretrained model using your own pretrained_model_weights by setting classification_save_dir
.
[4]:
vector = encoder.encode_text(document=documents[0], document_fields=['name', 'description'],
use_classification_model=True,
classification_save_dir='/home/jacky/.cache/transformers/vectorai-trained-distilbert-base-cased')
Voila!