Vector Analytics

Vector analytics is useful when you are interested in determining the quality of your vectors.

Once we have our multi-dimensional vectors (in 128D, 256D, 768D etc), we may want some way to visually grasp how good our encodings are and to understand if there’s any anomalies in our data and if we may want to try a different model. Vi supports visualising your vectors in 2D using high-level libraries such as Plotly.

Vector analytics include clustering the different vectors, reducing the dimensionality of the vectors and then visualising.

[1]:
# %pip install -e ..
[2]:
%load_ext autoreload
%autoreload 2
[3]:
from vectorai import ViClient
import os
username = os.environ['VI_USERNAME']
api_key = os.environ['VI_API_KEY']
collection_name = 'nlp-qa'
[53]:
collection_name = 'nlp-qa'
vi_client = ViClient(username, api_key)

Clustering

Quickstart using Vi client. Download Google QUEST Kaggle data from: https://www.kaggle.com/c/google-quest-challenge/data Once you download data, move train.csv into local directory.

[60]:
import pandas as pd
data = pd.read_csv('train.csv')
data.head(2)
[60]:
qa_id question_title question_body question_user_name question_user_page answer answer_user_name answer_user_page url category ... question_well_written answer_helpful answer_level_of_information answer_plausible answer_relevance answer_satisfaction answer_type_instructions answer_type_procedure answer_type_reason_explanation answer_well_written
0 0 What am I losing when using extension tubes in... After playing around with macro photography on... ysap https://photo.stackexchange.com/users/1024 I just got extension tubes, so here's the skin... rfusca https://photo.stackexchange.com/users/1917 http://photo.stackexchange.com/questions/9169/... LIFE_ARTS ... 1.000000 1.000000 0.666667 1.000000 1.000000 0.800000 1.0 0.0 0.000000 1.000000
1 1 What is the distinction between a city and a s... I am trying to understand what kinds of places... russellpierce https://rpg.stackexchange.com/users/8774 It might be helpful to look into the definitio... Erik Schmidt https://rpg.stackexchange.com/users/1871 http://rpg.stackexchange.com/questions/47820/w... CULTURE ... 0.888889 0.888889 0.555556 0.888889 0.888889 0.666667 0.0 0.0 0.666667 0.888889

2 rows × 41 columns

Setup ~ Insert documents

[6]:
# Change into document level first.
documents = data.to_dict(orient='records')

# Run encoding
from vectorai.models.transformer_models import Transformer2Vec
encoder = Transformer2Vec('distilbert')

# Quick test to see if this works on 1 document
# vector = encoder.encode(document=documents[0], document_fields=['question_title', 'question_body'])

vectors = encoder.bulk_encode(documents,
          document_fields=['question_title', 'question_body'],
        vector_output_field='question_vector_')
Some weights of the model checkpoint at distilbert-base-cased were not used when initializing TFDistilBertModel: ['vocab_transform', 'vocab_projector', 'vocab_layer_norm', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-cased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
Finished updating documents with additional field.
[7]:
vi_client.insert_documents(collection_name=collection_name, documents=documents, chunksize=50)

[7]:
{'inserted_successfully': 6079, 'failed': 0, 'failed_document_ids': []}

Quickstart Clustering

We support clustering on the client-side in order to help improve the product. On the backend-clustering, we are using KMeans.

[8]:
# Our main clustering is done via advanced_clustering_job. This allows us to easily determine IDs.
vi_client.advanced_clustering_job(
    collection_name=collection_name, vector_field='question_vector_',
    alias='1st_cluster', n_clusters=50, n_init=5)
[8]:
{'job_name': 'advanced_clustering',
 'job_id': '19235f50-fa36-4c4f-8b7e-ba76be800a2a'}

That’s it! Clustering is now running! Now, we wait till the job is complete.

[9]:
vi_client.wait_till_jobs_complete(collection_name=collection_name,
    job_id='7ed0ae66-1b44-43ce-929d-5eea1a887785', job_name='advanced_clustering')
{'status': 'Finished'}
[9]:
'Done'

Now, when we preview the collection, we can see that documents now belong to a cluster.

[10]:
vi_client.retrieve_documents(collection_name, page_size=1)['documents'][0]['_clusters_']
[10]:
{'question_vectors_': {'1st_cluster': 43},
 'question_vector_': {'1st_cluster': 39}}
[11]:
vi_client.head(collection_name, return_as_pandas_df=True, page_size=2)
[11]:
_id qa_id question_title question_body question_user_name question_user_page answer answer_user_name answer_user_page url ... answer_relevance answer_satisfaction answer_type_instructions answer_type_procedure answer_type_reason_explanation answer_well_written insert_date_ _clusters_ _dr_ question_vector_
0 3e80C3QBW7TQbeJX1K3V 5899 Eigenvalues of a transition probability matrix I have read that, for\n$$I - \alpha P$$\nwhere... user243716 https://math.stackexchange.com/users/243716 As stated, the result is FALSE. Presuming P i... Mark L. Stone https://math.stackexchange.com/users/240387 http://math.stackexchange.com/questions/130034... ... 1.000000 1.000000 0.0 0.000000 1.000000 1.000000 2020-08-21T05:20:49.072579 {'question_vectors_': {'1st_cluster': 43}, 'qu... {'default': {'2': {'question_vectors_': [1.229... [0.15878267586231232, 0.2284708470106125, -0.0...
1 dn00C3QBv-xqKfY1H0KD 2660 Does adding a comma before "or" change the mea... For example, the definition given from the OAL... kiamlaluno https://ell.stackexchange.com/users/95 Regarding general usage (in the U.S., at least... Scott https://ell.stackexchange.com/users/357 http://ell.stackexchange.com/questions/3208/do... ... 0.888889 0.733333 0.0 0.333333 0.666667 0.888889 2020-08-21T05:20:49.073579 {'question_vectors_': {'1st_cluster': 42}, 'qu... {'default': {'2': {'question_vectors_': [-0.19... [0.24962793290615082, 0.05536836013197899, -0....

2 rows × 46 columns

Custom clustering of the vectors

Sometimes, we may want to use our own clustering functions because KMeans doesn’t do exactly what we want. Here, we show a simple implementation of clustering to view what this is like for the popular HDBSCAN package.

[12]:
%%capture
%pip install hdbscan
[13]:
documents = vi_client.retrieve_all_documents(collection_name)
[14]:
import hdbscan
import numpy as np
clusterer = hdbscan.HDBSCAN()

# get a numpy array of values
question_vectors = np.array([x['question_vector_'] for x in documents])
clusterer.fit(question_vectors)
clusterer.labels_
[14]:
array([-1, -1, -1, ..., -1, 28, 41])
[15]:
# Add the cluster labels below.
{x.update({'_cluster_label': label}) for x, label in zip(documents, clusterer.labels_)}
[15]:
{None}

Dimensionality Reduction

Quickstart Dimensionality Reduction

Dimensionality reduction from client side is as simple as running the steps below. This runs Principal Component Analysis on dimensionality reduction and allows us to visualise what the end result looks like.

[16]:
vi_client.dimensionality_reduction_job(
    collection_name=collection_name,
    vector_field='question_vector_',
    n_components=2, alias="default")
[16]:
{'job_name': 'dimensionality_reduction',
 'job_id': 'b56b00a1-9fc2-4eb3-b467-260d96715ae3'}

Custom Dimensionality Reduction

Below shows how we utilised a custom implementation of IVIS Dimensionality Reduction!

[20]:
%%capture
%pip install ivis
[13]:
from vectorai.analytics.dimensionality_reduction import ViDimensionalityReductionBase
from sklearn.preprocessing import MinMaxScaler
from ivis import Ivis
import numpy as np

class IVISDimensionalityReduction(ViDimensionalityReductionBase):
    """IVIS Dimensionality Reduction.
    """
    def __init__(self, batch_size=120, k=15, embedding_dims=2):
        self.scaler = MinMaxScaler()
        self.model = Ivis(embedding_dims=embedding_dims, k=k, batch_size=batch_size)

    def fit_transform(self, documents, field_vector):
        """Fit transform at an document level
        """
        if isinstance(documents[0], dict):
            if 'documents' in documents[0].keys():
                vectors = [self.get_field(field_vector, document) for document in documents['documents']]
            else:
                vectors = [self.get_field(field_vector, document) for document in documents]

        X = np.stack(vectors)
        X_scaled = self.scaler.fit_transform(X)
        return self.model.fit_transform(X_scaled)

    def transform(self, X):
        if isinstance(documents[0], dict):
            if 'documents' in documents[0].keys():
                vectors = [document[field_vector] for document in documents['documents']]
            else:
                vectors = [document[field_vector] for document in documents]
        X_scaled = self.scaler.transform(vectors)
        return self.model.transform(X_scaled)
[1]:
dim_reducer = IVISDimensionalityReduction()
red_output = dim_reducer.fit_transform(documents, field_vector='question_vector_')
[17]:
{x.update({'question_reduced_': q_red_}) for q_red_, x in zip(red_output, documents)}
[17]:
{None}

Visualisations (Advanced)

Visualisation with Vector AI is much different to embedding visualisations in other libraries. Our focus is on practicality for deep learning practitioners to better understand the embeddings they want to explore and focus on what matters - the quality of their embeddings in as practical of a situation as possible. A few of our guiding principles when we are building tools:

  • Is it immediately obvious what the graph shows?

  • Is as much information captured as possible and as clear as possible?

  • What is the status quo in regards to the tools being used?

Our focus is on ensuring that practitioners have a lot of good tools at their disposal when they are evaluating the quality of their vectors.

[27]:
vi_client.collection_schema(collection_name)
[27]:
{'_clusters_': 'dict',
 '_clusters_.question_vector_': 'dict',
 '_clusters_.question_vector_.1st_cluster': 'numeric',
 '_clusters_.question_vectors_': 'dict',
 '_clusters_.question_vectors_.1st_cluster': 'numeric',
 '_dr_': 'dict',
 '_dr_.default': 'dict',
 '_dr_.default.2': 'dict',
 '_dr_.default.2.question_vector_': 'vector',
 '_dr_.default.2.question_vectors_': 'numeric',
 'answer': 'text',
 'answer_helpful': 'numeric',
 'answer_level_of_information': 'numeric',
 'answer_plausible': 'numeric',
 'answer_relevance': 'numeric',
 'answer_satisfaction': 'numeric',
 'answer_type_instructions': 'numeric',
 'answer_type_procedure': 'numeric',
 'answer_type_reason_explanation': 'numeric',
 'answer_user_name': 'text',
 'answer_user_page': 'text',
 'answer_well_written': 'numeric',
 'category': 'text',
 'host': 'text',
 'insert_date_': 'date',
 'qa_id': 'numeric',
 'question_asker_intent_understanding': 'numeric',
 'question_body': 'text',
 'question_body_critical': 'numeric',
 'question_conversational': 'numeric',
 'question_expect_short_answer': 'numeric',
 'question_fact_seeking': 'numeric',
 'question_has_commonly_accepted_answer': 'numeric',
 'question_interestingness_others': 'numeric',
 'question_interestingness_self': 'numeric',
 'question_multi_intent': 'numeric',
 'question_not_really_a_question': 'numeric',
 'question_opinion_seeking': 'numeric',
 'question_title': 'text',
 'question_type_choice': 'numeric',
 'question_type_compare': 'numeric',
 'question_type_consequence': 'numeric',
 'question_type_definition': 'numeric',
 'question_type_entity': 'numeric',
 'question_type_instructions': 'numeric',
 'question_type_procedure': 'numeric',
 'question_type_reason_explanation': 'numeric',
 'question_type_spelling': 'numeric',
 'question_user_name': 'text',
 'question_user_page': 'text',
 'question_vector_': 'vector',
 'question_well_written': 'numeric',
 'url': 'text'}

Plotting Dimensionality-Reduced Vectors

Plotting dimensionality-reduced vectors allows you understand your clusters better. It is a common (and pretty) technique to understand if your clustering technique was reflective of groups.

When you visualise the vectors, you should see clear groups where the points are separated. This means that the vectors have been trained well for their respective clusters - an example of having high quality vector representations if they learn groups on their own.

[15]:
documents = vi_client.retrieve_all_documents(collection_name)
[8]:
cluster_label = 'category'
dim_reduction_field = '_dr_.default.2.question_vector_' # the '.' in names implies nested dictionaries in Vi
cluster_field = 'question_vector_'
alias = '1st_cluster'
collection_name = 'nlp-qa'

vi_client.plot_dimensionality_reduced_vectors(
    collection=documents,
    cluster_field=cluster_field,
    cluster_label=cluster_label,
    point_label='question_title',
    dim_reduction_field=dim_reduction_field,
    # include_centroids=True,
    alias=alias)

[18]:
# plotting the IVIS dimensionality reduction
# documents = vi_client.random_documents(collection_name, 30)['documents']
point_label = 'question_title'
cluster_label = 'category'
dim_reduction_field = 'question_reduced_'
vi_client.plot_dimensionality_reduced_vectors(collection=documents,
    cluster_field=cluster_field, cluster_label=cluster_label,
    dim_reduction_field=dim_reduction_field, point_label=point_label)

[61]:
# Changing the clustering to the cluster groups.
cluster_label = '_clusters_.question_vector_.1st_cluster'

vi_client.plot_dimensionality_reduced_vectors(
    collection=collection_name,
    cluster_field=cluster_field,
    cluster_label=cluster_label,
    point_label='question_title',
    dim_reduction_field=dim_reduction_field,
    include_centroids=False,
    alias=alias)

1D Cosine Similarity

Vi is the first library that we are aware of that performs practical cosine similarity comparison for its documents to assess the most similar types as easily as possible! In addition - we allow users to compare multiple vectors at the same time easily!

[5]:
documents = vi_client.random_documents(collection_name, 30)['documents']

Using With Single Vector

[69]:
vi_client.plot_1d_cosine_similarity(documents, vector_fields=['question_vector_'],
    label='question_title', anchor_document=documents[0])

Using With Multiple Vectors

With Vi, we can now run comparison with multiple vectors easily!

Simply add the vectors into the model as below!

[6]:
from vectorai.models.deployed import ViText2Vec
model = ViText2Vec(username, api_key)
[7]:
{x.update({'question_vector_2': model.encode(x['question_title'] + x['question_body'])}) for x in documents}
[7]:
{None}

The charts below were made to allow for easy comparison of ranks and similarity scores. We believe these are important when comparing vectors and we have placed a lot of effort to ensure that they are as eligible as possible so that you can compare as much as possible.

[165]:
vi_client.plot_1d_cosine_similarity(documents[
    0:10], vector_fields=['question_vector_', 'question_vector_2'],
    label='question_title', anchor_document=documents[0], num_cols=2, x_axis_tickangle=70, orientation='v')

I think the blue question vectors don’t really seem aligned. We can easily re-sort the vectors simply by moving question_vector_2!

Plotting Cosine Similarity with 2 Documents

You are also able to plot 2D cosine similarity plots on the x and y axis to better evaluate your embeddings. This is important if you want to better understand the relationships with 2 documents as opposed to 1.

[52]:
documents = vi_client.random_documents(collection_name, 30)['documents']
fig = vi_client.plot_2d_cosine_similarity(documents=documents, anchor_documents=documents[:2],
    vector_fields=['question_vector_'], label='question_title', text_label_font_size=12)
fig.show()

Radar Plots

Radar plots allow you to compare better across different vector spaces and categories. This can be useful when comparing embeddings across in multi-label scenarios. E.g. when comparing a specific document to others.

[19]:
docs = vi_client.create_sample_documents(3)
vi_client.plot_radar_across_documents(docs, anchor_documents=docs[0:2],
    vector_field='color_vector_', label_field='country')

We can often use these plots to compare similarities. You can use the function below to compare vectors easier with radar plots.

[25]:
import plotly.graph_objects as go
[ ]:

[ ]:

isinstance(fig, )
[23]:
fig = vi_client.plot_radar_across_vector_fields(docs, anchor_document=docs[0],
    vector_fields=['color_vector_', 'color_2_vector_'], label_field='country')

Editing Plots

One of the key design principles about Vi is the ability of the user to be able to edit the plots as required and to have high customisability. That is why they will always be returned as objects rather than simple plots!

[119]:
fig.data[0].update({'marker': {'color': 'purple'}})
[119]:
Bar({
    'marker': {'color': 'purple'},
    'name': 'question_vector_',
    'orientation': 'h',
    'x': [0.6938691369437981, 0.712230453510121, 0.7170477947581689,
          0.7484762147002305, 0.7586918484846019, 0.7709117492518014,
          0.7748178292306391, 0.7910651604214004, 0.7951652539636912,
          0.7955191413920032, 0.7959471041429305, 0.7998393780205075,
          0.806897820009659, 0.809509839870059, 0.8136720972035922,
          0.8139103784142762, 0.8153082229309127, 0.8160983733249637,
          0.8237726690877931, 0.8290769553450886, 0.8316916003613232,
          0.8321987203719963, 0.8353326492919724, 0.8425234676807904,
          0.8444238944578158, 0.8480631595000561, 0.8584559421842285,
          0.8608480966542127, 0.8709205216599254, 1.0],
    'y': [AdvRef pin setup, Java read pptx file, Framing section headings,
          Graphicsrow title, Overbrace height adjustment, Engine Code P1312 on 2001
          Saab 9-3, Culinary uses for juniper extract?, Google maps marker show
          only icon, What are the symmetries of a principal homogeneous bundle?,
          Can I disable the default calendar app notifications?, I am or was a
          religious jew and purposely ate non kosher, How long does PhD Application
          Process take in the UK?, What is my lcd density?, Arrow tip for curved
          arrow in xymatrix, How to stop shell script if curl failed, How to find
          inter-country buses in Europe?, How do I fill a textbox with text if it
          is empty?, What does biting off the Pitum have to do with Pregnancy?,
          What's wrong with my launchctl config?, Is a self cleaning lube enough
          for chain maintenance?, What feats should a human rogue take to be
          stealthy?, Display relative date just for today, with "today" as prefix,
          up sample and down sample, User @include url Precedence over Script
          @include url, c# parsing json: error reading json object, Looking up for
          an item in a list and different table styles in iOS, How to allow line
          breaks after forward slashes in ConTeXt?, How was the sword of Gryffindor
          pulled from the hat a second time?, IRS2003 half bridge driver, Without a
          map or miniatures, how to best determine line-of-sight, etc?]
})
[120]:
fig.show()

Comparing Tables

When comparing vector search results, it may sometimes be difficult to compare the search results.

[12]:
# !pip install datasets
[51]:
from datasets import list_datasets, load_dataset, list_metrics, load_metric

# Load a dataset and print the first examples in the training set
squad_dataset = load_dataset('squad', )
[20]:
# !pip install vectorhub
# !pip install vectorhub[encoders-text-tfhub-windows]
[20]:
from vectorhub.auto_encoder import *
bert_enc = AutoEncoder.from_model('text/bert')
albert_enc = AutoEncoder.from_model('text/albert')
[21]:
# Rename / set name to the object
vi_client.set_name(bert_enc, 'bert')
vi_client.set_name(albert_enc, 'albert')
[22]:
docs = []
for d in range(100):
    doc = squad_dataset['train'][d]
    doc.update({'_id': d})
    docs.append(doc)
[26]:
collection_name = 'nlp_quickstart_2'
vi_client.insert_documents(collection_name, docs,
    models={'question': [bert_enc, albert_enc]})

[26]:
{'inserted_successfully': 100, 'failed': 0, 'failed_document_ids': []}
[49]:
import pandas as pd
pd.set_option('display.max_colwidth', 300)
[50]:
vi_client.compare_vector_search_results(
    collection_name,
    vector_fields=['question_bert_vector_', 'question_albert_vector_'],
    label='context',
    num_rows=3
)
Using a random document for now. Please specify a document.
[50]:
question_bert_vector_ question_albert_vector_
0 This Main Building, and the library collection, was entirely destroyed by a fire in April 1879, and the school closed immediately and students were sent home. The university founder, Fr. Sorin and the president at the time, the Rev. William Corby, immediately planned for the rebuilding of the st... This Main Building, and the library collection, was entirely destroyed by a fire in April 1879, and the school closed immediately and students were sent home. The university founder, Fr. Sorin and the president at the time, the Rev. William Corby, immediately planned for the rebuilding of the st...
1 Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building... This Main Building, and the library collection, was entirely destroyed by a fire in April 1879, and the school closed immediately and students were sent home. The university founder, Fr. Sorin and the president at the time, the Rev. William Corby, immediately planned for the rebuilding of the st...
2 Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building... As of 2012[update] research continued in many fields. The university president, John Jenkins, described his hope that Notre Dame would become "one of the pre–eminent research institutions in the world" in his inaugural address. The university has many multi-disciplinary institutes devoted to res...

References

@inproceedings{
  author = {Piero Molino, Yang Wang, Jiwei Zhang},
  booktitle = {ACL},
  title = {Parallax: Visualizing and Understanding the Semantics of Embedding Spaces via Algebraic Formulae},
  year = {2019},
}