Vector Analytics

Vector analytics is useful when you are interested in determining the quality of your vectors.

Once we have our multi-dimensional vectors (in 128D, 256D, 768D etc), we may want some way to visually grasp how good our encodings are and to understand if there’s any anomalies in our data and if we may want to try a different model. Vi supports visualising your vectors in 2D using high-level libraries such as Plotly.

Vector analytics include clustering the different vectors, reducing the dimensionality of the vectors and then visualising.

[1]:
%load_ext autoreload
%autoreload 2
[2]:
from vectorai import ViClient
import os
username = os.environ['VI_USERNAME']
api_key = os.environ['VI_API_KEY']
collection_name = 'nlp-qa'
[4]:
collection_name = 'nlp-qa'
vi_client = ViClient(username, api_key)

Clustering

Quickstart using Vi client. Download Google QUEST Kaggle data from: https://www.kaggle.com/c/google-quest-challenge/data Once you download data, move train.csv into local directory.

[60]:
import pandas as pd
data = pd.read_csv('train.csv')
data.head(2)
[60]:
qa_id question_title question_body question_user_name question_user_page answer answer_user_name answer_user_page url category ... question_well_written answer_helpful answer_level_of_information answer_plausible answer_relevance answer_satisfaction answer_type_instructions answer_type_procedure answer_type_reason_explanation answer_well_written
0 0 What am I losing when using extension tubes in... After playing around with macro photography on... ysap https://photo.stackexchange.com/users/1024 I just got extension tubes, so here's the skin... rfusca https://photo.stackexchange.com/users/1917 http://photo.stackexchange.com/questions/9169/... LIFE_ARTS ... 1.000000 1.000000 0.666667 1.000000 1.000000 0.800000 1.0 0.0 0.000000 1.000000
1 1 What is the distinction between a city and a s... I am trying to understand what kinds of places... russellpierce https://rpg.stackexchange.com/users/8774 It might be helpful to look into the definitio... Erik Schmidt https://rpg.stackexchange.com/users/1871 http://rpg.stackexchange.com/questions/47820/w... CULTURE ... 0.888889 0.888889 0.555556 0.888889 0.888889 0.666667 0.0 0.0 0.666667 0.888889

2 rows × 41 columns

Setup ~ Insert documents

[6]:
# Change into document level first.
documents = data.to_dict(orient='records')

# Run encoding
from vectorai.models.transformer_models import Transformer2Vec
encoder = Transformer2Vec('distilbert')

# Quick test to see if this works on 1 document
# vector = encoder.encode(document=documents[0], document_fields=['question_title', 'question_body'])

vectors = encoder.bulk_encode(documents,
          document_fields=['question_title', 'question_body'],
        vector_output_field='question_vector_')
Some weights of the model checkpoint at distilbert-base-cased were not used when initializing TFDistilBertModel: ['vocab_transform', 'vocab_projector', 'vocab_layer_norm', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-cased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
Finished updating documents with additional field.
[7]:
vi_client.insert_documents(collection_name=collection_name, documents=documents, chunksize=50)

[7]:
{'inserted_successfully': 6079, 'failed': 0, 'failed_document_ids': []}

Quickstart Clustering

We support clustering on the client-side in order to help improve the product. On the backend-clustering, we are using KMeans.

[8]:
# Our main clustering is done via advanced_clustering_job. This allows us to easily determine IDs.
vi_client.advanced_clustering_job(
    collection_name=collection_name, vector_field='question_vector_',
    alias='1st_cluster', n_clusters=50, n_init=5)
[8]:
{'job_name': 'advanced_clustering',
 'job_id': '19235f50-fa36-4c4f-8b7e-ba76be800a2a'}

That’s it! Clustering is now running! Now, we wait till the job is complete.

[9]:
vi_client.wait_till_jobs_complete(collection_name=collection_name,
    job_id='7ed0ae66-1b44-43ce-929d-5eea1a887785', job_name='advanced_clustering')
{'status': 'Finished'}
[9]:
'Done'

Now, when we preview the collection, we can see that documents now belong to a cluster.

[10]:
vi_client.retrieve_documents(collection_name, page_size=1)['documents'][0]['_clusters_']
[10]:
{'question_vectors_': {'1st_cluster': 43},
 'question_vector_': {'1st_cluster': 39}}
[11]:
vi_client.head(collection_name, return_as_pandas_df=True, page_size=2)
[11]:
_id qa_id question_title question_body question_user_name question_user_page answer answer_user_name answer_user_page url ... answer_relevance answer_satisfaction answer_type_instructions answer_type_procedure answer_type_reason_explanation answer_well_written insert_date_ _clusters_ _dr_ question_vector_
0 3e80C3QBW7TQbeJX1K3V 5899 Eigenvalues of a transition probability matrix I have read that, for\n$$I - \alpha P$$\nwhere... user243716 https://math.stackexchange.com/users/243716 As stated, the result is FALSE. Presuming P i... Mark L. Stone https://math.stackexchange.com/users/240387 http://math.stackexchange.com/questions/130034... ... 1.000000 1.000000 0.0 0.000000 1.000000 1.000000 2020-08-21T05:20:49.072579 {'question_vectors_': {'1st_cluster': 43}, 'qu... {'default': {'2': {'question_vectors_': [1.229... [0.15878267586231232, 0.2284708470106125, -0.0...
1 dn00C3QBv-xqKfY1H0KD 2660 Does adding a comma before "or" change the mea... For example, the definition given from the OAL... kiamlaluno https://ell.stackexchange.com/users/95 Regarding general usage (in the U.S., at least... Scott https://ell.stackexchange.com/users/357 http://ell.stackexchange.com/questions/3208/do... ... 0.888889 0.733333 0.0 0.333333 0.666667 0.888889 2020-08-21T05:20:49.073579 {'question_vectors_': {'1st_cluster': 42}, 'qu... {'default': {'2': {'question_vectors_': [-0.19... [0.24962793290615082, 0.05536836013197899, -0....

2 rows × 46 columns

Custom clustering of the vectors

Sometimes, we may want to use our own clustering functions because KMeans doesn’t do exactly what we want. Here, we show a simple implementation of clustering to view what this is like for the popular HDBSCAN package.

[12]:
%%capture
%pip install hdbscan
[13]:
documents = vi_client.retrieve_all_documents(collection_name)
[14]:
import hdbscan
import numpy as np
clusterer = hdbscan.HDBSCAN()

# get a numpy array of values
question_vectors = np.array([x['question_vector_'] for x in documents])
clusterer.fit(question_vectors)
clusterer.labels_
[14]:
array([-1, -1, -1, ..., -1, 28, 41])
[15]:
# Add the cluster labels below.
{x.update({'_cluster_label': label}) for x, label in zip(documents, clusterer.labels_)}
[15]:
{None}

Dimensionality Reduction

Quickstart Dimensionality Reduction

Dimensionality reduction from client side is as simple as running the steps below. This runs Principal Component Analysis on dimensionality reduction and allows us to visualise what the end result looks like.

[16]:
vi_client.dimensionality_reduction_job(
    collection_name=collection_name,
    vector_field='question_vector_',
    n_components=2, alias="default")
[16]:
{'job_name': 'dimensionality_reduction',
 'job_id': 'b56b00a1-9fc2-4eb3-b467-260d96715ae3'}

Custom Dimensionality Reduction

Below shows how we utilised a custom implementation of IVIS Dimensionality Reduction!

[20]:
%%capture
%pip install ivis
[13]:
from vectorai.analytics.dimensionality_reduction import ViDimensionalityReductionBase
from sklearn.preprocessing import MinMaxScaler
from ivis import Ivis
import numpy as np

class IVISDimensionalityReduction(ViDimensionalityReductionBase):
    """IVIS Dimensionality Reduction.
    """
    def __init__(self, batch_size=120, k=15, embedding_dims=2):
        self.scaler = MinMaxScaler()
        self.model = Ivis(embedding_dims=embedding_dims, k=k, batch_size=batch_size)

    def fit_transform(self, documents, field_vector):
        """Fit transform at an document level
        """
        if isinstance(documents[0], dict):
            if 'documents' in documents[0].keys():
                vectors = [self.get_field(field_vector, document) for document in documents['documents']]
            else:
                vectors = [self.get_field(field_vector, document) for document in documents]

        X = np.stack(vectors)
        X_scaled = self.scaler.fit_transform(X)
        return self.model.fit_transform(X_scaled)

    def transform(self, X):
        if isinstance(documents[0], dict):
            if 'documents' in documents[0].keys():
                vectors = [document[field_vector] for document in documents['documents']]
            else:
                vectors = [document[field_vector] for document in documents]
        X_scaled = self.scaler.transform(vectors)
        return self.model.transform(X_scaled)
[1]:
dim_reducer = IVISDimensionalityReduction()
red_output = dim_reducer.fit_transform(documents, field_vector='question_vector_')
[17]:
{x.update({'question_reduced_': q_red_}) for q_red_, x in zip(red_output, documents)}
[17]:
{None}

Visualisations (Advanced)

Visualisation with Vector AI is much different to embedding visualisations in other libraries. Our focus is on practicality for deep learning practitioners to better understand the embeddings they want to explore and focus on what matters - the quality of their embeddings in as practical of a situation as possible. A few of our guiding principles when we are building tools:

  • Is it immediately obvious what the graph shows?

  • Is as much information captured as possible and as clear as possible?

  • What is the status quo in regards to the tools being used?

Our focus is on ensuring that practitioners have a lot of good tools at their disposal when they are evaluating the quality of their vectors.

[27]:
vi_client.collection_schema(collection_name)
[27]:
{'_clusters_': 'dict',
 '_clusters_.question_vector_': 'dict',
 '_clusters_.question_vector_.1st_cluster': 'numeric',
 '_clusters_.question_vectors_': 'dict',
 '_clusters_.question_vectors_.1st_cluster': 'numeric',
 '_dr_': 'dict',
 '_dr_.default': 'dict',
 '_dr_.default.2': 'dict',
 '_dr_.default.2.question_vector_': 'vector',
 '_dr_.default.2.question_vectors_': 'numeric',
 'answer': 'text',
 'answer_helpful': 'numeric',
 'answer_level_of_information': 'numeric',
 'answer_plausible': 'numeric',
 'answer_relevance': 'numeric',
 'answer_satisfaction': 'numeric',
 'answer_type_instructions': 'numeric',
 'answer_type_procedure': 'numeric',
 'answer_type_reason_explanation': 'numeric',
 'answer_user_name': 'text',
 'answer_user_page': 'text',
 'answer_well_written': 'numeric',
 'category': 'text',
 'host': 'text',
 'insert_date_': 'date',
 'qa_id': 'numeric',
 'question_asker_intent_understanding': 'numeric',
 'question_body': 'text',
 'question_body_critical': 'numeric',
 'question_conversational': 'numeric',
 'question_expect_short_answer': 'numeric',
 'question_fact_seeking': 'numeric',
 'question_has_commonly_accepted_answer': 'numeric',
 'question_interestingness_others': 'numeric',
 'question_interestingness_self': 'numeric',
 'question_multi_intent': 'numeric',
 'question_not_really_a_question': 'numeric',
 'question_opinion_seeking': 'numeric',
 'question_title': 'text',
 'question_type_choice': 'numeric',
 'question_type_compare': 'numeric',
 'question_type_consequence': 'numeric',
 'question_type_definition': 'numeric',
 'question_type_entity': 'numeric',
 'question_type_instructions': 'numeric',
 'question_type_procedure': 'numeric',
 'question_type_reason_explanation': 'numeric',
 'question_type_spelling': 'numeric',
 'question_user_name': 'text',
 'question_user_page': 'text',
 'question_vector_': 'vector',
 'question_well_written': 'numeric',
 'url': 'text'}

Plotting Dimensionality-Reduced Vectors

Plotting dimensionality-reduced vectors allows you understand your clusters better. It is a common (and pretty) technique to understand if your clustering technique was reflective of groups.

When you visualise the vectors, you should see clear groups where the points are separated. This means that the vectors have been trained well for their respective clusters - an example of having high quality vector representations if they learn groups on their own.

[15]:
documents = vi_client.retrieve_all_documents(collection_name)
[8]:
cluster_label = 'category'
dim_reduction_field = '_dr_.default.2.question_vector_' # the '.' in names implies nested dictionaries in Vi
cluster_field = 'question_vector_'
alias = '1st_cluster'
collection_name = 'nlp-qa'

vi_client.plot_dimensionality_reduced_vectors(
    collection=documents,
    cluster_field=cluster_field,
    cluster_label=cluster_label,
    point_label='question_title',
    dim_reduction_field=dim_reduction_field,
    # include_centroids=True,
    alias=alias)