OpenAI Embeddings and tiktoken

Hi Folks, here I will share my experience about using OpenAI embedding, its necessity and the scenarios where embedding are useful.

I was tasked with creating an internal chat-bot for searching and summarizing content from our internal wiki and knowledge base. To achieve this, I initially built a web crawler to gather information from linked pages, processed the content, and then used an OpenAI LLM model to make the information easily retrievable for users.

This is the part-1 of this series where It I'm tried to cover till #3 from below list, i.e. Data Load from web page, Data Read, Spilt and Tokenize and Data Embeding and Store. In part-2 will see how this data can be retrived/searched to answer specific asks/queries from user.

The high level approach to build Q&A or chat-bot capability is as below:

Data Load: First we need to load the data. I used ‘requests’ and ‘BeautifulSoup’ to do so.
Data Read, Spilt & Tokenize: this is necessary to create a smaller chunks of large data so that it can be easily indexed and searched. Also the large data not fit into model’s finite context window. This can be done using simple looping logic.
Data Embedding & Store: the chunks can be stored in Vector Database like Chroma or Faiss by using Embedding model. I used OpenAI embedding create API and stored as csv file.
Data Retrieve: Retrieve the relevant spilts based on user input. I used ‘openai.embeddings_utils’
Answer/LLM: by using prompt and retrieved spilt LLM produces the answer

Lets have a look on first 3 things here one by one:

Data Load: below code snippet accept the main page URL as start page and using BeautifulSoup finds all anchor tags from page and add it to the list of scannedUrls. for an example here I took wiki page which has all list of forests and collecting all urls from that page.

main_prefix = "https://en.wikipedia.org"
main_url =  "https://en.wikipedia.org/wiki/Lists_of_forests"

scanned_urls = []

def scan_page(pageToScan):
    res = requests.get(pageToScan)

    if res.ok:
        page_content = res.text
        bsoup = BeautifulSoup(page_content,"html.parser")
        for link in bsoup.find_all('a'):
            if link.get('href').startswith('/wiki/List_of_forests'):
                if link.get('href') not in scanned_urls:
                    scanned_urls.append(main_prefix+link.get('href'))

    print(len(scanned_urls))
    for l in scanned_urls:
        print(l)

scan_page(main_url)

Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Output: 12 https://en.wikipedia.org/wiki/List_of_forests_in_Canada https://en.wikipedia.org/wiki/List_of_forests_in_Denmark https://en.wikipedia.org/wiki/List_of_forests_in_France https://en.wikipedia.org/wiki/List_of_forests_in_Iceland https://en.wikipedia.org/wiki/List_of_forests_in_India https://en.wikipedia.org/wiki/List_of_forests_in_Ireland https://en.wikipedia.org/wiki/List_of_forests_in_Israel https://en.wikipedia.org/wiki/List_of_forests_in_Lithuania https://en.wikipedia.org/wiki/List_of_forests_in_Serbia https://en.wikipedia.org/wiki/List_of_forests_of_South_Africa https://en.wikipedia.org/wiki/List_of_forests_in_the_United_Kingdom https://en.wikipedia.org/wiki/List_of_forests_in_Montana

2. Data Read, Spilt and Tokenize: Next thing is to iterate over the collection of URLs one by one (crawling) and read the page contents. The beautiful soup library is used to parse html contents and the ‘text’ returns actual text data. I created a data-frame with 3 columns initially to store url, its contents and content size.

def crawl_n_collect_data():
    # dict object to store url and page text
    data_holder = {
        "url": [],
        "content": [],
        "contentsize": []
    }
    #scan all urls from scanned_urls one by one and read data and calculate datasize
    for link in scanned_urls:
        res = requests.get(link)
        if res.ok:
            soup = BeautifulSoup(res.text, "html.parser")
            page_content = soup.text            
            data_holder["url"].append(link)
            data_holder["content"].append(page_content)
            data_holder["contentsize"].append(len(page_content))
    
    df = pd.DataFrame(data_holder)        
    df["content"] = df["content"].apply(lambda x : clean_text(x))

    tokenizer = tiktoken.get_encoding(tiktoken.encoding_name_for_model("gpt-3.5-turbo"))
    df["n_tokens"] = df["content"].apply(lambda x: len(tokenizer.encode(x)))  

    return df

‘tiktoken‘ is used to determine how many token the string has before embedding. We can split a string into tokens with OpenAI’s tokenizer tiktoken. To get the supported model encoding use

# method 'encoding_name_for_model' returns the supported encoding for model. 
tiktoken.encoding_name_for_model("gpt-3.5-turbo"))

All latest model is having ‘cl100k_base‘ encoding and having model token context limit to 8192 tokens. if token size is bigger than allowed limit then spilt the string into smaller chunks ( and so tokens)

def clean_n_tokenize(df):
    df['content'] = df["content"].apply(lambda x: normalize_text(x))
    tokenizer = tiktoken.get_encoding("cl100k_base")
    df['n_tokens'] = df["content"].apply(lambda x: len(tokenizer.encode(x)))
    return df

To prepare the data for tokenization, we’ll perform some light data cleaning by removing redundant whitespace and cleaning up the punctuation.

def normalize_text(s, sep_token=" \n "):
    s = re.sub(r'\s+', ' ', s).strip()
    s = re.sub(r". ,", "", s)
    # remove all instances of multiple spaces
    s = s.replace("..", ".")
    s = s.replace(". .", ".")
    s = s.replace("\n", "")
    s = s.strip()
    return s

Output:

url content contentsize n_tokens
0 https://en.wikipedia.org/wiki/List_of_forests_… Forests of Canada – Wikipedia Jump to content … 13789 2777
1 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Denmark – Wikipedia Jump to… 4475 1000
2 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in France – Wikipedia Jump to … 10249 3002
3 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Iceland – Wikipedia Jump to… 6222 1382
4 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in India – Wikipedia Jump to c… 7824 1526
5 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Ireland – Wikipedia Jump to… 4990 965
6 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Israel – Wikipedia Jump to … 6401 1519
7 https://en.wikipedia.org/wiki/List_of_forests_… Forests in Lithuania – Wikipedia Jump to conte… 8748 2277
8 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Serbia – Wikipedia Jump to … 2700 455
9 https://en.wikipedia.org/wiki/List_of_forests_… List of forests of South Africa – Wikipedia Ju… 3806 728
10 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in the United Kingdom – Wikipe… 6538 1279
11 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Montana – Wikipedia Jump to… 16246 4082

def generate_embedding(chunkData):  
    response = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=chunkData,
        temperature=0.5,
        max_tokens=60,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        best_of=1,
        stop=None,
        encoding_format='float')
    return response.data[0].embedding

df["ada_embedding"] = df["content"].apply(lambda x: embedding_data(x))

output after embedding

url content contentsize n_tokens ada_embedding
0 https://en.wikipedia.org/wiki/List_of_forests_… Forests of Canada – Wikipedia Jump to content … 13789 2777 [0.011232808, -0.025230426, -0.0033506253, -0….
1 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Denmark – Wikipedia Jump to… 4475 1000 [0.0022826288, -0.024155257, 0.018999346, -0.0…
2 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in France – Wikipedia Jump to … 10249 3002 [0.023525473, 0.001991317, 0.005225442, -0.023…
3 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Iceland – Wikipedia Jump to… 6222 1382 [0.0090938015, 0.00018668556, 0.017149758, -0….
4 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in India – Wikipedia Jump to c… 7824 1526 [0.00852957, -0.001390587, 0.0076518063, 0.010…
5 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Ireland – Wikipedia Jump to… 4990 965 [0.008661318, -0.01900384, 0.0147074275, -0.02…
6 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Israel – Wikipedia Jump to … 6401 1519 [0.004974614, -0.0038926275, 0.004324791, -0.0…
7 https://en.wikipedia.org/wiki/List_of_forests_… Forests in Lithuania – Wikipedia Jump to conte… 8748 2277 [0.017135024, 0.0046746093, 0.019561443, -0.02…
8 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Serbia – Wikipedia Jump to … 2700 455 [0.021920355, -0.012049191, 0.0045980527, -0.0…
9 https://en.wikipedia.org/wiki/List_of_forests_… List of forests of South Africa – Wikipedia Ju… 3806 728 [0.016609546, -0.015374541, 0.018525062, -0.03…
10 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in the United Kingdom – Wikipe… 6538 1279 [0.017363157, -0.004047701, -0.004485791, -0.0…
11 https://en.wikipedia.org/wiki/List_of_forests_… List of forests in Montana – Wikipedia Jump to… 16246 4082 [-0.007826162, -0.0027350672, 0.022195093, -0….

see below – how this data can be retrived/searched to answer specific asks/queries from user.

def get_llm_query_response(question: str):
    try:
        #supposed the embeddings are stores in a csv file. read csv using pandas.
        df = pd.read_csv("embed-data.csv")

        # Convert embeddings from string format back to numpy array
        df["ada_v2_embedding"] = df["ada_v2_embedding"].apply(
            lambda x: np.array(ast.literal_eval(x))
        )

        # print("get_query_response in ",df)
        client = OpenAI()

        embed_q = client.chat.embeddings.create(
            model="text-embedding-ada-002-2",
            input=question
        )
        embedding = embed_q.data[0].embedding

        df["similarities"] = df.ada_v2_embedding.apply(
            lambda x: cosine_similarity(x, embedding)
        )
        # df["similarities"] = df.ada_v2_embedding.apply(lambda x: cosine(embedding, x))

        resp_df = df.sort_values("similarities", ascending=False).head(2)

        txt_resp = resp_df["content"].iloc[0]

        messages = []
        prompt_text = f"Your job is to find the answer from the {txt_resp} \
                    1. Do not hallucinate the answer, only provide the information present in the {txt_resp}."

        openAIClient = new OpenAI()

        messages.append({"role": "user", "content": "" + question})
        messages.append({"role": "system", "content": prompt_text})

        response = openAIClient.chat.completions.create(
            model="gpt-4",
            messages=messages
        )
        # logging.info("Response",response.choices[0].message.content)

    except Exception as e:
        logger.error("Exception occurred while querying: %s", str(e))

    return response.choices[0].message.content

Stay Tuned !!!

OpenAI Embeddings and tiktoken

Published by Anupam Pawar

Leave a comment Cancel reply

Share this:

Related

Published by Anupam Pawar

Leave a comment Cancel reply