Build an AI-based browser phone extension

2023-05-01

Motivation

On March 14, 2023, OpenAI released GPT-4, a transformer-based model that is more reliable, intelligent, and capable of handling nuanced instructions compared to GPT-3.5. GPT-4 has consistently outperformed other Large Language Models (LLMs) in various benchmarks, leading to a surge in diverse AI-powered applications built upon it. Among these applications, the “phone” remains a fundamental human communication need and a core communication service for businesses. The potential for AI to enhance communication efficiency and quality in phone interactions is significant and worth exploring.

Compared to traditional phone applications, we aim to build an AI-driven browser phone extension. This extension is envisioned to leverage AI to assist in every stage of the phone call process, ultimately improving the user’s calling experience.

Before a call, we want it to help us extract contact information from any third-party webpage, especially complex ones. Crucially, it should provide intelligent analysis of relevant contact data (call recordings, voicemails, SMS, etc.) associated with each identified contact. This would enable us to understand sentiment analysis or potential draft summaries for upcoming calls, and more.
During a call, the spoken dialogue will be transcribed into text in real-time. Based on this live text conversation, the extension should offer effective intelligent prompts and assistance. It could even intelligently deploy an AI voicebot to participate in the conversation.
After a call concludes, it will intelligently generate a call sentiment assessment, conversation summary, and to-do list based on the current call information, the conversation transcript, and potentially past interactions with the contact. It will also provide automatic form filling in any CRM page forms.

Once this browser phone extension delivers these functionalities, we will see a seamless integration of phone workflows with relevant platforms. This demonstrates the practical application of AI in optimizing the entire call workflow.

Integrated AI Services and Capabilities

Before implementing this extension, we should clearly define the AI services and capabilities we plan to integrate:

OpenAI SDK: Provides the chat/completions API for intelligent content generation.
OpenAI SDK: Provides the audio/transcriptions API for transcribing audio files to text.
Azure: Additionally offers real-time STT (Speech-to-Text) and TTS (Text-to-Speech) APIs.

Flowchart

Based on the above three sets of APIs, combined with Browser Extension APIs, let’s design the overall business process:

BDD

Implementation

Creating a Basic browser phone extension Project

We can base our project on the open-source browser-extension-boilerplate, which supports building browser extensions for multiple browsers:

Chrome
Firefox
Opera (Chrome Build)
Edge (Chrome Build)
Brave
Safari

We will designate content.js as the JavaScript file that the extension will inject by default into any tabs. We will designate background.js as the browser’s service worker, which will be used to pop up a separate page (client.js) that can function as the Phone app.

Taking the manifest.json configuration of a Chrome MV3 extension as an example:

{
  "manifest_version": 3,
  "name": "demo",
  "version": "0.0.1",
  "description": "A browser extension demo",
  "background": {
    "service_worker": "background.js"
  },
  "action": {},
  "icons": {
    "128": "logo.png"
  },
  "permissions": ["tabs", "activeTab", "debugger", "management"],
  "host_permissions": ["http://*/", "https://*/"],
  "content_scripts": [
    {
      "matches": ["http://*/*", "https://*/*", "<all_urls>"],
      "js": ["content.js"],
      "css": ["content.styles.css"]
    }
  ],
  "web_accessible_resources": [
    {
      "resources": ["content.styles.css", "logo.png"],
      "matches": []
    },
    {
      "resources": ["client.html", "redirect.html", "logo.png"],
      "matches": ["<all_urls>"]
    }
  ]
}

We should primarily focus on the permissions section:

tabs and activeTab are mainly used for tab communication and status capture, etc.
debugger will primarily be used for automating webpage operations.

Implementing Contact Information Extraction from Any Webpage

In content.js, we will implement basic content extraction from webpages:

chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
  const type = message.type;
  if (type === "get-webpage-text") {
    sendResponse(document.body.innerText);
  }
});

And in client.js, we will implement the corresponding request for webpage content:

async function getActiveTab() {
  const { id } = await chrome.tabs.getCurrent();
  const [activeTab] = (await chrome.tabs.query({ active: true })).filter(
    (item) => item.id !== id
  );
  return activeTab;
}

async function getPageText() {
  const activeTab = await getActiveTab();
  if (!activeTab) {
    throw new Error("No active tab found");
  }
  const response = await chrome.tabs.sendMessage(activeTab.id!, {
    type: "get-webpage-text",
  });
  return response;
}

We only used the simple API document.body.innerText. A more complete extraction should consider content visibility, iframe content, and many other factors. Due to space limitations, we won’t elaborate further here.

Next, we will submit the extracted content to the OpenAI API to extract contact information:

async function fetchContactsInfo() {
  const text = await getPageText();
  const content = `The following is the text of a web page using 'document.body.innerText', please help me to extract the information about the contact \n'${text}', result should be Array JSON format and data format be like below
  
      {
        name: string;
        title: string;
        phone: string;
        email: string[];
      }[]
  
      IMPORTANT: only respond to the Array JSON result.
      `;
  const result = await fetchGPT({
    content,
  });
  return result;
}

It’s worth mentioning that here we used a TypeScript interface as the prompt to ensure OpenAI returns a stable JSON data structure.

fetchGPT can be implemented independently. You can install the OpenAI SDK (openai) and set up your own API server.

Intelligent Contact Analysis and Sorting

Once we have obtained contact information, we can match it with contacts in the current phone system and also match the corresponding call recordings, voicemails, and SMS messages. SMS messages are typically text-based, so we can organize this information and submit it directly to OpenAI. However, call recordings and voicemails are audio files, so we will need to use the OpenAI audio/transcriptions API.

async function transcriptVoicemail(files) {
  const results = await Promise.allSettled(
    files.map(async (file) => {
      const body = {
        file: file.file,
        model: "whisper-1",
        temperature: 0,
        response_format: "json",
      };
      const formData = new FormData();
      for (const [name, value] of Object.entries(body)) {
        formData.append(name, value as never);
      }
      const data = await firstValueFrom(
        defer(async () => {
          const response = await fetch(`${baseUrl}/v1/audio/transcriptions`, {
            method: "POST",
            headers: {
              "Access-Control-Allow-Origin": "*",
              Authorization: "Bearer your-api-key",
            },
            body: formData,
          });

          if (!response.ok) {
            throw new Error("fetch API failed");
          }

          const data: {
            text: string;
          } = await response.json();

          return data;
        }).pipe(retry(3))
      );

      return { text: data.text, url: file.url };
    })
  );

  const voiceMailTranscriptions = results.reduce((acc, result, i) => {
    if (result.status === "fulfilled") {
      const id = voiceMails[i].id!;
      const value = result.value;
      acc[id] = {
        ...voiceMails[i],
        transcription: value.text,
        playURL: value.url,
      };
    }
    return acc;
  }, {});

  return voiceMailTranscriptions;
}

This way, we can obtain transcriptions of voicemails. Similarly, we can also get transcriptions of call recordings.

Next, we will submit this information to OpenAI to analyze emotion:

async function getEmotion(callRecordings, voiceMails, sms) {
  const content = `
  Please give me the user's emotion based on Below Call recordings, Voicemails and SMS, the emotion score between 0 to 100, angry is 0, peaceful is 100.

  Call recordings:
  ${callRecordings}

  Voicemails:
  ${voiceMails}

  SMS:
  ${sms}

  Your response result should be JSON format and the data type be like below:
  {
    score: number;
    description: string;
  }

  IMPORTANT: only give me the json result
  `;
  const result = await fetchGPT({
    content,
  });
  return result;
}

Once we obtain the emotion analysis for each contact, we can sort these analyses to let users know the potential emotions associated with customer communication for each contact. This could indicate the ease or difficulty of communication, allowing users to plan their call strategies and schedules accordingly.

Real-time STT

Azure provides real-time STT and TTS services. We will use the official SDK, microsoft-cognitiveservices-speech-sdk, to implement STT:

import * as sdk from "microsoft-cognitiveservices-speech-sdk";

function startSpeechRecognition() {
  const speechConfig = sdk.SpeechConfig.fromSubscription(
    subscriptionKey,
    region
  );
  speechConfig.speechRecognitionLanguage = language;

  const stream = getMediaStream();
  if (!stream) return null;

  const audioConfig = sdk.AudioConfig.fromStreamInput(stream);
  const newRecognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

  newRecognizer.recognizing = (s, e) => {
    console.log(`Recognizing: ${e.result.text}`);
  };

  newRecognizer.recognized = (s, e) => {
    console.log(`Recognized: ${e.result.text}`);
    if (e.result.text !== undefined) {
      setTranscript(e.result.text);
    }
  };

  newRecognizer.canceled = (s, e) => {
    if (e.errorCode === sdk.CancellationErrorCode.ErrorAPIKey) {
      console.error("Invalid or incorrect subscription key");
    } else {
      console.log(`Canceled: ${e.errorDetails}`);
    }
    setIsListening(false);
  };

  newRecognizer.sessionStopped = (s, e) => {
    console.log("Session stopped");
    newRecognizer.stopContinuousRecognitionAsync();
    setIsListening(false);
  };

  newRecognizer.startContinuousRecognitionAsync(
    () => {
      console.log("Listening...");
    },
    (error) => {
      console.log(`Error: ${error}`);
      newRecognizer.stopContinuousRecognitionAsync();
      setIsListening(false);
    }
  );

  setRecognizer(newRecognizer);
}

In this way, using sdk.AudioConfig.fromStreamInput(stream) with our WebRTC stream allows us to achieve real-time STT. Furthermore, by calling this method separately for both the input and output streams of WebRTC, we can obtain real-time text content for both sides of the conversation.

It’s worth noting that microsoft-cognitiveservices-speech-sdk provides a remarkably comprehensive set of input and output stream configurations to accommodate a wide range of STT scenarios.

Intelligent Hints During Calls

In the Azure SDK, the recognized event represents an intermediate, paused result in the ongoing speech recognition. When this event is triggered, we can submit the in-progress conversation text to OpenAI to help provide potential information:

async function getHint(conversation) {
  const content = `
  You are a conversational intelligent assistant, please give possible hints based on the following conversation:

  ${conversation}

  Your response result should be JSON format and the data type be like below:
  {
    hint: string;
  }

  IMPORTANT: only give me the json result
  `;
  const result = await fetchGPT({
    content,
  });
  return result;
}

The hints we obtain will also be displayed in the conversation UI, shown to the user in real-time to provide this intelligent assistance. For example, if the conversation involves discussing discount calculations, the AI will provide the calculated results.

Generating Emotion, Summary, and To-Do List After the Call

We will generate reports such as emotion analysis and summaries by using the complete conversation text as part of the prompt for OpenAI:

async function getReport(conversation) {
  const content = `
  Please help me to "summarize", "improvements", and "evaluate" the quality of the customer service from the following conversation,

  ${conversation}

  result should be JSON format,
  summary and improvement please be more detailed, and format be like below

  interface Report {
    summary: string;
    improvements: string[];
    evaluation: Evaluation;
  }

  interface Evaluation {
    sellResult: 'Deal' | 'Lost deal' | 'Follow up';
    customerEmotion: 'Happy' | 'Sad' | 'Unhappy';
  }

  IMPORTANT: only give me the json result
  `;
  const result = await fetchGPT({
    content,
  });
  return result;
}

In a similar manner, we will also generate a to-do list for the current conversation:

async function getTodoList(conversation) {
  const content = `
  I am a sales agent. Please Help me generate a todo list in customer relationship management platform for the following conversation.

  ${conversations}

  result should be Array JSON format, and checked, processing to be false

  type ToDoListItem = {
    id: string;
    checked: boolean;
    message: string;
    processing: boolean;
  }[];

  IMPORTANT: only give me the Array JSON result, make sure that only the above conversation is relevant.
  `;
  const result = await fetchGPT({
    content,
  });
  return result;
}

Intelligently Filling Out Forms

As a fundamental communication service for businesses, phone systems are commonly integrated with third-party CRM platforms. Therefore, intelligently auto-filling CRM forms can efficiently assist collaboration for ToB (business-to-business) users.

To implement intelligent form filling, the first step is to address how to extract and compress DOM information from form pages of any third-party CRM. Here, we reference TaxyAI’s DOM compression model:

By traversing all nodes in the DOM and analyzing their validity, we retain valid attribute information and assign a DI (presumably “DOM Identifier”) tag to each valid node.

const allowedAttributes = [
  "aria-label",
  "data-name",
  "name",
  "type",
  "placeholder",
  "value",
  "role",
  "title",
];

const sample = `
<div>
  <button
    aria-label="Add to wishlist: Brinnon, Washington"
    type="button"
    id="1396"
  ></button>
  <div aria-label="Photo 1 of 6"></div>
</div>
`;

Once we have the compressed DOM information, we can also obtain the webpage labels by analyzing it (through analyzing attributes such as visibility, effective width and height, and operability). Then, we submit this to OpenAI:

async function getFieldValues(
  labels,
  domainInfo,
  callInfo,
  conversation,
  summary
) {
  const content = `
  You are a browser DOM automation assistant. Your goal is to help me fill in the customer relationship management Task form based on the conversation I provided.

  In the summary field, input the summary of the conversation, and the call transcription.

  This is an example of the field
  { key: 'Title' }

  This is an example of the response
  { key: 'Title', value: 'Here is the value for title' }

  You must always keep result should be JSON format and data format be like below:
  {
    key: string;
    value: string;
  }[]

  the list of input fields:
  ${JSON.stringify(labels, null, 2)}

  Domain information:
  ${domainInfo}

  Call information:
  ${callInfo}

  Conversation:
  ${conversation}

  Summary:
  ${summary}
  `;
  const result = await fetchGPT({
    content,
  });
  return result;
}

Once we have fieldValues, we submit the current page and fieldValues to OpenAI, so that it can generate the values for the corresponding input fields:

async function getInputValues(fieldValues, dom) {
  const content = `
  You are a browser automation assistant.

  You will be given a webpage DOM data and a list of input data. You should tell me how I can fill in the field based on the information.

  This is an example of the DOM:
  <div id="1101">Title<input id="1102"><div>

  This is an example of the input data:
  { key: 'Title', value: 'Here is the value for title' }

  This is an example of the response:
  { id: '1102', value: 'Here is the value for title' }

  Here is the list of input data:
  ${fieldValues}

  Here is the webpage DOM:
  ${dom}

  Your response result should be JSON format and the data type be like below:
  {
    id: string;
    value: string;
  }[]

  IMPORTANT: only give me the json result
  `;
  const result = await fetchGPT({
    content,
  });
  return result;
}

Then, we send these values to the current CRM form page to complete the automatic form filling.

In client.js, we send inputValues:

async function sendInputValues() {
  await chrome.tabs.sendMessage(activeTab.id!, {
    type: 'auto-run-action',
    action: inputValues.filter((item) => item.value),
  });
}

In content.js, complete the final input filling:

chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
  const { type, action } = message;
  if (type === "auto-run-action") {
    action.forEach(({ id, value }) => {
      const inputElement = getElementById(id);
      if (!inputElement) {
        console.error("Could not find element with id", id);
        return;
      }
      inputElement.value = value;
      inputElement.dispatchEvent(new Event("change", { bubbles: true }));
    });
    sendResponse();
  }
});

We omit the details of handling more complex form types.

Browser Automated To-Do Execution

We leverage the TaxyAI automated execution DOM operation model mentioned earlier, which utilizes OpenAI to control the browser and perform repetitive actions on behalf of the user.

By appending Thought and Action contextual prompts, OpenAI will iteratively and logically deduce the next DOM operation until the current to-do item is fulfilled.

This is the crucial function for formatting the prompt:

function formatPrompt(taskInstructions, previousActions, pageContents) {
  let previousActionsString = "";

  if (previousActions.length > 0) {
    const serializedActions = previousActions
      .map(
        (action) =>
          `<Thought>${action.thought}</Thought>\n<Action>${action.action}</Action>`
      )
      .join("\n\n");
    previousActionsString = `You have already taken the following actions: \n${serializedActions}\n\n`;
  }

  return `The user requests the following task:

${taskInstructions}

${previousActionsString}

Current time: ${new Date().toLocaleString()}

Current page contents:
${pageContents}`;
}

Conclusion

At this point, we have essentially implemented an AI-based browser phone extension that covers the entire phone call workflow.

This article serves as a Proof of Concept (PoC) and aims to outline the general implementation of the relevant processes. However, the application still has many details requiring improvement in areas such as cost, privacy and security, and performance. For instance, cost control might be achieved through techniques like pre-calculating tokens for segmented requests, and privacy concerns could be addressed by employing local Web LLM-based de-sensitization processing, and so on.