Building an AI-based browser phone extension
2023-05-01
Motivation
On March 14, 2023, OpenAI released GPT-4, a transformer-based model that was more reliable, intelligent, and capable of handling nuanced instructions than GPT-3.5. GPT-4 performed strongly across many benchmarks, which accelerated the growth of AI-powered applications. Among these application areas, the phone remains a fundamental human communication tool and a core communication service for businesses. The potential for AI to improve communication efficiency and quality in phone interactions is worth exploring.
Compared with traditional phone applications, we aim to build an AI-driven browser phone extension. This extension uses AI to assist at every stage of the call workflow and improve the user’s calling experience.
- Before a call, we want it to extract contact information from third-party webpages, especially complex ones. It should also analyze relevant contact data such as call recordings, voicemails, and SMS messages associated with each identified contact. This helps users understand past interactions, sentiment signals, and possible call summaries before calling.
- During a call, the spoken dialogue will be transcribed into text in real-time. Based on this live text conversation, the extension should offer effective intelligent prompts and assistance. It could even intelligently deploy an AI voicebot to participate in the conversation.
- After a call concludes, it will intelligently generate a call sentiment assessment, conversation summary, and to-do list based on the current call information, the conversation transcript, and potentially past interactions with the contact. It will also provide automatic form filling in any CRM page forms.
Once this browser phone extension delivers these capabilities, phone workflows can be integrated more naturally with related business platforms. This demonstrates a practical use of AI across the entire call workflow.
Integrated AI Services and Capabilities
Before implementing this extension, we should clearly define the AI services and capabilities we plan to integrate in this PoC:
- OpenAI SDK: Provides the chat/completions API for intelligent content generation.
- OpenAI SDK: Provides the audio/transcriptions API for transcribing audio files to text.
- Azure: Additionally offers real-time STT (Speech-to-Text) and TTS (Text-to-Speech) APIs.
Flowchart
Based on the above three sets of APIs, combined with Browser Extension APIs, let’s design the overall business process:

Implementation
Creating a basic browser phone extension project
We can base our project on the open-source browser-extension-boilerplate, which supports building browser extensions for multiple browsers:
- Chrome
- Firefox
- Opera (Chrome Build)
- Edge (Chrome Build)
- Brave
- Safari
We will use content.js as the JavaScript file injected into tabs by default. We will use background.js as the browser extension service worker, which opens a separate page (client.js) that functions as the phone app.
Taking the manifest.json configuration of a Chrome MV3 extension as an example:
1 | { |
We should primarily focus on the permissions section:
tabsandactiveTabare mainly used for tab communication and state capture.debuggeris used for automating webpage operations. This permission is sensitive and should be requested only when the automation feature is enabled and clearly explained to users.
Implementing Contact Information Extraction from Any Webpage
In content.js, we will implement basic content extraction from webpages:
1 | chrome.runtime.onMessage.addListener((message, sender, sendResponse) => { |
And in client.js, we will implement the corresponding request for webpage content:
1 | async function getActiveTab() { |
This example only uses the simple document.body.innerText API. A production extractor should consider content visibility, iframe content, shadow DOM, dynamic loading, and other factors. Due to space limitations, we will not elaborate further here.
Next, we will submit the extracted content to the OpenAI API to extract contact information:
1 | async function fetchContactsInfo() { |
Here we use a TypeScript-like shape in the prompt to guide the model toward a stable JSON structure.
fetchGPT can be implemented independently. In production, avoid putting API keys directly in the browser extension. Use a backend proxy, token exchange, or another controlled authorization mechanism.
Intelligent Contact Analysis and Sorting
Once we have obtained contact information, we can match it with contacts in the current phone system and also match the corresponding call recordings, voicemails, and SMS messages. SMS messages are typically text-based, so we can organize this information and submit it directly to OpenAI. However, call recordings and voicemails are audio files, so we will need to use the OpenAI audio/transcriptions API.
1 | async function transcriptVoicemail(files) { |
This way, we can obtain transcriptions of voicemails. Similarly, we can also get transcriptions of call recordings.
Next, we submit this information to OpenAI to analyze sentiment:
1 | async function getEmotion(callRecordings, voiceMails, sms) { |
Once we obtain sentiment analysis for each contact, we can sort the results to help users understand the likely communication context for each customer. This can help users plan call strategy and scheduling.
Real-time STT
Azure provides real-time STT and TTS services. We will use the official SDK, microsoft-cognitiveservices-speech-sdk, to implement STT:
1 | import * as sdk from "microsoft-cognitiveservices-speech-sdk"; |
In this way, using sdk.AudioConfig.fromStreamInput(stream) with our WebRTC stream allows us to achieve real-time STT. Furthermore, by calling this method separately for both the input and output streams of WebRTC, we can obtain real-time text content for both sides of the conversation.
It’s worth noting that microsoft-cognitiveservices-speech-sdk provides a remarkably comprehensive set of input and output stream configurations to accommodate a wide range of STT scenarios.
Intelligent Hints During Calls
In the Azure SDK, the recognized event represents an intermediate paused result in the ongoing speech recognition process. When this event is triggered, we can submit the in-progress conversation text to OpenAI for potential hints:
1 | async function getHint(conversation) { |
The hints we obtain will also be displayed in the conversation UI, shown to the user in real-time to provide this intelligent assistance. For example, if the conversation involves discussing discount calculations, the AI will provide the calculated results.
Generating Emotion, Summary, and To-Do List After the Call
We will generate reports such as emotion analysis and summaries by using the complete conversation text as part of the prompt for OpenAI:
1 | async function getReport(conversation) { |
In a similar manner, we will also generate a to-do list for the current conversation:
1 | async function getTodoList(conversation) { |
Intelligently Filling Out Forms
As a fundamental business communication service, phone systems are commonly integrated with third-party CRM platforms. Intelligent CRM form filling can improve collaboration for B2B users.
To implement intelligent form filling, the first step is to address how to extract and compress DOM information from form pages of any third-party CRM. Here, we reference TaxyAI’s DOM compression model:
By traversing all nodes in the DOM and analyzing their validity, we retain useful attribute information and assign a DI (“DOM Identifier”) tag to each valid node.
1 | const allowedAttributes = [ |
Once we have compressed DOM information, we can infer webpage labels by analyzing visibility, effective width and height, operability, and related attributes. Then we submit this information to OpenAI:
1 | async function getFieldValues( |
Once we have fieldValues, we submit the current page DOM and fieldValues to OpenAI so it can map values to the corresponding input fields:
1 | async function getInputValues(fieldValues, dom) { |
Then we send these values to the current CRM form page to complete automatic form filling. In production, this step should include user confirmation, validation, and rollback support before writing data into third-party systems.
In client.js, we send inputValues:
1 | async function sendInputValues() { |
In content.js, complete the final input filling:
1 | chrome.runtime.onMessage.addListener((message, sender, sendResponse) => { |
We omit the details of handling more complex form types.
Browser Automated To-Do Execution
We use the TaxyAI-style automated DOM operation model mentioned earlier, which uses OpenAI to control the browser and perform repetitive actions on behalf of the user.
By appending Thought and Action context prompts, OpenAI can iteratively infer the next DOM operation until the current to-do item is completed.
This is the crucial function for formatting the prompt:
1 | function formatPrompt(taskInstructions, previousActions, pageContents) { |
Conclusion
At this point, we have implemented a proof-of-concept AI-based browser phone extension that covers the full phone call workflow.
This article is a Proof of Concept (PoC) and outlines the general implementation path. A production application still needs substantial work in cost control, privacy, security, reliability, and performance. For example, cost can be controlled through token estimation and segmented requests. Privacy risks can be reduced through local Web LLM-based redaction, backend policy enforcement, consent handling for call recording, and explicit user confirmation before CRM writes or browser automation.