Meet Alibaba’s Page Agent: A JavaScript In-Page GUI Agent That Controls Web Interfaces With Natural Language Through the DOM

Most browser automation runs from the outside. Playwright, Puppeteer, Selenium, and browser-use all drive a browser from an external process. They read the page through screenshots or the Chrome DevTools Protocol.

Alibaba’s Page Agent takes the opposite path. The agent lives inside the webpage as plain JavaScript. It reads the live DOM as text and acts as the real user. No headless browser, no screenshots, no multi-modal model.

The project is open-source under the MIT license. The codebase is TypeScript-first. It builds on browser-use, from which its DOM processing and prompt are derived.

TL;DR

Page Agent runs inside the page as JavaScript, reading the live DOM as text, not screenshots.
DOM dehydration compresses the page into a FlatDomTree so smaller text models can act precisely.
It is model-agnostic through any OpenAI-compatible endpoint and ships under the MIT license.
Prompt-level safety and single-page scope are real limits; keep server-side validation for risky actions.
Best fit: copilots and form-filling inside apps you own, not external or locked-down sites.

What is Page Agent?

Page Agent is a client-side library for adding agent behavior to a web app. You embed it, then issue commands in natural language. The agent finds elements, clicks buttons, and fills forms from within the page.

Because it runs in the browser session, it inherits the user’s cookies, session, and authentication. There is no separate backend to write. The existing UI validation and security rules stay in place.

The design is model-agnostic. You bring your own large language model through any OpenAI-compatible endpoint. Only text is sent to the model, so a strong text model is enough.

How DOM Dehydration Works

The core technique is what the team calls DOM dehydration. A modern page can hold thousands of nodes. Sending raw HTML to a model would be slow and expensive.

When a command arrives, the agent scans the Document Object Model. It identifies every interactive element, such as buttons, links, and input fields. Each element receives an index plus a role and a label.

The live DOM is converted into a FlatDomTree, a clean text map of what matters. Redundant markup is stripped out. The model reads this compact representation, not pixels.

The interactive demo on this page mirrors this loop. Watch the “Dehydrated DOM” and “Action trace” panels update as commands run.

Under the hood, the agent delegates work to a PageController:

await this.pageController.updateTree()
await this.pageController.clickElement(index)
await this.pageController.inputText(index, text)
await this.pageController.scroll({ down: true, numPages: 1 })

The monorepo splits these concerns into small packages. @page-agent/core holds the headless agent logic. page-agent is the full entry class with a UI panel. @page-agent/page-controller handles DOM extraction and element indexing, with optional visual feedback through a SimulatorMask.

Developers keep control of scope. Operation allowlists limit which actions the agent may run. Data masking can hide sensitive fields, such as passwords, from the model. Custom knowledge can be injected so the agent follows your domain rules.

How It Compares

ApproachWhere it runsReads the page viaSetupBest fitPage AgentInside the page (client-side JS)Dehydrated text DOMOne script tag or npmCopilots inside apps you ownSelenium / Playwright / PuppeteerExternal processDOM via driver (WebDriver/CDP)Driver plus runtime or serverScripted end-to-end testingbrowser-useExternal processDOM plus optional visionPython plus a browserAutonomous multi-site agentsWebMCPServer-side toolsStructured function callsRequires standard adoptionNative agent tool access

The takeaway is scope, not speed. Page Agent fits products you control and can add code to. External drivers still win for cross-site scraping and locked-down environments.

Use Cases, With Examples

SaaS AI copilot: Ship an assistant that operates the product, not one that only gives instructions. A support bot can perform the steps for the user instead of describing them.
Smart form filling: Collapse a multi-step ERP or CRM form into one instruction. A user types ‘Submit a travel expense for $50 for lunch yesterday.’ The agent handles the navigation and data entry.
Accessibility: Pair it with the Web Speech API for voice control. Any web app becomes reachable through natural language, with screen-reader friendly announcements.
Legacy app modernization: It can wrap a legacy internal tool that has no API. You add a command bar without changing the original code.

Quick Start

For evaluation, one script tag loads Page Agent with a free testing LLM:

That demo endpoint is for technical evaluation only. Production use needs your own model credentials.

For a build, install the package and configure your endpoint:

import { PageAgent } from ‘page-agent’

const agent = new PageAgent({
model: ‘qwen3.5-plus’,
baseURL: ‘https://dashscope.aliyuncs.com/compatible-mode/v1’,
apiKey: ‘YOUR_API_KEY’,
language: ‘en-US’,
})

await agent.execute(‘Click the login button’)

The model and baseURL fields accept any OpenAI-compatible provider. Swapping models is mostly a base-URL and key change.

Note: a key passed to new PageAgent ships inside your client bundle. For production, proxy requests through your own backend instead. The agent can also show each critical action for approval before running it.

What Works and What Doesn’t

Strong integration story: A copilot ships in a few lines of code. There is no backend rewrite and no extension to distribute.
Lower model cost: Because only text moves, you avoid multi-modal models and their pricing. Precision comes from reading structure, not guessing from pixels.
Prompt-based safety has limits: Rules like “never auto-submit a payment form” live in the system prompt. These are persuasive guides, not hard guarantees. For sensitive or destructive actions, keep server-side validation. Prompt instructions should not be your only control.
Single-page focus: The core library targets interactions within one view. It cannot move across tabs or windows by itself. Multi-page automation needs the optional Chrome extension, which requires separate install and permissions. A Beta MCP server also lets outside agents, like Claude Desktop or Copilot, drive it.

Check out the GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link