🦾 [Browser Automation] Possibilities and Limitations：Browser Control Boundaries from a CDP Perspective

December 30, 2025 · 12 min read

微信公众号@卤代烃实验室

If we consider all human operations on a computer as the universal set, then the operational scope of Computer, Browser, CDP, and puppeteer can be categorized as follows:

Computer: The universal set of all operations
Browser: App-level permissions. Browser restricts many capabilities for security, such as direct access to local files
CDP: Focused on debugging capabilities. Non-debugging information of the browser (such as bookmarked web pages) cannot be accessed
puppeteer: Built on top of CDP, but doesn't utilize all CDP APIs, so its capabilities are a subset of CDP

In Browser-Use scenarios, unlike VNC which is a more general screen casting solution, CDP has capability boundaries. Therefore, understanding its strengths and limitations is significant for overall architectural design and future evolution direction.

Below is primarily from a browser perspective, listing what CDP (puppeteer) can accomplish. Support difficulty is classified as follows:

Direct Support: pptr has ready-made APIs available
Indirect Support: Requires combining multiple pptr APIs/CDP APIs to implement related functionality
Cannot Support: Things CDP is completely incapable of doing

I. Browser Functions

This mainly refers to global browser-level functions, including tabs management, page navigation, etc.

Tabs

Necessity: High
Support Level: Indirect Support
pptr API: Browser.newPage(), Browser.pages(), Page.bringToFront(), Page.close()

Tabs (tag pages) are a very important feature. CDP's tabs-related APIs are quite weak, and puppeteer hasn't provided related abstraction and encapsulation. This means implementing the 4 basic tab functions of "create/update/switch/close" requires combining multiple CDP APIs indirectly.

tabs

There's also an issue: if Tabs are dragged to change their display order, CDP cannot perceive these changes. For example, the tabs in the image above have been dragged, but CDP directly gets the order as: "History, Baidu, Today's Headlines". However, this issue can be ignored.

Navigate

Necessity: High
Support Level: Direct Support
pptr API: Page.goBack(), Page.goForward(), Page.reload(), Page.goto()

Navigate mainly refers to navigation within a Tab, including the 4 most basic functions: back/forward/refresh/goto.

Puppeteer has provided good encapsulation based on CDP's Page.getNavigationHistory and Page.navigateToHistoryEntry APIs, so they can basically be used directly.

However, it's worth noting that the input box's suggestion records cannot be obtained through CDP:

Extensions (Browser Plugins)

Necessity: Medium
Support Level: Direct Support
pptr API: Chrome Extensions

Plugins can be directly injected during browser initialization through pptr APIs. Plugin capability is quite important, for example ad-block can shield some ad DOMs, making web pages cleaner and easier for models to locate.

Internal Settings Pages

Necessity: Medium
Support Level: Direct Support
pptr API: Page.goto()

Internal settings pages are defined here as pages starting with the chrome:// protocol. For example, chrome://history/ is the browser's history page. Since these settings pages are all web-based, they can all be captured by CDP. There are many such pages, like chrome://downloads/, chrome://extensions/, etc.

In other words, as long as we know the settings page's URI, we can navigate directly there. This depends on overall requirements and can be supported as needed.

History Page	History Page Screenshot

Necessity: Low
Support Level: Not Supported

CDP cannot perceive it. However, since the entry points for these secondary and tertiary menus are quite deep, active changes are rare. Moreover, they all navigate to built-in web pages starting with chrome:// protocol. After obtaining the corresponding URIs, we can navigate directly.

list

Other App-Level Functions

Necessity: Low
Support Level: Not Supported

Besides the common functions above, there are some relatively obscure App-level functions that won't affect the main web browsing process. For example:

Login to Google account popup
Tab groups
Reading mode/Reading list
......

These are all features that can enhance browsing experience but aren't problematic if missing. In AI scenarios, simpler is always more robust, so there's no need to support these dessert features.

Login Popup	Tab Groups


Reading Mode	Reading List

II. Keyboard Shortcuts

Necessity: High
Support Level: Indirect Support
pptr API: Keyboard class

Keyboard shortcuts are a relatively complex issue.

Major operating systems have developed for decades, and although they use the same keyboard, they've added their own considerations to keyboard shortcuts. Thus, keyboard shortcuts across operating systems are inherently complex. In CDP scenarios, keyboard shortcuts on macOS also require separate adaptation. Let me expand on these two directions below.

Cross-Platform Perspective

From a cross-platform perspective, Windows and Linux common shortcuts are relatively unified, but macOS is quite special.

First, for most common shortcuts (select all/copy/paste, etc.), macOS uses the Command key (i.e., the Meta key), while Windows/Linux use Ctrl (i.e., the Control key).

Therefore, when the LLM issues a shortcut action command, such as the "select all" shortcut, the engineering end needs to do some fallback: first determine the specific OS of the current runtime environment, then modify the action command to ensure correct execution:

hotkey("ctrl+A") --> isMacOS? --- true ---> keyboard('Meta+KeyA')
                        └------- false --> keyboard('Control+KeyA')

Besides these common shortcuts, there are actually many situations that need adaptation. For example:

View browser history: macOS is Command+Y, Linux and Windows are Ctrl+H
Quit browser: macOS is Command+Q, Linux and Windows are Alt+F, then press X key
Navigate to previous page: macOS is Command+[, Linux and Windows are Alt+LeftArrow
......

For related shortcuts, you can refer to the official specifications:

Platform	Supported Shortcuts
macOS	https://support.apple.com/zh-cn/102650
Windows	https://support.microsoft.com/en-us/windows/keyboard-shortcuts-in-windows-dcc61a57-8ff0-cffe-9796-cb9706c75eec
Linux GNOME	https://help.gnome.org/users/gnome-help/stable/shell-keyboard-shortcuts.html.en
Chrome	https://support.google.com/chrome/answer/157179?hl=zh-Hans&co=GENIE.Platform%3DDesktop

These are all quite tedious. Besides leaving them to AI to write, the best approach is to provide only basic adaptation, then fix issues as they arise. Otherwise, it's a bottomless pit.

CDP Perspective

Due to permission issues, keyboard instructions sent through CDP are not true system keyboard instructions. It can be simply understood as the scope being limited to within Chrome and Page. Moreover, our macOS is acting up again.

First, let's talk about the most basic "select all" shortcut. On macOS, if you directly send Meta+KeyA, you'll find that the select all operation doesn't execute at all.

await page.keyboard.down("Meta");
await page.keyboard.down("KeyA");
await page.keyboard.up("KeyA");
await page.keyboard.up("Meta"); // not working in macOS

The specific reason is quite complex. You can refer to #776 and #1313. The core reason is as follows:

The first bug here is that we don't send nativeKeyCodes, so no real OSX events get made. When sending the nativeKeyCodes, "a" is keyCode 0 and protocol decides not to send a falsey keyCode. After these are fixed, OSX doesn't like to perform keyboard shortcuts unless the application has the foreground. And lastly, if Chromium has the foreground, we send the nativeKeyCode, and protocol processes it, the shortcut gets captured by the address bar instead of the page.

https://github.com/puppeteer/puppeteer/issues/776#issuecomment-329589760

You can see the response was in 2017. Nearly 10 years later, this problem still exists.

However, the good news is that for common shortcuts like "select all/copy/paste", CDP has some alternative solutions. CDP's Input.dispatchKeyEvent for sending keyboard instructions has an additional commands parameter with some editing commands that can trigger related operations. For example, if I want to execute a "select all" operation, I can write it like this:

await page.keyboard.down("KeyA", { commands: ["SelectAll"] });
await page.keyboard.up("KeyA"); // working in macOS

This way, these common editing shortcuts can be supported:

Operation	macOS	Windows/Linux	CDP commands
Copy	Command + C	Ctrl + C	Copy
Paste	Command + V	Ctrl + V	Paste
Cut	Command + X	Ctrl + X	Cut
Undo	Command + Z	Ctrl + Z	Undo
Redo	Shift + Command + Z	Ctrl + Y	Redo
Select All	Command + A	Ctrl + A	SelectAll

For some other shortcuts with higher permissions, we can also do function mapping:

View browser history: Page.goto('chrome://history/')
Quit browser: Browser.close()
Navigate to previous page: Page.goBack()
......

Of course, these should also be added as needed. Full adaptation is not very meaningful.

III. Web Page Functions

Mainly refers to operations on the web page itself, affecting mainly the current web page, including screenshots, file uploads/downloads, etc.

Screenshots

Necessity: High
Support Level: Direct Support

CDP's Page.captureScreenshot API can directly screenshot the web page content itself. However, it's important to note that CDP screenshots can only capture the web page itself (content within the green box). The external Chrome UI cannot be screenshotted.

Therefore, special attention is needed here. Some VLM models were trained during their training phase using complete Chrome screenshots (content within the red box). If generalization capability is average, directly passing CDP screenshots to VLM might cause action coordinate misalignment issues.

screenshot

Basic Interactions

Necessity: High
Support Level: Indirect Support
pptr API: Keyboard class, Mouse class

Basic operations here mainly refer to behaviors like click, drag, keyboard, etc. Puppeteer has created atomic methods for these that can be directly combined and used. Moreover, pptr also provides various DOM callbacks to execute related action operations, which is very flexible. Here, I recommend directly checking pptr's documentation: pptr: Page Interactions, as it's described quite clearly.

Dialog Popups

Necessity: High
Support Level: Indirect Support
pptr API: Dialog class

CDP can directly perceive popup-related events (for example Page.javascriptDialogOpening), so triggering of the following 4 types of popups can all be perceived:

Alert	Confirm


Prompt	Beforeunload

Because a popup is a browser behavior with very high priority, once invoked, it basically interrupts all web page behaviors, and the JS engine also suspends and stops responding. Therefore, popups must be responded to and closed to execute subsequent processes. Overall, this is a very high-priority feature.

Necessity: Low
Support Level: Indirect Support

CDP cannot perceive the system menu itself triggered by clicking the "right mouse button" within the page:

Page	Image	Link	Tab

However, the functions within the system popup can basically be achieved through other means. For example, functions like "back/forward/reload" can all be substituted with some Navigate methods. But based on current user requirements, the necessity of these functions is not very high.

For custom right-click menus within web pages, since they are basically DOM-drawn, they can actually be perceived through in-page screenshots. For example, the DOM menu of bilibili player's right-click in the image below can be captured by CDP screenshots:

Input Selectors

Necessity: High
Support Level: Indirect Support

Most HTML form functions are supported, but some Input selectors use system controls (for example, HTML's default Select selector, date selector), causing CDP screenshots to be unable to perceive them.

Current testing shows these selectors cannot be captured by CDP Screenshot:

Select	Date	Time	Color

However, there are workarounds. We can try using JS code injection to replace existing system controls with DOM controls, indirectly achieving the screenshot requirement. For example, for the Select Option Picker, the effect after replacement is as follows:

select-dom

File Upload/Download

Necessity: High
Support Level: Direct Support
pptr API: ElementHandle.uploadFile(), FileChooser class, DownloadBehavior

pptr's file execution logic is quite complete. Combining uploadFile API and FileChooser can provide good file upload support. However, for file downloads, it hasn't provided very good APIs. What users can operate is specify download strategies and download paths through DownloadBehavior.

Print

Necessity: Medium
Support Level: Direct Support
pptr API: Page.pdf()

This is also a ready-made API that can be called directly.

Other Page-Level Functions

Necessity: Low
Support Level: Not Supported

Besides the various high-frequency functions mentioned above, browsers also have some dessert-level minor functions. But from a personal perspective and user requirements, these functions are basically unused in Browser-Use scenarios. Moreover, these functions are basically not supported by CDP. But for the completeness of this article's content, I'll still list them:

Bookmarks
Translation
Search
QR code
......

Bookmarks	Translation


Search	QR Code

Conclusion

In summary, we can see that CDP has obvious capability boundaries, but it's already sufficient to support 95% of business functions. Most importantly, polishing these related functions well can maximize AI's capabilities.

🦾 [Browser Automation] Possibilities and Limitations：Browser Control Boundaries from a CDP Perspective

I. Browser Functions

Tabs

Navigate

Extensions (Browser Plugins)

Internal Settings Pages

Settings Menu

Other App-Level Functions

II. Keyboard Shortcuts

Cross-Platform Perspective

CDP Perspective

III. Web Page Functions

Screenshots

Basic Interactions

Dialog Popups

Right-Click Menu

Input Selectors

File Upload/Download

Print

Other Page-Level Functions

Conclusion

I. Browser Functions​

Tabs​

Navigate​

Extensions (Browser Plugins)​

Internal Settings Pages​

Settings Menu​

Other App-Level Functions​

II. Keyboard Shortcuts​

Cross-Platform Perspective​

CDP Perspective​

III. Web Page Functions​

Screenshots​

Basic Interactions​

Dialog Popups​

Right-Click Menu​

Input Selectors​

File Upload/Download​

Print​

Other Page-Level Functions​

Conclusion​

I. Browser Functions

Tabs

Navigate

Extensions (Browser Plugins)

Internal Settings Pages

Settings Menu

Other App-Level Functions

II. Keyboard Shortcuts

Cross-Platform Perspective

CDP Perspective

III. Web Page Functions

Screenshots

Basic Interactions

Dialog Popups

Right-Click Menu

Input Selectors

File Upload/Download

Print

Other Page-Level Functions

Conclusion