🦾 [Browser Automation] Possibilities and Limitations:Browser Control Boundaries from a CDP Perspective

If we consider all human operations on a computer as the universal set, then the operational scope of Computer, Browser, CDP, and puppeteer can be categorized as follows:
- Computer: The universal set of all operations
- Browser: App-level permissions. Browser restricts many capabilities for security, such as direct access to local files
- CDP: Focused on debugging capabilities. Non-debugging information of the browser (such as bookmarked web pages) cannot be accessed
- puppeteer: Built on top of CDP, but doesn't utilize all CDP APIs, so its capabilities are a subset of CDP
In Browser-Use scenarios, unlike VNC which is a more general screen casting solution, CDP has capability boundaries. Therefore, understanding its strengths and limitations is significant for overall architectural design and future evolution direction.
Below is primarily from a browser perspective, listing what CDP (puppeteer) can accomplish. Support difficulty is classified as follows:
- Direct Support: pptr has ready-made APIs available
- Indirect Support: Requires combining multiple pptr APIs/CDP APIs to implement related functionality
- Cannot Support: Things CDP is completely incapable of doing
I. Browser Functions
This mainly refers to global browser-level functions, including tabs management, page navigation, etc.
Tabs
- Necessity: High
- Support Level: Indirect Support
- pptr API: Browser.newPage(), Browser.pages(), Page.bringToFront(), Page.close()
Tabs (tag pages) are a very important feature. CDP's tabs-related APIs are quite weak, and puppeteer hasn't provided related abstraction and encapsulation. This means implementing the 4 basic tab functions of "create/update/switch/close" requires combining multiple CDP APIs indirectly.

There's also an issue: if Tabs are dragged to change their display order, CDP cannot perceive these changes. For example, the tabs in the image above have been dragged, but CDP directly gets the order as: "History, Baidu, Today's Headlines". However, this issue can be ignored.
Navigate
- Necessity: High
- Support Level: Direct Support
- pptr API: Page.goBack(), Page.goForward(), Page.reload(), Page.goto()
Navigate mainly refers to navigation within a Tab, including the 4 most basic functions: back/forward/refresh/goto.
Puppeteer has provided good encapsulation based on CDP's Page.getNavigationHistory and Page.navigateToHistoryEntry APIs, so they can basically be used directly.
However, it's worth noting that the input box's suggestion records cannot be obtained through CDP:

Extensions (Browser Plugins)
- Necessity: Medium
- Support Level: Direct Support
- pptr API: Chrome Extensions
Plugins can be directly injected during browser initialization through pptr APIs. Plugin capability is quite important, for example ad-block can shield some ad DOMs, making web pages cleaner and easier for models to locate.
Internal Settings Pages
- Necessity: Medium
- Support Level: Direct Support
- pptr API: Page.goto()
Internal settings pages are defined here as pages starting with the chrome:// protocol. For example, chrome://history/ is the browser's history page. Since these settings pages are all web-based, they can all be captured by CDP. There are many such pages, like chrome://downloads/, chrome://extensions/, etc.
In other words, as long as we know the settings page's URI, we can navigate directly there. This depends on overall requirements and can be supported as needed.
| History Page | History Page Screenshot |
|---|---|
![]() | ![]() |
Settings Menu
- Necessity: Low
- Support Level: Not Supported
CDP cannot perceive it. However, since the entry points for these secondary and tertiary menus are quite deep, active changes are rare. Moreover, they all navigate to built-in web pages starting with chrome:// protocol. After obtaining the corresponding URIs, we can navigate directly.

Other App-Level Functions
- Necessity: Low
- Support Level: Not Supported
Besides the common functions above, there are some relatively obscure App-level functions that won't affect the main web browsing process. For example:
- Login to Google account popup
- Tab groups
- Reading mode/Reading list
- ......
These are all features that can enhance browsing experience but aren't problematic if missing. In AI scenarios, simpler is always more robust, so there's no need to support these dessert features.
| Login Popup | Tab Groups |
|---|---|
![]() | ![]() |
| Reading Mode | Reading List |
![]() | ![]() |
II. Keyboard Shortcuts
- Necessity: High
- Support Level: Indirect Support
- pptr API: Keyboard class
Keyboard shortcuts are a relatively complex issue.
Major operating systems have developed for decades, and although they use the same keyboard, they've added their own considerations to keyboard shortcuts. Thus, keyboard shortcuts across operating systems are inherently complex. In CDP scenarios, keyboard shortcuts on macOS also require separate adaptation. Let me expand on these two directions below.
Cross-Platform Perspective
From a cross-platform perspective, Windows and Linux common shortcuts are relatively unified, but macOS is quite special.
First, for most common shortcuts (select all/copy/paste, etc.), macOS uses the Command key (i.e., the Meta key), while Windows/Linux use Ctrl (i.e., the Control key).
Therefore, when the LLM issues a shortcut action command, such as the "select all" shortcut, the engineering end needs to do some fallback: first determine the specific OS of the current runtime environment, then modify the action command to ensure correct execution:
hotkey("ctrl+A") --> isMacOS? --- true ---> keyboard('Meta+KeyA')
└------- false --> keyboard('Control+KeyA')
Besides these common shortcuts, there are actually many situations that need adaptation. For example:
- View browser history: macOS is
Command+Y, Linux and Windows areCtrl+H - Quit browser: macOS is
Command+Q, Linux and Windows areAlt+F, then pressXkey - Navigate to previous page: macOS is
Command+[, Linux and Windows areAlt+LeftArrow - ......
For related shortcuts, you can refer to the official specifications:
These are all quite tedious. Besides leaving them to AI to write, the best approach is to provide only basic adaptation, then fix issues as they arise. Otherwise, it's a bottomless pit.
CDP Perspective
Due to permission issues, keyboard instructions sent through CDP are not true system keyboard instructions. It can be simply understood as the scope being limited to within Chrome and Page. Moreover, our macOS is acting up again.
First, let's talk about the most basic "select all" shortcut. On macOS, if you directly send Meta+KeyA, you'll find that the select all operation doesn't execute at all.
await page.keyboard.down("Meta");
await page.keyboard.down("KeyA");
await page.keyboard.up("KeyA");
await page.keyboard.up("Meta"); // not working in macOS
The specific reason is quite complex. You can refer to #776 and #1313. The core reason is as follows:
The first bug here is that we don't send nativeKeyCodes, so no real OSX events get made. When sending the nativeKeyCodes, "a" is keyCode 0 and protocol decides not to send a falsey keyCode. After these are fixed, OSX doesn't like to perform keyboard shortcuts unless the application has the foreground. And lastly, if Chromium has the foreground, we send the nativeKeyCode, and protocol processes it, the shortcut gets captured by the address bar instead of the page.
https://github.com/puppeteer/puppeteer/issues/776#issuecomment-329589760
You can see the response was in 2017. Nearly 10 years later, this problem still exists.
However, the good news is that for common shortcuts like "select all/copy/paste", CDP has some alternative solutions. CDP's Input.dispatchKeyEvent for sending keyboard instructions has an additional commands parameter with some editing commands that can trigger related operations. For example, if I want to execute a "select all" operation, I can write it like this:
await page.keyboard.down("KeyA", { commands: ["SelectAll"] });
await page.keyboard.up("KeyA"); // working in macOS
This way, these common editing shortcuts can be supported:
| Operation | macOS | Windows/Linux | CDP commands |
|---|---|---|---|
| Copy | Command + C | Ctrl + C | Copy |
| Paste | Command + V | Ctrl + V | Paste |
| Cut | Command + X | Ctrl + X | Cut |
| Undo | Command + Z | Ctrl + Z | Undo |
| Redo | Shift + Command + Z | Ctrl + Y | Redo |
| Select All | Command + A | Ctrl + A | SelectAll |
For some other shortcuts with higher permissions, we can also do function mapping:
- View browser history:
Page.goto('chrome://history/') - Quit browser:
Browser.close() - Navigate to previous page:
Page.goBack() - ......
Of course, these should also be added as needed. Full adaptation is not very meaningful.
III. Web Page Functions
Mainly refers to operations on the web page itself, affecting mainly the current web page, including screenshots, file uploads/downloads, etc.
Screenshots
- Necessity: High
- Support Level: Direct Support
CDP's Page.captureScreenshot API can directly screenshot the web page content itself. However, it's important to note that CDP screenshots can only capture the web page itself (content within the green box). The external Chrome UI cannot be screenshotted.
Therefore, special attention is needed here. Some VLM models were trained during their training phase using complete Chrome screenshots (content within the red box). If generalization capability is average, directly passing CDP screenshots to VLM might cause action coordinate misalignment issues.

Basic Interactions
- Necessity: High
- Support Level: Indirect Support
- pptr API: Keyboard class, Mouse class
Basic operations here mainly refer to behaviors like click, drag, keyboard, etc. Puppeteer has created atomic methods for these that can be directly combined and used. Moreover, pptr also provides various DOM callbacks to execute related action operations, which is very flexible. Here, I recommend directly checking pptr's documentation: pptr: Page Interactions, as it's described quite clearly.
Dialog Popups
- Necessity: High
- Support Level: Indirect Support
- pptr API: Dialog class
CDP can directly perceive popup-related events (for example Page.javascriptDialogOpening), so triggering of the following 4 types of popups can all be perceived:
| Alert | Confirm |
|---|---|
![]() | ![]() |
| Prompt | Beforeunload |
![]() | ![]() |
Because a popup is a browser behavior with very high priority, once invoked, it basically interrupts all web page behaviors, and the JS engine also suspends and stops responding. Therefore, popups must be responded to and closed to execute subsequent processes. Overall, this is a very high-priority feature.
Right-Click Menu
- Necessity: Low
- Support Level: Indirect Support
CDP cannot perceive the system menu itself triggered by clicking the "right mouse button" within the page:
| Page | Image | Link | Tab |
|---|---|---|---|
![]() | ![]() | ![]() | ![]() |
However, the functions within the system popup can basically be achieved through other means. For example, functions like "back/forward/reload" can all be substituted with some Navigate methods. But based on current user requirements, the necessity of these functions is not very high.
For custom right-click menus within web pages, since they are basically DOM-drawn, they can actually be perceived through in-page screenshots. For example, the DOM menu of bilibili player's right-click in the image below can be captured by CDP screenshots:

Input Selectors
- Necessity: High
- Support Level: Indirect Support
Most HTML form functions are supported, but some Input selectors use system controls (for example, HTML's default Select selector, date selector), causing CDP screenshots to be unable to perceive them.
Current testing shows these selectors cannot be captured by CDP Screenshot:
| Select | Date | Time | Color |
|---|---|---|---|
![]() | ![]() | ![]() | ![]() |
However, there are workarounds. We can try using JS code injection to replace existing system controls with DOM controls, indirectly achieving the screenshot requirement. For example, for the Select Option Picker, the effect after replacement is as follows:

File Upload/Download
- Necessity: High
- Support Level: Direct Support
- pptr API: ElementHandle.uploadFile(), FileChooser class, DownloadBehavior
pptr's file execution logic is quite complete. Combining uploadFile API and FileChooser can provide good file upload support. However, for file downloads, it hasn't provided very good APIs. What users can operate is specify download strategies and download paths through DownloadBehavior.
Print
- Necessity: Medium
- Support Level: Direct Support
- pptr API: Page.pdf()
This is also a ready-made API that can be called directly.
Other Page-Level Functions
- Necessity: Low
- Support Level: Not Supported
Besides the various high-frequency functions mentioned above, browsers also have some dessert-level minor functions. But from a personal perspective and user requirements, these functions are basically unused in Browser-Use scenarios. Moreover, these functions are basically not supported by CDP. But for the completeness of this article's content, I'll still list them:
- Bookmarks
- Translation
- Search
- QR code
- ......
| Bookmarks | Translation |
|---|---|
![]() | ![]() |
| Search | QR Code |
![]() | ![]() |
Conclusion
In summary, we can see that CDP has obvious capability boundaries, but it's already sufficient to support 95% of business functions. Most importantly, polishing these related functions well can maximize AI's capabilities.






















