Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/impersonate 6.0 #163

Merged
merged 19 commits into from
Dec 31, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-22.04, macos-11, windows-2019]
# os: [ubuntu-22.04, macos-11, windows-2019]
os: [ubuntu-22.04, macos-11]
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
Expand Down
6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.ONESHELL:
SHELL := bash
VERSION := 0.5.4
CURL_VERSION := curl-7.84.0
VERSION := 0.6.0
CURL_VERSION := curl-8.1.1

.preprocessed: curl_cffi/include/curl/curl.h curl_cffi/cacert.pem .so_downloaded
touch .preprocessed
Expand Down Expand Up @@ -32,7 +32,7 @@ curl_cffi/cacert.pem:
curl https://curl.se/ca/cacert.pem -o curl_cffi/cacert.pem

.so_downloaded:
python preprocess/download_so.py
python preprocess/download_so.py $(VERSION)
touch .so_downloaded

preprocess: .preprocessed
Expand Down
28 changes: 26 additions & 2 deletions README-zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,14 @@ TLS 或者 JA3 指纹。如果你莫名其妙地被某个网站封锁了,可
- 支持 `asyncio`,并且每个请求都可以换代理。
- 支持 http 2.0,requests 不支持。

|库|requests|aiohttp|httpx|pycurl|curl_cffi|
|---|---|---|---|---|---|
|http2|❌|❌|✅|✅|✅|
|sync|✅|❌|✅|✅|✅|
|async|❌|✅|✅|❌|✅|
|指纹|❌|❌|❌|❌|✅|
|速度|🐇|🐇🐇|🐇|🐇🐇|🐇🐇|

## 安装

pip install curl_cffi --upgrade
Expand All @@ -23,8 +31,14 @@ TLS 或者 JA3 指纹。如果你莫名其妙地被某个网站封锁了,可
在其他小众平台,你可能需要先编译并安装 `curl-impersonate` 并且设置 `LD_LIBRARY_PATH` 这些
环境变量。

安装测试版:

pip install curl_cffi --pre

## 使用

尽量模仿比较新的浏览器,不要直接从下边的例子里复制 `chrome110` 去用。

### 类 requests

```python
Expand Down Expand Up @@ -59,14 +73,21 @@ print(r.json())
# {'cookies': {'foo': 'bar'}}
```

支持模拟的浏览器版本,和 [curl-impersonate](https://github.com/lwthiker/curl-impersonate) 一致:
支持模拟的浏览器版本,和我 [fork](https://github.com/yifeikong/curl-impersonate) 的 [curl-impersonate](https://github.com/lwthiker/curl-impersonate) 一致:

不过只支持类似 Chrome 的浏览器。

- chrome99
- chrome100
- chrome101
- chrome104
- chrome107
- chrome110
- chrome116
- chrome117

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently curl_impersonate only supports until version 116, are we not worried already providing support to 120 when it doesn't handle this yet?

Ref. https://github.com/lwthiker/curl-impersonate?tab=readme-ov-file#supported-browsers

On that note would it not make more sense to offer support for firefox?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the usage table here, most users are using the latest versions of Chrome and Safari. For strict blocking strategy, it's reasonable to just block users with any older versions of browsers.

116 is mucher new than 110, but it does not make things significantly better, let alone that their fingerprints are actually the same. The insteresting part is in 117, when ECH was added.

Actually I have been working on this in my fork of curl-impersonate. Hopefully I could get it landed before Chrome 120 is main stream. I'm just too busy on other stuff recently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of firefox, it's really challenging to pack an addtional .so file in a python wheel. There are two options to bypass this:

  1. release another package, i.e. curl_cffi_ff, as suggested by one of our users
  2. Try to use boringssl(chrome) to emulate nss(firefox)

At least one of them should work, just haven't had time to try them out. Maybe I can experiment them during the Chinese New Year.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that it would be nice to have the Chrome version117+ available sooner as this will help a lot more with the more challenging sites. (Already saw you are well on the way through all the different versions there on your fork.)

As for firefox probably the easier of the two options you mentioned would be to simply release a new package for firefox curl, but this would require maintenance of both packages simultaneously which seems a lot more effort on your part.

I would love to get closer to this project although I am extremely new to it, if there are any smaller issues for me to explore and help out on let me know and will try tackle it in my free time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be open to trying out the multiple package option. The opencv_python project builds 4 different packages (each with slightly different configurations) out of the same base repo, so I think it should be possible to minimize the maintenance overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related option could possibly be to factor out the ffi binding portion into its own standalone package, build chrome/firefox versions of that, and have curl_cffi import the bindings packages. This way, the requests/async interfaces that curl_cffi provides don't need be duplicated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At here, IMO, simulating NSS on BoringSSL could have more priority than maintaining multiple packages.
This may need to have some patches on BoringSSL, but I think it's worth to try investigating on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try whatever you like, I'm open to merge them both since there is no conflict, actually.

- chrome118
- chrome119
- chrome120
- chrome99_android
- edge99
- edge101
Expand Down Expand Up @@ -125,7 +146,10 @@ print(body.decode())

更多细节请查看 [英文文档](https://curl-cffi.readthedocs.io)。

如果你用 scrapy 的话,可以参考这个中间件:[tieyongjie/scrapy-fingerprint](https://github.com/tieyongjie/scrapy-fingerprint)
如果你用 scrapy 的话,可以参考这些中间件:

- [tieyongjie/scrapy-fingerprint](https://github.com/tieyongjie/scrapy-fingerprint)
- [jxlil/scrapy-impersonate](https://github.com/jxlil/scrapy-impersonate)

有问题和建议请优先提 issue,中英文均可,也可以加微信群交流讨论:

Expand Down
16 changes: 14 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ To install beta releases:

## Usage

Use the latest impersonate versions, do NOT copy `chrome110` here without changing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section could be improved


### requests-like

```python
Expand Down Expand Up @@ -74,14 +76,21 @@ print(r.json())
# {'cookies': {'foo': 'bar'}}
```

Supported impersonate versions, as supported by [curl-impersonate](https://github.com/lwthiker/curl-impersonate):
Supported impersonate versions, as supported by my [fork](https://github.com/yifeikong/curl-impersonate) of [curl-impersonate](https://github.com/lwthiker/curl-impersonate):

However, only Chrome-like browsers are supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could be more explicit here to why not supported others.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it would be great to have some documentation explaining why we don't support firefox or other ones supported by the curl_impersonate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly simply adding a link to #59 (comment) will be enough

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appologies for the missunderstanding, I'll make sure it's well documented in the new version.

Copy link

@Kwsswart Kwsswart Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please never apologize! Thank you for the work on this mate its an amazing repository! Only reviewing to try help somewhat ^^


- chrome99
- chrome100
- chrome101
- chrome104
- chrome107
- chrome110
- chrome116
- chrome117
- chrome118
- chrome119
- chrome120
- chrome99_android
- edge99
- edge101
Expand Down Expand Up @@ -140,7 +149,10 @@ print(body.decode())

See the [docs](https://curl-cffi.readthedocs.io) for more details.

If you are using scrapy, check out this middleware: [tieyongjie/scrapy-fingerprint](https://github.com/tieyongjie/scrapy-fingerprint)
If you are using scrapy, check out these middlewares:

- [tieyongjie/scrapy-fingerprint](https://github.com/tieyongjie/scrapy-fingerprint)
- [jxlil/scrapy-impersonate](https://github.com/jxlil/scrapy-impersonate)

## Acknowledgement

Expand Down
2 changes: 1 addition & 1 deletion curl_cffi/__version__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,5 @@
# __description__ = metadata.metadata("curl_cffi")["Summary"]
# __version__ = metadata.version("curl_cffi")
__description__ = "libcurl ffi bindings for Python, with impersonation support"
__version__ = "0.5.10"
__version__ = "0.6.0"
__curl_version__ = Curl().version().decode()
9 changes: 9 additions & 0 deletions curl_cffi/const.py
Original file line number Diff line number Diff line change
Expand Up @@ -527,3 +527,12 @@ class CurlHttpVersion(IntEnum):
V2TLS = 4 # use version 2 for HTTPS, version 1.1 for HTTP */
V2_PRIOR_KNOWLEDGE = 5 # please use HTTP 2 without HTTP/1.1 Upgrade */
V3 = 30 # Makes use of explicit HTTP/3 without fallback.


class CurlWsFlag(IntEnum):
TEXT = 1 << 0
BINARY = 1 << 1
CONT = 1 << 2
CLOSE = 1 << 3
PING = 1 << 4
OFFSET = 1 << 5
28 changes: 27 additions & 1 deletion curl_cffi/curl.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from typing import Any, List, Tuple, Union

from ._wrapper import ffi, lib # type: ignore
from .const import CurlHttpVersion, CurlInfo, CurlOpt
from .const import CurlHttpVersion, CurlInfo, CurlOpt, CurlWsFlag

try:
import certifi
Expand Down Expand Up @@ -107,6 +107,10 @@ def _set_error_buffer(self):
self.setopt(CurlOpt.VERBOSE, 1)
lib._curl_easy_setopt(self._curl, CurlOpt.DEBUGFUNCTION, lib.debug_function)

def debug(self):
self.setopt(CurlOpt.VERBOSE, 1)
lib._curl_easy_setopt(self._curl, CurlOpt.DEBUGFUNCTION, lib.debug_function)

def __del__(self):
self.close()

Expand Down Expand Up @@ -335,3 +339,25 @@ def close(self):
self._curl = None
ffi.release(self._error_buffer)
self._resolve = ffi.NULL

def ws_recv(self, n: int = 1024):
buffer = ffi.new("char[]", n)
n_recv = ffi.new("int *")
p_frame = ffi.new("struct curl_ws_frame **")

ret = lib.curl_ws_recv(self._curl, buffer, n, n_recv, p_frame)
self._check_error(ret, "WS_RECV")
frame = p_frame[0]
# print(frame.offset, frame.bytesleft)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove commented print


return ffi.buffer(buffer)[: n_recv[0]], frame

def ws_send(self, payload: bytes, flags: CurlWsFlag = CurlWsFlag.BINARY) -> int:
n_sent = ffi.new("int *")
buffer = ffi.from_buffer(payload)
ret = lib.curl_ws_send(self._curl, buffer, len(buffer), n_sent, 0, flags)
self._check_error(ret, "WS_SEND")
return n_sent

def ws_close(self):
self.ws_send(b"", CurlWsFlag.CLOSE)
11 changes: 11 additions & 0 deletions curl_cffi/ffi/cdef.c
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,14 @@ struct CURLMsg *curl_multi_info_read(void* curlm, int *msg_in_queue);
extern "Python" void socket_function(void *curl, int sockfd, int what, void *clientp, void *socketp);
extern "Python" void timer_function(void *curlm, int timeout_ms, void *clientp);

// websocket
struct curl_ws_frame {
int age; /* zero */
int flags; /* See the CURLWS_* defines */
long offset; /* the offset of this data into the frame */
long bytesleft; /* number of pending bytes left of the payload */
...;
};

int curl_ws_recv(void *curl, void *buffer, int buflen, int *recv, struct curl_ws_frame **meta);
int curl_ws_send(void *curl, void *buffer, int buflen, int *sent, int fragsize, unsigned int sendflags);
31 changes: 28 additions & 3 deletions curl_cffi/requests/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from .errors import RequestsError
from .headers import Headers, HeaderTypes
from .models import Request, Response
from .websockets import WebSocket

try:
import gevent
Expand Down Expand Up @@ -47,6 +48,11 @@ class BrowserType(str, Enum):
chrome104 = "chrome104"
chrome107 = "chrome107"
chrome110 = "chrome110"
chrome116 = "chrome116"
chrome117 = "chrome117"
chrome118 = "chrome118"
chrome119 = "chrome119"
chrome120 = "chrome120"
chrome99_android = "chrome99_android"
safari15_3 = "safari15_3"
safari15_5 = "safari15_5"
Expand Down Expand Up @@ -580,6 +586,13 @@ def stream(self, *args, **kwargs):
finally:
rsp.close()

def connect(self, url, *args, **kwargs):
self._set_curl_options(self.curl, "GET", url, *args, **kwargs)
# https://curl.se/docs/websocket.html
self.curl.setopt(CurlOpt.CONNECT_ONLY, 2)
self.curl.perform()
return WebSocket(self, self.curl)

def request(
self,
method: str,
Expand Down Expand Up @@ -752,7 +765,7 @@ def __init__(
```
"""
super().__init__(**kwargs)
self.loop = loop
self._loop = loop
self._acurl = async_curl
self.max_clients = max_clients
self._closed = False
Expand All @@ -763,10 +776,14 @@ def __init__(
):
warnings.warn(WINDOWS_WARN)

@property
def loop(self):
if self._loop is None:
self._loop = asyncio.get_running_loop()
return self._loop

@property
def acurl(self):
if self.loop is None:
self.loop = asyncio.get_running_loop()
if self._acurl is None:
self._acurl = AsyncCurl(loop=self.loop)
return self._acurl
Expand Down Expand Up @@ -827,6 +844,14 @@ async def stream(self, *args, **kwargs):
finally:
await rsp.aclose()

async def connect(self, url, *args, **kwargs):
curl = await self.pop_curl()
# curl.debug()
self._set_curl_options(curl, "GET", url, *args, **kwargs)
curl.setopt(CurlOpt.CONNECT_ONLY, 2) # https://curl.se/docs/websocket.html
await self.loop.run_in_executor(None, curl.perform)
return WebSocket(self, curl)

async def request(
self,
method: str,
Expand Down
54 changes: 54 additions & 0 deletions curl_cffi/requests/websockets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import asyncio
from curl_cffi.const import CurlECode, CurlWsFlag
from curl_cffi.curl import CurlError


class WebSocket:
def __init__(self, session, curl):
self.session = session
self.curl = curl
self._loop = None

def recv_fragment(self):
return self.curl.ws_recv()

def recv(self):
chunks = []
# TODO use select here
while True:
try:
chunk, frame = self.curl.ws_recv()
chunks.append(chunk)
if frame.bytesleft == 0:
break
except CurlError as e:
if e.code == CurlECode.AGAIN:
pass
else:
raise

return b"".join(chunks)

def send(self, payload: bytes, flags: CurlWsFlag = CurlWsFlag.BINARY):
return self.curl.ws_send(payload, flags)

def close(self):
# FIXME how to reset. or can a curl handle connect to two websockets?
self.curl.close()

@property
def loop(self):
if self._loop is None:
self._loop = asyncio.get_running_loop()
return self._loop

async def arecv(self):
return await self.loop.run_in_executor(None, self.recv)

async def asend(self, payload: bytes, flags: CurlWsFlag = CurlWsFlag.BINARY):
return await self.loop.run_in_executor(None, self.send, payload, flags)

async def aclose(self):
await self.loop.run_in_executor(None, self.close)
self.curl.reset()
self.session.push_curl(curl)
21 changes: 21 additions & 0 deletions examples/websocket.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import asyncio
from curl_cffi import requests

with requests.Session() as s:
w = s.connect("ws://localhost:8765")
w.send(b"Foo")
reply = w.recv()
print(reply)
assert reply == b"Hello Foo!"


async def async_examples():
async with requests.AsyncSession() as s:
w = await s.connect("ws://localhost:8765")
await w.asend(b"Bar")
reply = await w.arecv()
print(reply)
assert reply == b"Hello Bar!"


asyncio.run(async_examples())
18 changes: 18 additions & 0 deletions examples/websocket_server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import asyncio
import websockets

async def hello(websocket):
name = (await websocket.recv()).decode()
print(f"<<< {name}")

greeting = f"Hello {name}!"

await websocket.send(greeting)
print(f">>> {greeting}")

async def main():
async with websockets.serve(hello, "localhost", 8765):
await asyncio.Future() # run forever

if __name__ == "__main__":
asyncio.run(main())
Loading
Loading