forked from scrapy/w3lib
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNEWS
369 lines (255 loc) · 11.1 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
w3lib release notes
===================
2.2.1 (2024-06-12)
------------------
- :func:`~w3lib.url.canonicalize_url` no longer applies lowercase to the
userinfo URL component. (#229, #230)
2.2.0 (2024-06-05)
------------------
- Dropped Python 3.7 support (#214).
- Added Python 3.12 and PyPy 3.10 support (#218).
- Added the description to the package metadata (#227).
- Improved type hints (#226).
- Added ``.readthedocs.yml`` (#219).
- Updated the intersphinx URLs (#224).
- Added the ``pre-commit`` configuration, code reformatted with ``black``
(#220).
- Updated CI configuration (#217, #227).
2.1.2 (2023-08-03)
------------------
- Fix test failures on Python 3.11.4+ (#212, #213).
- Fix an incorrect type hint (#211).
- Add project URLs to setup.py (#215).
2.1.1 (2022-12-09)
------------------
- :func:`~w3lib.url.safe_url_string`, :func:`~w3lib.url.safe_download_url`
and :func:`~w3lib.url.canonicalize_url` now strip whitespace and control
characters urls according to the URL living standard.
2.1.0 (2022-11-28)
------------------
- Dropped Python 3.6 support, and made Python 3.11 support official. (#195,
#200)
- :func:`~w3lib.url.safe_url_string` now generates safer URLs.
To make URLs safer for the `URL living standard`_:
.. _URL living standard: https://url.spec.whatwg.org/
- ``;=`` are percent-encoded in the URL username.
- ``;:=`` are percent-encoded in the URL password.
- ``'`` is percent-encoded in the URL query if the URL scheme is `special
<https://url.spec.whatwg.org/#special-scheme>`__.
To make URLs safer for `RFC 2396`_ and `RFC 3986`_, ``|[]`` are
percent-encoded in URL paths, queries, and fragments.
.. _RFC 2396: https://www.ietf.org/rfc/rfc2396.txt
.. _RFC 3986: https://www.ietf.org/rfc/rfc3986.txt
(#80, #203)
- :func:`~w3lib.encoding.html_to_unicode` now checks for the `byte order
mark`_ before inspecting the ``Content-Type`` header when determining the
content encoding, in line with the `URL living standard`_. (#189, #191)
.. _byte order mark: https://en.wikipedia.org/wiki/Byte_order_mark
- :func:`~w3lib.url.canonicalize_url` now strips spaces from the input URL,
to be more in line with the `URL living standard`_. (#132, #136)
- :func:`~w3lib.html.get_base_url` now ignores HTML comments. (#70, #77)
- Fixed :func:`~w3lib.url.safe_url_string` re-encoding percent signs on
the URL username and password even when they were being used as part of an
escape sequence. (#187, #196)
- Fixed :func:`~w3lib.http.basic_auth_header` using the wrong flavor of
base64 encoding, which could prevent authentication in rare cases. (#181,
#192)
- Fixed :func:`~w3lib.html.replace_entities` raising :exc:`OverflowError` in
some cases due to `a bug in CPython
<https://github.com/python/cpython/issues/76763>`__. (#199, #202)
- Improved typing and fixed typing issues. (#190, #206)
- Made CI and test improvements. (#197, #198)
- Adopted a Code of Conduct. (#194)
2.0.1 (2022-08-11)
------------------
Minor documentation fix (release date is set in the changelog).
2.0.0 (2022-08-11)
------------------
Backwards incompatible changes:
- Python 2 is no longer supported; Python 3.6+ is required now (#168, #175).
- :func:`w3lib.url.safe_url_string` and :func:`w3lib.url.canonicalize_url`
no longer convert "%23" to "#" when it appears in the URL path. This is a bug
fix. It's listed as a backward-incomatible change because in some cases the
output of :func:`w3lib.url.canonicalize_url` is going to change, and so, if
this output is used to generate URL fingerprints, new fingerprints might be
incompatible with those created with the previous w3lib versions
(#141).
Deprecation removals (#169):
- The ``w3lib.form`` module is removed.
- The ``w3lib.html.remove_entities`` function is removed.
- The ``w3lib.url.urljoin_rfc`` function is removed.
The following functions are deprecated, and will be removed in future releases
(#170):
- ``w3lib.util.str_to_unicode``
- ``w3lib.util.unicode_to_str``
- ``w3lib.util.to_native_str``
Other improvements and bug fixes:
- Type annotations are added (#172, #184).
- Added support for Python 3.9 and 3.10 (#168, #176).
- Fixed :func:`w3lib.html.get_meta_refresh` for ``<meta>`` tags where
``http-equiv`` is written after ``content`` (#179).
- Fixed :func:`w3lib.url.safe_url_string` for IDNA domains with ports (#174).
- :func:`w3lib.url.url_query_cleaner` no longer adds an unneeded ``#`` when
``keep_fragments=True`` is passed, and the URL doesn't have a fragment
(#159).
- Removed a workaround for an ancient pathname2url bug (#142)
- CI is migrated to GitHub Actions (#166, #177); other CI improvements (#160,
#182).
- The code is formatted using black (#173).
1.22.0 (2020-05-13)
-------------------
- Python 3.4 is no longer supported (issue #156)
- :func:`w3lib.url.safe_url_string` now supports an optional ``quote_path``
parameter to disable the percent-encoding of the URL path (issue #119)
- :func:`w3lib.url.add_or_replace_parameter` and
:func:`w3lib.url.add_or_replace_parameters` no longer remove duplicate
parameters from the original query string that are not being added or
replaced (issue #126)
- :func:`w3lib.html.remove_tags` now raises a :exc:`ValueError` exception
instead of :exc:`AssertionError` when using both the ``which_ones`` and the
``keep`` parameters (issue #154)
- Test improvements (issues #143, #146, #148, #149)
- Documentation improvements (issues #140, #144, #145, #151, #152, #153)
- Code cleanup (issue #139)
1.21.0 (2019-08-09)
-------------------
- Add the ``encoding`` and ``path_encoding`` parameters to
:func:`w3lib.url.safe_download_url` (issue #118)
- :func:`w3lib.url.safe_url_string` now also removes tabs and new lines
(issue #133)
- :func:`w3lib.html.remove_comments` now also removes truncated comments
(issue #129)
- :func:`w3lib.html.remove_tags_with_content` no longer removes tags which
start with the same text as one of the specified tags (issue #114)
- Recommend pytest instead of nose to run tests (issue #124)
1.20.0 (2019-01-11)
-------------------
- Fix url_query_cleaner to do not append "?" to urls without a query string (issue #109)
- Add support for Python 3.7 and drop Python 3.3 (issue #113)
- Add `w3lib.url.add_or_replace_parameters` helper (issue #117)
- Documentation fixes (issue #115)
1.19.0 (2018-01-25)
-------------------
- Add a workaround for CPython segfault (https://bugs.python.org/issue32583)
which affect w3lib.encoding functions. This is technically **backwards
incompatible** because it changes the way non-decodable bytes are replaced
(in some cases instead of two ``\ufffd`` chars you can get one).
As a side effect, the fix speeds up decoding in Python 3.4+.
- Add 'encoding' parameter for w3lib.http.basic_auth_header.
- Fix pypy testing setup, add pypy3 to CI.
1.18.0 (2017-08-03)
-------------------
- Include additional assets used for distribution packages in the source tarball
- Consider ``[`` and ``]`` as safe characters in path and query components
of URLs, i.e. they are not escaped anymore
- Disable codecov project coverage check
1.17.0 (2017-02-08)
-------------------
- Add Python 3.5 and 3.6 support
- Add ``w3lib.url.parse_data_uri`` helper for parsing "data:" URIs
- Add ``w3lib.html.strip_html5_whitespace`` function to strip leading and
trailing whitespace as per W3C recommendations, e.g. for cleaning
"href" attribute values
- Fix ``w3lib.http.headers_raw_to_dict`` for multiple headers with same name
- Do not distribute tests/test_*.pyc artifacts
1.16.0 (2016-11-10)
-------------------
- ``canonicalize_url()`` and ``safe_url_string()``:
strip ":" when no port is specified (as per `RFC 3986`_;
see also https://github.com/scrapy/scrapy/issues/2377)
- ``url_query_cleaner()``: support new ``keep_fragments`` argument
(defaulting to ``False``)
1.15.0 (2016-07-29)
-------------------
- Add ``canonicalize_url()`` to ``w3lib.url``
1.14.3 (2016-07-14)
-------------------
Bugfix release:
- Handle IDNA encoding failures in ``safe_url_string()`` (issue #62)
1.14.2 (2016-04-11)
-------------------
Bugfix release:
- fix function import for (deprecated) ``urljoin_rfc`` (issue #51)
- only expose wanted functions from ``w3lib.url``, via ``__all__``
(see issue #54, https://github.com/scrapy/scrapy/issues/1917)
1.14.1 (2016-04-07)
-------------------
Bugfix release:
- For bytes URLs, when supplied encoding (or default UTF8) is wrong,
``safe_url_string`` falls back to percent-encoding offending bytes.
1.14.0 (2016-04-06)
-------------------
Changes to safe_url_string:
- proper handling of non-ASCII characters in Python2 and Python3
- support IDNs
- new `path_encoding` to override default UTF-8 when serializing non-ASCII
characters before percent-encoding
html_body_declared_encoding also detects encoding when not sole attribute
in ``<meta>``.
Package is now properly marked as ``zip_safe``.
1.13.0 (2015-11-05)
-------------------
- remove_tags removes uppercase tags as well;
- ignore meta-redirects inside script or noscript tags by default,
but add an option to not ignore them;
- replace_entities now handles entities without trailing semicolon;
- fixed uncaught UnicodeDecodeError when decoding entities.
1.12.0 (2015-06-29)
-------------------
- meta_refresh regex now handles leading newlines and whitespaces in the url;
- include tests folder in source distribution.
1.11.0 (2015-01-13)
-------------------
- url_query_cleaner now supports str or list parameters;
- add support for resolving base URLs in <base> tags with attributes
before href.
1.10.0 (2014-08-20)
-------------------
- reverted all 1.9.0 changes.
1.9.0 (2014-08-16)
------------------
- all url-related functions accept bytes and unicode and now return bytes.
1.8.1 (2014-08-14)
------------------
- w3lib.http.basic_auth_header now returns bytes
1.8.0 (2014-07-31)
------------------
- add support for big5-hkscs encoding.
1.7.1 (2014-07-26)
------------------
- PY3 fixed headers_raw_to_dict and headers_dict_to_raw;
- documentation improvements;
- provide wheels.
1.6 (2014-06-03)
----------------
- w3lib.form.encode_multipart is deprecated;
- docstrings and docs are improved;
- w3lib.url.add_or_replace_parameter is re-implemented on top of
stdlib functions;
- remove_entities is renamed to replace_entities.
1.5 (2013-11-09)
----------------
- Python 2.6 support is dropped.
1.4 (2013-10-18)
----------------
- Python 3 support;
- get_meta_refresh encoding handling is fixed;
- check for '?' in add_or_replace_parameter;
- ISO-8859-1 is used for HTTP Basic Auth;
- fixed unicode handling in replace_escape_chars;
1.3 (2012-05-13)
----------------
- support non-standard gb_2312_80 encoding;
- drop Python 2.5 support.
1.2 (2012-05-02)
----------------
- Detect encoding for content attr before http-equiv in meta tag.
1.1 (2012-04-18)
----------------
- w3lib.html.remove_comments handles multiline comments;
- Added w3lib.encoding module, containing functions for working with character
encoding, like encoding autodetection from HTML pages.
- w3lib.url.urljoin_rfc is deprecated.
1.0 (2011-04-17)
----------------
First release of w3lib.