Releases: shailshouryya/yt-videos-list
Releases · shailshouryya/yt-videos-list
0.6.7: Fix pip installation problem and improve features & performance
-
BUGFIX
- fix
pip
installation problem due to incorrectly formatted
version specifiers - update video duration extraction to correctly
extract the duration of each video and avoid
writing 'N/A'
- fix
-
FEATURE IMPROVEMENTS
- improve identification of seen videos in csv files by
- avoiding potentially brittle regular expression matching
- parsing each row of the csv file and extracting the
(Video ID|Video URL) value from the corresponding column directly
- normalize whitespace to avoid including newlines,
carriage returns, and multiple consecutive whitespace characters
in the video title - improve logging messages by including
time.time()
and
time.perf_counter()
when logging the time taken to perform
an operation
- improve identification of seen videos in csv files by
-
PERFORMANCE IMPROVEMENTS
- increase write efficiency by completely avoiding writing to a
temporary file when no new videos found for an existing file
- increase write efficiency by completely avoiding writing to a
-
INTERNAL IMPROVEMENT
- the following change does not affect the functionality of the program
- add unit tests for the video title whitespace normalization
- the following change does not affect the functionality of the program
0.6.6: Update scraping logic for the new UI
- BUGFIX
- around mid-late October, YouTube rolled out a new UI that changed
rendering of different parts of the website, including the videos
page - this broke the previous scraping logic, and this release fixes
the endpoints to correctly extract video information - for more information, see the following references:
- around mid-late October, YouTube rolled out a new UI that changed
0.6.5: Support newer driver binaries
- BINARY UPDATES
- Mozilla Firefox
- geckodriver v0.32.0 (Firefox versions ≥ 104)
- geckodriver v0.31.0 (Firefox versions ≥ 99)
- Opera Stable 82, 83, 84, 85, 88, 89, 90, 91, 92 & 93
- operadriver v.107.0.5304.88 (Opera Stable 93)
- operadriver v.106.0.5249.119 (Opera Stable 92)
- operadriver v.105.0.5195.102 (Opera Stable 91)
- operadriver v.104.0.5112.81 (Opera Stable 90)
- operadriver v.103.0.5060.66 (Opera Stable 89)
- operadriver v.102.0.5005.61 (Opera Stable 88)
- there was no operadriver release specifically for version 101 (Opera Stable 87)
- there was no operadriver release specifically for version 100 (Opera Stable 86)
- operadriver v.99.0.4844.51 (Opera Stable 85)
- operadriver v.98.0.4758.82 (Opera Stable 84)
- operadriver v.97.0.4692.71 (Opera Stable 83)
- operadriver v.96.0.4664.45 (Opera Stable 82)
- Google Chrome version 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, & 108 (updated version 97 binaries)
- chromedriver 108.0.5359.22
- chromedriver 107.0.5304.62
- chromedriver 106.0.5249.61
- chromedriver 105.0.5195.52
- chromedriver 104.0.5112.79
- chromedriver 103.0.5060.134
- chromedriver 102.0.5005.61
- chromedriver 101.0.4951.41
- chromedriver 100.0.4896.60
- chromedriver 99.0.4844.51
- chromedriver 98.0.4758.102
- chromedriver 97.0.4692.71 (previously 97.0.4692.20)
- Brave Browser version 96, 97, 98, 99, 102, 103, 104, 105, 106, & 107
- bravedriver v.107.0.5304.88 (uses operadriver binaries)
- bravedriver v.106.0.5249.119 (uses operadriver binaries)
- bravedriver v.105.0.5195.102 (uses operadriver binaries)
- bravedriver v.104.0.5112.81 (uses operadriver binaries)
- bravedriver v.103.0.5060.66 (uses operadriver binaries)
- bravedriver v.102.0.5005.61 (uses operadriver binaries)
- there was no operadriver release specifically for version 101
- there was no operadriver release specifically for version 100
- bravedriver v.99.0.4844.51 (uses operadriver binaries)
- bravedriver v.98.0.4758.82 (uses operadriver binaries)
- bravedriver v.97.0.4692.71 (uses operadriver binaries)
- bravedriver v.96.0.4664.45 (uses operadriver binaries)
- Microsoft Edge version 100, 101, 102, 103, 104, 105, 106, 107, 108, & 109 (updated version 96, 97, & 98 binaries)
- msedgedriver 109.0.1481.0
- msedgedriver 108.0.1462.15
- msedgedriver 107.0.1418.42
- msedgedriver 106.0.1370.52
- msedgedriver 105.0.1343.53
- msedgedriver 104.0.1293.91
- msedgedriver 103.0.1264.77
- msedgedriver 102.0.1245.62
- msedgedriver 101.0.1210.53
- msedgedriver 100.0.1185.60
- there was no msedgedriver release specifically for version 99
- msedgedriver 98.0.1085.0 (previously 98.0.1086.0)
- msedgedriver 97.0.1072.76 (previously 97.0.1072.8)
- msedgedriver 96.0.1054.75 (previously 96.0.1054.26)
- Mozilla Firefox
- MINOR BUGFIXES
- INTERNAL IMPROVEMENTS
0.6.4: Optimize multithreading and use explicit exception chaining
-
BUGFIXES
update XPath for blocking cookies button
- commit 62464aa
make
url
a required positional argument- commit 93029fa
-
FEATURE IMPROVEMENTS
raise
error instead ofprint
ing message and thensys.exit()
ing- see commits with a commit message starting with "Raise"
- also, see commit d43ef6a
use explicit exception chaining
show warning for users on unsupported operating systems
include real time taken by program
- see commits with a commit message
- starting with "Include real time"
- including
log_time_taken
-
PERFORMANCE IMPROVEMENTS
optimize multithreading for
create_list_from
function- see commit 67d94a0 for more details
- NOTE:
create_thread_from
mentioned in this commit message was a typo and should becreate_list_from
- NOTE:
- see the following commits for related changes:
- see commit 67d94a0 for more details
-
INTERNAL IMPROVEMENTS
these changes do not affect the functionality of the program
- interesting changes
- displayed debugging information changes
- testing/building changes
- refactoring changes
- rename variables to be more descriptive
- rename functions to be more descriptive
- reorganize code for readability
- remove unneccesary intermediate variables
- add intermediate variables for clarity
- documentation changes
- improve error messages
- improve README
- improve docstrings
0.6.3: Support newer driver binaries
- BINARY UPDATES
- Mozilla Firefox
- geckodriver v0.30.0 (Firefox versions ≥ 92)
- Opera Stable 77, 78, 79, 80, & 81
- operadriver v.95.0.4638.54 (Opera Stable 81)
- operadriver v.94.0.4606.61 (Opera Stable 80)
- operadriver v.93.0.4577.63 (Opera Stable 79)
- operadriver v.92.0.4515.107 (Opera Stable 78)
- operadriver v.91.0.4472.77 (Opera Stable 77)
- Google Chrome version 92, 93, 94, 95, 6, & 97 (updated version 91 binaries)
- chromedriver 97.0.4692.20
- chromedriver 96.0.4664.45
- chromedriver 95.0.4638.69
- chromedriver 94.0.4606.113
- chromedriver 93.0.4577.63
- chromedriver 92.0.4515.107
- chromedriver 91.0.4472.101 (previously 91.0.4472.19)
- Brave Browser version 91, 92, 93, 94, & 95
- operadriver v.95.0.4638.54 (uses operadriver binaries)
- operadriver v.94.0.4606.61 (uses operadriver binaries)
- operadriver v.93.0.4577.63 (uses operadriver binaries)
- operadriver v.92.0.4515.107 (uses operadriver binaries)
- operadriver v.91.0.4472.77 (uses operadriver binaries)
- Microsoft Edge version 93, 94, 95, 96, 97, & 98 (updated version 90, 91, & 92 binaries)
- msedgedriver 98.0.1086.0
- msedgedriver 97.0.1072.8
- msedgedriver 96.0.1054.26
- msedgedriver 95.0.1020.53
- msedgedriver 94.0.992.58
- msedgedriver 93.0.961.52
- msedgedriver 92.0.902.84 (previously 92.0.881.0)
- msedgedriver 91.0.864.71 (previously 91.0.864.19)
- msedgedriver 90.0.818.66 (previously 90.0.818.56)
- MINOR BUGFIXES
- handle videos with no "Video Duration" field (commit 2f538e1)
- this is an extremely rare edge case
- based on anecdotal data, occurs about 1 in every 70000 videos
- this is an extremely rare edge case
- update URLs shown in exception messages (commit 3f09612 & commit 99ed682)
- correctly handle unfinished threads in
create_list_from()
method (commit aa4ff3d) - generalize URL normalization for removing trailing parameters (commit 0789a3e)
- this removes any trailing tracking parameters that might be associated with a video URL
- e.g. youtube.com/watch?v=abcdefghijk?pp=sAQB → youtube.com/watch?v=abcdefghijk
- this removes any trailing tracking parameters that might be associated with a video URL
- verify page has videos (commit 82a4856)
- prevents crashing on channels with 0 public videos
- handle videos with no "Video Duration" field (commit 2f538e1)
- LOGGING IMPROVEMENTS
- INTERNAL CHANGES
- refactor code to:
- reduce code duplication
- make variable and function names more context specific
- place repeated code inside variables
- make browser naming more specific (commit 81144cb)
- refactor code to:
0.6.2: Explicitly order videos page & check existing videos more strictly
0.6.1: Change `create_list_for()` return, add features & improvements
-
BREAKING CHANGE
- BEFORE:
create_list_for()
returned astr
containing the name of the file the program wrote to
- NOW:
create_list_for()
returns atuple
containing- a
list
oflist
s containing the video information found by the program for the current run- by default, returns dummy video data to avoid cluttering the output
- to return the actual video data, set the
video_data_returned
ListCreator attribute toTrue
- dummy data:
[[0, '', '', '']]
- dummy data:
- a
tuple
containing astr
with the name of the channel (taken from the channel's heading) and astr
with the name of the file written to('The Channel Name', 'the_name_of_the_file')
('The Channel Name', '')
if the ListCreator attributes aretxt=False
,csv=False
,md=False
, ANDvideo_data_returned=True
- a
- see the NEW FEATURES section below for more details about
video_data_returned
- access the full documentation for the updated
create_list_for
method withhelp(ListCreator.create_list_for)
in the python interpreter
- BEFORE:
-
BUGFIX
- fixes
cookie_consent
blocking logic for new HTML in GDPR regions- YouTube updated the HTML formatting for blocking cookie consent, and the previous cookie consent blocking logic broke
- this release fixes the blocking logic to work with the new HTML formatting
- fixes
-
NEW FEATURES
- overview for the new ListCreator attributes given here, but run
help(ListCreator)
in the python interpreter or read the "More API information" section in the python README to see the full documentation:file_suffix
allows more control over the file naming (True
by default)all_video_data_in_memory
scrapes the ENTIRE YouTube channel's videos page, EVEN if files exist for the channel already (False
by default)- must also set the
video_data_returned
attribute toTrue
to actually get this information
- must also set the
video_data_returned
returns the video data for all videos the program scraped (False
by default)- data returned depends on a number of factors, see full documentation for more details
video_id_only
saves only the video ID instead of the entire URL (False
by default)- example: saves 'abcdefghijk' instead of 'https://www.youtube.com/watch?v=abcdefghijk'
- overview for the updated
file_name
argument options in thecreate_list_for
method given here, but runhelp(ListCreator.create_list_for)
in the python interpreter to see the full documentation:file_name='auto'
names the output file(s) using the name that shows up under the banner when you navigate to the channel's homepage (with spaces removed)file_name='id'
names the output file(s) using the identifier from the URL provided to theurl
argument- run
help(ListCreator.create_list_for)
for a comprehensive list of examples - using
file_name='id'
is very useful when multiple channels have the SAME channel name
- run
- overview for the new ListCreator attributes given here, but run
-
PERFORMANCE IMPROVEMENTS
- BEFORE:
- the program pulled the video data from the selenium instance and wrote to the file(s) directly
- NOW:
- the program loads the video data from the selenium instance into memory, THEN writes the saved video data from memory to the file(s)
- the performance improvement is more noticeable when writing more information
- for example:
- writing information for 200 videos to just a csv file: negligible performance difference between writing to csv file directly and loading to memory & THEN writing to csv file
- writing information for 200 videos to csv, txt, md files: slight performance difference between writing to files directly and loading to memory & THEN writing to files, but still not much of a performance difference
- writing information for 20000 videos to just a csv file: noticeable performance difference between writing to csv file directly and loading to memory & THEN writing to csv file
- writing information for 20000 videos to csv, txt, md files: significant performance difference between writing to to files directly and loading to memory & THEN writing to files
- summary:
- the performance difference between writing to ONE file directly and loading to memory & THEN writing to ONE file is barely noticeable for small jobs and more noticeable for larger jobs
- the performance difference between writing to MULTIPLE files directly and loading to memory & THEN writing to MULTIPLE file is more noticeable for small jobs (compared to writing to only ONE file) and SIGNIFICANT for larger jobs
- for example:
- the performance improvement is more noticeable when writing more information
- the program loads the video data from the selenium instance into memory, THEN writes the saved video data from memory to the file(s)
- logs from tests used to benchmark performance included below:
- BEFORE:
See logs
for https://www.youtube.com/user/schafer5 (small channel, 230 videos)
writing to 1 file directly with csv=True, txt=False, md=False
- to create the file:
It took 9.240757292005583 seconds to find 230 videos from https://www.youtube.com/user/schafer5/videos
It took 4.265756259999762 seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.csv
This program took 19.537945401003526 seconds to complete.
- to update the file:
It took 0.8453300589972059 seconds to find 60 videos from https://www.youtube.com/user/schafer5/videos
It took 0.6392399440010195 seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.csv
This program took 7.754261410002073 seconds to complete.
writing to 1 file by loading video information into memory THEN writing to files with csv=True, txt=True, md=True
- to create the file:
It took 9.163404727999989 seconds to find 230 videos from https://www.youtube.com/user/schafer5/videos
It took 4.260267737000007 seconds to load information for 230 videos into memory
It took 0.002389371999996115 seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.csv
This program took 19.483281371000004 seconds to complete.
- to update the file:
It took 0.8521808300000089 seconds to find 60 videos from https://www.youtube.com/user/schafer5/videos
It took 1.0964175420000117 seconds to load information for 60 videos into memory
It took 0.0015745449999826633 seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.csv
This program took 7.985743492000012 seconds to complete.
writing to 3 files directly with csv=True, txt=True, md=True
- to create the files:
It took 9.166668037003546 seconds to find 230 videos from https://www.youtube.com/user/schafer5/videos
It took 10.160974278995127 seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.txt
It took 10.164936708999448 seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.csv
It took 10.168633003995637 seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.md
This program took 25.594990328005224 seconds to complete.
- to update the files:
It took 0.8503098270011833 seconds to find 60 videos from https://www.youtube.com/user/schafer5/videos
It took 1.5225159670007997 seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.csv
It took 1.5322243859991431 seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.txt
It took 1.5359413480036892 seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.md
This program took 8.472728426997492 seconds to complete.
writing to 3 files by loading video information into memory THEN writing to files with csv=True, txt=True, md=True
- to create the files:
It took 9.367390958000005 seconds to find 230 videos from https://www.youtube.com/user/schafer5/videos
It took 4.218187391999997 seconds to load information for 230 videos into memory
It took 0.003894963000000473 seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.md
It took 0.005060710999998719 seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.csv
It took 0.006283445999997639 seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.txt
This program took 18.754924324 seconds to complete.
- to update the files:
It took 0.8672965029999986 seconds to find 60 videos from https://www.youtube.com/user/schafer5/videos
It took 1.0901944209999996 seconds to load information for 60 videos into memory
It took 0.005667658999996661 seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.csv
It took 0.008393589000000645 seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.txt
It took 0.008197031000001687 seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.md
This program took 8.090583961999997 seconds to complete.
for https://www.youtube.com/c/KhanAcademy (medium channel, 8095 videos)
writing to 1 file directly with csv=True, txt=False, md=False
- to create the file:
It took 322.72226654399856 seconds to find 8095 videos from htt...
0.6.0: Add `verify_page_bottom_n_times`, `file_buffering`, Video Duration
- compare changes to previous version
- if you are an existing user, skim through the BREAKING CHANGE and NON-BREAKING CHANGES sections below
- if you are a new user, you do not need to worry about these sections - just skip to the NEW FEATURES section at the bottom and read the python README to get started
- BREAKING CHANGE
- the program now extracts the video duration for every video uploaded by a channel
- this will likely cause problems when updating pre-existing
csv
files, since- the video duration information goes in a new column
csv
file renderers expect consistent column formatting throughout the file- BUT a pre-existing csv file will only have the
Video Number,Video Title,Video URL,Watched,Watch again later,Notes
columns - so updating a pre-existing
csv
file will result in newly extracted videos having theVideo Number,Video Title,Video Duration,Video URL,Watched,Watch again later,Notes
columns while the already extracted videos will only have theVideo Number,Video Title,Video URL,Watched,Watch again later,Notes
columns (noVideo Duration
column) - therefore, updating a pre-existing csv file will result in the newly extracted videos having 7 columns, while pre-existing videos will have only 6 columns
- BUT a pre-existing csv file will only have the
- if you want to continue using your pre-existing csv file and do NOT WANT TO INCLUDE the video duration for previously extracted videos:
- if you have NOT yet updated the pre-existing csv file:
- APPROACH 1: use a csv file editor such as Excel, Google Sheets, Numbers, IDE extension, etc.
- open the csv file
- insert the
Video Duration
column between theVideo Title
andVideo URL
columns - save the file
- the csv editor should automatically format the existing rows to include the
Video Duration
column - therefore, all rows should now have an empty cell for the
Video Duration
column
- the csv editor should automatically format the existing rows to include the
- APPROACH 2: use a simple text editor/IDE
- open the csv file
- insert the
Video Duration
column between theVideo Title
andVideo URL
columns - text editors will NOT automatically format the existing rows to include the
Video Duration
column- so you will need to manually format the existing rows to include the
Video Duration
column - the simplest way to do this would be to use a
Find and Replace
operation:- Find all occurrences of:
,https://
- Replace with:
,,https://
- this assumes the only urls in the csv file are in the
Video URL
column!- if you have manually added/modified parts of the file and this is no longer true, you will have to modify this approach slightly to meet your needs
- this assumes the only urls in the csv file are in the
- Find all occurrences of:
- so you will need to manually format the existing rows to include the
- APPROACH 1: use a csv file editor such as Excel, Google Sheets, Numbers, IDE extension, etc.
- if you have ALREADY updated the pre-existing csv file:
- you will not be able to use APPROACH 1 from above
- you will need to use APPROACH 2 with slight modifications:
- Find all occurrences of (with regular expression mode enabled):
([^:][^\d]{2}),https://
- Replace with:
$1,,https://
(depending on your editor, you may need to substitute$1
with\1
or something else)- looks for
,https://
where it is NOT preceeded with:\d\d
- since the most recently extracted videos will have the video duration but the already existing videos will not have the video duration
- so this only adds a comma for previously extracted videos without the video duration
- as with APPROACH 1, this also assumes the only urls in the csv file are in the
Video URL
column!- if you have manually added/modified parts of the file and this is no longer true, you will also have to modify this approach slightly to meet your needs
- looks for
- if the file is a
chronological_videos_list
file (as opposed to areverse_chronological_videos_list
file):- you will ALSO need to insert the
Video Duration
column between theVideo Title
andVideo URL
columns in the csv header- since
chronological_videos_list
files use the csv header from the pre-existing csv file- NOTE the program updates the
reverse_chronological_videos_list
csv header every time the program looks for new videos when rerun on a previously scraped channel - but usually this csv header update is not noticeable since the header does not change
- the csv header update is noticeable this time, however, since there is a new column (Video Duration)
- for
chronological_videos_list
files, however, the program never updates the csv header
- NOTE the program updates the
- since
- you will ALSO need to insert the
- Find all occurrences of (with regular expression mode enabled):
- if you have NOT yet updated the pre-existing csv file:
- if you want to continue using your pre-existing csv file and WANT TO INCLUDE the the video duration for previously extracted videos:
- rerun the program for the channel (in a different directory)
- copy over any notes you took in the pre-existing file to the new file with the video duration information
- if you do NOT want/care about using the pre-existing csv file
- just delete the pre-existing csv file and rerun the program on the channel again (or run the program on the same channel from a different directory)
- NOTE that if the channel deleted a video OR unlisted a video between
- the time the video information was originally scraped
- and you rerunning this after installing release
0.6.0+
- the deleted/unlisted video(s) will not show up (no workaround for this - this is how YouTube displays videos)
- NOTE that if the channel deleted a video OR unlisted a video between
- just delete the pre-existing csv file and rerun the program on the channel again (or run the program on the same channel from a different directory)
- this will likely cause problems when updating pre-existing
- the program now extracts the video duration for every video uploaded by a channel
- NON-BREAKING CHANGES
txt
andmd
files now also include the video duration information- this is simply an extra line in the output file, and will not cause any rendering issues since
txt
andmd
files do not depend on a consistent formatting the waycsv
files do
- this is simply an extra line in the output file, and will not cause any rendering issues since
txt
andmd
file now use slightly different formatting such as- fewer newlines
md
files usingh3
headings for video information instead of bullet points (the bullet points were also improperly formatted previously, but since they are no longer used, this is not an issue)
- NOTE that if you want these files to contain the video duration information, you will still need to rerun the program on the channel from scratch (either in a different directory, or after deleting the pre-existing files in the current directory)
- NEW FEATURES
0.5.9: Add built-in multi-threading
- compare changes to previous version
- creates new file I/O threads if writing to more than 1 file
- see commit 58c5fab for details
- supports scraping multiple channels from a txt file containing urls
- see Scraping multiple channels from a file simultaneously with multi-threading section in python README for usage details
- see
__init__.py
file for code changes