Avoid being tracked and located while web scrapping #118611
-
Select Topic AreaQuestion BodyHello community, I have a project in mind. I want to do web scrapping of several pages of retail companies. As I am extracting all the information of all the products and be able to store, I need to not leave any trace or evidence such as my IP, address of my machine, etc.. I want to do it in Python using selenium and Google Driver libraries. The goal is that every two days the code gets the information of all the products from the web pages of the main suppliers of a retail store. I want to extract all the information and avoid being tracked or maybe generate a change of internet network or dynamic IP. Among the options mentioned is for example to do it in a virtual machine but it seems to me that it leaves a disaster. I would like to know some recommendations and suggestions to be able to do this project and avoid being tracked. I would like to know how I could apply, what environment and how I could do it. I appreciate your comments and input. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
Thanks for posting in the GitHub Community, @francis-fm ! We're happy you're here. You are more likely to get a useful response if you are posting your question in the applicable category, the Discussions category is solely related to conversations around the GitHub product Discussions. This question should be in the Programming Help category. I've gone ahead and moved it for you. Good luck! |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I personally use Selenium for interactions [with an undecteced driver], and then BeautifulSoup for parsing elements. hope this answers your questions, if so mark it as answered.
GOOD LUCK. |
Beta Was this translation helpful? Give feedback.
So I mean the topic of hiding identity on the web is a huge one. It also depends on the website you talk about.
To this end, you could go full bonkers using TOR or a trustworthy VPN, Qubes OS or Tails OS, etc. But then, you also need to be aware of techniques such as browser fingerprinting that turn your request headers, settings and the browser environment (available fonts, rendering, etc.) to a pseudonym (e.g. hashing these together) to reidentify you.
And by the way, Virtual Machines alone would not change anything for most. The IP the VM has is a local one. It is translated at the NAT (Network address translation, located typically in your home router). The IP that the webserver can s…