How To Bypass Bot Detection And Scrape A Website Using Python?
Original Article Published At YourQuorum
Today we will look a great python module, rejecting is fun assuming that you attempt it already. Rejecting and Creeping are normal names, yet they have a touch of contrast. Web Slithering essentially Google and Facebook and so on do, it’s searching for any data. Then again, Rejecting is focused on at specific sites, for explicit information, for example for item data and cost, and so on.
Check Development Environment Ready or not
Prior to pushing ahead we want to check python is accessible or not. To do as such, Open terminal or order line and hit underneath order,
python --version
Output: Python 2.7.16
Or,
python3 --version
Output: Python 3.8.0
Assuming all that appears as though me, your python adaptation may be not the same as me. So don’t stress over it.
Setup Virtual Environment
We want to establish Virtual Climate, to stay away from python modules, reliance or libraries adaptation clashing issues. So we can guarantee disengagement, each venture conditions or libraries variant can be kept up with without any problem.
Open terminal or order line then make a task
macOS Users
pip install virtualenv
python3 -m virtualenv venv
source venv/bin/activate
Windows Users:-
pip install virtualenv
virtualenv venv
srouce venv\Scripts\activate
We can see, venv envelope will be made. Compliment effectively we can establish Virtual Climate
Introduce required libs or modules
Open Terminal or order line then, at that point, hit howl orders,
pip install beautifulsoup4
pip install cfscrape
Learn Fundamental How Rejecting Work
Create app.py file, includes
import cfscrape
from bs4 import BeautifulSoup
def basic():
# string html code sample
html_text = '''
<div>
<h1 class="product-name">Product Name 1</h1>
<h1 custom-attr="price" class="product-price">100</h1>
<p class="product description">This is basic description 1</p>
</div>
<div>
<h1 class="product-name">Product Name 2</h1>
<h1 custom-attr="price" class="product-price">200</h1>
<p class="product description">This is basic description 2</p>
</div>
'''
parsed_html = BeautifulSoup(html_text, 'html.parser') # String to HTML
# parsed_html = BeautifulSoup("https://www.google.com", 'html.parser') # URL to HTML
# parsed_html = BeautifulSoup(open("from_file.html"), 'html.parser') # File to HTML
print(parsed_html.select(".product-name")[0].text)
print(parsed_html.select(".product-name")[1].text)
print(parsed_html.select(".product.description")[0].text)
print(parsed_html.findAll("h1", {"custom-attr": "price"})[0].text)
print(parsed_html.find("h1", {"custom-attr": "price"}).text)
if __name__ == '__main__':
basic()
Remember the Anti Bot Scraping
Create app.py file, includes
def anti_bot_scraping():
target_url = "https://www.google.com" # replace url with anti-bot protected website
scraper = cfscrape.create_scraper()
html_text = scraper.get(target_url).text
parsed_html = BeautifulSoup(html_text, 'html.parser')
print(parsed_html)
if __name__ == '__main__':
anti_bot_scraping()
Presently, open a terminal and hit beneath order, python app.py to run that document.