# 7. Web Scraping

Web  scraping  is  the  practice  of  gathering  data  through  any  means  otherthan a program interacting with an API (or, obviously, through a human using a webbrowser).  This  is  most  commonly  accomplished  by  writing  an  automated  programthat queries a web server, requests data (usually in the form of the HTML and otherfiles  that  comprise  web  pages),  and  then  parses  that  data  to  extract  needed  information.

## 7.0 Selenium
Selenium automates browsers. That's it! <br>
Selenium is a Python library and tool used for automating web browsers to do a number of tasks. One of such is web-scraping to extract useful data and information that may be otherwise unavailable. <br>
**For this course, we use Chrome.**

## 7.1 Installing Libraries
We need to install these two libraries

In [1]:
#!pip install selenium
#!pip install webdriver-manager

## 7.2 Calling Libraries

In [1]:
# this library is to manipulate browser
from selenium import webdriver

# it allows you to work with differen versions of drivers
# We call ChromeDriver
from webdriver_manager.chrome import ChromeDriverManager
import re
import time 
from selenium.webdriver.common.by import By

## 7.3 Launch/Set the Driver
This code opens a Chrome Driver. We are going to use it to go navigate on the web.

In [4]:
pwd

'C:\\Users\\jorge\\Documents\\Github\\jb-python-course\\python_notebooks\\intermediate'

In [None]:
# Case 1 - Download the driver
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()

url = 'https://resultadoshistorico.onpe.gob.pe/EG2021/'
driver.get( url )

In [3]:
# Case 2 - ChromeDriverManager
driver = webdriver.Chrome( ChromeDriverManager().install() )
driver.maximize_window()



Current google-chrome version is 109.0.5414
Get LATEST chromedriver version for 109.0.5414 google-chrome
Trying to download new driver from https://chromedriver.storage.googleapis.com/109.0.5414.74/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\Anzony\.wdm\drivers\chromedriver\win32\109.0.5414.74]
  driver = webdriver.Chrome( ChromeDriverManager().install() )


In [4]:
url = 'https://resultadoshistorico.onpe.gob.pe/EG2021/'
driver.get( url )

## Chrome is being controlled by automated test software

![Chrome is controlled by automated software](Images/chrome_automated.png)

In [None]:
driver.save_screenshot('Images/resultados_presidenciales.png')

In [130]:
driver.save_screenshot('Images/resultados_presidenciales_peru.png')

True

In [119]:
# Access to the title
print('Title: ', driver.title)

Title:  Presentación de Resultados Elecciones Generales y Parlamento Andino 2021


In [25]:
# Access to the curent url 
print('Current Page URL: ', driver.current_url)

Current Page URL:  https://resultadoshistorico.onpe.gob.pe/EG2021/


In [26]:
# Make screenshot of the webpage
driver.save_screenshot('Images/resultados_presidenciales.png')

True

In [123]:
driver.current_url

'https://resultadoshistorico.onpe.gob.pe/EG2021/'

In [126]:
re.search(r'historico', driver.current_url)

<re.Match object; span=(18, 27), match='historico'>

In [127]:
if re.search(r'resultadoshistorico', driver.current_url):
    driver.save_screenshot('Images/resultados_presidenciales.png') #save screenshot with provided name
    print('Resultados Presidenciales saved!')

Resultados Presidenciales saved!


In [129]:
#get cookie information
cookies = driver.get_cookies()
print('Cookies obtained from resultados_presidenciales')
print(cookies)

Cookies obtained from resultados_presidenciales
[{'domain': 'resultadoshistorico.onpe.gob.pe', 'httpOnly': True, 'name': 'web_server_iron', 'path': '/', 'sameSite': 'Lax', 'secure': True, 'value': '!IaegKT+rbVwj1oBebdNjPhDZ64Q/2TT2B0Cbz/N3Iu4fvIP2IeTWOf76tiGuTaq6QPNtI9g/4MS9MiQ='}, {'domain': '.onpe.gob.pe', 'expiry': 1676230477, 'httpOnly': False, 'name': '_gid', 'path': '/', 'sameSite': 'Lax', 'secure': False, 'value': 'GA1.3.937023796.1676143935'}, {'domain': '.onpe.gob.pe', 'expiry': 1710704077, 'httpOnly': False, 'name': '_ga', 'path': '/', 'sameSite': 'Lax', 'secure': False, 'value': 'GA1.3.1050324605.1676143935'}]


In [29]:
# Get page source
type(driver.page_source)
driver.page_source



In [32]:
# Refresh the page - 
driver.refresh() #reload or refresh the browser

In [131]:
driver = webdriver.Chrome( ChromeDriverManager().install() )
# Maximize window
driver.maximize_window()
driver.get('https://www.legacy.com/obituaries/legacy/obituary-search.aspx?isnew=1&affiliateId=0&stateid=17')

  driver = webdriver.Chrome( ChromeDriverManager().install() )


In [None]:
# Example 1
driver = webdriver.Chrome( ChromeDriverManager().install() )
driver.maximize_window()

url_1 = "https://www.legacy.com/obituaries/legacy/obituary-search.aspx?isnew=1&affiliateId=0&stateid=17"
driver.get( url_1 )
time.sleep(3)

url_2 = "https://www.google.com/"
driver.get( url_2 )

time.sleep(3)

In [140]:
# # Example 2
driver = webdriver.Chrome( ChromeDriverManager().install() )
driver.maximize_window()

url_1 = "https://resultadoshistorico.onpe.gob.pe/EG2021/"
driver.get( url_1 )

time.sleep(3)

url_2 = "https://www.google.com/"
driver.get( url_2 )

time.sleep(3)
driver.back()

  driver = webdriver.Chrome( ChromeDriverManager().install() )


In [139]:
driver.close()

In [141]:
driver.quit()

![Quite and Close](Images/quite_close.png)

In [40]:
type(driver)

selenium.webdriver.chrome.webdriver.WebDriver

`driver` is an `selenium.webdriver.chrome.webdriver.WebDriver` object. This object has some attributes that will help us to navigate on the web.

Now, you can see in the driver that we are in [this link](https://www.convocatoriascas.com/).

## 7.4 Best Practices before working

1. Maximize the browser

In [142]:
driver = webdriver.Chrome( ChromeDriverManager().install() )

url = 'https://www.kaspersky.com/resource-center/definitions/cookies'
driver.get( url )

driver.maximize_window()

  driver = webdriver.Chrome( ChromeDriverManager().install() )


2. Set the Browser Zoom Level to 100 percent

In [147]:
driver.execute_script("document.body.style.zoom='100%'")

### 7.4.1. HTML
HTML stands for HyperText Markup Language. You can deduce that it’s a language for creating web pages. It’s not a programming language like Python or Java, but it’s a markup language. It describes the elements of a page through tags characterized by angle brackets.

1. The document always begins and ends using `<html>` and `</html>`.
2. `<body></body>` constitutes the visible part of HTML document.
3. `<h1>` to `<h3>` tags are defined for the headings.

#### 7.4.1.1. HTML Headings
HTML headings are defined with the `<h1>` to `<h6>` tags.
`<h1>` defines the most important heading. `<h6>` defines the least important heading.

We can use text cells since markdown reads html tags.

<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>

#### 7.4.1.2. HTML Paragraphs
HTML paragraphs are defined with the `<p>` tag.
`<br>` tag is similar to `"\n"`.

<html>
<br>
<p>My first paragraph.</p> <br>
<p>This is another paragraph for this text cell.</p>
<html>

#### 7.4.1.3. HTML Links
HTML links are defined with the <a> tag:

<a href="http://bayes.cs.ucla.edu/jp_home.html">This is a link for Judea Pearl Website</a>

#### 7.4.1.3. Unordered HTML List
An unordered list starts with the `<ul>` tag. Each list item starts with the `<li>` tag.

<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>

#### 7.4.1.4. Ordered HTML List
An ordered list starts with the `<ol>` tag. Each list item starts with the `<li>` tag.

<ol>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>

#### 7.4.1.4. HTML Tables

A table in HTML consists of table cells inside rows and columns. Each table cell is defined by a `<td>` and a `</td>` tag. Each table row starts with a `<tr>` and end with a `</tr>` tag.

<table>
  <tr>
    <th>Manager</th>
    <th>Club</th>
    <th>Nationality</th>
  </tr>
  <tr>
    <td>Mikel Arteta</td>
    <td>Arsenal</td>
    <td>Spain</td>
  </tr>
  <tr>
    <td>Thomas Tuchel</td>
    <td>Chelsea</td>
    <td>Germany</td>
  </tr>
</table>

#### 7.4.1.5. HTML Iframes

An HTML iframe is used to display a web page within a web page.

#### 7.4.1.6. HTML Tags - Key

|Tag|Description|
|---|---|
|`<h1>` to `<h6>`|	Defines HTML headings|
|`<ul>`|	Defines an unordered list|
|`<ol>`|	Defines an ordered list|
|`<p>`|	Defines a paragraph|
|`<a>`|	It is termed as anchor tag and it creates a hyperlink or link.|
|`<div>`|	It defines a division or section within HTML document.|
|`<strong>`|	It is used to define important text.|
|`<table>`|	It is used to present data in tabular form or to create a table within HTML document.|
|`<td>`|	It is used to define cells of an HTML table which contains table data|
|`<iframe>`|	Defines an inline frame|

### 7.4. Identifying elements in a web page

To identify elements of a webpage, we need to inspect the webpage. Open the driver and press `Ctrl`+ `Shift` + `I`.

#### One Element
|Method|Description|
|---|---|
|find_element( By.ID, ... | Use id.|
|find_element( By.NAME, ... | Use name.|
|find_element( By.XPATH, ... | Use Xpath.|
|find_element( By.TAG_NAME, ... | Use HTML tag.|
|find_element( By.CLASS_NAME, ... | Use class name.|
|find_element( By.CSS_SELECTOR, ...| Use css selector.|

#### Multiple  elements
|Method|Description|
|---|---|
|find_elements( By.ID, ... | Use id.|
|find_elements( By.NAME, ... | Use name.|
|find_elements( By.XPATH, ... | Use Xpath.|
|find_elements( By.TAG_NAME, ... | Use HTML tag.|
|find_elements( By.CLASS_NAME, ... | Use class name.|
|find_elements( By.CSS_SELECTOR, ...| Use css selector.|

### 7.4.1. Xpath

XPath in Selenium is an XML path used for navigation through the HTML structure of the page. It is a syntax or language for finding any element on a web page using XML path expression.

The basic format of XPath in selenium is explained below with screen shot.

![](../../img/x_path.png)

**DO NOT COMPLICATE!**
Finding the XPath of a element:
1. Go to the element
2. Right click
3. Inspect - You may have to do it twice.
4. Go to the selected line
5. Right click
7. Copy 
8. Copy Full Xpath

**Example**

We are going to select `Economistas` option and make a click. Use `find_element_by_xpath` and click.

In [148]:
driver = webdriver.Chrome( ChromeDriverManager().install() )
driver.maximize_window()

url_1 = "https://resultadoshistorico.onpe.gob.pe/EG2021/"
driver.get( url_1 )
#time.sleep(3)

  driver = webdriver.Chrome( ChromeDriverManager().install() )


In [5]:
resumen_general = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-home-onpe/div[1]/div/div/div/div[2]/div[1]/div/div/a/div[1]/img' )
resumen_general.click()

In [None]:
resumen_general = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-home-onpe/div[1]/div/div/div/div[2]/div[1]/div/div/a/div[1]/img' )
resumen_general.click()

In [None]:
resumen_general = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-home-onpe/div[1]/div/div/div/div[2]/div[1]/div/div/a/div[1]/img')
resumen_general.click()

In [8]:
driver.find_element( By.ID, 'select_ambito').click()

In [9]:
driver.find_element( By.ID, 'select_ambito').click()

In [10]:
driver.find_element( By.NAME, 'cod_ambito')

<selenium.webdriver.remote.webelement.WebElement (session="b36296bed47a9fae33c70b9df7685cb4", element="cd8127e3-f832-46d5-94ca-cc4d3b5a2fc1")>

In [11]:
# Best practices
driver.find_element( By.ID, 'select_ambito')
driver.find_element( By.NAME, 'cod_ambito')
#driver.find_element_by_xpath('/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div/select')

<selenium.webdriver.remote.webelement.WebElement (session="b36296bed47a9fae33c70b9df7685cb4", element="cd8127e3-f832-46d5-94ca-cc4d3b5a2fc1")>

In [12]:
driver.find_element( By.CLASS_NAME, 'select_ubigeo')

<selenium.webdriver.remote.webelement.WebElement (session="b36296bed47a9fae33c70b9df7685cb4", element="cd8127e3-f832-46d5-94ca-cc4d3b5a2fc1")>

In [13]:
searchBox = driver.find_element( By.ID, 'select_ambito' )
# searchBox = driver.find_element_by_xpath('//*[@id="select_ambito"]')
# searchBox = driver.find_element_by_css_selector('#select_ambito')

![Web Element](Images/Web_Elementpng.png)

In [85]:
searchBox.get_attribute('value')

'T'

In [14]:
driver = webdriver.Chrome( ChromeDriverManager().install() )

url = 'https://resultadoshistorico.onpe.gob.pe/EG2021/ResumenGeneral/10/T'
driver.get( url )

driver.maximize_window()



Current google-chrome version is 109.0.5414
Get LATEST chromedriver version for 109.0.5414 google-chrome
Driver [C:\Users\Anzony\.wdm\drivers\chromedriver\win32\109.0.5414.74\chromedriver.exe] found in cache
  driver = webdriver.Chrome( ChromeDriverManager().install() )


In [None]:
searchBox = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-layout-container/onpe-onpe-rgen-rsgr/div/div[2]/div[1]/div[1]/div/div/div[1]/select/option[2]')
searchBox.click()

In [78]:
searchBox = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-layout-container/onpe-onpe-rgen-rsgr/div/div[2]/div[1]/div[1]/div/div/div[1]/select/option[2]')
searchBox.text

  searchBox = driver.find_element_by_xpath('/html/body/onpe-root/onpe-layout-container/onpe-onpe-rgen-rsgr/div/div[2]/div[1]/div[1]/div/div/div[1]/select/option[2]')


'PERÚ'

In [79]:
searchBox = driver.find_element( By.ID, 'select_ambito')
searchBox

  searchBox = driver.find_element_by_id('select_ambito')


<selenium.webdriver.remote.webelement.WebElement (session="1a3d04055b8811cc78ba76e959f4d952", element="90ff300f-f2ec-49c1-a9be-97f64cfb6d4b")>

**Suggestion** <br>
We do not recomend to use `tag` at first time since most web pages use nested tags and it is difficult to define a element using HTML tag. However, it is great to find elements that is inside another located element in the web. Let's see the example.

## 7.5 Example using ONPE webpage

### [First Round](https://resultadoshistorico.onpe.gob.pe/EG2021/ResumenGeneral/10/T)

In [80]:
# pip install lxml
# pip install lxmunidecodel

In [94]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import ActionChains
import pandas as pd
import numpy as np
import os
import time
import re
import unidecode
import time 
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager

### Driver Path Address

In [302]:
driver = webdriver.Chrome( ChromeDriverManager().install() )
# Maximize window
driver.maximize_window()



Current google-chrome version is 96.0.4664
Get LATEST driver version for 96.0.4664
Driver [C:\Users\MSI-NB\.wdm\drivers\chromedriver\win32\96.0.4664.45\chromedriver.exe] found in cache


###  Extracting all tables

In [16]:
driver = webdriver.Chrome( ChromeDriverManager().install() )
# Maximize window
driver.maximize_window()

# go to the link
url_1 = "https://resultadoshistorico.onpe.gob.pe/EG2021/"
driver.get( url_1 )

resumen_general = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-home-onpe/div[1]/div/div/div/div[2]/div[1]/div/div/a/div[1]/img')
resumen_general.click()



Current google-chrome version is 109.0.5414
Get LATEST chromedriver version for 109.0.5414 google-chrome
Driver [C:\Users\Anzony\.wdm\drivers\chromedriver\win32\109.0.5414.74\chromedriver.exe] found in cache
  driver = webdriver.Chrome( ChromeDriverManager().install() )


In [17]:
presidential = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-layout-container/onpe-onpe-rgen-rsgr/div/div[2]/div[2]/ul/li[1]/a')
presidential.click()

In [18]:
opt_peru = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-layout-container/onpe-onpe-rgen-rsgr/div/div[2]/div[1]/div[1]/div/div/div/select/option[2]')
opt_peru.click()

###  Presidential results

In [329]:
# presidential = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-layout-container/onpe-menu/div/nav/div/div/div[2]/div/div[2]/a/span')
# presidential.click

In [330]:
# # presidential section
# presidential = driver.find_element( By.XPATH, "/html/body/onpe-root/onpe-layout-container/onpe-menu/div/nav/div/div/div[2]/div/div[2]/a" )
# presidential.click()

### Get all elements from all options

In [None]:
# scope = driver.find_element( By.XPATH, "/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div/select" )
# scope.click()

In [None]:
scope_options = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div/select')
scope_options.find_elements( By.TAG_NAME, "option")[2].text

In [163]:
region_options = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-layout-container/onpe-onpe-rgen-rsgr/div/div[2]/div[1]/div[1]/div/div/div[2]/select')
region_options

<selenium.webdriver.remote.webelement.WebElement (session="676168fb892c0c95adb790b2a48d1aa1", element="076ac479-aa2e-4f68-aaf0-fc67b52ef0dc")>

In [172]:
region_options

<selenium.webdriver.remote.webelement.WebElement (session="676168fb892c0c95adb790b2a48d1aa1", element="076ac479-aa2e-4f68-aaf0-fc67b52ef0dc")>

In [183]:
region_options.text

'--TODOS--\nAMAZONAS\nANCASH\nAPURIMAC\nAREQUIPA\nAYACUCHO\nCAJAMARCA\nCALLAO\nCUSCO\nHUANCAVELICA\nHUANUCO\nICA\nJUNIN\nLA LIBERTAD\nLAMBAYEQUE\nLIMA\nLORETO\nMADRE DE DIOS\nMOQUEGUA\nPASCO\nPIURA\nPUNO\nSAN MARTIN\nTACNA\nTUMBES\nUCAYALI'

In [168]:
from selenium.webdriver.common.by import By

In [191]:
region_options.find_elements(By.TAG_NAME,  "option")[1].text
region_options.find_elements(By.TAG_NAME,  "option")[2].text
region_options.find_elements(By.TAG_NAME,  "option")[3].text
region_options.find_elements(By.TAG_NAME,  "option")[4].text


'AREQUIPA'

In [207]:
num_dep =  len(region_options.find_elements(By.TAG_NAME,  "option"))
num_dep

26

In [208]:
region_options.find_elements(By.TAG_NAME,  "option")[3]

<selenium.webdriver.remote.webelement.WebElement (session="676168fb892c0c95adb790b2a48d1aa1", element="18cde593-af20-4316-a227-a260fed01c56")>

In [209]:
for dept in range(num_dep):
        print( region_options.find_elements(By.TAG_NAME,  "option")[ dept ].text )

--TODOS--
AMAZONAS
ANCASH
APURIMAC
AREQUIPA
AYACUCHO
CAJAMARCA
CALLAO
CUSCO
HUANCAVELICA
HUANUCO
ICA
JUNIN
LA LIBERTAD
LAMBAYEQUE
LIMA
LORETO
MADRE DE DIOS
MOQUEGUA
PASCO
PIURA
PUNO
SAN MARTIN
TACNA
TUMBES
UCAYALI


In [None]:
scope_options = driver.find_element( By.XPATH, '/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div')

In [313]:
scope_options.find_elements( By.TAG_NAME, "option")[0].text
scope_options.find_elements( By.TAG_NAME, "option")[1].text
scope_options.find_elements( By.TAG_NAME, "option")[2].text

'EXTRANJERO'

In [332]:
scope = driver.find_element( By.XPATH, "/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div[1]/select" )
scope

<selenium.webdriver.remote.webelement.WebElement (session="0b1571ae7fc3933ef3b474fc23ef5109", element="6a4a8af5-141e-4c51-b13a-daa46c99c3fc")>

In [335]:
scope_options = scope.find_elements( By.TAG_NAME, "option")

In [336]:
scope_options

[<selenium.webdriver.remote.webelement.WebElement (session="0b1571ae7fc3933ef3b474fc23ef5109", element="80298da3-cb22-41ce-9e3a-38d6ad99eb24")>,
 <selenium.webdriver.remote.webelement.WebElement (session="0b1571ae7fc3933ef3b474fc23ef5109", element="d28ac835-9d72-4738-b5a6-90564174ed98")>,
 <selenium.webdriver.remote.webelement.WebElement (session="0b1571ae7fc3933ef3b474fc23ef5109", element="1e00eecc-ecb4-4ea2-bbd9-11af08e629c1")>]

In [339]:
dict_scope_options = { i.text : i for i in scope_options }
dict_scope_options

{'TODOS': <selenium.webdriver.remote.webelement.WebElement (session="0b1571ae7fc3933ef3b474fc23ef5109", element="80298da3-cb22-41ce-9e3a-38d6ad99eb24")>,
 'PERÚ': <selenium.webdriver.remote.webelement.WebElement (session="0b1571ae7fc3933ef3b474fc23ef5109", element="d28ac835-9d72-4738-b5a6-90564174ed98")>,
 'EXTRANJERO': <selenium.webdriver.remote.webelement.WebElement (session="0b1571ae7fc3933ef3b474fc23ef5109", element="1e00eecc-ecb4-4ea2-bbd9-11af08e629c1")>}

In [340]:
# There are three options
dict_scope_options.keys()
dict_scope_options

{'TODOS': <selenium.webdriver.remote.webelement.WebElement (session="0b1571ae7fc3933ef3b474fc23ef5109", element="80298da3-cb22-41ce-9e3a-38d6ad99eb24")>,
 'PERÚ': <selenium.webdriver.remote.webelement.WebElement (session="0b1571ae7fc3933ef3b474fc23ef5109", element="d28ac835-9d72-4738-b5a6-90564174ed98")>,
 'EXTRANJERO': <selenium.webdriver.remote.webelement.WebElement (session="0b1571ae7fc3933ef3b474fc23ef5109", element="1e00eecc-ecb4-4ea2-bbd9-11af08e629c1")>}

In [341]:
# We click on Peru
dict_scope_options['PERÚ'].click()

We have to be careful since everytime we make a click, the url changes.

### Loop over all departments

In [None]:
# Store all_tables
all_tables = {}

dept_0 = driver.find_element( By.XPATH, "/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div[2]/select" )
dept_0

In [89]:
# All selenium objects in department select
dpt = Select( dept_0 )
#dpt.options[15].text

NameError: name 'Select' is not defined

In [352]:
# Get number of total options
num_prov_options = len( dpt.options )
num_prov_options

26

In [353]:
# we can loop over all departments
# for dpt_idx in range( num_prov_options ):
# but it will take too much time
# We are going to do it over two departments
for dpt_idx in range( num_prov_options ):
    
    # Get again all departments since HTML is refreshing
    # all elements
    # Click on one specific department
    dpt = Select( driver.find_element( By.XPATH, "/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div[2]/select" ) )
    department = dpt.options[ dpt_idx ]
    
    # Get departmant name
    dpt_name = department.text
    print(dpt_name)

--TODOS--
AMAZONAS
ANCASH
APURIMAC
AREQUIPA
AYACUCHO
CAJAMARCA
CALLAO
CUSCO
HUANCAVELICA
HUANUCO
ICA
JUNIN
LA LIBERTAD
LAMBAYEQUE
LIMA
LORETO
MADRE DE DIOS
MOQUEGUA
PASCO
PIURA
PUNO
SAN MARTIN
TACNA
TUMBES
UCAYALI


## 7.6 Dynamic Pages

In [210]:
driver = webdriver.Chrome( ChromeDriverManager().install() )
# Maximize window
driver.maximize_window()
driver.get('https://www.legacy.com/obituaries/legacy/obituary-search.aspx?isnew=1&affiliateId=0&stateid=17')

  driver = webdriver.Chrome( ChromeDriverManager().install() )


In [213]:
# type the Firstname 
keyword = driver.find_element( By.XPATH,'/html/body/div[2]/div[2]/div[2]/form/div[3]/div[1]/div[1]/div[1]/div/div[1]/div[2]/div[3]/div/div[1]/input[1]')
keyword.send_keys('alex')

  keyword = driver.find_element_by_xpath('/html/body/div[2]/div[2]/div[2]/form/div[3]/div[1]/div[1]/div[1]/div/div[1]/div[2]/div[3]/div/div[1]/input[1]')


In [None]:
# range of death
driver.find_element( By.XPATH,'//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_ddlSearchRange"]/option[10]').click()


death_begin = driver.find_element( By.XPATH, '//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_txtStartDate"]')
death_begin.send_keys('10/10/1994')

In [56]:
death_end = driver.find_element( By.XPATH, '//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_txtEndDate"]')    
death_end.send_keys('10/10/2005')

# type the Firstname 
keyword = driver.find_element( By.XPATH, '//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_txtFirstName"]')
keyword.send_keys('robert')

# type the Lastname 
keyword = driver.find_element( By.XPATH, '//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_txtLastName"]')
keyword.send_keys('brown')

# type the Title 
keyword = driver.find_element( By.XPATH, '//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_txtKeyword"]')
keyword.send_keys('professor')

 # Set the state of last residence
driver.find_element( By.XPATH, '//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_ddlCountry"]/option[11]').click()
        
# Send information
driver.find_element( By.XPATH, '//*[@id="lnkSearch"]').click()

  driver.find_element_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_ddlSearchRange"]/option[10]').click()
  death_begin = driver.find_element_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_txtStartDate"]')
  death_end = driver.find_element_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_txtEndDate"]')
  keyword = driver.find_element_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_txtFirstName"]')
  keyword = driver.find_element_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_txtLastName"]')
  keyword = driver.find_element_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_txtKeyword"]')
  driver.find_element_by_xpath('//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_uxSearchWideControl_ddlCountry"]/option[11]').click()
 

In [10]:
# we can loop over all departments
# for dpt_idx in range( num_prov_options ):
# but it will take too much time
# We are going to do it over two departments
for dpt_idx in range( 2 ):
    
    # Get again all departments since HTML is refreshing
    # all elements
    # Click on one specific department
    dpt = Select( driver.find_element(  By.XPATH,  "/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div[2]/select" ) )
    department = dpt.options[ dpt_idx ]
    
    # Get departmant name
    dpt_name = department.text
    
    # We select a different department name
    if dpt_name != "--TODOS--" :
        
        # click on department
        department.click()
        
        # Get all elements of province
        prov = Select( driver.find_element(  By.XPATH,  "/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div[3]/select" ) )
        num_prov_options = len( prov.options )
        
        for prov_idx in range( num_prov_options ):
            
            # Get again all districts since HTML is refreshing
            # all elements
            prov = Select( driver.find_element(  By.XPATH,  "/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div[3]/select" ) )
            province = prov.options[ prov_idx ]
                
            # Get province name
            prov_name = province.text
            
            if prov_name != "--TODOS--" :
                
                # click on province
                province.click()
                
                # Get all elements from district
                dist = Select( driver.find_element(  By.XPATH,  "/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div[4]/select" ) )
                num_dist_options = len( dist.options )
                
                for dist_idx in range( num_dist_options ):
                    
                    # Get again all districts since HTML is refreshing
                    # all elements
                    dist = Select( driver.find_element(  By.XPATH,  "/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[3]/div[1]/div[1]/div/div/div[4]/select" ) )
                    district = dist.options[ dist_idx ]
                    
                    # Get district name
                    dist_name = district.text
                    
                    if dist_name != "-- SELECCIONE --" :
                        
                        # click on district
                        district.click()
                        
                        # Get UBIGEO
                        ubigeo = driver.current_url.split("/")[ -1 ]
                        
                        ## Get table of presidential votes
                        # Get html at this point
                        table_path = driver.find_element(  By.XPATH,  "/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[4]/div[1]/div[3]/div" )
                        table_html = table_path.get_attribute( 'innerHTML' )
                        # Read the table using pandas
                        table = pd.read_html( table_html )
                        
                        # Cleaning tables
                        row_new_columns = table[ 0 ].iloc[ 0 , 2: ]
                        clean_columns = row_new_columns \
                                              .str.replace( " ", "_") \
                                              .str.lower().str.replace( "%", "share_") \
                                              .apply( lambda x : unidecode.unidecode( x ) ) \
                                              .tolist()
                        
                        # Selecting specific columns
                        table_clean = table[0].iloc[ 1:, 2: ].copy()
                        
                        # rename columns
                        table_clean.columns = clean_columns
                        
                        # New values to columns 
                        table_clean[ 'department' ] = dpt_name
                        table_clean[ 'province' ]   = prov_name
                        table_clean[ 'district' ]   = dist_name
                        table_clean[ 'ubigeo' ]     = ubigeo
                        
                        # store tables
                        all_tables[ ubigeo ] = table_clean

In [12]:
final_data = pd.concat( all_tables.values() ).reset_index( drop = True )

In [15]:
final_data.to_excel( r'example_round.xlsx' , index = False )