.extract() is unable to get data properly from sparse tables #192

shubham-MLwiz · 2020-05-28T13:16:20Z

I created a manual table to reproduce the bug which I am facing

<!DOCTYPE html>
<html lang="en">
<table class="manual_table">
   <thead>
      <tr>
        <th class="">Mar 2008</th>
        <th class="">Mar 2009</th>
        <th class="">Mar 2010</th>
      </tr>
   </thead>
   <tbody>
      <tr>
        <td class="">8,626</td>
        <td class="">8,427</td>
        <td class="">11,525</td>
      </tr>
      <tr>
        <td class="">16,408</td>
        <td class="">19,582</td>
        <td class=""></td>
      </tr>
      <tr>        
        <td class=""></td>
        <td class="">22,574</td>
        <td class="">21,755</td> 
      </tr>
   </tbody>
</table>

Now when I try to run the below code on the above html. This is the output I get

>>> rows = response.css(".manual_table tbody tr")
>>> rows[0].css("td::text").extract()
['8,626', '8,427', '11,525']
>>> rows[1].css("td::text").extract()
['16,408', '19,582']
>>> rows[2].css("td::text").extract()
['22,574', '21,755']

As you can notice, It is unable to give proper output for empty data cells. It is ignoring all empty values and that seems a bug.

Similarly if you run below code you will find some weird results. I am confused because it is not supposed to be like this.

>>> len(rows[2].css("td::text").extract())
2
>>> len(rows[2].css("td::text"))
2
>>> len(rows[2].css("td"))
3

Both .getall() and .extract() give the same issue.

The text was updated successfully, but these errors were encountered:

elacuesta · 2020-05-28T14:22:06Z

AFAICT, this is expected. "td::text" does not exist if there is no text, that's why it's not included in the results and why len(rows[2].css("td")) != len(rows[2].css("td::text")).

Were you expecting some other value, None for instance?

PS: to reproduce in parsel:

In [1]: html = """<!DOCTYPE html> 
   ...: <html lang="en"> 
   ...: <table class="manual_table"> 
   ...:    <thead> 
   ...:       <tr> 
   ...:         <th class="">Mar 2008</th> 
   ...:         <th class="">Mar 2009</th> 
   ...:         <th class="">Mar 2010</th> 
   ...:       </tr> 
   ...:    </thead> 
   ...:    <tbody> 
   ...:       <tr> 
   ...:         <td class="">8,626</td> 
   ...:         <td class="">8,427</td> 
   ...:         <td class="">11,525</td> 
   ...:       </tr> 
   ...:       <tr> 
   ...:         <td class="">16,408</td> 
   ...:         <td class="">19,582</td> 
   ...:         <td class=""></td> 
   ...:       </tr> 
   ...:       <tr>         
   ...:         <td class=""></td> 
   ...:         <td class="">22,574</td> 
   ...:         <td class="">21,755</td>  
   ...:       </tr> 
   ...:    </tbody> 
   ...: </table>"""

In [2]: from parsel import Selector

In [3]: s = Selector(text=html)

In [4]: rows = s.css(".manual_table tbody tr")

In [5]: rows[0].css("td::text").extract()
Out[5]: ['8,626', '8,427', '11,525']

In [6]: rows[1].css("td::text").extract()
Out[6]: ['16,408', '19,582']

In [7]: rows[2].css("td::text").extract()
Out[7]: ['22,574', '21,755']

shubham-MLwiz · 2020-05-28T17:33:44Z

Thank for the clarification.
But I still think that if I am scraping a table, I should be able to get all the td values properly with empty cells included.
Currently I am getting it by putting it in a for loop and using .get() with default argument.

rows = response.css(".manual_table tbody tr")
dt=[]
for row in rows:
    for data in row.css("td"):
         dt.append(data.css("::text").get(default=''))

Is there a better way to parse a sparse table other than the looping method?

What I suggest is that similar default argument should be there for .getall() and .extract() as well. So if some tag is available but corresponding "::text" is not there then we should be able to assign a default value to it, rather than totally ignoring it.

shubham-MLwiz · 2020-05-31T10:46:55Z

Is anyone looking into this?

Gallaecio · 2020-06-01T06:18:02Z

Is there a better way to parse a sparse table other than the looping method?

I believe that is the right way to do it with Parsel.

ilyazub · 2022-02-16T20:54:38Z

@shubham-MLwiz xpath("normalize-space()").getall() returns None from the empty data cells unlike text().

>>> s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()
['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']

Full code

from parsel import Selector

html = """<!DOCTYPE html> 
<html lang="en"> 
<table class="manual_table"> 
  <thead> 
    <tr> 
      <th class="">Mar 2008</th> 
      <th class="">Mar 2009</th> 
      <th class="">Mar 2010</th> 
    </tr> 
  </thead> 
  <tbody> 
    <tr> 
      <td class="">8,626</td> 
      <td class="">8,427</td> 
      <td class="">11,525</td> 
    </tr> 
    <tr> 
      <td class="">16,408</td> 
      <td class="">19,582</td> 
      <td class=""></td> 
    </tr> 
    <tr>         
      <td class=""></td> 
      <td class="">22,574</td> 
      <td class="">21,755</td>  
    </tr> 
  </tbody> 
</table>
</html>"""

s = Selector(text=html)

rows = s.css(".manual_table tbody tr")

dt = []
for row in rows:
    for data in row.css("td"):
        dt.append(data.css("::text").get(default=''))

print("Loop:", dt)

dt2 = s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()

print("One-liner:", dt2)

Output

Loop: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']
One-liner: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']

I'm commenting on this old issue because I've faced it today.

elacuesta transferred this issue from scrapy/scrapy May 28, 2020

Gallaecio added the enhancement label Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.extract() is unable to get data properly from sparse tables #192

.extract() is unable to get data properly from sparse tables #192

shubham-MLwiz commented May 28, 2020

elacuesta commented May 28, 2020

shubham-MLwiz commented May 28, 2020 •

edited

Loading

shubham-MLwiz commented May 31, 2020

Gallaecio commented Jun 1, 2020

ilyazub commented Feb 16, 2022 •

edited

Loading

.extract() is unable to get data properly from sparse tables #192

.extract() is unable to get data properly from sparse tables #192

Comments

shubham-MLwiz commented May 28, 2020

elacuesta commented May 28, 2020

shubham-MLwiz commented May 28, 2020 • edited Loading

shubham-MLwiz commented May 31, 2020

Gallaecio commented Jun 1, 2020

ilyazub commented Feb 16, 2022 • edited Loading

shubham-MLwiz commented May 28, 2020 •

edited

Loading

ilyazub commented Feb 16, 2022 •

edited

Loading