Scraping
dadosfera.services.scraping.get_content_from_url
get_content_from_url(url)
Retrieves the content from a specified URL using a GET request with a Firefox user agent.
PARAMETER | DESCRIPTION |
---|---|
url
|
The URL to fetch content from.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bytes
|
bytes or None: The raw content of the webpage if successful (status code 200), None if the request fails. |
Example
content = get_content_from_url('https://example.com') if content: # Process the content pass
Note
- Uses Firefox user agent to mimic browser behavior
- Requires the 'requests' library
- Does not handle exceptions, caller should implement error handling
Source code in dadosfera/services/scraping.py
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
dadosfera.services.scraping.parse_content_from_html
parse_content_from_html(html_content)
Parses HTML content into a BeautifulSoup object for easy manipulation and searching.
PARAMETER | DESCRIPTION |
---|---|
html_content
|
Raw HTML content to be parsed.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
BeautifulSoup
|
A parsed BeautifulSoup object representing the HTML document. |
Example
soup = parse_content_from_html(html_content) title = soup.find('title')
Note
- Requires the 'beautifulsoup4' library
- Uses 'html.parser' as the parsing engine
Source code in dadosfera/services/scraping.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
dadosfera.services.scraping.extract_text_from_html
extract_text_from_html(html_content)
Extracts visible text content from HTML while filtering out script, style, and comment content.
PARAMETER | DESCRIPTION |
---|---|
html_content
|
Raw HTML content to extract text from.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
A single string containing all visible text from the HTML document, with each text segment separated by spaces.
TYPE:
|
Implementation Details
- Uses an internal tag_visible() function to determine if text should be included
- Filters out content from: style, script, head, title, meta tags and comments
- Joins all visible text segments with spaces
- Strips whitespace from each text segment
Example
html = '
Hello
World!
' text = extract_text_from_html(html)Returns: "Hello World!"
Note
- Requires 'beautifulsoup4' library
- Preserves the natural reading order of the document
- Removes unnecessary whitespace while maintaining word separation
Source code in dadosfera/services/scraping.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
|