Before we start: A short disclaimer
Don’t do this. Don’t go there importing random code you found on the internet into your code because that can be dangerous and code would be injected into your machine or a client’s machine and you don’t want that. Remember to always have your code and modules in a version control system and that you have complete knowledge of what you’re loading.
Today I was talking with some of the guys in a Discord server about python, called Python Discord (which I definitely recommend you to check out!), and stumbled upon a guy who was requesting assistance.
(Yeah… I’m theinquisitor.)
What he wanted to do was to import a variable that contains a list object, which is perfectly fine, the only issue was that the variable was in a web page called dumptext.com.
And to make things worst, when you inspect the web page’s source code you find that the text isn’t even raw text, I don’t know why would anyone add a raw button that doesn’t even return raw text.
This is a bad idea, but hey: I like to solve problems, and we will work with this scenario just for the fun of it, so let’s start with Solution #1.
Solution #1: Create a module containing the variable name
The first solution will consist in:
- Request the web page HTML’s source code.
- Parse the HTML code with BeautifulSoup
- Extract the contents inside the HTML’s
<pre></pre>tags which contains the list we want.
- Save the list into a module called
wordlist.pythat we can import.
- Import the variable
wordsfrom our newly created
Before we start, we have to create an empty module first that will hold our data:
$ touch wordlist.py
Now we proceed to create our main python code called:
"""Solution 1.""" import requests from bs4 import BeautifulSoup from importlib import reload # our module containing the variable with the list object import wordlist # request the HTML source and extract the text inside the '<pre></pre>' tags. dumptext_data = requests.get('https://dumptext.com/Ai9Ww8j4/raw/').text parsed_dumptext_html = BeautifulSoup(dumptext_data, 'html.parser') scraped_list = parsed_dumptext_html.pre.text # save the contents of the scraped_list into a module with open('wordlist.py', 'w') as wordlist_object: wordlist_object.write(scraped_list) # reload the module to import the variable 'words' reload(wordlist) # now we can use the variable words everywhere we want. words = wordlist.words print(type(words)) print(words)
Now, we are familiar with the packages
BeautifulSoup, but what about
importlib have a lot of utilities for our imports. In this case we imported the module
importlib that allows us to reload a previous imported module, fresh as new, including the newly created variable holding our list. Remember that PEP8 recommends adding your imports at the top of the file, following your docstring, not in the middle of the file.
Let’s see if our solution works well:
$ python solution1.py <class 'list'> ['aah', 'aal', 'aas', 'aba', 'abo', 'abs', 'aby', 'ace', 'act', 'add', 'ado', 'ads', 'adz', 'aff', 'aft', 'aga', 'age', 'ago', 'ags', 'aha', 'ahi', 'ahs', 'aid', 'ail', 'aim', 'ain', 'air', 'ais', 'ait', 'ala', 'alb', 'ale', 'all', 'alp', 'als', 'alt', 'ama', 'ami', 'amp', 'amu', 'ana', 'and', 'ane', 'ani', 'ant', 'any', ...]
It works!, we can see that our it is an object from the class list and it prints successfully.
There’s only one thing… we shouldn’t do this. Importing random code from the Internet is not a great option, believe me. You could do this in very specific scenario where you absolutely don’t have any choice, the file is read only and also you are the owner. If it’s a controlled file then good, but realistically I don’t see that happening, so please, avoid this solution.
It was fun to write though.
Solution #2: A Regular Expression to Match the List Items
This is a much better (and cleaner) solution, what we’ll do is the same scraping as before, but instead of creating and loading a module, we will pick everything inside a pair of single quotes (
'') from the parsed HTML code with a regular expression and add those results to a list.
Let’s jump straight into the code:
"""Solution 2.""" import requests from bs4 import BeautifulSoup import re # request the HTML source and extract the text inside the '<pre></pre>' tags. dumptext_data = requests.get('https://dumptext.com/Ai9Ww8j4/raw/').text parsed_dumptext_html = BeautifulSoup(dumptext_data, 'html.parser') scraped_list = parsed_dumptext_html.pre.text # define a pattern that captures everything inside a pair of quotes ('') pattern = re.compile(r'\'(.*?)\'') # create a list of items using list comprehension with all the matching items words = [word for word in re.findall(pattern, scraped_list)] # now we can use the variable words everywhere we want. print(type(words)) print(words)
Let’s see if it works correctly:
$ python solution2.py <class 'list'> ['aah', 'aal', 'aas', 'aba', 'abo', 'abs', 'aby', 'ace', 'act', 'add', 'ado', 'ads', 'adz', 'aff', 'aft', 'aga', 'age', 'ago', 'ags', 'aha', 'ahi', 'ahs', 'aid', 'ail', 'aim', 'ain', 'air', 'ais', 'ait', 'ala', 'alb', 'ale', 'all', 'alp', 'als', 'alt', 'ama', 'ami', 'amp', 'amu', 'ana', 'and', 'ane', 'ani', 'ant', 'any', ...]
That’s it!, no need for stinkin' modules and injecting random code into our code, no importing and no reloading:
- We imported the library
rewhich aid us to search for strings using regular expression (Or RegEx).
- We use the same scraping technique as before.
- Now we define a RegEx pattern with
- Create a List of words, where
wordis the result of all the words found in
And that all, you can now use your
words in a lot more safer way.
If you’re interested in being part of a community, then I cannot recommend you Python Discord more. It has helped me to learn so much about Python, people is really helpful and we’re are always growing.
So please, be my guest and hop in this awesome community, here’s your invite: https://pythondiscord.com/invite
Everyone is welcome.