SwoleSource recoveries

Cdsnuts

Moderator
Messages
114
@Cdsnuts I automated scraping threads from googles cache now, anything else you want me to save? (Hold on let me fix something, zip will be back soon)

Wow!

I was really just hoping for the Recoveries as they were starting to pick up steam. Everything else can be replaced/duplicated.

I'm sure there are others I'd like to keep, but we had so many posts there it would be hard for me to remember more of them. Maybe @TubZy or @jacknap has some ideas?
 

barbaar

Well-Known Member
Messages
807
If anyone else is into programming and stuff, here's the script I wrote to save threads. Just edit the thread name and it should be good to go for different threads. It's python 3, you'll need mechanicalsoup (pip install mechanicalsoup). Googles bot detection is really good though, so once you trigger that you need to wait a while before it works again. I managed to get a list of threads too, it's attached here.


Python:
import mechanicalsoup
import time
import re

def stripGoogleHeader(soup):
    soup.find(id="google-cache-hdr").decompose()
    return soup

browser = mechanicalsoup.StatefulBrowser(user_agent="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")

# Edit this to save a different thread
threadname = "3199-recoveries"

url = "http://webcache.googleusercontent.com/search?q=cache:http://www.swolesource.com/forum/post-finasteride-syndrome/{0}.html".format(threadname)

baseurl = "http://webcache.googleusercontent.com/search?q=cache:http://www.swolesource.com/forum/post-finasteride-syndrome/{0}".format(threadname) + "-{0}.html"

print("Scraping page 1: {0}".format(url))

browser.open(url)

# Figure out amt of pages
rgx = re.compile("Page \d of (?P<pageno>\d)")
amtpages = int(re.search(rgx, browser.get_current_page()\
    .find("a", string=re.compile("Page \d of \d"))\
    .string).group("pageno"))

print("Total of {0} pages found".format(amtpages))

# Save first page
with open('{0}-{1}.html'.format(threadname, 1), 'w') as file:
    file.write(str(stripGoogleHeader(browser.get_current_page())))

# Save the rest of the pages
for i in range(2, amtpages + 1):
    time.sleep(15)

    print("Scraping page {0}: {1}".format(i, baseurl.format(i)))

    browser.open(baseurl.format(i))

    with open('{0}-{1}.html'.format(threadname, i), 'w') as file:
        file.write(str(stripGoogleHeader(browser.get_current_page())))
 

Attachments

  • cachedthreads.txt
    31.2 KB · Views: 6