SwoleSource recoveries

barbaar · Mar 8, 2018

@Cdsnuts I automated scraping threads from googles cache now, anything else you want me to save?

Cdsnuts · Mar 8, 2018

barbaar said:
@Cdsnuts I automated scraping threads from googles cache now, anything else you want me to save? (Hold on let me fix something, zip will be back soon)

Wow!

I was really just hoping for the Recoveries as they were starting to pick up steam. Everything else can be replaced/duplicated.

I'm sure there are others I'd like to keep, but we had so many posts there it would be hard for me to remember more of them. Maybe @TubZy or @jacknap has some ideas?

barbaar · Mar 8, 2018

If anyone else is into programming and stuff, here's the script I wrote to save threads. Just edit the thread name and it should be good to go for different threads. It's python 3, you'll need mechanicalsoup (pip install mechanicalsoup). Googles bot detection is really good though, so once you trigger that you need to wait a while before it works again. I managed to get a list of threads too, it's attached here.

Python:

import mechanicalsoup
import time
import re

def stripGoogleHeader(soup):
    soup.find(id="google-cache-hdr").decompose()
    return soup

browser = mechanicalsoup.StatefulBrowser(user_agent="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")

# Edit this to save a different thread
threadname = "3199-recoveries"

url = "http://webcache.googleusercontent.com/search?q=cache:http://www.swolesource.com/forum/post-finasteride-syndrome/{0}.html".format(threadname)

baseurl = "http://webcache.googleusercontent.com/search?q=cache:http://www.swolesource.com/forum/post-finasteride-syndrome/{0}".format(threadname) + "-{0}.html"

print("Scraping page 1: {0}".format(url))

browser.open(url)

# Figure out amt of pages
rgx = re.compile("Page \d of (?P<pageno>\d)")
amtpages = int(re.search(rgx, browser.get_current_page()\
    .find("a", string=re.compile("Page \d of \d"))\
    .string).group("pageno"))

print("Total of {0} pages found".format(amtpages))

# Save first page
with open('{0}-{1}.html'.format(threadname, 1), 'w') as file:
    file.write(str(stripGoogleHeader(browser.get_current_page())))

# Save the rest of the pages
for i in range(2, amtpages + 1):
    time.sleep(15)

    print("Scraping page {0}: {1}".format(i, baseurl.format(i)))

    browser.open(baseurl.format(i))

    with open('{0}-{1}.html'.format(threadname, i), 'w') as file:
        file.write(str(stripGoogleHeader(browser.get_current_page())))

Cdsnuts · Mar 8, 2018

Man...all this stuff is just a completely different language to me....

barbaar · Mar 8, 2018

Cdsnuts said:
Man...all this stuff is just a completely different language to me....

No worries man. I just figured I'd post the script in case anyone else on here is nerdy enough to want to use it.

joekool · Mar 8, 2018

Impressive !

The community dealing with this thanks you big time !

Thank you !

Chapman · Mar 8, 2018

Good lad barbaar, king babar would be proud lol.

Seriously though those posts have probably helped most of us at one time or another, I know they did for me.

SwoleSource recoveries

barbaar

Well-Known Member

Attachments

Cdsnuts

Moderator

barbaar

Well-Known Member

Attachments

Cdsnuts

Moderator

barbaar

Well-Known Member

joekool

Moderator

Chapman

Well-Known Member