Avatar

InternetPirate

InternetPirate@lemmy.fmhy.ml
Joined
80 posts • 208 comments
Direct message

There are several Linux command-line tools that can be used to download a mirror of a website for offline browsing. Here are some of the most popular ones:

  1. wget: wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. To download a website recursively, use the -r option. For example, the following command will download the entire website located at http://example.com:

    wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.com
    

    This will create a local copy of the website in a directory named example.com.

  2. httrack: httrack is a free and open-source website copier that allows you to download a website and browse it offline. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. To download a website, use the -O option followed by the directory where you want to save the website. For example, the following command will download the entire website located at http://example.com:

    httrack http://example.com -O /path/to/save/directory
    

    This will create a local copy of the website in the directory /path/to/save/directory/example.com.

  3. curl: curl is a command-line tool for transferring data from or to a server, using one of the supported protocols (HTTP, HTTPS, FTP, etc.). To download a website recursively, use the -r option. For example, the following command will download the entire website located at http://example.com:

    curl -k -L -O -J -R -H 'Referer: http://example.com' -e 'http://example.com' -A 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' -m 1800 --retry 3 --retry-delay 5 --retry-max-time 60 --retry-connrefused -s -S -L -r -k -J -O -C - http://example.com
    

    This will create a local copy of the website in the current directory.

These tools have different options and features, so you should choose the one that best suits your needs.

Citations: [1] https://www.tutorialspoint.com/5-linux-command-line-based-tools-for-downloading-files-and-browsing-websites [2] https://www.kali.org/tools/httrack/ [3] https://medevel.com/os1-3-web-copier/ [4] http://www.linux-magazine.com/Online/Features/WebHTTrack-Website-Copier [5] https://winaero.com/make-offline-copy-of-a-site-with-wget-on-windows-and-linux/ [6] https://alvinalexander.com/linux-unix/how-to-make-offline-mirror-copy-website-with-wget

permalink
report
parent
reply

what are the best linux cli tools to download a mirror of a website for offline browsing?

permalink
report
parent
reply

wget -mkEpnp

wget --mirror --convert-links --adjust-extension --page-requisites –no-parent http://example.org

Explanation of the various flags:

--mirror – Makes (among other things) the download recursive.
--convert-links – convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing.
--adjust-extension – Adds suitable extensions to filenames (html or css) depending on their content-type.
--page-requisites – Download things like CSS style-sheets and images required to properly display the page offline.
--no-parent – When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.

wget -mpHkKEb -t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ http://www.example.com

–m (--mirror) : turn on options suitable for mirroring (infinite recursive download and timestamps).

-p (--page-requisites) : download all files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

-H (--span-hosts): enable spanning across hosts when doing recursive retrieving.

–k (--convert-links) : after the download, convert the links in document for local viewing.

-K (--backup-converted) : when converting a file, back up the original version with a .orig suffix. Affects the behavior of -N.

-E (--adjust-extension) : add the proper extension to the end of the file.

-b (--background) : go to background immediately after startup. If no output file is specified via the -o, output is redirected to wget-log.

-e (--execute) : execute command (robots=off).

-t number (--tries=number) : set number of tries to number.

-U (--user-agent) : identify as agent-string to the HTTP server. Some servers may ban you permanently for recursively download if you send the default User Agent.

Cronjobs

0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk -t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com

0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk -t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com

0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com

0 23 * * * cd ~/Documents/Webs/mirror; wget -mpkH t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ -D https://example.com

0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com

0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com

0 23 * 1 * cd ~/Documents/Webs/mirror; wget -mpk t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com

0 8 * * * pkill wget; cd ~/Documents/Webs/mirror/; rm wget*

permalink
report
reply

It isn’t chatgpt. It’s an LLM with search

permalink
report
parent
reply

To achieve a rate of 1.39 requests per second, you can use a global variable to keep track of the time elapsed between requests and then calculate the delay based on that time. Here’s a modified version of your fetch_github_data function that implements this:

import time
import requests
import logging

PERSONAL_ACCESS_TOKEN = "your_personal_access_token"
DELAY = 1 / 1.39  # Calculate the delay for 1.39 requests per second
last_request_time = 0  # Initialize a global variable to store the last request time


def fetch_github_data(url):
    global last_request_time  # Access the global variable

    try:
        headers = {
            "Accept": "application/vnd.github+json",
            "Authorization": f"Bearer {PERSONAL_ACCESS_TOKEN}",
            "X-GitHub-Api-Version": "2022-11-28",
        }

        # Calculate the time elapsed since the last request
        time_elapsed = time.time() - last_request_time

        # Calculate the required delay based on the time elapsed
        required_delay = max(0, DELAY - time_elapsed)

        # Sleep for the required delay
        time.sleep(required_delay)

        response = requests.get(url, headers=headers)

        # Update the last request time
        last_request_time = time.time()

        logging.info(f"Fetched data from {url}")
        return response.json()
    except requests.exceptions.RequestException as e:
        logging.exception(f"Error fetching data from {url}\n{e}")
        raise

This code calculates the required delay based on the desired rate of 1.39 requests per second and the time elapsed since the last request. It then sleeps for the required delay before making the next request. The global variable last_request_time is used to keep track of the time of the last request.

Citations: [1] https://www.geeksforgeeks.org/how-to-add-time-delay-in-python/ [2] https://stackoverflow.com/questions/66229987/calculate-attempts-per-second [3] https://pypi.org/project/requests-ratelimiter/ [4] https://akshayranganath.github.io/Rate-Limiting-With-Python/ [5] https://stackoverflow.com/questions/32815451/are-global-variables-thread-safe-in-flask-how-do-i-share-data-between-requests [6] https://stackoverflow.com/questions/44014718/python-request-get-after-few-seconds [7] https://realpython.com/python-sleep/ [8] https://algotrading101.com/learn/yahoo-finance-api-guide/ [9] https://stackoverflow.com/questions/26098711/limiting-number-of-http-requests-per-second-on-python [10] https://realpython.com/python-use-global-variable-in-function/ [11] https://scrapeops.io/python-scrapy-playbook/scrapy-delay-between-requests/ [12] https://cloud.google.com/python/docs/reference/storage/1.44.0/client [13] https://github.com/JWCook/requests-ratelimiter [14] https://discuss.python.org/t/global-variables-shared-across-modules/16833 [15] https://coderslegacy.com/python/delay-between-requests-in-scrapy/ [16] https://jrnold.github.io/r4ds-exercise-solutions/transform.html [17] https://levelup.gitconnected.com/implement-rate-limiting-in-python-d4f86b09259f [18] https://docs.python.org/3/faq/programming.html [19] https://www.javatpoint.com/how-to-add-time-delay-in-python [20] https://koji.mbox.centos.org/koji/buildinfo?buildID=22406 [21] https://pypi.org/project/ratelimit/ [22] https://docs.python.org/3/library/timeit.html [23] https://www.purplefrogsystems.com/2020/07/how-to-delay-a-python-loop/ [24] https://medium.com/clover-platform-blog/conquering-api-rate-limiting-dcac5552714d [25] https://learning.postman.com/docs/writing-scripts/pre-request-scripts/ [26] https://python-forum.io/thread-35631.html [27] https://365datascience.com/tutorials/python-tutorials/limit-rate-requests-web-scraping/ [28] https://instructobit.com/tutorial/108/How-to-share-global-variables-between-files-in-Python [29] https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-latency [30] https://dev.to/astagi/rate-limiting-using-python-and-redis-58gk [31] https://www.simplilearn.com/tutorials/python-tutorial/global-variable-in-python

permalink
report
reply

Just change lemmy.post.create to lemmy.post.createe to trigger an AttributeError. That way you can debug the code without creating any posts. You can also use many print statements all around the code, I would use two for each line to make sure the computer isn’t fooling you. Lastly, you can spin up your own Lemmy instance to not have to worry about the generated posts.

permalink
report
reply

Testing.

https://join-lemmy.org/docs/users/03-votes-and-ranking.html

Edit: I was wrong the ranking that works like forums is New Comments and yes it seems to take into account the OP comments.

permalink
report
reply

The paper actually demonstrates a 16-million context window with 92% accuracy. Most models can be retrained to have a 100k context window with over 92% accuracy, but the accuracy drops to 74% at 256k. The code has already been released on GitHub as well. I’m excited to see the development of 100k models using this method soon!

permalink
report
reply

You don’t have any idea of how GPT works. Read about it and then we can talk.

permalink
report
parent
reply