John Davidson

php - How would I programmatically determine all the currently existing dirs in a Github dir page?

0 comments
Message:


I have this: https://github.com/bitcoin-core/guix.sigs/tree/main/22.0


In PHP, I'm trying to grab a list of all subdirs in that dir, as it exists at that given moment. For all I know, they sometimes remove and add (or even rename) the existing ones. In other words: [ '0xb10c', 'CoinForensics', 'Emzy', ... ];


What would be the best way to accomplish this?


Do I really have to cURL-fetch the webpage (Github/Microsoft loves blocking my bots) and then try to parse them out from the absolute clusterduck of HTML code?


Do they really not provide this list of "independent verifiers" as some sort of computer-parseable list somewhere?


My ultimate goal is to be able to fetch all of their verifications for the current version of Bitcoin Core, such as: https://raw.githubusercontent.com/bitcoin-core/guix.sigs/main/22.0/fanquake/all.SHA256SUMS, and compare it with the "official" one on BitcoinCore.org, and it they all don't match, I will not install the new update. To be able to do this, I need to know the list of "users" to construct the URLs to fetch.


I don't understand why they always seem to actively make one step impossible or near-impossible to automate, even in highly technical and security-related contexts where it makes no sense. I really hope that I'm missing something obvious.



My Answer:
Unfortunately, GitHub does not provide a direct API endpoint to retrieve a list of directories in a specific repository. One way to achieve this programmatically is to use web scraping techniques to fetch the HTML content of the GitHub directory page and then parse it to extract the list of directories.

You can use PHP libraries like Guzzle or cURL to make a request to the GitHub directory page URL (https://github.com/bitcoin-core/guix.sigs/tree/main/22.0) and then use a HTML parsing library like DOMDocument or SimpleHTMLDom to extract the list of directories from the HTML content.

Here is a basic example using cURL and SimpleHTMLDom to fetch and parse the GitHub directory page:

php

// URL of the GitHub directory page
$url = 'https://github.com/bitcoin-core/guix.sigs/tree/main/22.0';

// Initialize cURL session
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// Execute cURL session
$response = curl_exec($ch);

// Close cURL session
curl_close($ch);

// Parse the HTML content using SimpleHTMLDom
include('simple_html_dom.php');
$html = str_get_html($response);

// Find all links with class "js-navigation-open" (these are the directories)
$directories = [];
foreach($html->find('a.js-navigation-open') as $link) {
$directories[] = $link->plaintext;
}

// Output the list of directories
print_r($directories);

?>


You will need to download the SimpleHTMLDom library from http://simplehtmldom.sourceforge.net/ and include it in your PHP script.

Please note that web scraping may not be the most reliable method as it relies on the structure of the HTML content, which can change over time. Additionally, GitHub may have rate limits or restrictions on automated access to their website, so you may need to handle that as well.

Rate this post

3 of 5 based on 8370 votes

Comments




© 2024 Hayatsk.info - Personal Blogs Platform. All Rights Reserved.
Create blog  |  Privacy Policy  |  Terms & Conditions  |  Contact Us