-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Package hallucination #950
Open
arjun-krishna1
wants to merge
15
commits into
NVIDIA:main
Choose a base branch
from
arjun-krishna1:feature/pypi_add_date_first_seen
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 12 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
e427e83
use pypi dataset with date first seen col
arjun-krishna1 12d3360
add huggingface data creation scripts
arjun-krishna1 cc56711
reformat package hallucination dataset scripts
arjun-krishna1 ef7f9ff
add javascript package hallucination dataset script
arjun-krishna1 762ff5f
update to tsv and format dates
arjun-krishna1 8221ed3
update time parsing
arjun-krishna1 8b176e6
add date parsing to js
arjun-krishna1 ca9287e
add batching to ruby
arjun-krishna1 004d94d
add note explaining how to get ruby gems input
arjun-krishna1 2760fbb
update ruby dataset with tsv and dates
arjun-krishna1 6561315
update pypi and npm datasets w tsv and dates
arjun-krishna1 9a55150
add cutoff date filtering on pkg hallucination detector
arjun-krishna1 b62de09
update cutoff for meta llama 3.1
arjun-krishna1 16b5904
uses default params for cutoff date
arjun-krishna1 ce2eb9a
test nemotron python
arjun-krishna1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
--- | ||
system: | ||
verbose: 0 | ||
narrow_output: false | ||
parallel_requests: 16 | ||
parallel_attempts: 16 | ||
lite: true | ||
show_z: false | ||
|
||
run: | ||
seed: | ||
deprefix: true | ||
eval_threshold: 0.5 | ||
generations: 5 | ||
probe_tags: | ||
|
||
plugins: | ||
model_type: openai | ||
model_name: gpt-3.5-turbo | ||
probe_spec: all | ||
detector_spec: auto | ||
extended_detectors: false | ||
buff_spec: | ||
buffs_include_original_prompt: false | ||
buff_max: | ||
detectors: | ||
PythonPypi: | ||
config: | ||
cutoff_date: "1994-01-01 00:00:00 +0000" | ||
generators: {} | ||
buffs: {} | ||
harnesses: {} | ||
probe_spec: packagehallucination.Python | ||
probes: | ||
encoding: | ||
payloads: | ||
- default | ||
|
||
reporting: | ||
report_prefix: | ||
taxonomy: | ||
report_dir: garak_runs | ||
show_100_pass_modules: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
import time | ||
import requests | ||
from datetime import datetime, timezone | ||
import backoff | ||
import json | ||
from concurrent.futures import ThreadPoolExecutor, as_completed | ||
|
||
TIME_FORMAT = "%Y-%m-%d %H:%M:%S %z" | ||
|
||
@backoff.on_exception(backoff.expo, | ||
(requests.exceptions.RequestException, requests.exceptions.HTTPError), | ||
max_tries=5) | ||
def get_package_first_seen(package_name): | ||
url = f"https://registry.npmjs.org/{package_name}" | ||
try: | ||
response = requests.get(url, timeout=30) | ||
response.raise_for_status() | ||
data = response.json() | ||
created_date = data.get('time', {}).get('created', 'N/A') | ||
# Parse the ISO format date and format it according to TIME_FORMAT | ||
dt = datetime.fromisoformat(created_date) | ||
dt = dt.replace(tzinfo=timezone.utc) | ||
created_date = dt.strftime(TIME_FORMAT) | ||
except requests.RequestException as e: | ||
created_date = f"Error: {str(e)}" | ||
print(f'Error getting data for {package_name}: {created_date}') | ||
|
||
return created_date | ||
|
||
def main(): | ||
# names.json from https://github.com/nice-registry/all-the-package-names/blob/master/names.json | ||
input_file = 'names.json' | ||
output_file = 'npm_packages3.tsv' | ||
processed = 0 | ||
included = 0 | ||
excluded = 0 | ||
errors = 0 | ||
start_time = time.time() | ||
|
||
# Read the JSON file into a Python list | ||
with open(input_file, 'r') as infile: | ||
package_names = json.load(infile) | ||
|
||
total_packages = len(package_names) | ||
print(f"Starting to process {total_packages} npm packages...") | ||
|
||
# Processes packages in parallel within batches | ||
batch_size = 1000 | ||
batches = [package_names[i:i+batch_size] for i in range(0, len(package_names), batch_size)] | ||
|
||
with open(output_file, 'a') as outfile: | ||
outfile.write("text\tpackage_first_seen\n") | ||
for batch in batches: | ||
batch_results = [] | ||
with ThreadPoolExecutor(max_workers=batch_size) as executor: | ||
future_to_package = {executor.submit(get_package_first_seen, package): package for package in batch} | ||
|
||
for future in as_completed(future_to_package): | ||
package = future_to_package[future] | ||
creation_date = future.result() | ||
batch_results.append((package, creation_date)) | ||
|
||
batch_output = [] | ||
for package, creation_date in batch_results: | ||
if creation_date: | ||
batch_output.append(f"{package}\t{creation_date}") | ||
included += 1 | ||
status = "Included" | ||
else: | ||
excluded += 1 | ||
status = "Error" if "Error:" in str(creation_date) else "Excluded" | ||
|
||
processed += 1 | ||
|
||
if "Error:" in str(creation_date): | ||
errors += 1 | ||
|
||
outfile.write("\n".join(batch_output) + "\n") | ||
outfile.flush() | ||
|
||
# Progress reporting | ||
elapsed_time = time.time() - start_time | ||
packages_per_second = processed / elapsed_time | ||
estimated_total_time = total_packages / packages_per_second | ||
estimated_remaining_time = estimated_total_time - elapsed_time | ||
|
||
print(f"Processed: {processed}/{total_packages} ({processed/total_packages*100:.2f}%)") | ||
print(f"Included: {included}, Excluded: {excluded}, Errors: {errors}") | ||
print(f"Elapsed time: {elapsed_time:.2f} seconds") | ||
print(f"Estimated remaining time: {estimated_remaining_time:.2f} seconds") | ||
print(f"Processing speed: {packages_per_second:.2f} packages/second") | ||
print("-" * 50) | ||
|
||
print(f"Filtering complete. Results saved in {output_file}") | ||
print(f"Total gems processed: {processed}") | ||
print(f"Gems included: {included}") | ||
print(f"Gems excluded: {excluded}") | ||
print(f"Gems with errors: {errors}") | ||
print(f"Total execution time: {time.time() - start_time:.2f} seconds") | ||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
import requests | ||
from datetime import datetime, timezone | ||
import csv | ||
import backoff | ||
from concurrent.futures import ThreadPoolExecutor, as_completed | ||
|
||
TIME_FORMAT = "%Y-%m-%d %H:%M:%S %z" | ||
|
||
def get_all_packages(): | ||
url = "https://pypi.org/simple/" | ||
response = requests.get(url) | ||
packages = response.text.split("\n") | ||
return [pkg.split("/")[2] for pkg in packages if "a href" in pkg] | ||
|
||
@backoff.on_exception(backoff.expo, | ||
(requests.exceptions.RequestException, requests.exceptions.HTTPError), | ||
max_tries=5) | ||
def get_package_first_seen(package_name): | ||
url = f"https://pypi.org/pypi/{package_name}/json" | ||
response = requests.get(url) | ||
response.raise_for_status() | ||
data = response.json() | ||
releases = data.get("releases", {}) | ||
if releases: | ||
oldest_release = min(releases.keys(), key=lambda x: releases[x][0]['upload_time'] if releases[x] else '9999-99-99') | ||
if releases[oldest_release] and releases[oldest_release][0].get("upload_time"): | ||
# Parse the upload time and format it according to TIME_FORMAT | ||
upload_time = releases[oldest_release][0]["upload_time"] | ||
try: | ||
# Parse the time (PyPI times are in UTC) | ||
dt = datetime.fromisoformat(upload_time) | ||
dt = dt.replace(tzinfo=timezone.utc) | ||
return dt.strftime(TIME_FORMAT) | ||
except ValueError: | ||
return None | ||
return None | ||
|
||
def main(): | ||
output_file = "pypi_20241007_NEW.tsv" | ||
packages = get_all_packages() | ||
processed = 0 | ||
total_packages = len(packages) | ||
print(f"Starting to process {total_packages} PyPI packages...") | ||
|
||
batch_size = 1000 | ||
batches = [packages[i:i+batch_size] for i in range(0, total_packages, batch_size)] | ||
|
||
try: | ||
with open(output_file, "a", newline='') as outfile: | ||
tsv_writer = csv.writer(outfile, delimiter='\t') | ||
tsv_writer.writerow(["text", "package_first_seen"]) | ||
|
||
for batch in batches: | ||
batch_results = [] | ||
with ThreadPoolExecutor(max_workers=batch_size) as executor: | ||
future_to_package = {executor.submit(get_package_first_seen, package): package for package in batch} | ||
|
||
for future in as_completed(future_to_package): | ||
package = future_to_package[future] | ||
try: | ||
creation_date = future.result() | ||
batch_results.append((package, creation_date)) | ||
processed += 1 | ||
if processed % 100 == 0: | ||
print(f"Processed: {processed}/{total_packages} ({processed/total_packages*100:.2f}%)") | ||
except Exception as e: | ||
print(f"Error processing {package}: {str(e)}") | ||
|
||
for package, creation_date in batch_results: | ||
if creation_date: | ||
tsv_writer.writerow([package, creation_date]) | ||
else: | ||
print(f"No creation date found for {package}") | ||
|
||
outfile.flush() | ||
print(f"Batch completed. Total processed: {processed}/{total_packages} ({processed/total_packages*100:.2f}%)") | ||
print("*"*50) | ||
|
||
except IOError as e: | ||
print(f"Error writing to file: {str(e)}") | ||
|
||
print(f"Done! Results saved in {output_file}") | ||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
import time | ||
import requests | ||
from datetime import datetime, timezone | ||
import backoff | ||
from concurrent.futures import ThreadPoolExecutor, as_completed | ||
|
||
INPUT_TIME_FORMAT = "%Y-%m-%dT%H:%M:%S.%fZ" | ||
TIME_FORMAT = "%Y-%m-%d %H:%M:%S %z" | ||
|
||
@backoff.on_exception(backoff.expo, | ||
(requests.exceptions.RequestException, requests.exceptions.HTTPError), | ||
max_tries=5) | ||
def get_package_first_seen(gem_name): | ||
url = f"https://rubygems.org/api/v1/versions/{gem_name}.json" | ||
response = requests.get(url, timeout=30) | ||
response.raise_for_status() # This will raise an HTTPError for bad responses | ||
|
||
versions = response.json() | ||
|
||
# Sort versions by creation date and get the earliest one | ||
earliest_version = min(versions, key=lambda v: datetime.strptime(v['created_at'], INPUT_TIME_FORMAT)) | ||
|
||
# Parse and format the date | ||
creation_datetime = datetime.strptime(earliest_version['created_at'], INPUT_TIME_FORMAT) | ||
creation_datetime = creation_datetime.replace(tzinfo=timezone.utc) | ||
return creation_datetime.strftime(TIME_FORMAT) | ||
|
||
def main(): | ||
# gems.txt is the output from the `gem list --remote` command | ||
input_file = 'gems.txt' | ||
output_file = 'filtered_gems.tsv' | ||
batch_size = 100 | ||
|
||
# Read all gem names first | ||
with open(input_file, 'r') as infile: | ||
all_gems = [line.strip().split(" (")[0] for line in infile] | ||
|
||
total_gems = len(all_gems) | ||
processed = 0 | ||
included = 0 | ||
excluded = 0 | ||
errors = 0 | ||
start_time = time.time() | ||
|
||
# Create batches | ||
batches = [all_gems[i:i+batch_size] for i in range(0, total_gems, batch_size)] | ||
|
||
print(f"Starting to process {total_gems} gems...") | ||
|
||
with open(output_file, 'a') as outfile: | ||
outfile.write(f"text\tpackage_first_seen\n") | ||
|
||
for batch in batches: | ||
batch_results = [] | ||
with ThreadPoolExecutor(max_workers=batch_size) as executor: | ||
future_to_gem = {executor.submit(get_package_first_seen, gem_name): gem_name for gem_name in batch} | ||
|
||
for future in as_completed(future_to_gem): | ||
gem_name = future_to_gem[future] | ||
try: | ||
formatted_date = future.result() | ||
batch_results.append((gem_name, formatted_date)) | ||
included += 1 | ||
status = "Included" | ||
except Exception as e: | ||
print(f"Error processing gem '{gem_name}': {e}") | ||
errors += 1 | ||
status = "Error" | ||
|
||
processed += 1 | ||
|
||
if processed % 100 == 0 or processed == total_gems: | ||
elapsed_time = time.time() - start_time | ||
gems_per_second = processed / elapsed_time | ||
estimated_total_time = total_gems / gems_per_second | ||
estimated_remaining_time = estimated_total_time - elapsed_time | ||
|
||
print(f"Processed: {processed}/{total_gems} ({processed/total_gems*100:.2f}%)") | ||
print(f"Included: {included}, Excluded: {excluded}, Errors: {errors}") | ||
print(f"Elapsed time: {elapsed_time:.2f} seconds") | ||
print(f"Estimated remaining time: {estimated_remaining_time:.2f} seconds") | ||
print(f"Processing speed: {gems_per_second:.2f} gems/second") | ||
print("-" * 50) | ||
|
||
# Write batch results | ||
for gem_name, formatted_date in batch_results: | ||
if formatted_date: | ||
outfile.write(f"{gem_name}\t{formatted_date}\n") | ||
outfile.flush() | ||
print(f"Batch completed. Total processed: {processed}/{total_gems} ({processed/total_gems*100:.2f}%)") | ||
print("*"*50) | ||
|
||
print(f"Filtering complete. Results saved in {output_file}") | ||
print(f"Total gems processed: {processed}") | ||
print(f"Gems included: {included}") | ||
print(f"Gems excluded: {excluded}") | ||
print(f"Gems with errors: {errors}") | ||
print(f"Total execution time: {time.time() - start_time:.2f} seconds") | ||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the contents of this file are useful for testing it should not be committed to the repository for distribution as part of the primary package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good @jmartin-tech , I will remove this