Datacleaning fighter jets

I already downloaded and cleaned data, and it was an experience I didn’t want to repeat..I couldn’t find a way to enter input without taking the time to learn about widgets, so I had to break a 2-layer for-loop into a series of cells that had to be run in the right order.This is not how you want to do things.I assigned numbers to files, and entered the number of the image that didn’t belong..This time I used a javascript command that captures links to the images you can see, instead of the chromedriver-based command-line script..You build your url files by entering the javascript command in your browser’s console (cmd-alt-i or -j) when you see the images you want..You then run a for-loop over them, calling fastai’s download_images function:# download datasetfor url_path in urls.ls(): aircraft_type = url_path.name.split('.')[0] # get class name print(f'downloading: {aircraft_type}') dest = path/aircraft_type; dest.mkdir(parents=True, exist_ok=True) # set & create class folder download_images(url_path, dest)Where url_path is a pathlib path to your url files (.csv or .txt), assuming they’re named by class..The purpose of me doing this on my Mac, is that I can take advantage of my OS’s GUI to quickly delete images I don’t want..I need to build a list of good links to download..I didn’t want to take the time to understand python’s global keyword, and I figured it was because a copy of my url_fname_dict was being written to in the lower-level function..I don’t know per se if it’s related to my issue, but it felt right enough and I didn’t want to dive into multiprocessing / threading..So it turns out I didn’t even need a class at all.399 file-url maps stored when `max_workers` ≤ 1; 0 stored when > 1So this worked..I was stuck in the Jupyter notebook paradigm, and gained a lot by letting go.Then copy-paste to a text file, and run a few regex filters to extract the filepath and its corresponding url, then put them into your dictionary.I made 4 regex filters:fail_pat = re.compile(r'Error S+') # splitclas_pat = re.compile(r'downloading: S+') # splitsave_pat = re.compile(r'data/S+')link_pat = re.compile(r's-sS+') # splitThe ones with ‘split’ comments need part of their filtered text removed, since I don’t know how to do that in regex yet..The dictionaries are just defaltdicts:removal_urls = defaultdict(lambda:[])file_mapping = defaultdict(lambda:{})Then building the file-url mapping, and the first bit of urls to remove:urls that don’t download are the first additions to `remove_urls`Everything’s finally ready for cleaning.2..One problem: the interface is made for smaller datasets, and there isn’t a way to turn off training-set shuffling without playing with PyTorch dataloaders:A step in the right direction; but this just isn’t going to cut if for 10,000+ images.What would be perfect is if I could have a full screen of images to review, and if I could do so by class.. More details

Leave a Reply