Last week, 1,443 Electronic Theses and Dissertations (ETDs) were uploaded to CORE Scholar. We had talked about adding the ETDs for nearly a year, and decided Summer 2016 was the time for this massive undertaking. Fears of having to download each paper from the OhioLINK ETD Center permeated our thoughts and drove our workflow toward finding a way for automation. I know several schools are eyeing their ETDs for inclusion in their repository, and I hope this short explanation of our process can help those who can’t quite figure out how to undertake such a giant project. Please note, however, that these instructions are Ohio-centric and specific to a Digital Commons repository. Below is our process:
Step 1 – Request the Metadata
Request the metadata from OhioLINK. It gets sent to you as a massive spreadsheet. There are probably fields you don’t need, and several fields you’ll have to play with to fit your repository’s schema.
Step 2 – Tests
Bepress provides a demo site/sandbox for all Digital Commons users. We were able to figure out a few things using our Demo Site. First, uploading one record at a time, besides being incredibly laborious, would not allow us to simply import the record from the OhioLINK site. However, we were able to import the ETD via a batch upload spreadsheet. This led to our second breakthrough.
Each ETD has a unique identifier, called the Accession Number, which was included in the metadata given to us by OhioLINK. By adding “https://etd.ohiolink.edu/!etd.send_file?accession=” before the Accession number, and “&disposition=inline” after the Accession number, one has created the direct URL to the document. Luckily, as I said, the batch upload tool in Digital Commons allows for the import of the document. By adding a column with the first part before the Accession Number, and one with the last part after the Accession Number, we were able to merge all three cells to create the direct link to the document. Read, we did not have to download each ETD. We simply needed to merge three columns in Excel. That’s it.
Step 3 – Metadata cleanup
This past spring, a volunteer came to us seeking experience working with Institutional Repositories. She was a huge help in the metadata cleanup process. Capitalizing titles, merging columns, adding consistent keywords, disciplines, department names, etc. We downloaded ASAP Utilities, a wonderful add-in for Microsoft Excel, which helped her accomplish a large chunk of the cleanup quickly and concisely.
Our volunteer left in April, putting this project on hold until June. In the meantime, we investigated OpenRefine. Using OpenRefine for this project was a lifesaver. It detects duplicates, clusters information to find inconsistencies, has a plethora of faceting and editing tools, and is an all-around powerful software for data cleanup. I’m nowhere near an expert, and have yet to completely implement all of OpenRefine’s tools, but what I did use was incredibly helpful both in speed and consistency of data.
Step 4 – Uploading
It was time to actually push the upload of our ETDs. Early on, we had decided that our ETDs would go into one container, and would then be sorted by department into smaller collections. This made the upload process that much easier.
I ran into two snags during the upload process. First, DON’T TRY TO UPLOAD 1443 RECORDS AT ONCE. That is a terrible idea. After learning my lesson, I uploaded the records in batches of 100. This helped speed up the process, and if I got an error, it was much easier to fix. Second, Digital Commons does not allow HTML entities in the Title or Abstract fields in a batch upload. And we had A LOT of HTML entities. Thankfully, find and replace took care of most of that cleanup.
I am confident that adding our Theses and Dissertations to CORE Scholar will increase the visibility of the ETDs and help shine the spotlight on the unique research being performed at Wright State University.
If you have any questions about our workflow, CORE Scholar, the ETDs, etc., please don’t hesitate to contact me.