Importing Files Into Greenstone

 

Return to Main Project Page

Introduction:

This webpage is intended to show a step-by-step process how to get from a "list" into a basic Greenstone collection of records, with their metadata, ready to build a digital library/archive.  If you have landed on this page and are new to Greenstone and may be unsure why you want to import a list, click on the Return to Main Page link above. If you have immediate questions or comments, please email me, Bob Schmitt.

This suggested process assumes you have some familiarity with Greenstone and Excel.  The Greenstone wiki has links to both a tutorial and the lesson plans for three to five-day workshops.  All are very good and recommended.  I highly recommend obtaining the very complete book, "How to Build a Digital Library" by Witten, Bainbridge and Nichols. It contains excellent background and considerations for a digital library and a great Greenstone tutorial and reference.

The examples on this webpage are Excel files with real, working data actually imported into Greenstone.  A few were done many times, until I re-read the manual or learned from mistakes.

This process also assumes you have some knowledge of library classification techniques.  If not, do a little research on this topic or talk to an expert.  One of the official Dublin Core websites is a good starting point, especially topic 4, "Elements" at the bottom of that webpage.  Basic categories such as "Title", "Description", and "Subject and Keyword" can be confusing and tedious to correct after an import if you are dealing with many records.  It's best to get it (mostly) right in the beginning.

However:

"The Perfect is the enemy of the Good (enough)"  - Voltaire

So don't hesitate too long to get started!

Why Do This?

If you have a collection of objects - cars, paintings, books, photographs - and have been reasonably organized, you probably have a list (inventory) of these objects and some of their characteristics: "1948 Oldsmobile 88, green, bought in 1985 from Ted Smith, loaned to the Oakville Museum, call Fred Friday for status" or "Snowy Winter, oil on canvas, by Patricia Jones, 26"x18", bought in 1997 for $175".  If you are a bit more organized, this list and it attributes may now be on a computer, in some type of database.  But what do you do with the documents about the restoration of your 1948 Oldsmobile or the hundreds of photos at various car shows?  You can put linked references - or sometimes the actual document - on your PC.  But when the documents grows to hundreds and thousands and you want direct, repeatable access to any type of digital file related to your collection, a software program that goes beyond databases or photo organizers is recommended - Greenstone Digital Library software.  The cost is right - it's open-source (free to use).  And it's not just for "libraries", but applicable for collections, archives, libraries and museums.

When you make the conversion of your (Excel) list to Greenstone, the attributes of each object become metadata in categories you can use for access or classifying your current collection and future acquisitions.  Further, if you use Greenstone's metadata sets, these classifications will be recognized and can be accessible to a wider audience.  Your collection on Greenstone can remain private on your computer or home network or made accessible to the Internet.

Lists, Databases and Excel

A good example of a "list" is a simple contact list of names, addresses and other personal or business information.  Such a list would look like this:

Record No First Name Last Name Street City State Comments
01 Tom Jones 122 Maple Lane Elm Gardens CA Tom is the lead salesperson from the Acme Chemical company and should be a first contact.
02 Sue Scott 432 Apple Avenue Scottsdale Arizona Sue is the CFO of Acme Chemical company and will make final decisions on all contracts 

Lists such as the example above have been hand-written and typed for generations, but computers allow us to put this data into Excel (or Word) tables or more advanced programs, such as Access.  For a useful table (or list) to be used as a "database", some rules must be followed:

1.  Each line is a single Record with data on the one person, item or other entity on that line.

2.  Each column is headed by a descriptor that is termed a "Field" and all the data in that column has similar characteristics.

Note in the example above, Record 01 uses "CA" as the standard abbreviation for "California" whereas Record 02 spells out the full state name.  This is not good database practice.

We highly recommend, and this webpage will use, Excel as the table/list/database for gathering records and data for import into Greenstone.  If you have typed lists, investigate using OCR to convert them to Excel files.  Or find a good typist familiar with Excel!  If your tables are in Word or another program, they usually can be copied into or imported into Excel.

Excel provides many functions to help you review and clean-up your data.  Sorting, copying, pasting and moving cells of data will speed up any need to make your data uniform and within good database practices.

For more information on using Excel as a database, see Using Excel As A Database or any Excel book.  

Let's look at another example of records and a database, closer to data we may want to bring into Greenstone:

Record No car.Manufacturer car.Make car.Year car.Serial_No Engine No Engine Type Car.Model
01 AFN Ltd. Frazer Nash 1948 420/E1

-

2-seater Special
02 AFN Ltd. Frazer Nash 1949 421/100/003 1051 85-Series Bristol Two-seater
03 AFN Ltd. Frazer Nash 1948 421/100/004 1055 85-Series Bristol Competition two-seater
04 AFN Ltd. Frazer Nash 1950 421/100/005 FNS 1/14 FNS-Series Bristol Fast Tourer
05 AFN Ltd. Frazer Nash 1948 421/100/006 1053 85-Series Bristol High Speed
06 AFN Ltd. Frazer Nash 1949 421/100/007 FNS 1/2 FNS-Series Bristol High Speed
07 AFN Ltd. Frazer Nash 1949 421/100/008 FNS 1/3 FNS-Series Bristol High Speed
08 AFN Ltd. Frazer Nash 1950 421/100/109 FNS 1/11 FNS-Series Bristol Le Mans Replica

This is another example we will use to show how a archive catalog can be imported into Greenstone:

Category ID Cat ref: QTY Title Pages Description Size Date Condition Est Value Total Value Source
Sales-Promo 0001 RAN/1 1 2 Litre cars 400, 401, 402 4 page folder colour illustrations of 400,401,402 + power unit 8.5 x 13 1950 A £120 £120 Gift of R. Smith
Sales-Promo 0002 RAN/2 1 2 litre cars 400, 401, 402 4 page folder chassis & engine description 8.5 x 13 1950 B, tears £30 £30   
Sales-Promo 0003 RAN/3 1 Brigand Beaufort, Beaufighter & Britannia Sales-Promo 4 page folder range description b & w photos 8 X 11.5   A £5 £5   
Sales-Promo 0004 RAN/4 2 Beaufighter, Brigand & Britannia 8 page folder range description b & w & colour photos 8 x 11.5   A £15 £30   

Finally, without providing much detail at this stage, here are two examples of field names (columns headings) that have been used for Excel files to record data on cars (vehicles) and their owners, in separate files:

Vehicle File    Owner File
CarNo    OwnNo
OwnerNo    Last Name
Year    First name
Manufacturer    Salutation
Make    Address1
Model    Address2
ChassisNo    City
EngineNo    District/State
RegNo    Postal Code
Former_RegNo    Country
DeliveryDate    Email Address
OriginalColor    Home Phone
OriginalRegNo    Work Phone
CurrentLocation    Mobile
Remarks    Fax
      Remarks

Create an Excel File/Database

Using the examples above, create or check your Excel file to ensure all the data for each record is on a single line, "like kind" data is in each column, variations in each data item (spelling and abbreviations) have been made uniform, and blank lines have been eliminated.  Blank cells are OK.

Dates often assume great variability.  The can be either "text" or one of Excel's date formats - which can look exactly like text.  A good method to fix dates is to sort the entire file on the "date" field.  Dates in text format should be at the top and should be corrected to one of the Excel formats for dates.  

Your file can have a few records - probably best for initial trials - or thousands of records.  Greenstone seems to import very quickly!

Finally, create a new first column for your Excel File with a name such as "RecordID" - the exact field name is not critical.  The data in this column should be name that means something to you plus a number, perhaps, to make sure each records has a unique identifier.  For the Frazer Nash car file above, this would be something like "HighSpeed05", "Highspeed06", and "LeMansReplica08".  For the archive file above, this would be "Sales-Promo001".  If you have many records, this can be tedious to do manually, so use Excel's process to create a list of consecutive numbers, format the numbers into a standard format (ie. "01", "02") and then use the "Concatenate" function to combine this number field with a data item from a different column.  This column of data will be very useful later in both Greenstone and Access!

Import/Backup to Access

Consider importing your Excel file(s) into Access.  Your data will be more secure (harder to delete inadvertently) and Access will give you excellent reporting (print or online) and query abilities.  Files (tables) can be linked together to make a potentially very powerful and useful relational database.  Examples of Access databases for car collections, linking to car owners, events and other historical data can be found on this related webpage.  The "RecordID" field you created in the previous step can become a key index.

Review Standard Greenstone Metadata Categories

Assuming you have made at least an initial exploration of Greenstone, you should be familiar with the "Dublin Core Metadata Standard", which is the basic classification scheme used in Greenstone and widely recognized by digital libraries and other resources (including web pages).

The Dublin Core basically consists of these elements:

  1. Title

  2. Creator

  3. Subject

  4. Description

  5. Publisher

  6. Contributor

  7. Date

  8. Type

  9. Format

  10. Identifier

  11. Source

  12. Language

  13. Relation

  14. Coverage

  15. Rights

In Greenstone, each Dublin Core element is prefixed with "dc.", so they appear as dc.Title, dc.Creator, etc. Because these elements are widely accepted and recognized, it is a good idea to match your field names to the Dublin Core elements, insofar as that is possible.

For our example of an archive file, we will use this mapping:

Archive File   Dublin Core Metadata
Title = dc.Title
Category = dc.Subject
Description = dc.Description
Date = dc.Date
ID = dc.Type
Cat ref: = dc.Identifier
Source = dc.Source
     dc.Format
     dc.Creator
     dc.Publisher
     dc.Contributor
     dc.Language
     dc.Relation
     dc.Coverage
     dc.Rights
QTY = item.Quantity
Size = item.Size
Pages = item.Pages
Condition = item.Condition
EstValue = item.Value
Total Value = item.TotalValue

Note that not all Dublin Core metatags are mapped from the file scheduled to be imported; unused metatags can be added after the import as needed, directly in Greenstone.  Note also that new "item" metatags have appeared.  These should be added to Greenstone before the import.  This process is described below.

Other metatags will be also added, such as "car.Make", "car.Model", etc., specifically because the data of this archive is from a car company and a car club.

The "mapping" is very easy - just rename your column headings to the relevant Dublin Core element or a new metatag you plan to add to Greenstone.

Adding Metatags to Greenstone

There is a Greenstone tutorial which describes how add new metadata elements by the Metadata Set Editor.  Either use this approach or click on the "Manage Metadata Sets box in the lower left when you are in Greenstone's "Enrich" panel.  

For the the imports of the Excel example files shown above, two metadata sets were created: one for "cars.XXX" and one for "item.XXX"

"Exploding" (Importing) Your Database

If you have worked with Greenstone, you know "importing" an Excel file/database is very easy - In the "Gather" panel, just drag the file across from the Local Filespace on the left to the Collection panel on the right.  You should do this as the first record in your new collection.  When the collection is set up, this will give you a searchable file containing all your records.

But the purpose of this webpage process is to create hundreds of (nul/empty) records, with each of their data elements a new metatag, ready to be joined to a document, photo or other digital item.  The database must be "exploded"!

A Greenstone tutorial explains this - on that page, follow onwards from step 15.

An Excel file cannot be "exploded", but such a file is easily saved to a "comma-delimited" (csv) file.  

  1. Do this with the Excel file you want to import, by "saving as" the "csv" format.  Save the file in this format.

  2. With Greenstone open and in the "Gather" panel, drag the "csv" file into your Collection panel.

  3. Right click on the file and choose "Explode metadata database"

  4. When the "Explode Metadata Database" menu/panel appears, uncheck the box next to "metadata_set"

  5. Check the box for "document_field" and type in the exact field name of your first column, which you created above.  This should be something like "RecordID"

  6. Click "Explode"

  7. When the "import" is finished, a panel "Merging Action Required" appears.  

  8. Look to see the first "Source metadata element", select a "Target metadata set" such as "Dublin Core", "Item" or "Car", select a "Target metadata element" and "Merge"

  9. Greenstone will continue to go through each field name - you make choices and either Merge or Ignore until the end of the field names.

  10. You will then see new folders in the Collection panel, with names derived from the name of your imported file.  Each folder will have 100 (or less) records.

  11. Look at any record in the "Enrich" panel.  You should have the correct metadata for that record in each metadata element.

  12. The new folders created by the import process can be renamed (or create new one) , such as "Books", "Newsletters", etc. and the records can be moved into the new or renamed folders.

  13. If you have newly scanned documents, existing photos or any digital asset, they can be dragged into the appropriate folder and each can be matched with its correct metadata.  

  14. You can either select the metadata from the existing imported metadata for the new photo/document or you can replace the "nul" record with the actual document/photo.  In either the Gather or Enrich panels, just right click the "nul" record, select "Replace" and browse for the actual document/photo and confirm the replacement.

Your work to complete your new collection has only just begun!

If you would like specific help with your Excel file, send it to me by email (all or part) and I'll send you suggestions or make the actual import to a Greenstone collection.  All will be volunteer work, until I feel fully qualified to charge for services!

Email me with any questions!  Bob Schmitt, rgschmitt@gmail.com

Return to Main Project Page

April 26, 2012