SRS format . Overall description 2.1 Product perspective 2.2 Product functions 2.3 User characteristics 2.4 Constraints 2.5 Assumptions and depen

Click here to Order a Custom answer to this Question from our writers. It’s fast and plagiarism-free.

 . Overall description 

2.1 Product perspective

 2.2 Product functions

 2.3 User characteristics

 2.4 Constraints 

2.5 Assumptions and dependencies 

1 Overview

The Personal Name Extractor (PNE) is a program to examine blocks of text that have been extracted from the opening pages of a document and to extract any personal names, presumably representing the authors of the document.

This document describes the basic requirements for the PNE system.

2 Background

2.1 Metadata Extraction

Metadata is data that describes data. For our purposes, the second “data” in that statement is a document, so metadata is data that describes a document. Typical metadata fields include the title, the authors, the date of publication, the publisher, descriptive keywords, and the abstract. Large libraries rely upon sch metadata to organize and permit searching of their holdings. (“Libraries” in this context are not limited to public lending libraries, but refers to almost any large, organized collection of documents. Corporate libraries, for example, are quite common.)

Metadata is generally described in terms of named fields. In some libraries, these fields are defined in-house, e.g., these 
DTIC Cataloging Guidelines
. More often, librarians employ a variant of the simple 
Dublin Core
 standard metadata set or the substantially more complicated Library of Congress 
Marc standard
.

2.2 The Extract Project

The 

Extract

 project
 was a recent research project at ODU that was designed to assist large document repositories that were in the process of digitizing collections of documents accumulated over years or even decades. With funding from 
DTIC
, NASA, and the U.S. Government Printing Office (GPO), Extract sought to automatically extract metadata from digital (PDF) documents. That metadata could be used to index those documents to permit easy and meaningful searches through large online repositories.

The general approach taken by Extract was to have a human describe the layout of a document’s title page by specifying such rules as “The paragraph with the largest font on the 1st page has the title”, or “The names of the authors appear after a centered paragraph containing only the word ”by“. The Extract software would use those rules to locate the desired lines or paragraphs of text. This focus on ”layout” was an attempt to capture the basic intuition that a human cataloger working for DTIC or a librarian working for the G.P.O. could have located the title, authors, date, etc., by visual cues when viewing a title page from across the room, from much too far away to actually read any of the text.

The layout approach would be easy if every document in the repository had the same layout. But these large repositories (The DTIC collection numbers in the millions of documents, with 30,000 to 50,000 new documents added every year. The U.S. GPO handles even more documents, though the Extract project looked only at Congressional reports and at documents from the EPA.) feature documents from many publishers, each publisher having multiple layouts, and many of those layouts change over the years. So Extract needed to determine which of many layouts was actually yielding the appropriate metadata, and, by implication, which layout actually describes the particular document we are processing.

Extract would examine the text obtained via each layout to see if the text matched the kind of metadata expected. Sometimes, this would be a straightforward check. if for example, a layout had a rule that “the upper right hand corner of page 1 will contain the date of publication”, then it’s easy enough to see if the text appearing in that corner of the current document really is formatted as a date. Sometimes, we relied on statistical tests. For example, we might have computed the average and standard deviation of the length of documents’ titles in the existing collection. Then, if a layout suggested a title that had only a single word, or that had more than 100 words, we would probably reject that as improbable.

2.3 Name Extraction

One of the more difficult checks performed by Extract was to determine if a block of text, believed to contain a list of author names, actually contained any names and, if so, what those names were.

Some examples, with the names in italics:

· John H. van Huffel

· John C. Ellis, II

· Pierce, Edward T.

· Jay S. Lewis, Lieutenant Commander, U.S. Navy

· K. Alan Kronstadt, Coordinator
Analyst in Asian Affairs
Bruce Vaughn
Analyst in Southeast and South Asian Affairs

· Jason S. Metcalfe (DCS); James A. Davis, Jr.Richard A. Tauson, and Kaleb McDowell (all of ARL)

A complicating factor is that we aren’t interested in personal names that are embedded within names of places or organizations. So none of these should be treated as a personal name:

· George Washington Bridge

· Martha Washington College

· Martin Luther King Drive, Chicago

Punctuation can make a difference. “Joe Montana” is a person. “Jordan, Montana” is a city.

Line breaks can make a difference

John Smith Advanced Research Laboratory

University of Lower Podunk

describes a location, but

John Smith

Advanced Research Laboratory

University of Lower Podunk

describes a person.

2.4 Machine Learning

The Extract team was never satisfied with their hard-coded functions for extracting names. It was not hard to lay down some basic rules for recognizing first-name-first and last-name-first patterns, but over time, a lot of exceptions were added to those basic patterns. Then exceptions were added to the exceptions, and later exceptions were added to the exceptions to the exceptions, and … Part of the problem is that there are so many different things you can take into consideration, e.g.: Is a word capitalized? Is it in all-caps? Are the surrounding words also capitalized or in all-caps? Is it a common word that can be found in any dictionary? Is it proceeded by an honorific like “Mr.” or “Dr.” or a military rank like “Lt.” or “Major”? Does the word appear in lists of the most common names in America, or as the name of an author of an earlier document in the collection? With so many things to look at, it is hard for programmers to strike a balance among them.

Such “brittle”, easily broken code is a good candidate for replacement by some form of machine learning, and this project may be seen as an attempt to show whether machine learning can out-perform the former hand-written version.

learning machine is a basic calculation framework that has multiple parameters. This framework takes data, of a type and format that we can choose, as input and produces a “classification”, an indicator indicating to which of several predefined classes (“Classes” in the sense of groups of related things, not in the programming language sense.) the input belongs. In this case, we will be using blocks of text as the input and will want to classify each word in the text as either “beginning a name”, “continuing a name”, or “other”.

These calculation frameworks are generally fairly simple, so a calculation as to the probability of a word representing part of a name would be easy enough if only we knew the proper values for the framework parameters. The real challenge is finding a set of parameter values that will accurately distinguish words that are part of a name from words that are not.

We do this by “training” the learning machine on a number of examples. (Such training has its parallel in the real world. Both DTIC catalogers and GPO librarians undergo significant training in identifying the appropriate metadata fields.) The outcome of this training is the set of parameters we need to plug into our framework. We will supply a training set of sample phrases already marked up to indicate how many names it contains and where each name begins and ends. We hope that by training a learning machine on such a set, we will wind up with a set of parameters that somehow mimic our own decision process in identify and extracting names.

3 Requirements

The Personal Name Extractor system should be capable of processing a block of English-language text (up to a page in size) and to locate personal names within it. This system accuracy should be at least 90% that of a trained human performing the same task.

Because the purpose of this project is to explore the suitability of replacing the hand-written code for name extraction by a solution based on learning machines, the PNE system must be based upon machine learning.

The ultimate purpose of this project is to provide a library that can be embedded within Extract or similar projects. The public interface of this library should be quite simple:

package edu.odu.cs.cs350.namex;

public class Extractor {

/**

* Mark all personal names within a block of text.

*

* @param textBlock a block of text, possibly spanning multiple

* lines.

* @return the same block of text with “<PER>” and “</PER>” tags

* surrounding any personal names found within that text.

*/

public static String markPersonalNames (String textBlock) {

}

}

For example, given the code

String input = “Name Extraction — Requirements DefinitionnSteven J ZeilnJan 20, 2016”;

String markedUp = Extractor.markPersonalNames(input);

System.out.println(markedUp);

should produce the output:

Name Extraction — Requirements Definition

<PER>Steven J Zeil</PER>

Jan 20, 2016

The tag PER is one of the three traditional markups used in Named Entity Recognition, of which this program is a special case. The others are LOC (for locations or place names) and ORG (for names of organizations).

The markPersonalNames function must be fast enough that, when called from a larger application, it does not appreciably slow that application even when called multiple times.

For testing and demonstration purposes, however, the project should also produce an executable version of the extractor. The executable will be operated from a command-line interface. (A command-line interface (CLI) accepts all input as command line parameters. It is not an interactive interface that prints prompts to the user requesting the typing of input information.)

It should read one or more blocks of text from the standard input and print the marked-up versions of those blocks on standard output. To distinguish one block of text from another, each block will be surrounded by the tags <NER> and </NER> (in both the input and output).

Developers will be provided with a collection of 
sample text blocks
 drawn from application of the Extract layout processor to approximately 4,000 public documents from the DTIC collection. This collection can serve as both training data and as expected output from a system test.

Following completion of the project, head-to-head testing will be employed to determine whether to use the machine learning based PNE as a replacement for the current hand-written Extract name processor.

3.1 Deployment

The Personal Name Extractor will be delivered as a PNE.jar file. This Jar will include:

· Compiled code for the PNE system.

· Any necessary data files (e.g., trained learning machines).

· Compiled code from any third-party libraries required for execution of the system.

It shall not, however, include compiled versions of unit or integration test drivers and the libraries for their support.

For testing and demonstration purposes, however, it should be possible to run the executable version of the PNE via a “java -jar” invocation. For example, if the file inputBlocks.txt contains

<NER>(UUV) master plan. Bluefin Robotics Corporation and</NER>

<NER>Final Report

Prepared for :

Dr. David Chris Arney

US Army Research Office

Research Triangle Park, NC 27 70 9 – 22 11

Email : david.arney1@us.army.</NER>

<NER>(WHOI) Department of Applied Ocean Physics and</NER>

then the command

java -jar PNE.jar < inputBlocks.txt > markedUpOutput.txt

would place the following into markedUpOutput.txt:

<NER>(UUV) master plan. Bluefin Robotics Corporation and</NER>

<NER>Final Report

Prepared for :

<PER>Dr. David Chris Arney</PER>

US Army Research Office

Research Triangle Park, NC 27 70 9 – 22 11

Email : david.arney1@us.army.</NER>

<NER>(WHOI) Department of Applied Ocean Physics and</NER>

3.2 Additional Requirements

The system will be implemented in Java 11.

The system should run on both Windows, Linux, and OS/X systems equipped with an appropriate Java JRE.

Place your order now for a similar assignment and have exceptional work written by one of our experts, guaranteeing you an A result.

Need an Essay Written?

This sample is available to anyone. If you want a unique paper order it from one of our professional writers.

Get help with your academic paper right away

Quality & Timely Delivery

Free Editing & Plagiarism Check

Security, Privacy & Confidentiality