Automatic Timesheet Completion Description: Making tools that make sound guesses at what a user would want to do in filling out a timesheet and do it for them
Data:
Input: User Activity
User Activities:
Document Type (worked on) / Activity Type (e.g. phone call)
Document Title
Time-Period of Work (Total + Work-Intervals)
First 8K of the Document
Past Behavior
Output: Timesheets
Timesheet Entry:
Case Number: i.e. related matter/subject
Phase/Task code: Discovery - Depositions (see UTBMS in wikipedia for details)
Narrative: short summary of the (case?)
Data-Collection: Engineering tracks a users activity and we use the data they scraped to fill in the entries
Sources: 1 UK firm + 1 US firm
Notes:
Labeled Targets Exist (labeled time-sheets)
Input Features are limited
Complications:
Limited Data Resources (quantity, variety)
Different Countries have different documents \(\implies\) different datasets have different targets: Notes: The US and UK have separate phase/task code systems, so we may always have two models.
Single-Tenancy implies no data sharing
Long-term storage of client Data requires full Anonymization
Current NER is, possibly, too “aggressive”
Does it do us any good to store endless documents and emails if every 5th word is replaced with a tag? (I have deep concerns over this.)
Tags eliminate the ability for any large transformer to track semantic relationships between subjects and objects.
Does that matter (I presume it does, but haven’t tested the effect)?
Current Stack:
Automated Time-Sheet Completion:
Model: Gradient-Boosted Models (GBMs)
Model for each data source (1 UK firm, 1 US firm)
Data Anonymization:
Model: Locality-based Hash
Replace all proper nouns in the document titles and bodies.
Desiderata:
How quickly can we get away from a model for every customer.
Anonymize customer data meeting security standards.
Create other metrics/features to track and collect as part of data collection.
Create Personalized Models per/user that attends to there characteristics
An accurate NER system that may be fine-tuned on the domain of legal documents.