Ahmad Badary

Tasks:
- Automatic Timesheet Completion
  Description: Making tools that make sound guesses at what a user would want to do in filling out a timesheet and do it for them
Data:
- Input: User Activity
  - User Activities:
    - Document Type (worked on) / Activity Type (e.g. phone call)
    - Document Title
    - Time-Period of Work (Total + Work-Intervals)
    - First 8K of the Document
    - Past Behavior
- Output: Timesheets
  - Timesheet Entry:
    - Case Number: i.e. related matter/subject
    - Phase/Task code: Discovery - Depositions (see UTBMS in wikipedia for details)
    - Narrative: short summary of the (case?)
- Data-Collection: Engineering tracks a users activity and we use the data they scraped to fill in the entries
- Sources: 1 UK firm + 1 US firm
- Notes:
  - Labeled Targets Exist (labeled time-sheets)
  - Input Features are limited
Complications:
- Limited Data Resources (quantity, variety)
- Different Countries have different documents \(\implies\) different datasets have different targets:
  Notes: The US and UK have separate phase/task code systems, so we may always have two models.
- Single-Tenancy implies no data sharing
- Long-term storage of client Data requires full Anonymization
  - Current NER is, possibly, too “aggressive”
    - Does it do us any good to store endless documents and emails if every 5th word is replaced with a tag? (I have deep concerns over this.)
    - Tags eliminate the ability for any large transformer to track semantic relationships between subjects and objects.
      Does that matter (I presume it does, but haven’t tested the effect)?
Current Stack:
- Automated Time-Sheet Completion:
  - Model: Gradient-Boosted Models (GBMs)
    Model for each data source (1 UK firm, 1 US firm)
- Data Anonymization:
  - Model: Locality-based Hash
    Replace all proper nouns in the document titles and bodies.
Desiderata:
- How quickly can we get away from a model for every customer.
- Anonymize customer data meeting security standards.
- Create other metrics/features to track and collect as part of data collection.
- Create Personalized Models per/user that attends to there characteristics
- An accurate NER system that may be fine-tuned on the domain of legal documents.

Ping! Roadmap

Ping!
Roadmap