• Home
  • Work
  • Projects
  • Blog
  • About

Ping!
Roadmap

Home Work-Space Previous
  • Tasks:
    • Automatic Timesheet Completion
      Description: Making tools that make sound guesses at what a user would want to do in filling out a timesheet and do it for them
  • Data:
    • Input: User Activity
      • User Activities:
        • Document Type (worked on) / Activity Type (e.g. phone call)
        • Document Title
        • Time-Period of Work (Total + Work-Intervals)
        • First 8K of the Document
        • Past Behavior
    • Output: Timesheets
      • Timesheet Entry:
        • Case Number: i.e. related matter/subject
        • Phase/Task code: Discovery - Depositions (see UTBMS in wikipedia for details)
        • Narrative: short summary of the (case?)
    • Data-Collection: Engineering tracks a users activity and we use the data they scraped to fill in the entries
    • Sources: 1 UK firm + 1 US firm
    • Notes:
      • Labeled Targets Exist (labeled time-sheets)
      • Input Features are limited
  • Complications:
    • Limited Data Resources (quantity, variety)
    • Different Countries have different documents \(\implies\) different datasets have different targets:
      Notes: The US and UK have separate phase/task code systems, so we may always have two models.
    • Single-Tenancy implies no data sharing
    • Long-term storage of client Data requires full Anonymization
      • Current NER is, possibly, too “aggressive”
        • Does it do us any good to store endless documents and emails if every 5th word is replaced with a tag? (I have deep concerns over this.)
        • Tags eliminate the ability for any large transformer to track semantic relationships between subjects and objects.
          Does that matter (I presume it does, but haven’t tested the effect)?
  • Current Stack:
    • Automated Time-Sheet Completion:
      • Model: Gradient-Boosted Models (GBMs)
        Model for each data source (1 UK firm, 1 US firm)
    • Data Anonymization:
      • Model: Locality-based Hash
        Replace all proper nouns in the document titles and bodies.
  • Desiderata:
    • How quickly can we get away from a model for every customer.
    • Anonymize customer data meeting security standards.
    • Create other metrics/features to track and collect as part of data collection.
    • Create Personalized Models per/user that attends to there characteristics
    • An accurate NER system that may be fine-tuned on the domain of legal documents.
Site maintained by Ahmad Badary.

© 2017. All rights reserved.

  • GitHub
  • LinkedIn
  • Facebook