Description: For the time entries we can’t quite predict with confidence, can we at least predict the likelihood that the user will actually bill for that time entry?
That way we can put the important stuff at the top of their time entry lists.
Generalized Modeling + Transfer Learning:
Description: How quickly can we get away from a model for every customer?
Feature Engineering:
Description: Create other metrics/features to track and collect as part of data collection.
User-level Models:
Description: Create Personalized Models per/user that attends to there characteristics
An accurate NER system that may be fine-tuned on the domain of legal documents.
Description:
Data Anonymization:
Description: Anonymize customer data meeting security standards.
Current Stack:
Locality-based Hash: Replace all proper nouns in the document titles and bodies.
Activity Segmentation:
Description: Since we collect the user’s activity in too much detail, we need to help them group several activities together as a single billable unit.
Notes:
Most clustering algos want to know how many clusters to make and don’t like leaving things out of clusters.
So this requires some creativity or research.
I also think that narrative generation should be driven by completion tries.
We all know the theory behind those, but when you get into it, there are lots of fun implementation details (fast serialization/deserialization, how much to send, how quickly can we update it).
Activity-Type Classification:
Description: We also need a model that weeds out personal vs work emails and web sessions.
Topic Modeling (aka Document-Type Classifier):
Description: Can we build a generic document type classifier?
Notes:
Each firm will have their own templates that they use for documents they create, so those should be easy.
They will also get documents through email from other firms that would be great to identify.
I personally think if we can figure out the document types, we can make an excellent advancement on narratives, phase/task codes and matter categorization for users.
(Long-Term) Time-Series Analytics:
Description: In time, we expect to show law firms analytics that help them understand how their lawyers are working, and how long cases of different types take.
DATA
Input: User Activity
User Activities:
Document Type (worked on) / Activity Type (e.g. phone call)
Document Title
Time-Period of Work (Total + Work-Intervals)
First 8K of the Document
Past Behavior
Output: Timesheets
Timesheet Entry:
Case Number: i.e. related matter/subject
Phase/Task code: Discovery - Depositions (see UTBMS in wikipedia for details)
Narrative: short summary of the (case?)
Data-Collection: Engineering tracks a users activity and we use the data they scraped to fill in the entries
Sources: 1 UK firm + 1 US firm
Notes:
Labeled Targets Exist (labeled time-sheets)
Input Features are limited
COMPLICATIONS
Limited Data Resources (quantity, variety)
Different Countries have different documents \(\implies\) different datasets have different targets: Notes: The US and UK have separate phase/task code systems, so we may always have two models.
Single-Tenancy implies no data sharing
Long-term storage of client Data requires full Anonymization