Add visualization support

2025-01-22 09:22:04 -05:00 · 2018-07-01 12:04:10 -04:00 · 2018-07-01 12:04:10 -04:00 · a3ac722159
commit a3ac722159
parent 74ed3146a6
2 changed files with 67 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -1,10 +1,11 @@
 # hangouts-parser

 This repository parses conversation data from Google Hangouts and gives
-diagnostics on the number of messages in conversations. I'm working on adding
-support for visualizations based on the parsed data for more interesting
-graphical views of the data.
-
+diagnostics on the number of messages in conversations. Two scripts are
+currently supported: `parser.py` and `visualize.py`. The parsing script parses
+raw JSON data from Google Takeout and creates pickled summary files for each
+conversation. The visualization script creates a histogram of messages over
+time using the pickled conversation summaries.

 ## Usage
 1. Clone this repository
@ -14,13 +15,13 @@ graphical views of the data.
    + Download the data in zip format and move the `Hangouts.json` file into the `raw` folder in this repository
 3. Install dependencies via `pip`
    + `pip install -r requirements.txt`
-    + No dependencies are required for the `parser.py` script, but `visualize.py` will require the dependencies
+    + No dependencies are required for the `parser.py` script, but `visualize.py` requires the dependencies
 4. Run the parser
    + **Note:** if you did not place your hangouts data as `raw/Hangouts.json` you can specify the path to the `.json` file as an argument to the `parser.py` script via the `-f` flag
 ```bash
 python parser.py
 ```
-5. **Coming soon** Run the visualization
+5. Run the visualization
    + The `<conversation_id>` can be found in the output of the `parser.py`
      script
 ```bash
@ -34,4 +35,4 @@ This code is freely available under the GNU Public License (GPL).
 > All of the data processing in these scripts happens locally on your computer. The data you provide to the script is **NOT** uploaded to an external server. Feel free to examine the code if you are concerned.

 ### Acknowledgements
-> This repository was inspired by [MasterScrat/Chatistics](https://github.com/MasterScrat/Chatistics). Chatistics can parse Facebook Messenger and Telegram data, but not Hangouts group messages. I originally intended to contribute to that repository and hadd Hangouts group message support, but my design drifted far from the existing design in that repository so I created a new project. Shoutout to MasterScrat for great work and thanks for the inspiration!
+> This repository was inspired by [MasterScrat/Chatistics](https://github.com/MasterScrat/Chatistics). Chatistics can parse Facebook Messenger and Telegram data, but not Hangouts group messages. I originally intended to contribute to that repository and add Hangouts group message support, but my design drifted far from the existing design in that repository so I created a new project. Shoutout to MasterScrat for great work and thanks for the inspiration!
--- a/visualize.py
+++ b/visualize.py
@ -0,0 +1,59 @@
+#!/usr/bin/env python
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import ggplot
+import pickle
+import argparse
+import datetime
+
+import pandas as pd
+
+import utils
+utils.set_log_level(1)
+
+from utils import LOG_ERROR, LOG_DEBUG, LOG_INFO, LOG_WARN
+
+def main(file_path):
+    # Validate raw data path
+    if not os.path.exists(file_path):
+        LOG_ERROR('Could not find file: {}'.format(file_path))
+        return
+
+    # Validate raw data file type
+    if not file_path.endswith('.pkl'):
+        LOG_ERROR('File path must be a pickle file')
+        return
+
+    with open(file_path, 'rb') as f:
+        LOG_INFO('Parsing pickle file: {}'.format(file_path))
+        conversation = pickle.load(f)
+
+        LOG_INFO('Found conversation: {}'.format(conversation['conversation_name']))
+
+        df = pd.DataFrame(conversation['messages'])
+        df.columns = ['Timestamp', 'Type', 'Participant']
+        # df['Datetime'] = pd.to_datetime(df['Timestamp'])
+        df['Datetime'] = df['Timestamp'].apply(lambda x:
+                datetime.datetime.fromtimestamp(float(x)).toordinal())
+
+        histogram = ggplot.ggplot(df, ggplot.aes(x='Datetime', fill='Participant')) \
+                        + ggplot.geom_histogram(alpha=0.6, binwidth=2) \
+                        + ggplot.scale_x_date(labels='%b %Y') \
+                        + ggplot.ggtitle(conversation['conversation_name']) \
+                        + ggplot.ylab('Number of messages') \
+                        + ggplot.xlab('Date')
+
+        print(histogram)
+
+if __name__ == "__main__":
+    LOG_INFO('Started script')
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-f', '--file_path', required=True,
+                        type=str, help='Path to parsed data file')
+    args = parser.parse_args()
+
+    main(args.file_path)