calculate 7-day retention rate for mobile apps on a csv dataset

source

run

assumptions

  • The data set is small enough, the function will load the entire dataset and perform calculation on the entire data set when the calculate() is called, so that a 7-day retention rate of a specific time peroid needs to be retrieved, the calculation time will be instant.

    • if the data is streaming or dataset is too big, I shall design models load data streamingly or by block and/or perform calculation on the subset dataset sliced specific time period.
  • The filter by OS will be cross-platform

    • for example, if we want to calculate the 7-day retention rate filtered by ios OS of a day 9-10-2016, and that day is the first time user X opened app on that day on OS system, but EARLIER user user X opened app on 9-5-2016 on a andriod system. In that case, the model will not count user X as a new user, and user X's info will not be used to calculate 7-day retention rate for 9-10-2016 onwards.
      • this assumption can be simply revered by changing very few codes, which will let the model count user X as a new user even if user X opened app on the early days on other OS.

from model import functions

In [1]:
from retention import load_data, filter_data, analyze, interval_rate  
from retention import get_stat_df, get_all_data, get_filltered_data, plotting

Use question 1 as an example

a. What was the overall Day-7 retention over the month of September?

function load csv data

In [2]:
path='sample_data.csv'
load_data(path)
used: 1.47s
from  2016-09-01 to  2016-10-31
142327 records

Perform calculation on the entire dataset, if filter is needed, func filter_data should be called before analyze()

In [3]:
analyze()
used: 5.38s
during 61 days period between 2016-09-01 and 2016-10-31
The overall 7 days retention rate is 4.97%
call get_stat_df() to get details, which will return a DataFrame
call interval_rate("start_date","end_date") for 7-day retention rate of specific time period

calculate 7-day retention rate of a specified time period

In [4]:
interval_rate('9-1-16','9-30-16')
Out[4]:
0.0492223692918597

Optional

call get_stat_df() will return the calculated DataFrame contains specific information of each day,columns, for reference

  • day1 represents the number of NEW users on that day
  • day7 tell us the Unique(not new) users on 7 days after that day
  • 'matched' contains the number that users from day1 reopened 7 days after that day
In [5]:
df=get_stat_df()
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 61 entries, 2016-09-01 to 2016-10-31
Data columns (total 4 columns):
day1                    61 non-null int64
day7                    61 non-null int64
matched                 61 non-null int64
single_day_retention    58 non-null float64
dtypes: float64(1), int64(3)
memory usage: 2.4 KB
In [6]:
df.head()
Out[6]:
day1 day7 matched single_day_retention
date
2016-09-01 154 188 6 0.038961
2016-09-02 171 209 5 0.029240
2016-09-03 232 235 5 0.021552
2016-09-04 180 219 3 0.016667
2016-09-05 114 237 6 0.052632

call get_all_data() will return the raw data will use to analyze, for reference

In [7]:
data=get_all_data()
data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 142327 entries, 2016-09-01 to 2016-10-31
Data columns (total 3 columns):
user_id        142327 non-null object
os_name        142327 non-null object
app_version    142327 non-null object
dtypes: object(3)
memory usage: 4.3+ MB
In [8]:
data.head()
Out[8]:
user_id os_name app_version
2016-09-01 8f07aa09-530d-4571-86fd-269cf6a255e7 android 2.5.1
2016-09-01 8f07aa09-530d-4571-86fd-269cf6a255e7 android 2.5.1
2016-09-01 8f07aa09-530d-4571-86fd-269cf6a255e7 android 2.5.1
2016-09-01 d60aae20-1c98-48e6-8b2f-72fbcb13bef5 android 2.5.1
2016-09-01 d60aae20-1c98-48e6-8b2f-72fbcb13bef5 android 2.5.1

call plotting() will simply plot line charts from DataFrame returned by get_stat_df()

  • just simple plotting, no further analysis
  • pass a NAME will save image as html file, otherwise will plot in jupyter notebook
In [9]:
plotting()
Loading BokehJS ...

save graph as test.html

In [11]:
# plotting('test')

b. What was the Day-7 retention from September 8 through September 10 for Android users?

filter data by os 'android'

In [12]:
filter_data('android')
filtered by android
86393 records
  • every time choose to filter data, retention rates should be reculculate
    • so call analyze() then analyze
In [ ]:
analyze()

get 7-day retention rate of September 8 through September 10

In [13]:
interval_rate('9-8-16','9-10-16')
Out[13]:
0.05851063829787234

c. What was the Day-7 retention over the month of September for iOS users using version 6.5?

filter data first

In [14]:
filter_data('ios','6.5.0')
filtered by ios, 6.5.0
808 records

get 7-day retention rate of September

then analyze

In [15]:
analyze()
used: 1.90s
during 61 days period between 2016-09-01 and 2016-10-31
The overall 7 days retention rate is 0.00%
call get_stat_df() to get details, which will return a DataFrame
call interval_rate("start_date","end_date") for 7-day retention rate of specific time period
In [16]:
interval_rate('9-1-16','9-30-16')
Out[16]:
0.0

why 0? examine

In [17]:
df=get_stat_df()
In [18]:
df.head()
Out[18]:
day1 day7 matched single_day_retention
date
2016-09-01 0 0 0 NaN
2016-09-02 0 0 0 NaN
2016-09-03 0 0 0 NaN
2016-09-04 0 0 0 NaN
2016-09-05 0 0 0 NaN
In [19]:
df['matched'].sum()
Out[19]:
0L

0 retention rate due to 0 records matched

In [20]:
df
Out[20]:
day1 day7 matched single_day_retention
date
2016-09-01 0 0 0 NaN
2016-09-02 0 0 0 NaN
2016-09-03 0 0 0 NaN
2016-09-04 0 0 0 NaN
2016-09-05 0 0 0 NaN
2016-09-06 0 0 0 NaN
2016-09-07 0 0 0 NaN
2016-09-08 0 0 0 NaN
2016-09-09 0 0 0 NaN
2016-09-10 0 0 0 NaN
2016-09-11 0 0 0 NaN
2016-09-12 0 0 0 NaN
2016-09-13 0 0 0 NaN
2016-09-14 0 0 0 NaN
2016-09-15 0 0 0 NaN
2016-09-16 0 0 0 NaN
2016-09-17 0 0 0 NaN
2016-09-18 0 0 0 NaN
2016-09-19 0 0 0 NaN
2016-09-20 0 0 0 NaN
2016-09-21 0 0 0 NaN
2016-09-22 0 132 0 NaN
2016-09-23 0 86 0 NaN
2016-09-24 0 16 0 NaN
2016-09-25 0 9 0 NaN
2016-09-26 0 1 0 NaN
2016-09-27 0 0 0 NaN
2016-09-28 0 1 0 NaN
2016-09-29 48 0 0 0.0
2016-09-30 6 1 0 0.0
... ... ... ... ...
2016-10-02 0 1 0 NaN
2016-10-03 0 0 0 NaN
2016-10-04 0 0 0 NaN
2016-10-05 0 0 0 NaN
2016-10-06 0 0 0 NaN
2016-10-07 0 0 0 NaN
2016-10-08 0 0 0 NaN
2016-10-09 0 0 0 NaN
2016-10-10 0 0 0 NaN
2016-10-11 0 1 0 NaN
2016-10-12 0 0 0 NaN
2016-10-13 0 0 0 NaN
2016-10-14 0 0 0 NaN
2016-10-15 0 0 0 NaN
2016-10-16 0 1 0 NaN
2016-10-17 0 0 0 NaN
2016-10-18 0 0 0 NaN
2016-10-19 0 0 0 NaN
2016-10-20 0 1 0 NaN
2016-10-21 0 0 0 NaN
2016-10-22 0 0 0 NaN
2016-10-23 1 0 0 0.0
2016-10-24 0 0 0 NaN
2016-10-25 0 0 0 NaN
2016-10-26 0 0 0 NaN
2016-10-27 0 0 0 NaN
2016-10-28 0 0 0 NaN
2016-10-29 0 0 0 NaN
2016-10-30 0 0 0 NaN
2016-10-31 0 0 0 NaN

61 rows × 4 columns

In [ ]:
 

Leave a Reply

Your email address will not be published.