Exploring Relationships with Machine Learning
By Oscar Ko
Research Questions:
- Do couples that meet on dating apps have higher or lower quality relationships?
- Can any features in this dataset help predict how a subject would rate their relationship quality?
- What insights can I derive from using machine learning for exploratory analysis?
This dataset was taken from a Stanford University survey called, "How Couples Meet and Stay Together
2017." This
dataset contains many variables including age, education level, political party, income, usage of
dating
apps, and quality of relationship (as self-rated by the subject). To prepare this data for analysis,
I
renamed and recoded a few features.
For example, one notable change I made was for the relationship
quality variable. It was originally based on a scale from 1 to 5; 1 being "Excellent," 2 being
"Good,"
and 5
being "Very Poor." I created a new binary variable that only specified whether or not if a couple
was
"good" or "not good." Couples that were rated as "good" or "excellent" in the original outcome
varible,
were categorized as "good" in the categorical outcome. For the regression models I conducted, I
flipped
the values of the original outcome varible, so 5 was "Excellent" and 1 was "Very Poor." This helped
make
interpretation more intuitive.
All the specific cleaning steps I took can be seen in
this Jupyter notebook. During the cleaning process, I also consulted
this codebook.
Machine Learning Analysis
I analyzed this dataset by creating four notebooks where I conducted Exploratory Data Analysis,
Classification,
Regression,
and Principal Component Analysis.
With the dropdown menu below, you can view visualizations of the results from those four notebooks:
This barchart shows the numerical features within the dataset that had the strongest correlations with each other based on the correlation matrix conducted in the EDA notebook.
This barchart shows the numerical features that had the strongest correlations with relationship quality ratings based on the correlation matrix conducted in the EDA notebook.
This barchart shows the features remaining in the Logistic Regression model after conducting backward elimination to remove statistically insignificant features (where their alpha levels were over p = .05). The specific steps I took can be found in the classification notebook.
This barchart shows the features left in my Linear Regression model after conducting backward elimination to remove all features that were statistically insignificant (where their alpha levels were over p = .05). The specific steps I took can be found in the regression notebook.
Top Correlations between Numeric Features
Top Correlations between Numeric Features & Relationship Quality
Classification Feature Coefficients
Regression Feature Coefficients
All the correlations between the feature pairs above had a p-value of 0, so there is statistical significance. Any correlations at an absolute value of 0.3 or below are pretty weak.
- Political Views: Subjects were more likely to be with partners that had similar political views.
- Education: Subjects were more likely to be with partners that had similar education levels.
- As there is correlation between a subject's and their mother's education levels, the couple's mothers were also likely to have similar education levels with each other.
- Age: The older a subject was when the couple first met, the larger their age gap tended to be and the shorter it took for them to become a couple.
Interesting Findings:
All the top correlations with rQual shown had p-values below 0.05, so there is statistical significance. Any correlations with an absolute value of 0.3 or below are pretty weak, so all of these correlations above with relationship quality very weak.
-
Subjects were more likely to rate their relationships as "good" if the couple had:
- Higher household income
- Higher education levels
- Older age
-
Subjects were less likely to rate their relationships as "good" if the couple had:
- More household members below the age of 18
Interesting Findings:
-
Subjects were more likely to rate their relationships as "good" if the couple had:
- Higher household income
- Higher education levels
- Older age
- Sex frequency of once a week or more
-
Subjects were less likely to rate their relationships as "good" if the couple had:
- Sex frequency of once a month or less
Interesting Findings:
Note! Any racial correlations here should be taken with a grain of salt. Most subjects and their partners in this dataset identified as white, so when it comes to issues of race on relationships, we would need much more data. And even with more data, racial issues can be very complex as they may be tied to many other correlated factors such as eductional and economic opportunties.
-
Subjects were more likely to rate their relationships as "good" if the couple had:
- Higher household income
- About the same income earnings as each other
- Older age
- Sex frequency of once a week or more
- Met in school
- Met as "work neighbors"
-
Subjects were less likely to rate their relationships as "good" if the couple had:
- Sex frequency of once a month or less
Interesting Findings:
Components 0 & 1 - Features
The values of the features here shows is how they contribute to that principal component.
The grey colored rectangles have values closer to 0, so they don't contribute much to that
component. (It doesn't necessarily mean the feature isn't significant. It just means this
specific
component's captured variance doesn't have anything to do with that feature.)
A yellow colored feature shows that it is positive, so if the component were represented on
an
axis,
increasing along this axis would represent more of that feature.
For example, Component 0 has the feature "householdMinor_num" colored in yellow, so as you
go
forward
on this axis, the subjects tend to have more children at home.
A black colored feature shows that it is negative, so if the component were represented on
an
axis,
increasing along this axis would represent less of that feature.
For example, Component 0 has the feature "partnerAge" colored in black, so as you go forward
on
this
axis,
the
subjects tend to have younger partners.
Below is my interpretation of the two components.
Component 0
Positive
- Number of minors in household (strong)
- Partner's mother education (strong)
Negative
- Partner's Age (strong)
Component 1:
Positive
- Household income (strong)
- Partner's mother education (strong)
Negative
- Number of minors in household (moderate)
Relationship Quality: Good vs Not Good
-
A subject was more likely to rate their relationships as "good" if the couple had:
- Higher household income
- More education
- Older age
-
A subject was less likely to rate their relationships as "good" if the couple had:
- More household members below the age of 18
Interesting Findings:
Conclusions
Do couples that meet on dating apps have higher or lower quality relationships?
In the all the models built, it seems there was no statistical significance between meeting on dating
apps and rated relationship quality.
What insights can I derive from using machine learning for exploratory analysis? Can any features in
this dataset help predict how a subject would rate their relationship
quality?
Some interesting findings are that the education levels and political views of subjects and their
partners tend to be about the same. The older a subject was when the couple first met, the larger their
age gap tended to be and the shorter it took for them to become a couple.
As for the best features to predict relationship quality, a feature may show up as significant in one
model, but maybe it wouldn't be so significant in other
models. In the efforts for exploratory analysis, if a feature appeared as signifigant in multiple models
surely, there is something worth further investigation here.
In the table below, I grouped together similar features into their general concepts. For example
"subjectAge" and
"partnerAge" both relate to age, and they are correlated, so for simplicity I grouped them together in
the
table below as "Age."
With this table we can see which general concepts popped up as significant in which models.
Classification | Regression | Unsupervised | Total | |
---|---|---|---|---|
# of Household Minors | ✅ | ✅ | 2 | |
Age | ✅ | ✅ | 2 | |
Earned about the same | ✅ | 1 | ||
Education | ✅ | ✅ | 2 | |
Household Income | ✅ | ✅ | ✅ | 3 |
Living Together | ✅ | ✅ | 2 | |
Met as Coworkers / Work Neighbors | ✅ | ✅ | 2 | |
Met in School | ✅ | 1 | ||
Race | ✅ | 1 | ||
Sex Frequency | ✅ | ✅ | 2 |
It seems the general concepts that appeared in more than one model were
number of household minors,
age,
education,
income,
living together, and
sex frequency.
Less of the first feature and more of any of the last five seem to be
correlated with higher likelihood a subject would rate their relationship as "good."
Meeting as workmates also seems to have correlation with relationship quality,
but there seems to be conflicting results.
"metAs_coworkers" is correlated with lower likelihood of
"good" relationship
quality, while "metAs_workNeighbors" was correlated with higher liklihood.
When consulting the
codebook, it was unclear what the distinction was
between
"coworkers"
and
"work neighbors" were, if any.
The relationship between these seven concepts and relationship quality might be worth investigating
further
with future studies and more relationship data.
As a quick reminder, correlation does not mean causation. This analysis is done on just one dataset of
couples. Relationships in the real world might be very different.
Just because a couple is of certain ages, races, education levels, or incomes it neither
means their relationship is doomed to fail, nor destined to succeed.
"I've learned that you can't predict [love] or plan for it. For someone like me who is obsessed with organization and planning, I love the idea that love is the one exception to that. Love is the one wild card." - Taylor Swift