Exploring Relationships with Machine Learning
By Oscar Ko
Research Questions:
- Do couples that meet on dating apps have higher or lower quality relationships?
 - Can any features in this dataset help predict how a subject would rate their relationship quality?
 - What insights can I derive from using machine learning for exploratory analysis?
 
                    This dataset was taken from a Stanford University survey called, "How Couples Meet and Stay Together
                        2017." This
                    dataset contains many variables including age, education level, political party, income, usage of
                    dating
                    apps, and quality of relationship (as self-rated by the subject). To prepare this data for analysis,
                    I
                    renamed and recoded a few features.
                    
                    For example, one notable change I made was for the relationship
                    quality variable. It was originally based on a scale from 1 to 5; 1 being "Excellent," 2 being
                    "Good,"
                    and 5
                    being "Very Poor." I created a new binary variable that only specified whether or not if a couple
                    was
                    "good" or "not good." Couples that were rated as "good" or "excellent" in the original outcome
                    varible,
                    were categorized as "good" in the categorical outcome. For the regression models I conducted, I
                    flipped
                    the values of the original outcome varible, so 5 was "Excellent" and 1 was "Very Poor." This helped
                    make
                    interpretation more intuitive.
                    
                    All the specific cleaning steps I took can be seen in
                    this Jupyter notebook. During the cleaning process, I also consulted
                    this codebook.
                
Machine Learning Analysis
                I analyzed this dataset by creating four notebooks where I conducted Exploratory Data Analysis,
                Classification,
                Regression,
                and Principal Component Analysis.
                
                With the dropdown menu below, you can view visualizations of the results from those four notebooks:
            
This barchart shows the numerical features within the dataset that had the strongest correlations with each other based on the correlation matrix conducted in the EDA notebook.
This barchart shows the numerical features that had the strongest correlations with relationship quality ratings based on the correlation matrix conducted in the EDA notebook.
This barchart shows the features remaining in the Logistic Regression model after conducting backward elimination to remove statistically insignificant features (where their alpha levels were over p = .05). The specific steps I took can be found in the classification notebook.
This barchart shows the features left in my Linear Regression model after conducting backward elimination to remove all features that were statistically insignificant (where their alpha levels were over p = .05). The specific steps I took can be found in the regression notebook.
Top Correlations between Numeric Features
Top Correlations between Numeric Features & Relationship Quality
Classification Feature Coefficients
Regression Feature Coefficients
All the correlations between the feature pairs above had a p-value of 0, so there is statistical significance. Any correlations at an absolute value of 0.3 or below are pretty weak.
- Political Views: Subjects were more likely to be with partners that had similar political views.
 - Education: Subjects were more likely to be with partners that had similar education levels.
 - As there is correlation between a subject's and their mother's education levels, the couple's mothers were also likely to have similar education levels with each other.
 - Age: The older a subject was when the couple first met, the larger their age gap tended to be and the shorter it took for them to become a couple.
 
Interesting Findings:
All the top correlations with rQual shown had p-values below 0.05, so there is statistical significance. Any correlations with an absolute value of 0.3 or below are pretty weak, so all of these correlations above with relationship quality very weak.
- 
                                Subjects were more likely to rate their relationships as "good" if the couple had:
                                
- Higher household income
 - Higher education levels
 - Older age
 
 - 
                                Subjects were less likely to rate their relationships as "good" if the couple had:
                                
- More household members below the age of 18
 
 
Interesting Findings:
- 
                                Subjects were more likely to rate their relationships as "good" if the couple had:
                                
- Higher household income
 - Higher education levels
 - Older age
 - Sex frequency of once a week or more
 
 - 
                                Subjects were less likely to rate their relationships as "good" if the couple had:
                                
- Sex frequency of once a month or less
 
 
Interesting Findings:
Note! Any racial correlations here should be taken with a grain of salt. Most subjects and their partners in this dataset identified as white, so when it comes to issues of race on relationships, we would need much more data. And even with more data, racial issues can be very complex as they may be tied to many other correlated factors such as eductional and economic opportunties.
- 
                                Subjects were more likely to rate their relationships as "good" if the couple had:
                                
- Higher household income
 - About the same income earnings as each other
 - Older age
 - Sex frequency of once a week or more
 - Met in school
 - Met as "work neighbors"
 
 - 
                                Subjects were less likely to rate their relationships as "good" if the couple had:
                                
- Sex frequency of once a month or less
 
 
Interesting Findings:
Components 0 & 1 - Features
                            The values of the features here shows is how they contribute to that principal component.
                            
                            The grey colored rectangles have values closer to 0, so they don't contribute much to that
                            component. (It doesn't necessarily mean the feature isn't significant. It just means this
                            specific
                            component's captured variance doesn't have anything to do with that feature.)
                            
                            A yellow colored feature shows that it is positive, so if the component were represented on
                            an
                            axis,
                            increasing along this axis would represent more of that feature.
                            
                            For example, Component 0 has the feature "householdMinor_num" colored in yellow, so as you
                            go
                            forward
                            on this axis, the subjects tend to have more children at home.
                            
                            A black colored feature shows that it is negative, so if the component were represented on
                            an
                            axis,
                            increasing along this axis would represent less of that feature.
                            
                            For example, Component 0 has the feature "partnerAge" colored in black, so as you go forward
                            on
                            this
                            axis,
                            the
                            subjects tend to have younger partners.
                            
                            Below is my interpretation of the two components.
                        
Component 0
Positive
- Number of minors in household (strong)
 - Partner's mother education (strong)
 
Negative
- Partner's Age (strong)
 
Component 1:
Positive
- Household income (strong)
 - Partner's mother education (strong)
 
Negative
- Number of minors in household (moderate)
 
Relationship Quality: Good vs Not Good
- 
                            A subject was more likely to rate their relationships as "good" if the couple had:
                            
- Higher household income
 - More education
 - Older age
 
 - 
                            A subject was less likely to rate their relationships as "good" if the couple had:
                            
- More household members below the age of 18
 
 
Interesting Findings:
Conclusions
                
                    Do couples that meet on dating apps have higher or lower quality relationships?
                
                
                In the all the models built, it seems there was no statistical significance between meeting on dating
                apps and rated relationship quality.
                
                
                    What insights can I derive from using machine learning for exploratory analysis? Can any features in
                    this dataset help predict how a subject would rate their relationship
                    quality?
                
                
                Some interesting findings are that the education levels and political views of subjects and their
                partners tend to be about the same. The older a subject was when the couple first met, the larger their
                age gap tended to be and the shorter it took for them to become a couple.
                
                As for the best features to predict relationship quality, a feature may show up as significant in one
                model, but maybe it wouldn't be so significant in other
                models. In the efforts for exploratory analysis, if a feature appeared as signifigant in multiple models
                surely, there is something worth further investigation here.
                
                In the table below, I grouped together similar features into their general concepts. For example
                "subjectAge" and
                "partnerAge" both relate to age, and they are correlated, so for simplicity I grouped them together in
                the
                table below as "Age."
                
                With this table we can see which general concepts popped up as significant in which models.
            
| Classification | Regression | Unsupervised | Total | |
|---|---|---|---|---|
| # of Household Minors | ✅ | ✅ | 2 | |
| Age | ✅ | ✅ | 2 | |
| Earned about the same | ✅ | 1 | ||
| Education | ✅ | ✅ | 2 | |
| Household Income | ✅ | ✅ | ✅ | 3 | 
| Living Together | ✅ | ✅ | 2 | |
| Met as Coworkers / Work Neighbors | ✅ | ✅ | 2 | |
| Met in School | ✅ | 1 | ||
| Race | ✅ | 1 | ||
| Sex Frequency | ✅ | ✅ | 2 | 
                It seems the general concepts that appeared in more than one model were
                number of household minors,
                age,
                education,
                income,
                living together, and
                sex frequency.
                Less of the first feature and more of any of the last five seem to be
                correlated with higher likelihood a subject would rate their relationship as "good."
                
                Meeting as workmates also seems to have correlation with relationship quality,
                but there seems to be conflicting results.
                "metAs_coworkers" is correlated with lower likelihood of
                "good" relationship
                quality, while "metAs_workNeighbors" was correlated with higher liklihood.
                When consulting the
                codebook, it was unclear what the distinction was
                between
                "coworkers"
                and
                "work neighbors" were, if any.
                
                The relationship between these seven concepts and relationship quality might be worth investigating
                further
                with future studies and more relationship data.
                
                As a quick reminder, correlation does not mean causation. This analysis is done on just one dataset of
                couples. Relationships in the real world might be very different.
                Just because a couple is of certain ages, races, education levels, or incomes it neither
                means their relationship is doomed to fail, nor destined to succeed.
                
"I've learned that you can't predict [love] or plan for it. For someone like me who is obsessed with organization and planning, I love the idea that love is the one exception to that. Love is the one wild card." - Taylor Swift