Evaluation, even on crowdsourcing platforms used by ordinary people, should capture end users’ types of interactions and decisions. The evaluations should demonstrate what happens when the algorithm is integrated into a human decision-making process. Does that alter or improve the decision and the resultant decision-making process as revealed by the downstream outcome?